wiki-llm/Building an LLM.md

# 🧠 LLM Mini Project — Step-by-Step Checklist

---

## 📦 0. Setup Environment

- [ ] Create a new project folder
- [ ] Set up a virtual environment
- [ ] Install core dependencies:
  - [ ] torch
  - [ ] transformers
  - [ ] datasets
  - [ ] accelerate
  - [ ] peft (for LoRA later)
  - [ ] bitsandbytes (for quantization later)
- [ ] Confirm GPU is available (`torch.cuda.is_available()`)

---

## 🔍 1. Understand the Problem (don’t skip this)

- [ ] Write down in your own words:
  - [ ] What is a language model?
  - [ ] What does “predict next token” actually mean?
- [ ] Manually inspect:
  - [ ] A sample sentence
  - [ ] Its tokenized form
- [ ] Verify:
  - [ ] Input tokens vs target tokens (shifted by 1)

---

## 📚 2. Load Dataset

- [ ] Choose dataset:
  - [ ] Start with WikiText-2
- [ ] Load dataset using `datasets`
- [ ] Print:
  - [ ] A few raw samples
- [ ] Check:
  - [ ] Dataset size
  - [ ] Train/validation split

---

## 🔢 3. Tokenization

- [ ] Load GPT-2 tokenizer
- [ ] Tokenize dataset:
  - [ ] Apply truncation
  - [ ] Apply padding
- [ ] Verify:
  - [ ] Shape of tokenized output
  - [ ] Decode tokens back to text (sanity check)

---

## 🧱 4. Prepare Training Data

- [ ] Convert dataset to PyTorch format
- [ ] Create DataLoader:
  - [ ] Set batch size (start small: 2–8)
- [ ] Confirm:
  - [ ] Batches load correctly
  - [ ] Tensor shapes are consistent

---

## 🤖 5. Load Model

- [ ] Load pretrained GPT-2 small
- [ ] Move model to GPU (if available)
- [ ] Print:
  - [ ] Model size (parameters)
- [ ] Run a single forward pass to confirm:
  - [ ] No errors

---

## 🔁 6. Build Training Loop (core understanding)

- [ ] Write your own training loop (no Trainer API yet)
- [ ] Include:
  - [ ] Forward pass
  - [ ] Loss calculation
  - [ ] Backpropagation
  - [ ] Optimizer step
- [ ] Print:
  - [ ] Loss every few steps

---

## 📉 7. Observe Training Behaviour

- [ ] Track:
  - [ ] Training loss over time
- [ ] Answer:
  - [ ] Is loss decreasing?
  - [ ] Is it noisy or stable?
- [ ] (Optional)
  - [ ] Plot loss curve

---

## 🧪 8. Evaluate Model

- [ ] Generate text from model:
  - [ ] Before training
  - [ ] After training
- [ ] Compare:
  - [ ] Coherence
  - [ ] Structure
- [ ] Note:
  - [ ] Any overfitting signs (repetition, memorization)

---

## ⚖️ 9. Try LoRA Fine-Tuning

- [ ] Add LoRA using `peft`
- [ ] Freeze base model weights
- [ ] Train only adapter layers
- [ ] Compare vs full fine-tuning:
  - [ ] Speed
  - [ ] Memory usage
  - [ ] Output quality

---

## 🧠 10. Understand Convergence

- [ ] Identify:
  - [ ] When loss plateaus
- [ ] Check validation loss:
  - [ ] Does it increase? (overfitting)
- [ ] Write down:
  - [ ] What “good training” looks like

---

## ⚙️ 11. Model Saving & Loading

- [ ] Save:
  - [ ] Model weights
  - [ ] Tokenizer
- [ ] Reload model
- [ ] Confirm:
  - [ ] Outputs remain consistent

---

# 🚀 PART 2 — Infrastructure & Serving

---

## 🧠 12. Understand Inference Flow

- [ ] Write down:
  - [ ] Steps from input → output
- [ ] Measure:
  - [ ] Time taken for a single generation

---

## ⚡ 13. Optimize Inference

- [ ] Test batching:
  - [ ] Multiple inputs at once
- [ ] Compare:
  - [ ] Latency vs throughput

---

## 🧮 14. Apply Quantization

- [ ] Load model in:
  - [ ] 8-bit
  - [ ] (Optional) 4-bit
- [ ] Compare:
  - [ ] Memory usage
  - [ ] Speed
  - [ ] Output quality

---

## 🖥️ 15. Simulate Real-World Usage

- [ ] Pretend you have:
  - [ ] Multiple users hitting your model
- [ ] Think through:
  - [ ] How would you queue requests?
  - [ ] When would you batch?
  - [ ] When would you scale?

---

## ☁️ 16. Understand Infra Concepts

- [ ] Research:
  - [ ] GPU provisioning
  - [ ] Autoscaling
  - [ ] Model warm starts
- [ ] Understand:
  - [ ] Why loading time matters
  - [ ] Why GPUs shouldn’t sit idle

---

## 🧬 17. (Bonus) DICOM Exploration

- [ ] Learn:
  - [ ] What DICOM files are
- [ ] Think:
  - [ ] How LLMs could be used with medical data
- [ ] Note:
  - [ ] Privacy + domain challenges

---

## ✍️ 18. Write Your Blog

### Structure

- [ ] Introduction:
  - [ ] What is an LLM really?
- [ ] Training:
  - [ ] Tokenization
  - [ ] Training loop
  - [ ] Loss behaviour
- [ ] Fine-tuning:
  - [ ] Full vs LoRA
- [ ] Challenges:
  - [ ] What went wrong
- [ ] Infrastructure:
  - [ ] Serving challenges
  - [ ] Batching
  - [ ] Quantization
- [ ] Key Learnings:
  - [ ] What surprised you
  - [ ] What actually matters

---

## ✅ Final Deliverables

- [ ] Working training script
- [ ] LoRA vs full fine-tune comparison
- [ ] Basic inference script
- [ ] Blog post (clear + honest)
- [ ] Notes showing your understanding

---

## ⚠️ Keep Yourself Honest

- [ ] Can you explain the training loop without looking?
- [ ] Do you understand why loss decreases?
- [ ] Can you explain batching vs latency tradeoffs?
- [ ] Do you know what would break at scale?