wiki-llm/Building an LLM.md

259 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🧠 LLM Mini Project — Step-by-Step Checklist
---
## 📦 0. Setup Environment
- [ ] Create a new project folder
- [ ] Set up a virtual environment
- [ ] Install core dependencies:
- [ ] torch
- [ ] transformers
- [ ] datasets
- [ ] accelerate
- [ ] peft (for LoRA later)
- [ ] bitsandbytes (for quantization later)
- [ ] Confirm GPU is available (`torch.cuda.is_available()`)
---
## 🔍 1. Understand the Problem (dont skip this)
- [ ] Write down in your own words:
- [ ] What is a language model?
- [ ] What does “predict next token” actually mean?
- [ ] Manually inspect:
- [ ] A sample sentence
- [ ] Its tokenized form
- [ ] Verify:
- [ ] Input tokens vs target tokens (shifted by 1)
---
## 📚 2. Load Dataset
- [ ] Choose dataset:
- [ ] Start with WikiText-2
- [ ] Load dataset using `datasets`
- [ ] Print:
- [ ] A few raw samples
- [ ] Check:
- [ ] Dataset size
- [ ] Train/validation split
---
## 🔢 3. Tokenization
- [ ] Load GPT-2 tokenizer
- [ ] Tokenize dataset:
- [ ] Apply truncation
- [ ] Apply padding
- [ ] Verify:
- [ ] Shape of tokenized output
- [ ] Decode tokens back to text (sanity check)
---
## 🧱 4. Prepare Training Data
- [ ] Convert dataset to PyTorch format
- [ ] Create DataLoader:
- [ ] Set batch size (start small: 28)
- [ ] Confirm:
- [ ] Batches load correctly
- [ ] Tensor shapes are consistent
---
## 🤖 5. Load Model
- [ ] Load pretrained GPT-2 small
- [ ] Move model to GPU (if available)
- [ ] Print:
- [ ] Model size (parameters)
- [ ] Run a single forward pass to confirm:
- [ ] No errors
---
## 🔁 6. Build Training Loop (core understanding)
- [ ] Write your own training loop (no Trainer API yet)
- [ ] Include:
- [ ] Forward pass
- [ ] Loss calculation
- [ ] Backpropagation
- [ ] Optimizer step
- [ ] Print:
- [ ] Loss every few steps
---
## 📉 7. Observe Training Behaviour
- [ ] Track:
- [ ] Training loss over time
- [ ] Answer:
- [ ] Is loss decreasing?
- [ ] Is it noisy or stable?
- [ ] (Optional)
- [ ] Plot loss curve
---
## 🧪 8. Evaluate Model
- [ ] Generate text from model:
- [ ] Before training
- [ ] After training
- [ ] Compare:
- [ ] Coherence
- [ ] Structure
- [ ] Note:
- [ ] Any overfitting signs (repetition, memorization)
---
## ⚖️ 9. Try LoRA Fine-Tuning
- [ ] Add LoRA using `peft`
- [ ] Freeze base model weights
- [ ] Train only adapter layers
- [ ] Compare vs full fine-tuning:
- [ ] Speed
- [ ] Memory usage
- [ ] Output quality
---
## 🧠 10. Understand Convergence
- [ ] Identify:
- [ ] When loss plateaus
- [ ] Check validation loss:
- [ ] Does it increase? (overfitting)
- [ ] Write down:
- [ ] What “good training” looks like
---
## ⚙️ 11. Model Saving & Loading
- [ ] Save:
- [ ] Model weights
- [ ] Tokenizer
- [ ] Reload model
- [ ] Confirm:
- [ ] Outputs remain consistent
---
# 🚀 PART 2 — Infrastructure & Serving
---
## 🧠 12. Understand Inference Flow
- [ ] Write down:
- [ ] Steps from input → output
- [ ] Measure:
- [ ] Time taken for a single generation
---
## ⚡ 13. Optimize Inference
- [ ] Test batching:
- [ ] Multiple inputs at once
- [ ] Compare:
- [ ] Latency vs throughput
---
## 🧮 14. Apply Quantization
- [ ] Load model in:
- [ ] 8-bit
- [ ] (Optional) 4-bit
- [ ] Compare:
- [ ] Memory usage
- [ ] Speed
- [ ] Output quality
---
## 🖥️ 15. Simulate Real-World Usage
- [ ] Pretend you have:
- [ ] Multiple users hitting your model
- [ ] Think through:
- [ ] How would you queue requests?
- [ ] When would you batch?
- [ ] When would you scale?
---
## ☁️ 16. Understand Infra Concepts
- [ ] Research:
- [ ] GPU provisioning
- [ ] Autoscaling
- [ ] Model warm starts
- [ ] Understand:
- [ ] Why loading time matters
- [ ] Why GPUs shouldnt sit idle
---
## 🧬 17. (Bonus) DICOM Exploration
- [ ] Learn:
- [ ] What DICOM files are
- [ ] Think:
- [ ] How LLMs could be used with medical data
- [ ] Note:
- [ ] Privacy + domain challenges
---
## ✍️ 18. Write Your Blog
### Structure
- [ ] Introduction:
- [ ] What is an LLM really?
- [ ] Training:
- [ ] Tokenization
- [ ] Training loop
- [ ] Loss behaviour
- [ ] Fine-tuning:
- [ ] Full vs LoRA
- [ ] Challenges:
- [ ] What went wrong
- [ ] Infrastructure:
- [ ] Serving challenges
- [ ] Batching
- [ ] Quantization
- [ ] Key Learnings:
- [ ] What surprised you
- [ ] What actually matters
---
## ✅ Final Deliverables
- [ ] Working training script
- [ ] LoRA vs full fine-tune comparison
- [ ] Basic inference script
- [ ] Blog post (clear + honest)
- [ ] Notes showing your understanding
---
## ⚠️ Keep Yourself Honest
- [ ] Can you explain the training loop without looking?
- [ ] Do you understand why loss decreases?
- [ ] Can you explain batching vs latency tradeoffs?
- [ ] Do you know what would break at scale?