259 lines
4.8 KiB
Markdown
259 lines
4.8 KiB
Markdown
# 🧠 LLM Mini Project — Step-by-Step Checklist
|
||
|
||
---
|
||
|
||
## 📦 0. Setup Environment
|
||
|
||
- [ ] Create a new project folder
|
||
- [ ] Set up a virtual environment
|
||
- [ ] Install core dependencies:
|
||
- [ ] torch
|
||
- [ ] transformers
|
||
- [ ] datasets
|
||
- [ ] accelerate
|
||
- [ ] peft (for LoRA later)
|
||
- [ ] bitsandbytes (for quantization later)
|
||
- [ ] Confirm GPU is available (`torch.cuda.is_available()`)
|
||
|
||
---
|
||
|
||
## 🔍 1. Understand the Problem (don’t skip this)
|
||
|
||
- [ ] Write down in your own words:
|
||
- [ ] What is a language model?
|
||
- [ ] What does “predict next token” actually mean?
|
||
- [ ] Manually inspect:
|
||
- [ ] A sample sentence
|
||
- [ ] Its tokenized form
|
||
- [ ] Verify:
|
||
- [ ] Input tokens vs target tokens (shifted by 1)
|
||
|
||
---
|
||
|
||
## 📚 2. Load Dataset
|
||
|
||
- [ ] Choose dataset:
|
||
- [ ] Start with WikiText-2
|
||
- [ ] Load dataset using `datasets`
|
||
- [ ] Print:
|
||
- [ ] A few raw samples
|
||
- [ ] Check:
|
||
- [ ] Dataset size
|
||
- [ ] Train/validation split
|
||
|
||
---
|
||
|
||
## 🔢 3. Tokenization
|
||
|
||
- [ ] Load GPT-2 tokenizer
|
||
- [ ] Tokenize dataset:
|
||
- [ ] Apply truncation
|
||
- [ ] Apply padding
|
||
- [ ] Verify:
|
||
- [ ] Shape of tokenized output
|
||
- [ ] Decode tokens back to text (sanity check)
|
||
|
||
---
|
||
|
||
## 🧱 4. Prepare Training Data
|
||
|
||
- [ ] Convert dataset to PyTorch format
|
||
- [ ] Create DataLoader:
|
||
- [ ] Set batch size (start small: 2–8)
|
||
- [ ] Confirm:
|
||
- [ ] Batches load correctly
|
||
- [ ] Tensor shapes are consistent
|
||
|
||
---
|
||
|
||
## 🤖 5. Load Model
|
||
|
||
- [ ] Load pretrained GPT-2 small
|
||
- [ ] Move model to GPU (if available)
|
||
- [ ] Print:
|
||
- [ ] Model size (parameters)
|
||
- [ ] Run a single forward pass to confirm:
|
||
- [ ] No errors
|
||
|
||
---
|
||
|
||
## 🔁 6. Build Training Loop (core understanding)
|
||
|
||
- [ ] Write your own training loop (no Trainer API yet)
|
||
- [ ] Include:
|
||
- [ ] Forward pass
|
||
- [ ] Loss calculation
|
||
- [ ] Backpropagation
|
||
- [ ] Optimizer step
|
||
- [ ] Print:
|
||
- [ ] Loss every few steps
|
||
|
||
---
|
||
|
||
## 📉 7. Observe Training Behaviour
|
||
|
||
- [ ] Track:
|
||
- [ ] Training loss over time
|
||
- [ ] Answer:
|
||
- [ ] Is loss decreasing?
|
||
- [ ] Is it noisy or stable?
|
||
- [ ] (Optional)
|
||
- [ ] Plot loss curve
|
||
|
||
---
|
||
|
||
## 🧪 8. Evaluate Model
|
||
|
||
- [ ] Generate text from model:
|
||
- [ ] Before training
|
||
- [ ] After training
|
||
- [ ] Compare:
|
||
- [ ] Coherence
|
||
- [ ] Structure
|
||
- [ ] Note:
|
||
- [ ] Any overfitting signs (repetition, memorization)
|
||
|
||
---
|
||
|
||
## ⚖️ 9. Try LoRA Fine-Tuning
|
||
|
||
- [ ] Add LoRA using `peft`
|
||
- [ ] Freeze base model weights
|
||
- [ ] Train only adapter layers
|
||
- [ ] Compare vs full fine-tuning:
|
||
- [ ] Speed
|
||
- [ ] Memory usage
|
||
- [ ] Output quality
|
||
|
||
---
|
||
|
||
## 🧠 10. Understand Convergence
|
||
|
||
- [ ] Identify:
|
||
- [ ] When loss plateaus
|
||
- [ ] Check validation loss:
|
||
- [ ] Does it increase? (overfitting)
|
||
- [ ] Write down:
|
||
- [ ] What “good training” looks like
|
||
|
||
---
|
||
|
||
## ⚙️ 11. Model Saving & Loading
|
||
|
||
- [ ] Save:
|
||
- [ ] Model weights
|
||
- [ ] Tokenizer
|
||
- [ ] Reload model
|
||
- [ ] Confirm:
|
||
- [ ] Outputs remain consistent
|
||
|
||
---
|
||
|
||
# 🚀 PART 2 — Infrastructure & Serving
|
||
|
||
---
|
||
|
||
## 🧠 12. Understand Inference Flow
|
||
|
||
- [ ] Write down:
|
||
- [ ] Steps from input → output
|
||
- [ ] Measure:
|
||
- [ ] Time taken for a single generation
|
||
|
||
---
|
||
|
||
## ⚡ 13. Optimize Inference
|
||
|
||
- [ ] Test batching:
|
||
- [ ] Multiple inputs at once
|
||
- [ ] Compare:
|
||
- [ ] Latency vs throughput
|
||
|
||
---
|
||
|
||
## 🧮 14. Apply Quantization
|
||
|
||
- [ ] Load model in:
|
||
- [ ] 8-bit
|
||
- [ ] (Optional) 4-bit
|
||
- [ ] Compare:
|
||
- [ ] Memory usage
|
||
- [ ] Speed
|
||
- [ ] Output quality
|
||
|
||
---
|
||
|
||
## 🖥️ 15. Simulate Real-World Usage
|
||
|
||
- [ ] Pretend you have:
|
||
- [ ] Multiple users hitting your model
|
||
- [ ] Think through:
|
||
- [ ] How would you queue requests?
|
||
- [ ] When would you batch?
|
||
- [ ] When would you scale?
|
||
|
||
---
|
||
|
||
## ☁️ 16. Understand Infra Concepts
|
||
|
||
- [ ] Research:
|
||
- [ ] GPU provisioning
|
||
- [ ] Autoscaling
|
||
- [ ] Model warm starts
|
||
- [ ] Understand:
|
||
- [ ] Why loading time matters
|
||
- [ ] Why GPUs shouldn’t sit idle
|
||
|
||
---
|
||
|
||
## 🧬 17. (Bonus) DICOM Exploration
|
||
|
||
- [ ] Learn:
|
||
- [ ] What DICOM files are
|
||
- [ ] Think:
|
||
- [ ] How LLMs could be used with medical data
|
||
- [ ] Note:
|
||
- [ ] Privacy + domain challenges
|
||
|
||
---
|
||
|
||
## ✍️ 18. Write Your Blog
|
||
|
||
### Structure
|
||
|
||
- [ ] Introduction:
|
||
- [ ] What is an LLM really?
|
||
- [ ] Training:
|
||
- [ ] Tokenization
|
||
- [ ] Training loop
|
||
- [ ] Loss behaviour
|
||
- [ ] Fine-tuning:
|
||
- [ ] Full vs LoRA
|
||
- [ ] Challenges:
|
||
- [ ] What went wrong
|
||
- [ ] Infrastructure:
|
||
- [ ] Serving challenges
|
||
- [ ] Batching
|
||
- [ ] Quantization
|
||
- [ ] Key Learnings:
|
||
- [ ] What surprised you
|
||
- [ ] What actually matters
|
||
|
||
---
|
||
|
||
## ✅ Final Deliverables
|
||
|
||
- [ ] Working training script
|
||
- [ ] LoRA vs full fine-tune comparison
|
||
- [ ] Basic inference script
|
||
- [ ] Blog post (clear + honest)
|
||
- [ ] Notes showing your understanding
|
||
|
||
---
|
||
|
||
## ⚠️ Keep Yourself Honest
|
||
|
||
- [ ] Can you explain the training loop without looking?
|
||
- [ ] Do you understand why loss decreases?
|
||
- [ ] Can you explain batching vs latency tradeoffs?
|
||
- [ ] Do you know what would break at scale? |