Created a basic model and ran it for 10 epochs. Used the GPT2 Tokenizer and the base GPT2 model to tokenize and train the retrain the model

Signed-off-by: rodude123 <rodude123@gmail.com>
This commit is contained in:
Rohit Pai
2026-04-12 11:25:38 +01:00
committed by rodude123
parent 703bece135
commit 9315dc352b
4 changed files with 1789 additions and 6 deletions
+259
View File
@@ -0,0 +1,259 @@
# 🧠 LLM Mini Project — Step-by-Step Checklist
---
## 📦 0. Setup Environment
- [ ] Create a new project folder
- [ ] Set up a virtual environment
- [ ] Install core dependencies:
- [ ] torch
- [ ] transformers
- [ ] datasets
- [ ] accelerate
- [ ] peft (for LoRA later)
- [ ] bitsandbytes (for quantization later)
- [ ] Confirm GPU is available (`torch.cuda.is_available()`)
---
## 🔍 1. Understand the Problem (dont skip this)
- [ ] Write down in your own words:
- [ ] What is a language model?
- [ ] What does “predict next token” actually mean?
- [ ] Manually inspect:
- [ ] A sample sentence
- [ ] Its tokenized form
- [ ] Verify:
- [ ] Input tokens vs target tokens (shifted by 1)
---
## 📚 2. Load Dataset
- [ ] Choose dataset:
- [ ] Start with WikiText-2
- [ ] Load dataset using `datasets`
- [ ] Print:
- [ ] A few raw samples
- [ ] Check:
- [ ] Dataset size
- [ ] Train/validation split
---
## 🔢 3. Tokenization
- [ ] Load GPT-2 tokenizer
- [ ] Tokenize dataset:
- [ ] Apply truncation
- [ ] Apply padding
- [ ] Verify:
- [ ] Shape of tokenized output
- [ ] Decode tokens back to text (sanity check)
---
## 🧱 4. Prepare Training Data
- [ ] Convert dataset to PyTorch format
- [ ] Create DataLoader:
- [ ] Set batch size (start small: 28)
- [ ] Confirm:
- [ ] Batches load correctly
- [ ] Tensor shapes are consistent
---
## 🤖 5. Load Model
- [ ] Load pretrained GPT-2 small
- [ ] Move model to GPU (if available)
- [ ] Print:
- [ ] Model size (parameters)
- [ ] Run a single forward pass to confirm:
- [ ] No errors
---
## 🔁 6. Build Training Loop (core understanding)
- [ ] Write your own training loop (no Trainer API yet)
- [ ] Include:
- [ ] Forward pass
- [ ] Loss calculation
- [ ] Backpropagation
- [ ] Optimizer step
- [ ] Print:
- [ ] Loss every few steps
---
## 📉 7. Observe Training Behaviour
- [ ] Track:
- [ ] Training loss over time
- [ ] Answer:
- [ ] Is loss decreasing?
- [ ] Is it noisy or stable?
- [ ] (Optional)
- [ ] Plot loss curve
---
## 🧪 8. Evaluate Model
- [ ] Generate text from model:
- [ ] Before training
- [ ] After training
- [ ] Compare:
- [ ] Coherence
- [ ] Structure
- [ ] Note:
- [ ] Any overfitting signs (repetition, memorization)
---
## ⚖️ 9. Try LoRA Fine-Tuning
- [ ] Add LoRA using `peft`
- [ ] Freeze base model weights
- [ ] Train only adapter layers
- [ ] Compare vs full fine-tuning:
- [ ] Speed
- [ ] Memory usage
- [ ] Output quality
---
## 🧠 10. Understand Convergence
- [ ] Identify:
- [ ] When loss plateaus
- [ ] Check validation loss:
- [ ] Does it increase? (overfitting)
- [ ] Write down:
- [ ] What “good training” looks like
---
## ⚙️ 11. Model Saving & Loading
- [ ] Save:
- [ ] Model weights
- [ ] Tokenizer
- [ ] Reload model
- [ ] Confirm:
- [ ] Outputs remain consistent
---
# 🚀 PART 2 — Infrastructure & Serving
---
## 🧠 12. Understand Inference Flow
- [ ] Write down:
- [ ] Steps from input → output
- [ ] Measure:
- [ ] Time taken for a single generation
---
## ⚡ 13. Optimize Inference
- [ ] Test batching:
- [ ] Multiple inputs at once
- [ ] Compare:
- [ ] Latency vs throughput
---
## 🧮 14. Apply Quantization
- [ ] Load model in:
- [ ] 8-bit
- [ ] (Optional) 4-bit
- [ ] Compare:
- [ ] Memory usage
- [ ] Speed
- [ ] Output quality
---
## 🖥️ 15. Simulate Real-World Usage
- [ ] Pretend you have:
- [ ] Multiple users hitting your model
- [ ] Think through:
- [ ] How would you queue requests?
- [ ] When would you batch?
- [ ] When would you scale?
---
## ☁️ 16. Understand Infra Concepts
- [ ] Research:
- [ ] GPU provisioning
- [ ] Autoscaling
- [ ] Model warm starts
- [ ] Understand:
- [ ] Why loading time matters
- [ ] Why GPUs shouldnt sit idle
---
## 🧬 17. (Bonus) DICOM Exploration
- [ ] Learn:
- [ ] What DICOM files are
- [ ] Think:
- [ ] How LLMs could be used with medical data
- [ ] Note:
- [ ] Privacy + domain challenges
---
## ✍️ 18. Write Your Blog
### Structure
- [ ] Introduction:
- [ ] What is an LLM really?
- [ ] Training:
- [ ] Tokenization
- [ ] Training loop
- [ ] Loss behaviour
- [ ] Fine-tuning:
- [ ] Full vs LoRA
- [ ] Challenges:
- [ ] What went wrong
- [ ] Infrastructure:
- [ ] Serving challenges
- [ ] Batching
- [ ] Quantization
- [ ] Key Learnings:
- [ ] What surprised you
- [ ] What actually matters
---
## ✅ Final Deliverables
- [ ] Working training script
- [ ] LoRA vs full fine-tune comparison
- [ ] Basic inference script
- [ ] Blog post (clear + honest)
- [ ] Notes showing your understanding
---
## ⚠️ Keep Yourself Honest
- [ ] Can you explain the training loop without looking?
- [ ] Do you understand why loss decreases?
- [ ] Can you explain batching vs latency tradeoffs?
- [ ] Do you know what would break at scale?