Created a basic model and ran it for 10 epochs. Used the GPT2 Tokenizer and the base GPT2 model to tokenize and train the retrain the model

Signed-off-by: rodude123 <rodude123@gmail.com>
2026-04-12 11:25:38 +01:00
parent 703bece135
commit 9315dc352b
4 changed files with 1789 additions and 6 deletions
@@ -0,0 +1,259 @@
+# 🧠 LLM Mini Project — Step-by-Step Checklist
+
+---
+
+## 📦 0. Setup Environment
+
+- [ ] Create a new project folder
+- [ ] Set up a virtual environment
+- [ ] Install core dependencies:
+  - [ ] torch
+  - [ ] transformers
+  - [ ] datasets
+  - [ ] accelerate
+  - [ ] peft (for LoRA later)
+  - [ ] bitsandbytes (for quantization later)
+- [ ] Confirm GPU is available (`torch.cuda.is_available()`)
+
+---
+
+## 🔍 1. Understand the Problem (don’t skip this)
+
+- [ ] Write down in your own words:
+  - [ ] What is a language model?
+  - [ ] What does “predict next token” actually mean?
+- [ ] Manually inspect:
+  - [ ] A sample sentence
+  - [ ] Its tokenized form
+- [ ] Verify:
+  - [ ] Input tokens vs target tokens (shifted by 1)
+
+---
+
+## 📚 2. Load Dataset
+
+- [ ] Choose dataset:
+  - [ ] Start with WikiText-2
+- [ ] Load dataset using `datasets`
+- [ ] Print:
+  - [ ] A few raw samples
+- [ ] Check:
+  - [ ] Dataset size
+  - [ ] Train/validation split
+
+---
+
+## 🔢 3. Tokenization
+
+- [ ] Load GPT-2 tokenizer
+- [ ] Tokenize dataset:
+  - [ ] Apply truncation
+  - [ ] Apply padding
+- [ ] Verify:
+  - [ ] Shape of tokenized output
+  - [ ] Decode tokens back to text (sanity check)
+
+---
+
+## 🧱 4. Prepare Training Data
+
+- [ ] Convert dataset to PyTorch format
+- [ ] Create DataLoader:
+  - [ ] Set batch size (start small: 2–8)
+- [ ] Confirm:
+  - [ ] Batches load correctly
+  - [ ] Tensor shapes are consistent
+
+---
+
+## 🤖 5. Load Model
+
+- [ ] Load pretrained GPT-2 small
+- [ ] Move model to GPU (if available)
+- [ ] Print:
+  - [ ] Model size (parameters)
+- [ ] Run a single forward pass to confirm:
+  - [ ] No errors
+
+---
+
+## 🔁 6. Build Training Loop (core understanding)
+
+- [ ] Write your own training loop (no Trainer API yet)
+- [ ] Include:
+  - [ ] Forward pass
+  - [ ] Loss calculation
+  - [ ] Backpropagation
+  - [ ] Optimizer step
+- [ ] Print:
+  - [ ] Loss every few steps
+
+---
+
+## 📉 7. Observe Training Behaviour
+
+- [ ] Track:
+  - [ ] Training loss over time
+- [ ] Answer:
+  - [ ] Is loss decreasing?
+  - [ ] Is it noisy or stable?
+- [ ] (Optional)
+  - [ ] Plot loss curve
+
+---
+
+## 🧪 8. Evaluate Model
+
+- [ ] Generate text from model:
+  - [ ] Before training
+  - [ ] After training
+- [ ] Compare:
+  - [ ] Coherence
+  - [ ] Structure
+- [ ] Note:
+  - [ ] Any overfitting signs (repetition, memorization)
+
+---
+
+## ⚖️ 9. Try LoRA Fine-Tuning
+
+- [ ] Add LoRA using `peft`
+- [ ] Freeze base model weights
+- [ ] Train only adapter layers
+- [ ] Compare vs full fine-tuning:
+  - [ ] Speed
+  - [ ] Memory usage
+  - [ ] Output quality
+
+---
+
+## 🧠 10. Understand Convergence
+
+- [ ] Identify:
+  - [ ] When loss plateaus
+- [ ] Check validation loss:
+  - [ ] Does it increase? (overfitting)
+- [ ] Write down:
+  - [ ] What “good training” looks like
+
+---
+
+## ⚙️ 11. Model Saving & Loading
+
+- [ ] Save:
+  - [ ] Model weights
+  - [ ] Tokenizer
+- [ ] Reload model
+- [ ] Confirm:
+  - [ ] Outputs remain consistent
+
+---
+
+# 🚀 PART 2 — Infrastructure & Serving
+
+---
+
+## 🧠 12. Understand Inference Flow
+
+- [ ] Write down:
+  - [ ] Steps from input → output
+- [ ] Measure:
+  - [ ] Time taken for a single generation
+
+---
+
+## ⚡ 13. Optimize Inference
+
+- [ ] Test batching:
+  - [ ] Multiple inputs at once
+- [ ] Compare:
+  - [ ] Latency vs throughput
+
+---
+
+## 🧮 14. Apply Quantization
+
+- [ ] Load model in:
+  - [ ] 8-bit
+  - [ ] (Optional) 4-bit
+- [ ] Compare:
+  - [ ] Memory usage
+  - [ ] Speed
+  - [ ] Output quality
+
+---
+
+## 🖥️ 15. Simulate Real-World Usage
+
+- [ ] Pretend you have:
+  - [ ] Multiple users hitting your model
+- [ ] Think through:
+  - [ ] How would you queue requests?
+  - [ ] When would you batch?
+  - [ ] When would you scale?
+
+---
+
+## ☁️ 16. Understand Infra Concepts
+
+- [ ] Research:
+  - [ ] GPU provisioning
+  - [ ] Autoscaling
+  - [ ] Model warm starts
+- [ ] Understand:
+  - [ ] Why loading time matters
+  - [ ] Why GPUs shouldn’t sit idle
+
+---
+
+## 🧬 17. (Bonus) DICOM Exploration
+
+- [ ] Learn:
+  - [ ] What DICOM files are
+- [ ] Think:
+  - [ ] How LLMs could be used with medical data
+- [ ] Note:
+  - [ ] Privacy + domain challenges
+
+---
+
+## ✍️ 18. Write Your Blog
+
+### Structure
+
+- [ ] Introduction:
+  - [ ] What is an LLM really?
+- [ ] Training:
+  - [ ] Tokenization
+  - [ ] Training loop
+  - [ ] Loss behaviour
+- [ ] Fine-tuning:
+  - [ ] Full vs LoRA
+- [ ] Challenges:
+  - [ ] What went wrong
+- [ ] Infrastructure:
+  - [ ] Serving challenges
+  - [ ] Batching
+  - [ ] Quantization
+- [ ] Key Learnings:
+  - [ ] What surprised you
+  - [ ] What actually matters
+
+---
+
+## ✅ Final Deliverables
+
+- [ ] Working training script
+- [ ] LoRA vs full fine-tune comparison
+- [ ] Basic inference script
+- [ ] Blog post (clear + honest)
+- [ ] Notes showing your understanding
+
+---
+
+## ⚠️ Keep Yourself Honest
+
+- [ ] Can you explain the training loop without looking?
+- [ ] Do you understand why loss decreases?
+- [ ] Can you explain batching vs latency tradeoffs?
+- [ ] Do you know what would break at scale?