wiki-llm/Building an LLM.md

4.8 KiB
Raw Blame History

🧠 LLM Mini Project — Step-by-Step Checklist


📦 0. Setup Environment

  • Create a new project folder
  • Set up a virtual environment
  • Install core dependencies:
    • torch
    • transformers
    • datasets
    • accelerate
    • peft (for LoRA later)
    • bitsandbytes (for quantization later)
  • Confirm GPU is available (torch.cuda.is_available())

🔍 1. Understand the Problem (dont skip this)

  • Write down in your own words:
    • What is a language model?
    • What does “predict next token” actually mean?
  • Manually inspect:
    • A sample sentence
    • Its tokenized form
  • Verify:
    • Input tokens vs target tokens (shifted by 1)

📚 2. Load Dataset

  • Choose dataset:
    • Start with WikiText-2
  • Load dataset using datasets
  • Print:
    • A few raw samples
  • Check:
    • Dataset size
    • Train/validation split

🔢 3. Tokenization

  • Load GPT-2 tokenizer
  • Tokenize dataset:
    • Apply truncation
    • Apply padding
  • Verify:
    • Shape of tokenized output
    • Decode tokens back to text (sanity check)

🧱 4. Prepare Training Data

  • Convert dataset to PyTorch format
  • Create DataLoader:
    • Set batch size (start small: 28)
  • Confirm:
    • Batches load correctly
    • Tensor shapes are consistent

🤖 5. Load Model

  • Load pretrained GPT-2 small
  • Move model to GPU (if available)
  • Print:
    • Model size (parameters)
  • Run a single forward pass to confirm:
    • No errors

🔁 6. Build Training Loop (core understanding)

  • Write your own training loop (no Trainer API yet)
  • Include:
    • Forward pass
    • Loss calculation
    • Backpropagation
    • Optimizer step
  • Print:
    • Loss every few steps

📉 7. Observe Training Behaviour

  • Track:
    • Training loss over time
  • Answer:
    • Is loss decreasing?
    • Is it noisy or stable?
  • (Optional)
    • Plot loss curve

🧪 8. Evaluate Model

  • Generate text from model:
    • Before training
    • After training
  • Compare:
    • Coherence
    • Structure
  • Note:
    • Any overfitting signs (repetition, memorization)

⚖️ 9. Try LoRA Fine-Tuning

  • Add LoRA using peft
  • Freeze base model weights
  • Train only adapter layers
  • Compare vs full fine-tuning:
    • Speed
    • Memory usage
    • Output quality

🧠 10. Understand Convergence

  • Identify:
    • When loss plateaus
  • Check validation loss:
    • Does it increase? (overfitting)
  • Write down:
    • What “good training” looks like

⚙️ 11. Model Saving & Loading

  • Save:
    • Model weights
    • Tokenizer
  • Reload model
  • Confirm:
    • Outputs remain consistent

🚀 PART 2 — Infrastructure & Serving


🧠 12. Understand Inference Flow

  • Write down:
    • Steps from input → output
  • Measure:
    • Time taken for a single generation

13. Optimize Inference

  • Test batching:
    • Multiple inputs at once
  • Compare:
    • Latency vs throughput

🧮 14. Apply Quantization

  • Load model in:
    • 8-bit
    • (Optional) 4-bit
  • Compare:
    • Memory usage
    • Speed
    • Output quality

🖥️ 15. Simulate Real-World Usage

  • Pretend you have:
    • Multiple users hitting your model
  • Think through:
    • How would you queue requests?
    • When would you batch?
    • When would you scale?

☁️ 16. Understand Infra Concepts

  • Research:
    • GPU provisioning
    • Autoscaling
    • Model warm starts
  • Understand:
    • Why loading time matters
    • Why GPUs shouldnt sit idle

🧬 17. (Bonus) DICOM Exploration

  • Learn:
    • What DICOM files are
  • Think:
    • How LLMs could be used with medical data
  • Note:
    • Privacy + domain challenges

✍️ 18. Write Your Blog

Structure

  • Introduction:
    • What is an LLM really?
  • Training:
    • Tokenization
    • Training loop
    • Loss behaviour
  • Fine-tuning:
    • Full vs LoRA
  • Challenges:
    • What went wrong
  • Infrastructure:
    • Serving challenges
    • Batching
    • Quantization
  • Key Learnings:
    • What surprised you
    • What actually matters

Final Deliverables

  • Working training script
  • LoRA vs full fine-tune comparison
  • Basic inference script
  • Blog post (clear + honest)
  • Notes showing your understanding

⚠️ Keep Yourself Honest

  • Can you explain the training loop without looking?
  • Do you understand why loss decreases?
  • Can you explain batching vs latency tradeoffs?
  • Do you know what would break at scale?