Created a basic model and ran it for 10 epochs. Used the GPT2 Tokenizer and the base GPT2 model to tokenize and train the retrain the model

Signed-off-by: rodude123 <rodude123@gmail.com>
2026-04-12 11:25:38 +01:00
parent 703bece135
commit 9315dc352b
4 changed files with 1789 additions and 6 deletions
@@ -167,10 +167,89 @@ dmypy.json
 # Cython debug symbols
 cython_debug/

-# PyCharm
-#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
-#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
-#  and can be added to the global gitignore or merged into this file.  For a more nuclear
-#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
+# Covers JetBrains IDEs: IntelliJ, GoLand, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
+# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

+# User-specific stuff
+.idea/**/workspace.xml
+.idea/**/tasks.xml
+.idea/**/usage.statistics.xml
+.idea/**/dictionaries
+.idea/**/shelf
+
+# AWS User-specific
+.idea/**/aws.xml
+
+# Generated files
+.idea/**/contentModel.xml
+
+# Sensitive or high-churn files
+.idea/**/dataSources/
+.idea/**/dataSources.ids
+.idea/**/dataSources.local.xml
+.idea/**/sqlDataSources.xml
+.idea/**/dynamic.xml
+.idea/**/uiDesigner.xml
+.idea/**/dbnavigator.xml
+
+# Gradle
+.idea/**/gradle.xml
+.idea/**/libraries
+
+# Gradle and Maven with auto-import
+# When using Gradle or Maven with auto-import, you should exclude module files,
+# since they will be recreated, and may cause churn.  Uncomment if using
+# auto-import.
+# .idea/artifacts
+# .idea/compiler.xml
+# .idea/jarRepositories.xml
+# .idea/modules.xml
+# .idea/*.iml
+# .idea/modules
+# *.iml
+# *.ipr
+
+# CMake
+cmake-build-*/
+
+# Mongo Explorer plugin
+.idea/**/mongoSettings.xml
+
+# File-based project format
+*.iws
+
+# IntelliJ
+out/
+
+# mpeltonen/sbt-idea plugin
+.idea_modules/
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# Cursive Clojure plugin
+.idea/replstate.xml
+
+# SonarLint plugin
+.idea/sonarlint/
+.idea/sonarlint.xml # see https://community.sonarsource.com/t/is-the-file-idea-idea-idea-sonarlint-xml-intended-to-be-under-source-control/121119
+
+# Crashlytics plugin (for Android Studio and IntelliJ)
+com_crashlytics_export_strings.xml
+crashlytics.properties
+crashlytics-build.properties
+fabric.properties
+
+# Editor-based HTTP Client
+.idea/httpRequests
+http-client.private.env.json
+
+# Android studio 3.1+ serialized cache file
+.idea/caches/build_file_checksums.ser
+
+# Apifox Helper cache
+.idea/.cache/.Apifox_Helper
+.idea/ApifoxUploaderProjectSetting.xml
+
+# Github Copilot persisted session migrations, see: https://github.com/microsoft/copilot-intellij-feedback/issues/712#issuecomment-3322062215
+.idea/**/copilot.data.migration.*.xml
@@ -0,0 +1,259 @@
+# 🧠 LLM Mini Project — Step-by-Step Checklist
+
+---
+
+## 📦 0. Setup Environment
+
+- [ ] Create a new project folder
+- [ ] Set up a virtual environment
+- [ ] Install core dependencies:
+  - [ ] torch
+  - [ ] transformers
+  - [ ] datasets
+  - [ ] accelerate
+  - [ ] peft (for LoRA later)
+  - [ ] bitsandbytes (for quantization later)
+- [ ] Confirm GPU is available (`torch.cuda.is_available()`)
+
+---
+
+## 🔍 1. Understand the Problem (don’t skip this)
+
+- [ ] Write down in your own words:
+  - [ ] What is a language model?
+  - [ ] What does “predict next token” actually mean?
+- [ ] Manually inspect:
+  - [ ] A sample sentence
+  - [ ] Its tokenized form
+- [ ] Verify:
+  - [ ] Input tokens vs target tokens (shifted by 1)
+
+---
+
+## 📚 2. Load Dataset
+
+- [ ] Choose dataset:
+  - [ ] Start with WikiText-2
+- [ ] Load dataset using `datasets`
+- [ ] Print:
+  - [ ] A few raw samples
+- [ ] Check:
+  - [ ] Dataset size
+  - [ ] Train/validation split
+
+---
+
+## 🔢 3. Tokenization
+
+- [ ] Load GPT-2 tokenizer
+- [ ] Tokenize dataset:
+  - [ ] Apply truncation
+  - [ ] Apply padding
+- [ ] Verify:
+  - [ ] Shape of tokenized output
+  - [ ] Decode tokens back to text (sanity check)
+
+---
+
+## 🧱 4. Prepare Training Data
+
+- [ ] Convert dataset to PyTorch format
+- [ ] Create DataLoader:
+  - [ ] Set batch size (start small: 2–8)
+- [ ] Confirm:
+  - [ ] Batches load correctly
+  - [ ] Tensor shapes are consistent
+
+---
+
+## 🤖 5. Load Model
+
+- [ ] Load pretrained GPT-2 small
+- [ ] Move model to GPU (if available)
+- [ ] Print:
+  - [ ] Model size (parameters)
+- [ ] Run a single forward pass to confirm:
+  - [ ] No errors
+
+---
+
+## 🔁 6. Build Training Loop (core understanding)
+
+- [ ] Write your own training loop (no Trainer API yet)
+- [ ] Include:
+  - [ ] Forward pass
+  - [ ] Loss calculation
+  - [ ] Backpropagation
+  - [ ] Optimizer step
+- [ ] Print:
+  - [ ] Loss every few steps
+
+---
+
+## 📉 7. Observe Training Behaviour
+
+- [ ] Track:
+  - [ ] Training loss over time
+- [ ] Answer:
+  - [ ] Is loss decreasing?
+  - [ ] Is it noisy or stable?
+- [ ] (Optional)
+  - [ ] Plot loss curve
+
+---
+
+## 🧪 8. Evaluate Model
+
+- [ ] Generate text from model:
+  - [ ] Before training
+  - [ ] After training
+- [ ] Compare:
+  - [ ] Coherence
+  - [ ] Structure
+- [ ] Note:
+  - [ ] Any overfitting signs (repetition, memorization)
+
+---
+
+## ⚖️ 9. Try LoRA Fine-Tuning
+
+- [ ] Add LoRA using `peft`
+- [ ] Freeze base model weights
+- [ ] Train only adapter layers
+- [ ] Compare vs full fine-tuning:
+  - [ ] Speed
+  - [ ] Memory usage
+  - [ ] Output quality
+
+---
+
+## 🧠 10. Understand Convergence
+
+- [ ] Identify:
+  - [ ] When loss plateaus
+- [ ] Check validation loss:
+  - [ ] Does it increase? (overfitting)
+- [ ] Write down:
+  - [ ] What “good training” looks like
+
+---
+
+## ⚙️ 11. Model Saving & Loading
+
+- [ ] Save:
+  - [ ] Model weights
+  - [ ] Tokenizer
+- [ ] Reload model
+- [ ] Confirm:
+  - [ ] Outputs remain consistent
+
+---
+
+# 🚀 PART 2 — Infrastructure & Serving
+
+---
+
+## 🧠 12. Understand Inference Flow
+
+- [ ] Write down:
+  - [ ] Steps from input → output
+- [ ] Measure:
+  - [ ] Time taken for a single generation
+
+---
+
+## ⚡ 13. Optimize Inference
+
+- [ ] Test batching:
+  - [ ] Multiple inputs at once
+- [ ] Compare:
+  - [ ] Latency vs throughput
+
+---
+
+## 🧮 14. Apply Quantization
+
+- [ ] Load model in:
+  - [ ] 8-bit
+  - [ ] (Optional) 4-bit
+- [ ] Compare:
+  - [ ] Memory usage
+  - [ ] Speed
+  - [ ] Output quality
+
+---
+
+## 🖥️ 15. Simulate Real-World Usage
+
+- [ ] Pretend you have:
+  - [ ] Multiple users hitting your model
+- [ ] Think through:
+  - [ ] How would you queue requests?
+  - [ ] When would you batch?
+  - [ ] When would you scale?
+
+---
+
+## ☁️ 16. Understand Infra Concepts
+
+- [ ] Research:
+  - [ ] GPU provisioning
+  - [ ] Autoscaling
+  - [ ] Model warm starts
+- [ ] Understand:
+  - [ ] Why loading time matters
+  - [ ] Why GPUs shouldn’t sit idle
+
+---
+
+## 🧬 17. (Bonus) DICOM Exploration
+
+- [ ] Learn:
+  - [ ] What DICOM files are
+- [ ] Think:
+  - [ ] How LLMs could be used with medical data
+- [ ] Note:
+  - [ ] Privacy + domain challenges
+
+---
+
+## ✍️ 18. Write Your Blog
+
+### Structure
+
+- [ ] Introduction:
+  - [ ] What is an LLM really?
+- [ ] Training:
+  - [ ] Tokenization
+  - [ ] Training loop
+  - [ ] Loss behaviour
+- [ ] Fine-tuning:
+  - [ ] Full vs LoRA
+- [ ] Challenges:
+  - [ ] What went wrong
+- [ ] Infrastructure:
+  - [ ] Serving challenges
+  - [ ] Batching
+  - [ ] Quantization
+- [ ] Key Learnings:
+  - [ ] What surprised you
+  - [ ] What actually matters
+
+---
+
+## ✅ Final Deliverables
+
+- [ ] Working training script
+- [ ] LoRA vs full fine-tune comparison
+- [ ] Basic inference script
+- [ ] Blog post (clear + honest)
+- [ ] Notes showing your understanding
+
+---
+
+## ⚠️ Keep Yourself Honest
+
+- [ ] Can you explain the training loop without looking?
+- [ ] Do you understand why loss decreases?
+- [ ] Can you explain batching vs latency tradeoffs?
+- [ ] Do you know what would break at scale?