Created a basic model and ran it for 10 epochs. Used the GPT2 Tokenizer and the base GPT2 model to tokenize and train the retrain the model

Signed-off-by: rodude123 <rodude123@gmail.com>
This commit is contained in:
Rohit Pai 2026-04-12 11:25:38 +01:00 committed by rodude123
parent 703bece135
commit 9315dc352b
4 changed files with 1789 additions and 6 deletions

91
.gitignore vendored
View File

@ -167,10 +167,89 @@ dmypy.json
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
# Covers JetBrains IDEs: IntelliJ, GoLand, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# AWS User-specific
.idea/**/aws.xml
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr
# CMake
cmake-build-*/
# Mongo Explorer plugin
.idea/**/mongoSettings.xml
# File-based project format
*.iws
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# SonarLint plugin
.idea/sonarlint/
.idea/sonarlint.xml # see https://community.sonarsource.com/t/is-the-file-idea-idea-idea-sonarlint-xml-intended-to-be-under-source-control/121119
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
# Editor-based HTTP Client
.idea/httpRequests
http-client.private.env.json
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser
# Apifox Helper cache
.idea/.cache/.Apifox_Helper
.idea/ApifoxUploaderProjectSetting.xml
# Github Copilot persisted session migrations, see: https://github.com/microsoft/copilot-intellij-feedback/issues/712#issuecomment-3322062215
.idea/**/copilot.data.migration.*.xml

259
Building an LLM.md Normal file
View File

@ -0,0 +1,259 @@
# 🧠 LLM Mini Project — Step-by-Step Checklist
---
## 📦 0. Setup Environment
- [ ] Create a new project folder
- [ ] Set up a virtual environment
- [ ] Install core dependencies:
- [ ] torch
- [ ] transformers
- [ ] datasets
- [ ] accelerate
- [ ] peft (for LoRA later)
- [ ] bitsandbytes (for quantization later)
- [ ] Confirm GPU is available (`torch.cuda.is_available()`)
---
## 🔍 1. Understand the Problem (dont skip this)
- [ ] Write down in your own words:
- [ ] What is a language model?
- [ ] What does “predict next token” actually mean?
- [ ] Manually inspect:
- [ ] A sample sentence
- [ ] Its tokenized form
- [ ] Verify:
- [ ] Input tokens vs target tokens (shifted by 1)
---
## 📚 2. Load Dataset
- [ ] Choose dataset:
- [ ] Start with WikiText-2
- [ ] Load dataset using `datasets`
- [ ] Print:
- [ ] A few raw samples
- [ ] Check:
- [ ] Dataset size
- [ ] Train/validation split
---
## 🔢 3. Tokenization
- [ ] Load GPT-2 tokenizer
- [ ] Tokenize dataset:
- [ ] Apply truncation
- [ ] Apply padding
- [ ] Verify:
- [ ] Shape of tokenized output
- [ ] Decode tokens back to text (sanity check)
---
## 🧱 4. Prepare Training Data
- [ ] Convert dataset to PyTorch format
- [ ] Create DataLoader:
- [ ] Set batch size (start small: 28)
- [ ] Confirm:
- [ ] Batches load correctly
- [ ] Tensor shapes are consistent
---
## 🤖 5. Load Model
- [ ] Load pretrained GPT-2 small
- [ ] Move model to GPU (if available)
- [ ] Print:
- [ ] Model size (parameters)
- [ ] Run a single forward pass to confirm:
- [ ] No errors
---
## 🔁 6. Build Training Loop (core understanding)
- [ ] Write your own training loop (no Trainer API yet)
- [ ] Include:
- [ ] Forward pass
- [ ] Loss calculation
- [ ] Backpropagation
- [ ] Optimizer step
- [ ] Print:
- [ ] Loss every few steps
---
## 📉 7. Observe Training Behaviour
- [ ] Track:
- [ ] Training loss over time
- [ ] Answer:
- [ ] Is loss decreasing?
- [ ] Is it noisy or stable?
- [ ] (Optional)
- [ ] Plot loss curve
---
## 🧪 8. Evaluate Model
- [ ] Generate text from model:
- [ ] Before training
- [ ] After training
- [ ] Compare:
- [ ] Coherence
- [ ] Structure
- [ ] Note:
- [ ] Any overfitting signs (repetition, memorization)
---
## ⚖️ 9. Try LoRA Fine-Tuning
- [ ] Add LoRA using `peft`
- [ ] Freeze base model weights
- [ ] Train only adapter layers
- [ ] Compare vs full fine-tuning:
- [ ] Speed
- [ ] Memory usage
- [ ] Output quality
---
## 🧠 10. Understand Convergence
- [ ] Identify:
- [ ] When loss plateaus
- [ ] Check validation loss:
- [ ] Does it increase? (overfitting)
- [ ] Write down:
- [ ] What “good training” looks like
---
## ⚙️ 11. Model Saving & Loading
- [ ] Save:
- [ ] Model weights
- [ ] Tokenizer
- [ ] Reload model
- [ ] Confirm:
- [ ] Outputs remain consistent
---
# 🚀 PART 2 — Infrastructure & Serving
---
## 🧠 12. Understand Inference Flow
- [ ] Write down:
- [ ] Steps from input → output
- [ ] Measure:
- [ ] Time taken for a single generation
---
## ⚡ 13. Optimize Inference
- [ ] Test batching:
- [ ] Multiple inputs at once
- [ ] Compare:
- [ ] Latency vs throughput
---
## 🧮 14. Apply Quantization
- [ ] Load model in:
- [ ] 8-bit
- [ ] (Optional) 4-bit
- [ ] Compare:
- [ ] Memory usage
- [ ] Speed
- [ ] Output quality
---
## 🖥️ 15. Simulate Real-World Usage
- [ ] Pretend you have:
- [ ] Multiple users hitting your model
- [ ] Think through:
- [ ] How would you queue requests?
- [ ] When would you batch?
- [ ] When would you scale?
---
## ☁️ 16. Understand Infra Concepts
- [ ] Research:
- [ ] GPU provisioning
- [ ] Autoscaling
- [ ] Model warm starts
- [ ] Understand:
- [ ] Why loading time matters
- [ ] Why GPUs shouldnt sit idle
---
## 🧬 17. (Bonus) DICOM Exploration
- [ ] Learn:
- [ ] What DICOM files are
- [ ] Think:
- [ ] How LLMs could be used with medical data
- [ ] Note:
- [ ] Privacy + domain challenges
---
## ✍️ 18. Write Your Blog
### Structure
- [ ] Introduction:
- [ ] What is an LLM really?
- [ ] Training:
- [ ] Tokenization
- [ ] Training loop
- [ ] Loss behaviour
- [ ] Fine-tuning:
- [ ] Full vs LoRA
- [ ] Challenges:
- [ ] What went wrong
- [ ] Infrastructure:
- [ ] Serving challenges
- [ ] Batching
- [ ] Quantization
- [ ] Key Learnings:
- [ ] What surprised you
- [ ] What actually matters
---
## ✅ Final Deliverables
- [ ] Working training script
- [ ] LoRA vs full fine-tune comparison
- [ ] Basic inference script
- [ ] Blog post (clear + honest)
- [ ] Notes showing your understanding
---
## ⚠️ Keep Yourself Honest
- [ ] Can you explain the training loop without looking?
- [ ] Do you understand why loss decreases?
- [ ] Can you explain batching vs latency tradeoffs?
- [ ] Do you know what would break at scale?

1445
LLM-gpt.ipynb Normal file

File diff suppressed because it is too large Load Diff

BIN
model_output_10_epochs.zip Normal file

Binary file not shown.