Created a basic model and ran it for 10 epochs. Used the GPT2 Tokenizer and the base GPT2 model to tokenize and train the retrain the model
Signed-off-by: rodude123 <rodude123@gmail.com>
This commit is contained in:
parent
703bece135
commit
9315dc352b
91
.gitignore
vendored
91
.gitignore
vendored
@ -167,10 +167,89 @@ dmypy.json
|
|||||||
# Cython debug symbols
|
# Cython debug symbols
|
||||||
cython_debug/
|
cython_debug/
|
||||||
|
|
||||||
# PyCharm
|
# Covers JetBrains IDEs: IntelliJ, GoLand, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
|
||||||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
|
||||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
|
||||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
|
||||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
|
||||||
#.idea/
|
|
||||||
|
|
||||||
|
# User-specific stuff
|
||||||
|
.idea/**/workspace.xml
|
||||||
|
.idea/**/tasks.xml
|
||||||
|
.idea/**/usage.statistics.xml
|
||||||
|
.idea/**/dictionaries
|
||||||
|
.idea/**/shelf
|
||||||
|
|
||||||
|
# AWS User-specific
|
||||||
|
.idea/**/aws.xml
|
||||||
|
|
||||||
|
# Generated files
|
||||||
|
.idea/**/contentModel.xml
|
||||||
|
|
||||||
|
# Sensitive or high-churn files
|
||||||
|
.idea/**/dataSources/
|
||||||
|
.idea/**/dataSources.ids
|
||||||
|
.idea/**/dataSources.local.xml
|
||||||
|
.idea/**/sqlDataSources.xml
|
||||||
|
.idea/**/dynamic.xml
|
||||||
|
.idea/**/uiDesigner.xml
|
||||||
|
.idea/**/dbnavigator.xml
|
||||||
|
|
||||||
|
# Gradle
|
||||||
|
.idea/**/gradle.xml
|
||||||
|
.idea/**/libraries
|
||||||
|
|
||||||
|
# Gradle and Maven with auto-import
|
||||||
|
# When using Gradle or Maven with auto-import, you should exclude module files,
|
||||||
|
# since they will be recreated, and may cause churn. Uncomment if using
|
||||||
|
# auto-import.
|
||||||
|
# .idea/artifacts
|
||||||
|
# .idea/compiler.xml
|
||||||
|
# .idea/jarRepositories.xml
|
||||||
|
# .idea/modules.xml
|
||||||
|
# .idea/*.iml
|
||||||
|
# .idea/modules
|
||||||
|
# *.iml
|
||||||
|
# *.ipr
|
||||||
|
|
||||||
|
# CMake
|
||||||
|
cmake-build-*/
|
||||||
|
|
||||||
|
# Mongo Explorer plugin
|
||||||
|
.idea/**/mongoSettings.xml
|
||||||
|
|
||||||
|
# File-based project format
|
||||||
|
*.iws
|
||||||
|
|
||||||
|
# IntelliJ
|
||||||
|
out/
|
||||||
|
|
||||||
|
# mpeltonen/sbt-idea plugin
|
||||||
|
.idea_modules/
|
||||||
|
|
||||||
|
# JIRA plugin
|
||||||
|
atlassian-ide-plugin.xml
|
||||||
|
|
||||||
|
# Cursive Clojure plugin
|
||||||
|
.idea/replstate.xml
|
||||||
|
|
||||||
|
# SonarLint plugin
|
||||||
|
.idea/sonarlint/
|
||||||
|
.idea/sonarlint.xml # see https://community.sonarsource.com/t/is-the-file-idea-idea-idea-sonarlint-xml-intended-to-be-under-source-control/121119
|
||||||
|
|
||||||
|
# Crashlytics plugin (for Android Studio and IntelliJ)
|
||||||
|
com_crashlytics_export_strings.xml
|
||||||
|
crashlytics.properties
|
||||||
|
crashlytics-build.properties
|
||||||
|
fabric.properties
|
||||||
|
|
||||||
|
# Editor-based HTTP Client
|
||||||
|
.idea/httpRequests
|
||||||
|
http-client.private.env.json
|
||||||
|
|
||||||
|
# Android studio 3.1+ serialized cache file
|
||||||
|
.idea/caches/build_file_checksums.ser
|
||||||
|
|
||||||
|
# Apifox Helper cache
|
||||||
|
.idea/.cache/.Apifox_Helper
|
||||||
|
.idea/ApifoxUploaderProjectSetting.xml
|
||||||
|
|
||||||
|
# Github Copilot persisted session migrations, see: https://github.com/microsoft/copilot-intellij-feedback/issues/712#issuecomment-3322062215
|
||||||
|
.idea/**/copilot.data.migration.*.xml
|
||||||
259
Building an LLM.md
Normal file
259
Building an LLM.md
Normal file
@ -0,0 +1,259 @@
|
|||||||
|
# 🧠 LLM Mini Project — Step-by-Step Checklist
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📦 0. Setup Environment
|
||||||
|
|
||||||
|
- [ ] Create a new project folder
|
||||||
|
- [ ] Set up a virtual environment
|
||||||
|
- [ ] Install core dependencies:
|
||||||
|
- [ ] torch
|
||||||
|
- [ ] transformers
|
||||||
|
- [ ] datasets
|
||||||
|
- [ ] accelerate
|
||||||
|
- [ ] peft (for LoRA later)
|
||||||
|
- [ ] bitsandbytes (for quantization later)
|
||||||
|
- [ ] Confirm GPU is available (`torch.cuda.is_available()`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔍 1. Understand the Problem (don’t skip this)
|
||||||
|
|
||||||
|
- [ ] Write down in your own words:
|
||||||
|
- [ ] What is a language model?
|
||||||
|
- [ ] What does “predict next token” actually mean?
|
||||||
|
- [ ] Manually inspect:
|
||||||
|
- [ ] A sample sentence
|
||||||
|
- [ ] Its tokenized form
|
||||||
|
- [ ] Verify:
|
||||||
|
- [ ] Input tokens vs target tokens (shifted by 1)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📚 2. Load Dataset
|
||||||
|
|
||||||
|
- [ ] Choose dataset:
|
||||||
|
- [ ] Start with WikiText-2
|
||||||
|
- [ ] Load dataset using `datasets`
|
||||||
|
- [ ] Print:
|
||||||
|
- [ ] A few raw samples
|
||||||
|
- [ ] Check:
|
||||||
|
- [ ] Dataset size
|
||||||
|
- [ ] Train/validation split
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔢 3. Tokenization
|
||||||
|
|
||||||
|
- [ ] Load GPT-2 tokenizer
|
||||||
|
- [ ] Tokenize dataset:
|
||||||
|
- [ ] Apply truncation
|
||||||
|
- [ ] Apply padding
|
||||||
|
- [ ] Verify:
|
||||||
|
- [ ] Shape of tokenized output
|
||||||
|
- [ ] Decode tokens back to text (sanity check)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧱 4. Prepare Training Data
|
||||||
|
|
||||||
|
- [ ] Convert dataset to PyTorch format
|
||||||
|
- [ ] Create DataLoader:
|
||||||
|
- [ ] Set batch size (start small: 2–8)
|
||||||
|
- [ ] Confirm:
|
||||||
|
- [ ] Batches load correctly
|
||||||
|
- [ ] Tensor shapes are consistent
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🤖 5. Load Model
|
||||||
|
|
||||||
|
- [ ] Load pretrained GPT-2 small
|
||||||
|
- [ ] Move model to GPU (if available)
|
||||||
|
- [ ] Print:
|
||||||
|
- [ ] Model size (parameters)
|
||||||
|
- [ ] Run a single forward pass to confirm:
|
||||||
|
- [ ] No errors
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔁 6. Build Training Loop (core understanding)
|
||||||
|
|
||||||
|
- [ ] Write your own training loop (no Trainer API yet)
|
||||||
|
- [ ] Include:
|
||||||
|
- [ ] Forward pass
|
||||||
|
- [ ] Loss calculation
|
||||||
|
- [ ] Backpropagation
|
||||||
|
- [ ] Optimizer step
|
||||||
|
- [ ] Print:
|
||||||
|
- [ ] Loss every few steps
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📉 7. Observe Training Behaviour
|
||||||
|
|
||||||
|
- [ ] Track:
|
||||||
|
- [ ] Training loss over time
|
||||||
|
- [ ] Answer:
|
||||||
|
- [ ] Is loss decreasing?
|
||||||
|
- [ ] Is it noisy or stable?
|
||||||
|
- [ ] (Optional)
|
||||||
|
- [ ] Plot loss curve
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧪 8. Evaluate Model
|
||||||
|
|
||||||
|
- [ ] Generate text from model:
|
||||||
|
- [ ] Before training
|
||||||
|
- [ ] After training
|
||||||
|
- [ ] Compare:
|
||||||
|
- [ ] Coherence
|
||||||
|
- [ ] Structure
|
||||||
|
- [ ] Note:
|
||||||
|
- [ ] Any overfitting signs (repetition, memorization)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚖️ 9. Try LoRA Fine-Tuning
|
||||||
|
|
||||||
|
- [ ] Add LoRA using `peft`
|
||||||
|
- [ ] Freeze base model weights
|
||||||
|
- [ ] Train only adapter layers
|
||||||
|
- [ ] Compare vs full fine-tuning:
|
||||||
|
- [ ] Speed
|
||||||
|
- [ ] Memory usage
|
||||||
|
- [ ] Output quality
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧠 10. Understand Convergence
|
||||||
|
|
||||||
|
- [ ] Identify:
|
||||||
|
- [ ] When loss plateaus
|
||||||
|
- [ ] Check validation loss:
|
||||||
|
- [ ] Does it increase? (overfitting)
|
||||||
|
- [ ] Write down:
|
||||||
|
- [ ] What “good training” looks like
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚙️ 11. Model Saving & Loading
|
||||||
|
|
||||||
|
- [ ] Save:
|
||||||
|
- [ ] Model weights
|
||||||
|
- [ ] Tokenizer
|
||||||
|
- [ ] Reload model
|
||||||
|
- [ ] Confirm:
|
||||||
|
- [ ] Outputs remain consistent
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 🚀 PART 2 — Infrastructure & Serving
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧠 12. Understand Inference Flow
|
||||||
|
|
||||||
|
- [ ] Write down:
|
||||||
|
- [ ] Steps from input → output
|
||||||
|
- [ ] Measure:
|
||||||
|
- [ ] Time taken for a single generation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚡ 13. Optimize Inference
|
||||||
|
|
||||||
|
- [ ] Test batching:
|
||||||
|
- [ ] Multiple inputs at once
|
||||||
|
- [ ] Compare:
|
||||||
|
- [ ] Latency vs throughput
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧮 14. Apply Quantization
|
||||||
|
|
||||||
|
- [ ] Load model in:
|
||||||
|
- [ ] 8-bit
|
||||||
|
- [ ] (Optional) 4-bit
|
||||||
|
- [ ] Compare:
|
||||||
|
- [ ] Memory usage
|
||||||
|
- [ ] Speed
|
||||||
|
- [ ] Output quality
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🖥️ 15. Simulate Real-World Usage
|
||||||
|
|
||||||
|
- [ ] Pretend you have:
|
||||||
|
- [ ] Multiple users hitting your model
|
||||||
|
- [ ] Think through:
|
||||||
|
- [ ] How would you queue requests?
|
||||||
|
- [ ] When would you batch?
|
||||||
|
- [ ] When would you scale?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ☁️ 16. Understand Infra Concepts
|
||||||
|
|
||||||
|
- [ ] Research:
|
||||||
|
- [ ] GPU provisioning
|
||||||
|
- [ ] Autoscaling
|
||||||
|
- [ ] Model warm starts
|
||||||
|
- [ ] Understand:
|
||||||
|
- [ ] Why loading time matters
|
||||||
|
- [ ] Why GPUs shouldn’t sit idle
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧬 17. (Bonus) DICOM Exploration
|
||||||
|
|
||||||
|
- [ ] Learn:
|
||||||
|
- [ ] What DICOM files are
|
||||||
|
- [ ] Think:
|
||||||
|
- [ ] How LLMs could be used with medical data
|
||||||
|
- [ ] Note:
|
||||||
|
- [ ] Privacy + domain challenges
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✍️ 18. Write Your Blog
|
||||||
|
|
||||||
|
### Structure
|
||||||
|
|
||||||
|
- [ ] Introduction:
|
||||||
|
- [ ] What is an LLM really?
|
||||||
|
- [ ] Training:
|
||||||
|
- [ ] Tokenization
|
||||||
|
- [ ] Training loop
|
||||||
|
- [ ] Loss behaviour
|
||||||
|
- [ ] Fine-tuning:
|
||||||
|
- [ ] Full vs LoRA
|
||||||
|
- [ ] Challenges:
|
||||||
|
- [ ] What went wrong
|
||||||
|
- [ ] Infrastructure:
|
||||||
|
- [ ] Serving challenges
|
||||||
|
- [ ] Batching
|
||||||
|
- [ ] Quantization
|
||||||
|
- [ ] Key Learnings:
|
||||||
|
- [ ] What surprised you
|
||||||
|
- [ ] What actually matters
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ✅ Final Deliverables
|
||||||
|
|
||||||
|
- [ ] Working training script
|
||||||
|
- [ ] LoRA vs full fine-tune comparison
|
||||||
|
- [ ] Basic inference script
|
||||||
|
- [ ] Blog post (clear + honest)
|
||||||
|
- [ ] Notes showing your understanding
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ Keep Yourself Honest
|
||||||
|
|
||||||
|
- [ ] Can you explain the training loop without looking?
|
||||||
|
- [ ] Do you understand why loss decreases?
|
||||||
|
- [ ] Can you explain batching vs latency tradeoffs?
|
||||||
|
- [ ] Do you know what would break at scale?
|
||||||
1445
LLM-gpt.ipynb
Normal file
1445
LLM-gpt.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
BIN
model_output_10_epochs.zip
Normal file
BIN
model_output_10_epochs.zip
Normal file
Binary file not shown.
Loading…
Reference in New Issue
Block a user