Created a basic model and ran it for 10 epochs. Used the GPT2 Tokenizer and the base GPT2 model to tokenize and train the retrain the model

Signed-off-by: rodude123 <rodude123@gmail.com>
2026-04-12 11:25:38 +01:00
parent 703bece135
commit 9315dc352b
4 changed files with 1789 additions and 6 deletions
@@ -167,10 +167,89 @@ dmypy.json
 # Cython debug symbols
 cython_debug/
-# PyCharm
+# Covers JetBrains IDEs: IntelliJ, GoLand, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
-#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 # User-specific stuff
 .idea/**/workspace.xml
 .idea/**/tasks.xml
 .idea/**/usage.statistics.xml
 .idea/**/dictionaries
 .idea/**/shelf
 # AWS User-specific
 .idea/**/aws.xml
 # Generated files
 .idea/**/contentModel.xml
 # Sensitive or high-churn files
 .idea/**/dataSources/
 .idea/**/dataSources.ids
 .idea/**/dataSources.local.xml
 .idea/**/sqlDataSources.xml
 .idea/**/dynamic.xml
 .idea/**/uiDesigner.xml
 .idea/**/dbnavigator.xml
 # Gradle
 .idea/**/gradle.xml
 .idea/**/libraries
 # Gradle and Maven with auto-import
 # When using Gradle or Maven with auto-import, you should exclude module files,
 # since they will be recreated, and may cause churn.  Uncomment if using
 # auto-import.
 # .idea/artifacts
 # .idea/compiler.xml
 # .idea/jarRepositories.xml
 # .idea/modules.xml
 # .idea/*.iml
 # .idea/modules
 # *.iml
 # *.ipr
 # CMake
 cmake-build-*/
 # Mongo Explorer plugin
 .idea/**/mongoSettings.xml
 # File-based project format
 *.iws
 # IntelliJ
 out/
 # mpeltonen/sbt-idea plugin
 .idea_modules/
 # JIRA plugin
 atlassian-ide-plugin.xml
 # Cursive Clojure plugin
 .idea/replstate.xml
 # SonarLint plugin
 .idea/sonarlint/
 .idea/sonarlint.xml # see https://community.sonarsource.com/t/is-the-file-idea-idea-idea-sonarlint-xml-intended-to-be-under-source-control/121119
 # Crashlytics plugin (for Android Studio and IntelliJ)
 com_crashlytics_export_strings.xml
 crashlytics.properties
 crashlytics-build.properties
 fabric.properties
 # Editor-based HTTP Client
 .idea/httpRequests
 http-client.private.env.json
 # Android studio 3.1+ serialized cache file
 .idea/caches/build_file_checksums.ser
 # Apifox Helper cache
 .idea/.cache/.Apifox_Helper
 .idea/ApifoxUploaderProjectSetting.xml
 # Github Copilot persisted session migrations, see: https://github.com/microsoft/copilot-intellij-feedback/issues/712#issuecomment-3322062215
 .idea/**/copilot.data.migration.*.xml
@@ -0,0 +1,259 @@
 # 🧠 LLM Mini Project — Step-by-Step Checklist
 ---
 ## 📦 0. Setup Environment
 - [ ] Create a new project folder
 - [ ] Set up a virtual environment
 - [ ] Install core dependencies:
  - [ ] torch
  - [ ] transformers
  - [ ] datasets
  - [ ] accelerate
  - [ ] peft (for LoRA later)
  - [ ] bitsandbytes (for quantization later)
 - [ ] Confirm GPU is available (`torch.cuda.is_available()`)
 ---
 ## 🔍 1. Understand the Problem (don’t skip this)
 - [ ] Write down in your own words:
  - [ ] What is a language model?
  - [ ] What does “predict next token” actually mean?
 - [ ] Manually inspect:
  - [ ] A sample sentence
  - [ ] Its tokenized form
 - [ ] Verify:
  - [ ] Input tokens vs target tokens (shifted by 1)
 ---
 ## 📚 2. Load Dataset
 - [ ] Choose dataset:
  - [ ] Start with WikiText-2
 - [ ] Load dataset using `datasets`
 - [ ] Print:
  - [ ] A few raw samples
 - [ ] Check:
  - [ ] Dataset size
  - [ ] Train/validation split
 ---
 ## 🔢 3. Tokenization
 - [ ] Load GPT-2 tokenizer
 - [ ] Tokenize dataset:
  - [ ] Apply truncation
  - [ ] Apply padding
 - [ ] Verify:
  - [ ] Shape of tokenized output
  - [ ] Decode tokens back to text (sanity check)
 ---
 ## 🧱 4. Prepare Training Data
 - [ ] Convert dataset to PyTorch format
 - [ ] Create DataLoader:
  - [ ] Set batch size (start small: 2–8)
 - [ ] Confirm:
  - [ ] Batches load correctly
  - [ ] Tensor shapes are consistent
 ---
 ## 🤖 5. Load Model
 - [ ] Load pretrained GPT-2 small
 - [ ] Move model to GPU (if available)
 - [ ] Print:
  - [ ] Model size (parameters)
 - [ ] Run a single forward pass to confirm:
  - [ ] No errors
 ---
 ## 🔁 6. Build Training Loop (core understanding)
 - [ ] Write your own training loop (no Trainer API yet)
 - [ ] Include:
  - [ ] Forward pass
  - [ ] Loss calculation
  - [ ] Backpropagation
  - [ ] Optimizer step
 - [ ] Print:
  - [ ] Loss every few steps
 ---
 ## 📉 7. Observe Training Behaviour
 - [ ] Track:
  - [ ] Training loss over time
 - [ ] Answer:
  - [ ] Is loss decreasing?
  - [ ] Is it noisy or stable?
 - [ ] (Optional)
  - [ ] Plot loss curve
 ---
 ## 🧪 8. Evaluate Model
 - [ ] Generate text from model:
  - [ ] Before training
  - [ ] After training
 - [ ] Compare:
  - [ ] Coherence
  - [ ] Structure
 - [ ] Note:
  - [ ] Any overfitting signs (repetition, memorization)
 ---
 ## ⚖️ 9. Try LoRA Fine-Tuning
 - [ ] Add LoRA using `peft`
 - [ ] Freeze base model weights
 - [ ] Train only adapter layers
 - [ ] Compare vs full fine-tuning:
  - [ ] Speed
  - [ ] Memory usage
  - [ ] Output quality
 ---
 ## 🧠 10. Understand Convergence
 - [ ] Identify:
  - [ ] When loss plateaus
 - [ ] Check validation loss:
  - [ ] Does it increase? (overfitting)
 - [ ] Write down:
  - [ ] What “good training” looks like
 ---
 ## ⚙️ 11. Model Saving & Loading
 - [ ] Save:
  - [ ] Model weights
  - [ ] Tokenizer
 - [ ] Reload model
 - [ ] Confirm:
  - [ ] Outputs remain consistent
 ---
 # 🚀 PART 2 — Infrastructure & Serving
 ---
 ## 🧠 12. Understand Inference Flow
 - [ ] Write down:
  - [ ] Steps from input → output
 - [ ] Measure:
  - [ ] Time taken for a single generation
 ---
 ## ⚡ 13. Optimize Inference
 - [ ] Test batching:
  - [ ] Multiple inputs at once
 - [ ] Compare:
  - [ ] Latency vs throughput
 ---
 ## 🧮 14. Apply Quantization
 - [ ] Load model in:
  - [ ] 8-bit
  - [ ] (Optional) 4-bit
 - [ ] Compare:
  - [ ] Memory usage
  - [ ] Speed
  - [ ] Output quality
 ---
 ## 🖥️ 15. Simulate Real-World Usage
 - [ ] Pretend you have:
  - [ ] Multiple users hitting your model
 - [ ] Think through:
  - [ ] How would you queue requests?
  - [ ] When would you batch?
  - [ ] When would you scale?
 ---
 ## ☁️ 16. Understand Infra Concepts
 - [ ] Research:
  - [ ] GPU provisioning
  - [ ] Autoscaling
  - [ ] Model warm starts
 - [ ] Understand:
  - [ ] Why loading time matters
  - [ ] Why GPUs shouldn’t sit idle
 ---
 ## 🧬 17. (Bonus) DICOM Exploration
 - [ ] Learn:
  - [ ] What DICOM files are
 - [ ] Think:
  - [ ] How LLMs could be used with medical data
 - [ ] Note:
  - [ ] Privacy + domain challenges
 ---
 ## ✍️ 18. Write Your Blog
 ### Structure
 - [ ] Introduction:
  - [ ] What is an LLM really?
 - [ ] Training:
  - [ ] Tokenization
  - [ ] Training loop
  - [ ] Loss behaviour
 - [ ] Fine-tuning:
  - [ ] Full vs LoRA
 - [ ] Challenges:
  - [ ] What went wrong
 - [ ] Infrastructure:
  - [ ] Serving challenges
  - [ ] Batching
  - [ ] Quantization
 - [ ] Key Learnings:
  - [ ] What surprised you
  - [ ] What actually matters
 ---
 ## ✅ Final Deliverables
 - [ ] Working training script
 - [ ] LoRA vs full fine-tune comparison
 - [ ] Basic inference script
 - [ ] Blog post (clear + honest)
 - [ ] Notes showing your understanding
 ---
 ## ⚠️ Keep Yourself Honest
 - [ ] Can you explain the training loop without looking?
 - [ ] Do you understand why loss decreases?
 - [ ] Can you explain batching vs latency tradeoffs?
 - [ ] Do you know what would break at scale?