Mastering LLM Fine-Tuning: Adapt and optimise with Python
Mastering LLM Fine-Tuning: Adapt, Optimise, and Deploy with Python
Agenda
-
Introduction to LLMs
-
Limitations of LLMS
-
LLM fine-tuning process
-
LLM fine-tuning vs RAG
-
Live Demo
-
The potential of RAG
-
Technical Deep Dive
- Dataset preparation
- Choosing the right model
- Fine-tuning with Python
- Model optimisation
- Deployment strategies
My journey into Large Language Models (LLMs) began with curiosity: How could I adapt and optimise LLMs for real-world use?
This blog transforms my recent PyMUG presentation into a digestible guide on fine-tuning LLMs with Python.
What surprised me was how accessible fine-tuning can be with the right techniques and tools. This article aims to demystify the process of fine-tuning LLMs using Python—with practical code examples and key takeaways.
Large Language Models (LLMs)
LLMs are deep neural networks designed to generate human-like text by learning from vast datasets. They have revolutionised how we interact with AI, powering applications like chatbots, translators and code assistants.
How Do They Work?
- Built on transformer architectures with self-attention mechanisms.
- Understand sequential context and know how words relate to each other in a sentence.
- Examples of popular LLMs:
- GPT O1
- Llama 3.2
- Mistral
- Deepseek-R1
- Qwen 2.5
What are LLMs?
Before we dive into fine-tuning, it is crucial to understand the transformer architecture which is at the foundation of LLMs.
Key Components:
- Embeddings: Words are converted into numerical vectors.
- Positional Encoding: Injects information about word order.
- Self-Attention Mechanism: Allows the model to focus on different words in the sequence when generating each word.
🔎 Self-Attention in Action: Imagine predicting the next word in “The cat sat on the ___.” The model evaluates all previous words to determine that “mat” is the best fit.
🔍 Under the Hood: Transformer Architecture
Before we get to fine-tuning, we need to understand how transformers work as they power all LLMs.
🏗️ Basic Transformer Architecture
The transformer consists of two key components:
- Encoder: Reads and understands input text (ex:for translation).
- Decoder: Generates output text (ex: in the target language).
For LLMs like GPT, we mostly use the decoder-only transformer (focused on text generation).
🔄 Core Components:
Component | Purpose |
---|---|
Input Embedding | Converts words into numerical vectors |
Positional Encoding | Adds information about word order |
Self-Attention Layers | Allows each word to focus on relevant words in the sequence |
Feed-Forward Network | Processes attention outputs for non-linear transformations |
Layer Normalisation | Stabilises training |
Residual Connections | Helps preserve original input alongside new learned information |
🧩 Core Problems
Transformers revolutionised how models process sequences, thus solving three core problems that older models like RNNs and LSTMs struggled with:
- How do we understand the position of words in sequences?
- How do we handle the context of words in sequences?
- How do we make the training process faster and scalable?
Let’s break down how transformers solve these.
1️⃣ Word Positioning: Positional Encoding
Unlike RNNs, transformers don’t process sequences in order. They see all words simultaneously (parallel processing). But this raises a problem:
How does the model know the order of the words?
The solution: Positional Encoding
Transformers inject positional information into word embeddings using sine and cosine functions. This lets the model distinguish between words at different positions.
2️⃣ Context Handling: Self-Attention Mechanism
How do we capture the relationships between words?
Solution: Self-Attention 🔥
- Every word looks at every other word in the sentence.
- It calculates attention scores to decide which words to focus on.
- Words with higher scores contribute more.
- This gives the model context awareness.
3️⃣ Training Speed: Parallel Processing
Older models like RNNs process text sequentially, making them slow and hard to scale.
Transformers:
- Process entire sequences in parallel.
- Use GPU acceleration effectively.
- This allows training on massive datasets much faster.
⚠️ Limitations of LLMs
Despite their capabilities, LLMs have inherent limitations:
- Hallucinations: Generate plausible but incorrect outputs.
- Knowledge Cut-off: Can’t know anything beyond their last training data.
- Resource Intensive: Require significant GPU power for training or even inference at times.
🛠️ Why Fine-Tuning?
Fine-tuning lets us adapt general-purpose LLMs to specific tasks or industries to address these limitations.
Fine tuning:
- Improves accuracy and relevance
- Builds trust in AI-driven tools
- Mitigates hallucinations and knowledge gaps
Why fine tune?
General Models | Fine-Tuned Models |
---|---|
Broad knowledge but lacking specialisation | Tailored to specific domains/tasks |
Prone to hallucinations in niche areas | Improved accuracy, reduced hallucinations |
Static knowledge post-training | Updated with domain-specific knowledge |
Fine tuning approaches
There are multiple approaches, depending on resources, task complexity, and desired performance.
1️⃣ Task-Specific Fine-Tuning
- Fine-tunes the model on a specific dataset for a particular task (ex: customer support chat, medical diagnostics).
- Full control over the data and objectives.
- Suitable when you have domain-specific data.
Example: Training a model to answer legal questions by fine-tuning it on legal documents.
2️⃣ Full Fine-Tuning
- Updates all the parameters of the model.
- Resource-heavy as it requires powerful GPUs and large datasets.
- Provides maximum flexibility but isn’t always practical for large models.
Use case: When you have enough data and compute resources to customise the model completely.
3️⃣ Parameter-Efficient Fine-Tuning (PEFT)
- Instead of updating the whole model, only a small subset of parameters is trained.
- Methods include LoRA (Low-Rank Adaptation), prefix tuning, adapter layers, etc.
- Much lighter on resources while achieving strong performance.
Why I used PEFT:
I leveraged LoRA for my demo to fine-tune DistilGPT2 efficiently on consumer-grade GPUs.
4️⃣ Instruction Fine-Tuning
- Trains the model to follow human instructions better.
- Involves reframing datasets as instructions and responses.
Example:
Rather than feeding Q&A pairs, the model is trained on instructions like “Summarise this article” or “Translate this sentence”.
Why it matters:
Instruction-tuned models (e.g., GPT-Instruct) generalise better for unseen tasks.
5️⃣ Transfer Learning
- Takes a pretrained model and adapts it to a related but different task.
Example:
Starting with a model trained on general English text, then fine-tuning it on scientific articles.
Benefit:
You leverage the model’s existing knowledge instead of starting from scratch.
6️⃣ Multi-Task Learning
- Trains the model on multiple tasks simultaneously.
- The model learns shared representations that improve generalisation across tasks.
Example:
Fine-tuning on summarisation, translation and Q&A all at once.
7️⃣ Sequential Fine-Tuning
- Fine-tunes the model in stages—first on a general dataset, then on specific datasets.
Example:
- Fine-tune on Wikipedia articles (general).
- Fine-tune again on legal documents (specific domain).
📝 Choosing the Right Approach
Approach | Resource Use | Flexibility | When to Use |
---|---|---|---|
Full Fine-Tuning | High | Maximum | When resources & large datasets are available |
PEFT (e.g., LoRA) | Low | Moderate | When compute is limited |
Instruction Fine-Tuning | Moderate | High | For general-purpose models that follow instructions |
Transfer Learning | Low | Moderate | For adapting pretrained models to related tasks |
Multi-Task Learning | High | High | When handling diverse tasks |
Sequential Fine-Tuning | Moderate | High | When refining models step by step |
RAG vs Fine-tuning
Both Fine-Tuning and Retrieval-Augmented Generation (RAG) adapt LLMs to specific tasks but they take different approaches.
🛠️ Fine-Tuning
Train (or re-train) the model on task-specific data—embedding knowledge directly into the model.
- Best for:
- Domain specialisation (e.g., legal, medical).
- Scenarios where the knowledge doesn't change often.
- Pros: High accuracy, offline-ready.
- Cons: Requires compute, static knowledge.
🔎 RAG (Retrieval-Augmented Generation)
Combine the model with a retrieval system (like a vector database). The model fetches relevant documents at runtime for context.
- Best for:
- Dynamic information (e.g., FAQs, latest updates).
- Low compute environments.
- Pros: Always up-to-date and lightweight.
- Cons: Needs a retrieval system and quality depends on data retrieval.
👨💻 Technical Deep Dive: Fine-Tuning GPT2 with Python
Component | Details |
---|---|
Model | GPT2 / DistilGPT2 |
Language | Python |
Framework | PyTorch + Hugging Face Transformers |
Dataset | WikiText-2 |
Hardware | Apple MPS (16GB) + RTX 4060 Ti (8GB) |
Fine-tuning | LoRA (Low-Rank Adaptation) |
I selected DistilGPT2 (a smaller version of GPT2) to make the fine-tuning process more resource-friendly.
Full Fine-tuning performance
Use Case 1: Full LLM fine-tuning (MPS)
Use Case 1: Full LLM fine-tuning (CUDA)
❌ Full Fine-Tuning: Bad Results 😭
I initially attempted full fine-tuning on DistilGPT2 using both Apple MPS and CUDA GPUs.
- Result? The model struggled heavily.
- It couldn't even complete a sentence: Instead, it repeated the input question or generated garbled text.
Why?
Full fine-tuning is resource-heavy and requires longer training cycles with larger datasets. In my setup (limited hardware, single epoch), it simply wasn't enough. The model was fully fine tuned and it lost its initial characteristics.
This failure led me to explore parameter-efficient fine-tuning (PEFT) instead.
Use Case 2: Parameter Efficient fine-tuning
✅ Did It Really Work? 😮
After switching to parameter-efficient fine-tuning (PEFT) with LoRA on DistilGPT2, the results were surprisingly good!
-
Base model:
Completely missed the mark. Either generated irrelevant text or repeated phrases. 😂
-
Fine-tuned model:
Got it spot on for the task-specific queries I trained it on!
Example:
When prompted with a question from the WikiText domain:
- Base model: “The capital of France is… France is…”
- Fine-tuned model: “The capital of France is Paris.” 🎯
Even with just one epoch and 147,456 trainable parameters, LoRA fine-tuning delivered reliable, targeted outputs!
Conclusion
Check out the full code and setup instructions on my GitHub:
🔗 Github: https://github.com/nirmal-k-r/pymug_llm_fine_tuning.git