Mastering LLM Fine-Tuning: Adapt, Optimise, and Deploy with Python

Agenda

Introduction to LLMs
Limitations of LLMS
LLM fine-tuning process
LLM fine-tuning vs RAG
Live Demo
The potential of RAG
Technical Deep Dive
- Dataset preparation
- Choosing the right model
- Fine-tuning with Python
- Model optimisation
- Deployment strategies

My journey into Large Language Models (LLMs) began with curiosity: How could I adapt and optimise LLMs for real-world use?

This blog transforms my recent PyMUG presentation into a digestible guide on fine-tuning LLMs with Python.

What surprised me was how accessible fine-tuning can be with the right techniques and tools. This article aims to demystify the process of fine-tuning LLMs using Python—with practical code examples and key takeaways.

Large Language Models (LLMs)

Screenshot 2025-04-28 at 06.02.04.png

LLMs are deep neural networks designed to generate human-like text by learning from vast datasets. They have revolutionised how we interact with AI, powering applications like chatbots, translators and code assistants.

How Do They Work?

Built on transformer architectures with self-attention mechanisms.
Understand sequential context and know how words relate to each other in a sentence.
Examples of popular LLMs:
- GPT O1
- Llama 3.2
- Mistral
- Deepseek-R1
- Qwen 2.5

What are LLMs?

Before we dive into fine-tuning, it is crucial to understand the transformer architecture which is at the foundation of LLMs.

Key Components:

Embeddings: Words are converted into numerical vectors.
Positional Encoding: Injects information about word order.
Self-Attention Mechanism: Allows the model to focus on different words in the sequence when generating each word.

🔎 Self-Attention in Action: Imagine predicting the next word in “The cat sat on the ___.” The model evaluates all previous words to determine that “mat” is the best fit.

🔍 Under the Hood: Transformer Architecture

Before we get to fine-tuning, we need to understand how transformers work as they power all LLMs.

🏗️ Basic Transformer Architecture

The transformer consists of two key components:

Encoder: Reads and understands input text (ex:for translation).
Decoder: Generates output text (ex: in the target language).

For LLMs like GPT, we mostly use the decoder-only transformer (focused on text generation).

🔄 Core Components:

Component	Purpose
Input Embedding	Converts words into numerical vectors
Positional Encoding	Adds information about word order
Self-Attention Layers	Allows each word to focus on relevant words in the sequence
Feed-Forward Network	Processes attention outputs for non-linear transformations
Layer Normalisation	Stabilises training
Residual Connections	Helps preserve original input alongside new learned information

Screenshot 2025-04-28 at 06.07.19.png

🧩 Core Problems

Transformers revolutionised how models process sequences, thus solving three core problems that older models like RNNs and LSTMs struggled with:

How do we understand the position of words in sequences?
How do we handle the context of words in sequences?
How do we make the training process faster and scalable?

Let’s break down how transformers solve these.

1️⃣ Word Positioning: Positional Encoding

Unlike RNNs, transformers don’t process sequences in order. They see all words simultaneously (parallel processing). But this raises a problem:

How does the model know the order of the words?

The solution: Positional Encoding

Transformers inject positional information into word embeddings using sine and cosine functions. This lets the model distinguish between words at different positions.

2️⃣ Context Handling: Self-Attention Mechanism

How do we capture the relationships between words?

Solution: Self-Attention 🔥

Every word looks at every other word in the sentence.
It calculates attention scores to decide which words to focus on.
Words with higher scores contribute more.
This gives the model context awareness.

Screenshot 2025-04-28 at 06.16.24.png

Screenshot 2025-04-28 at 06.18.17.png

3️⃣ Training Speed: Parallel Processing

Older models like RNNs process text sequentially, making them slow and hard to scale.

Transformers:

Process entire sequences in parallel.
Use GPU acceleration effectively.
This allows training on massive datasets much faster.

Screenshot 2025-04-28 at 06.18.31.png

⚠️ Limitations of LLMs

Despite their capabilities, LLMs have inherent limitations:

Hallucinations: Generate plausible but incorrect outputs.
Knowledge Cut-off: Can’t know anything beyond their last training data.
Resource Intensive: Require significant GPU power for training or even inference at times.

🛠️ Why Fine-Tuning?

Fine-tuning lets us adapt general-purpose LLMs to specific tasks or industries to address these limitations.

Fine tuning:

Improves accuracy and relevance
Builds trust in AI-driven tools
Mitigates hallucinations and knowledge gaps

Screenshot 2025-04-28 at 06.20.17.png

Why fine tune?

General Models	Fine-Tuned Models
Broad knowledge but lacking specialisation	Tailored to specific domains/tasks
Prone to hallucinations in niche areas	Improved accuracy, reduced hallucinations
Static knowledge post-training	Updated with domain-specific knowledge

Fine tuning approaches

There are multiple approaches, depending on resources, task complexity, and desired performance.

1️⃣ Task-Specific Fine-Tuning

Fine-tunes the model on a specific dataset for a particular task (ex: customer support chat, medical diagnostics).
Full control over the data and objectives.
Suitable when you have domain-specific data.

Example: Training a model to answer legal questions by fine-tuning it on legal documents.

2️⃣ Full Fine-Tuning

Updates all the parameters of the model.
Resource-heavy as it requires powerful GPUs and large datasets.
Provides maximum flexibility but isn’t always practical for large models.

Use case: When you have enough data and compute resources to customise the model completely.

3️⃣ Parameter-Efficient Fine-Tuning (PEFT)

Instead of updating the whole model, only a small subset of parameters is trained.
Methods include LoRA (Low-Rank Adaptation), prefix tuning, adapter layers, etc.
Much lighter on resources while achieving strong performance.

Why I used PEFT:

I leveraged LoRA for my demo to fine-tune DistilGPT2 efficiently on consumer-grade GPUs.

4️⃣ Instruction Fine-Tuning

Trains the model to follow human instructions better.
Involves reframing datasets as instructions and responses.

Example:

Rather than feeding Q&A pairs, the model is trained on instructions like “Summarise this article” or “Translate this sentence”.

Why it matters:

Instruction-tuned models (e.g., GPT-Instruct) generalise better for unseen tasks.

5️⃣ Transfer Learning

Takes a pretrained model and adapts it to a related but different task.

Example:

Starting with a model trained on general English text, then fine-tuning it on scientific articles.

Benefit:

You leverage the model’s existing knowledge instead of starting from scratch.

6️⃣ Multi-Task Learning

Trains the model on multiple tasks simultaneously.
The model learns shared representations that improve generalisation across tasks.

Example:

Fine-tuning on summarisation, translation and Q&A all at once.

7️⃣ Sequential Fine-Tuning

Fine-tunes the model in stages—first on a general dataset, then on specific datasets.

Example:

Fine-tune on Wikipedia articles (general).

Fine-tune again on legal documents (specific domain).

📝 Choosing the Right Approach

Approach	Resource Use	Flexibility	When to Use
Full Fine-Tuning	High	Maximum	When resources & large datasets are available
PEFT (e.g., LoRA)	Low	Moderate	When compute is limited
Instruction Fine-Tuning	Moderate	High	For general-purpose models that follow instructions
Transfer Learning	Low	Moderate	For adapting pretrained models to related tasks
Multi-Task Learning	High	High	When handling diverse tasks
Sequential Fine-Tuning	Moderate	High	When refining models step by step

RAG vs Fine-tuning

Both Fine-Tuning and Retrieval-Augmented Generation (RAG) adapt LLMs to specific tasks but they take different approaches.

🛠️ Fine-Tuning

Train (or re-train) the model on task-specific data—embedding knowledge directly into the model.

Best for:
- Domain specialisation (e.g., legal, medical).
- Scenarios where the knowledge doesn't change often.
Pros: High accuracy, offline-ready.
Cons: Requires compute, static knowledge.

🔎 RAG (Retrieval-Augmented Generation)

Combine the model with a retrieval system (like a vector database). The model fetches relevant documents at runtime for context.

Best for:
- Dynamic information (e.g., FAQs, latest updates).
- Low compute environments.
Pros: Always up-to-date and lightweight.
Cons: Needs a retrieval system and quality depends on data retrieval.

Screenshot 2025-04-28 at 06.27.25.png

Screenshot 2025-04-28 at 06.38.18.png

👨‍💻 Technical Deep Dive: Fine-Tuning GPT2 with Python

Screenshot 2025-04-28 at 06.29.29.png

Component	Details
Model	GPT2 / DistilGPT2
Language	Python
Framework	PyTorch + Hugging Face Transformers
Dataset	WikiText-2
Hardware	Apple MPS (16GB) + RTX 4060 Ti (8GB)
Fine-tuning	LoRA (Low-Rank Adaptation)

I selected DistilGPT2 (a smaller version of GPT2) to make the fine-tuning process more resource-friendly.

Full Fine-tuning performance

Use Case 1: Full LLM fine-tuning (MPS)

Screenshot 2025-03-21 at 06.57.34.png

Use Case 1: Full LLM fine-tuning (CUDA)

Screenshot 2025-03-21 075753.png

❌ Full Fine-Tuning: Bad Results 😭

I initially attempted full fine-tuning on DistilGPT2 using both Apple MPS and CUDA GPUs.

Result? The model struggled heavily.
It couldn't even complete a sentence: Instead, it repeated the input question or generated garbled text.

Why?

Full fine-tuning is resource-heavy and requires longer training cycles with larger datasets. In my setup (limited hardware, single epoch), it simply wasn't enough. The model was fully fine tuned and it lost its initial characteristics.

This failure led me to explore parameter-efficient fine-tuning (PEFT) instead.

Use Case 2: Parameter Efficient fine-tuning

Screenshot 2025-04-28 at 06.35.50.png

✅ Did It Really Work? 😮

After switching to parameter-efficient fine-tuning (PEFT) with LoRA on DistilGPT2, the results were surprisingly good!

Base model:

Completely missed the mark. Either generated irrelevant text or repeated phrases. 😂
Fine-tuned model:

Got it spot on for the task-specific queries I trained it on!

Example:

When prompted with a question from the WikiText domain:

Base model: “The capital of France is… France is…”

Fine-tuned model: “The capital of France is Paris.” 🎯

Even with just one epoch and 147,456 trainable parameters, LoRA fine-tuning delivered reliable, targeted outputs!

Screenshot 2025-04-28 at 06.36.05.png