Step-by-Step Guide to Using Parameter Efficient Fine Tuning (PEFT) in LLMs

Large Language Models (LLMs) such as GPT-4, BERT and Flan-T5 have revolutionized the field of Artificial Intelligence (AI), enabling applications in chatbots, content generation, machine translation and text summarization. Thanks to open-source platforms like Hugging Face, developers can now easily access and integrate these pre-trained LLMs into their workflows.

However, pre-trained models are not always ideal for specialized tasks. They are trained on broad datasets, meaning they may lack accuracy and relevance in domain-specific applications.

For example:

Scenario: Suppose you need an LLM for medical text summarization. A general-purpose model trained on diverse internet text may not perform well in summarizing complex medical terminology and reports.
Solution: Fine-Tuning!

The issue is that conventional fine-tuning consumes many computational resources and GPUs. It also has other problems like catastrophic forgetting, where the model loses its original understanding while retraining for a new task, which is equally detrimental.

LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning) tackle these problems effectively, requiring less time, money and resources to train while simultaneously improving efficiency. These techniques restrict fine-tuning to small sets of parameters, thus ensuring trimmed memory usage without sacrificing accuracy.

What is Fine-Tuning?

Simply put, fine-tuning is a process of retraining a pre-trained model to improve efficiency and precision for a targeted application. Retraining from scratch is expensive in terms of resources, so pre-trained LLM's are utilized for custom use cases.

How Does Fine-Tuning Work?

Fine-tuning involves adjusting the model's weights by training it on a domain-specific dataset. This process helps the model learn industry-specific terminology, patterns and structures.

Example:

Imagine you have a chatbot trained on general conversations. If you need it to handle legal documents, you can fine-tune it with legal datasets to enhance its ability to understand and respond accurately in a legal context.

Why Fine-Tune a Model?

Domain-Specific Knowledge - LLMs are generalists by default. Fine-tuning allows them to specialize in medicine, law, finance, etc.
Improved Accuracy - A fine-tuned model outperforms a general model when dealing with specialized tasks.
Cost Efficiency - Training an LLM from scratch is expensive. Fine-tuning an existing model saves time, computational power and financial resources.

Fine-Tuning Challenges

Despite its benefits, fine-tuning comes with several challenges:

High Memory Usage

Most LLMs contain billions of parameters, making full fine-tuning extremely memory-intensive.

Example:

A model with 1B parameters requires 4GB GPU memory just for storage.
A 100B-parameter model would need 400GB+ GPU memory-far beyond the capabilities of most machines.

Only large organizations with high-end GPUs can afford full fine-tuning.

Catastrophic Forgetting

Fine-tuning on a new dataset may cause the model to forget previous knowledge.

Example:

If a chatbot is fine-tuned for medical text summarization, it might lose its ability to answer general questions about daily conversations.

The model becomes too specialized, reducing its flexibility.

High Training Costs

Fine-tuning requires powerful GPUs and long training times, which increase costs significantly.

Fine-tuning an entire model is impractical for small businesses and researchers with limited computing resources.

Use PEFT & LoRA!

What is PEFT (Parameter-Efficient Fine-Tuning)?

Parameter-Efficient Fine-Tuning (PEFT) is a modern training method where instead of retraining the entire model, only a selection of parameters is trained. This saves fulfilling amounts of memory space, time and resources.

Why is PEFT Important?

Reduces Memory Requirements - Fine-tuning only a small portion of the model means you don't need a supercomputer.
Speeds Up Training - With fewer trainable parameters, fine-tuning is significantly faster.
Prevents Catastrophic Forgetting - Since the base model remains mostly unchanged, it retains previous knowledge while learning new tasks.

Types of PEFT Techniques

Selective Fine-Tuning - Only updates specific layers instead of the whole model.
Reparameterization - Uses low-rank matrix representations to update model weights efficiently.
Additive Fine-Tuning - Instead of modifying existing weights, it adds new trainable layers.

Among these, the most powerful technique is LoRA!

What is LoRA (Low-Rank Adaptation)?

LoRA is an advanced fine-tuning method that:

Freezes most of the model parameters to reduce memory usage.
Introduces trainable low-rank matrices to capture new knowledge.
Achieves nearly the same accuracy as full fine-tuning with 99% fewer trainable parameters.

How Does LoRA Work?

LoRA does not modify the original model parameters. Instead it:

Adds low-rank adapter matrices to certain layers (e.g., attention layers).
Fine-tunes only the adapter matrices, leaving the base model unchanged.
Combines adapter matrices with the original model during inference.

Why Use LoRA?

LoRA is the perfect solution for organizations and researchers who:

Need efficient fine-tuning without expensive hardware.
Want to fine-tune massive LLMs without modifying the base model.
Require flexibility-fine-tune for multiple tasks without re-training from scratch.

Example Use Cases:

Chatbots: Fine-tune an LLM to understand customer queries more effectively.
Medical AI: Adapt a general NLP model to analyze clinical reports.
Financial Text Processing: Train a model to summarize market reports accurately.

PEFT in Action: Fine-Tuning Flan-T5 with LoRA

Let's implement LoRA fine-tuning on Flan-T5 for dialogue summarization using Hugging Face.

Note: The code is available on Google Colab Notebook. Run this notebook using T4 GPU.

Step 1: Install Required Libraries

!pip install evaluate
!pip install rouge_score
%pip install --upgrade pip
%pip install --disable-pip-version-check torch==1.13.1 torchdata==0.5.1 --quiet

Import Libraries

This code sets up LoRA fine-tuning for the Flan-T5 model using PEFT (Parameter-Efficient Fine-Tuning). It imports essential libraries (Torch, Evaluate, Pandas, NumPy) for deep learning and performance evaluation. The Hugging Face datasets module loads the dataset, while AutoModelForSeq2SeqLM and AutoTokenizer initialize the pre-trained model and tokenizer. LoRAConfig configures fine-tuning by updating only a small subset of parameters, reducing memory usage and training costs. Finally, Trainer and TrainingArguments handle the training process efficiently.

import torch
import time
import evaluate  ## for calculating rouge score
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer

Step 2: Load Dataset (DialogSum)

This code loads the DialogSum dataset from Hugging Face's datasets library, which is a dialogue summarization dataset. The load_dataset("knkarthick/dialogsum") function automatically fetches and prepares all dataset splits (train, validation and test). This dataset contains dialogues and their corresponding summaries, making it ideal for training models on text summarization tasks.

# prompt: load dataset in dialogsum all file
from datasets import load_dataset
dataset = load_dataset("knkarthick/dialogsum")

Step 3: Load Pre-Trained LLM (Flan-T5)

This code loads the Flan-T5 base model for text summarization and NLP tasks, using bfloat16 to reduce memory usage while maintaining precision. The tokenizer converts text into tokens for model processing, making it ready for fine-tuning or inference.

model_name = 'google/flan-t5-base'
# bfloat16 mean we are using the small version of flan-t5
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Counting Trainable Parameters in a Model

This function calculates the total and trainable parameters in a model, helping analyze fine-tuning efficiency. It iterates through all model parameters, summing up both trainable (requires_grad=True) and frozen parameters. The function then returns a formatted output displaying the number of trainable parameters, total parameters and the percentage of trainable parameters relative to the entire model. This is particularly useful in LoRA and PEFT fine-tuning, where only a subset of parameters is updated to reduce memory usage and training costs.

def print_number_of_trainable_model_parameters(model):
   trainable_model_params = 0
   all_model_params = 0
   for _, param in model.named_parameters():
       all_model_params += param.numel()
       if param.requires_grad:
           trainable_model_params += param.numel()
   return f'trainable model parameters: {trainable_model_params}\n \
           all model parameters: {all_model_params} \n \
           percentage of trainable model parameters: {(trainable_model_params / all_model_params) * 100} %'
print(print_number_of_trainable_model_parameters(original_model))

Output:

trainable model parameters: 247577856
             all model parameters: 247577856 
             percentage of trainable model parameters: 100.0 %

Step 4: Preprocess Dataset

This code prepares the dataset for fine-tuning by tokenizing dialogues into structured prompts with "Summarize the following conversation.". It converts text into input IDs, applies padding and truncation and processes data in batches using dataset.map(). Finally, it removes unnecessary columns, keeping only the essential tokenized inputs for training Flan-T5 with LoRA fine-tuning.

def tokenize_function(example):
   start_prompt = 'Summarize the following conversation. \n\n'
   end_prompt = '\n\nSummary: '
   prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
   example['input_ids'] = tokenizer(prompt, padding='max_length', truncation=True,
                                    return_tensors='pt').input_ids
   example['labels'] = tokenizer(example['summary'], padding='max_length', truncation=True,
                                return_tensors='pt').input_ids
   return example
# Apply tokenization function
tokenize_datasets = dataset.map(tokenize_function, batched=True)
# Remove unnecessary columns
tokenize_datasets = tokenize_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])

Subsampling the Dataset

This code reduces the dataset size by keeping only every 100th example (index % 100 == 0), making training faster and more efficient. The with_indices=True ensures filtering based on index position, which is useful for quick testing or low-resource training.

tokenize_datasets = tokenize_datasets.filter(lambda exmaple, index: index % 100 == 0,
                                           with_indices=True)

Checking Dataset Structure and Size

This code prints the shape of the training, validation and test datasets to verify the dataset size after preprocessing. It ensures that the data is correctly structured before training by displaying the number of samples in each split. The final print(tokenize_datasets) outputs the dataset structure, confirming that tokenization and filtering were applied successfully.

print(f'Training: {tokenize_datasets["train"].shape}')
print(f'Valdiation: {tokenize_datasets["validation"].shape}')
print(f'Test: {tokenize_datasets["test"].shape}')
print(tokenize_datasets)

Output:

Training: (125, 2)
Valdiation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})

Setting Output Directory

This code creates a unique folder for saving the trained model by appending the current timestamp to "dialogue-summary-training", preventing overwrites and helping track different training runs.

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

Step 5: Configuring LoRA for Fine-Tuning

This code sets up LoRA (Low-Rank Adaptation) for fine-tuning the Flan-T5 model efficiently. It applies LoRA to the query (q) and value (v) layers in the attention mechanism, reducing memory usage with low-rank matrices (r=32) and scaling updates using LoRA alpha (32). The dropout (0.05) prevents overfitting and TaskType.SEQ_2_SEQ_LM ensures the setup is optimized for sequence-to-sequence learning. This makes fine-tuning faster, cost-efficient and effective while keeping most model parameters frozen.

lora_config = LoraConfig(r=32,
                        lora_alpha=32,
                        target_modules=['q', 'v'],
                        lora_dropout = 0.05,
                        bias='none',
                        task_type=TaskType.SEQ_2_SEQ_LM
)

Applying LoRA and Checking Trainable Parameters

This code integrates LoRA into the Flan-T5 model, modifying only selected layers for efficient fine-tuning. It then prints the total and trainable parameters, confirming that only a small subset is updated, reducing memory usage and training costs.

peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

Output:

trainable model parameters: 3538944
             all model parameters: 251116800 
             percentage of trainable model parameters: 1.4092820552029972 %

Configuring Training with Hugging Face Trainer

This code sets up and initializes the training process for the LoRA fine-tuned model using Hugging Face's Trainer. It automatically adjusts batch size, sets the learning rate (1e-3), runs one training epoch and logs progress at each step. The output_dir ensures training results are saved uniquely, while report_to='none' disables external logging. Finally, Trainer() starts fine-tuning using the PEFT model and tokenized dataset, ensuring efficient, low-resource training.

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'
## this is we are again back to the hugging face trainer module
peft_training_args = TrainingArguments(output_dir=output_dir,
                                      auto_find_batch_size=True,
                                      learning_rate=1e-3,
                                      num_train_epochs=1,
                                      logging_steps=1,
                                      max_steps=1,
                                       report_to='none' ## can be wandb, but we are reporint to noe
               )
## this is same except we are using PEFT model instead of regular
peft_trainer = Trainer(model=peft_model,
                     args=peft_training_args,
                     train_dataset=tokenize_datasets['train']
                )

Starting LoRA Fine-Tuning

This command starts the training process for the LoRA fine-tuned model using Hugging Face's Trainer. It updates only a small subset of parameters, ensuring efficient and memory-optimized fine-tuning.

peft_trainer.train()

Output:

Step Training Loss

1 49.000000

TrainOutput(global_step=1, training_loss=49.0, metrics={'train_runtime': 2.6968, 'train_samples_per_second': 2.966, 'train_steps_per_second': 0.371, 'total_flos': 5565031907328.0, 'train_loss': 49.0, 'epoch': 0.0625})

Saving the Fine-Tuned LoRA Model

This code saves the fine-tuned LoRA model and tokenizer to the specified directory (peft-dialogue-summary-checkpoint-local). The save_pretrained() method stores the fine-tuned model weights and configuration, ensuring it can be reloaded later for inference or further training. Saving the tokenizer alongside the model ensures consistent text processing when using the trained model.

peft_model_path = './peft-dialogue-summary-checkpoint-local'
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Loading the Fine-Tuned LoRA Model for Inference

This code reloads the fine-tuned LoRA model for inference. It loads the Flan-T5 base model with bfloat16 precision to reduce memory usage, along with its tokenizer for text processing. The PeftModel.from_pretrained() function loads the fine-tuned model from './peft-dialogue-summary-checkpoint-local', ensuring it retains learned parameters. Setting is_trainable=False makes the model ready for inference, preventing further modifications.

from peft import LoraConfig, get_peft_model, TaskType, PeftModel # Import PeftModel here
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')
# Now PeftModel should be accessible
peft_model = PeftModel.from_pretrained(peft_model_base,
                                     './peft-dialogue-summary-checkpoint-local',
                                     torch_dtype=torch.bfloat16,
                                     is_trainable=False)

Step 6: Generating and Comparing Summaries

This code extracts a test dialogue (index 200) and generates summaries using both the original Flan-T5 model and the fine-tuned LoRA model. It formats the dialogue into a prompt, moves inputs and models to GPU (if available) and generates summaries with generate(). The outputs are decoded and printed, comparing the human-written summary, original model's summary and LoRA fine-tuned model's summary. This helps evaluate how well fine-tuning improves text summarization.

index = 200 ## randomly pick index
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']
prompt = f"""
Summarize the following conversation.
{dialogue}
Summary:
"""
# Get the device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
# Move input_ids and original_model to the device
input_ids = input_ids.to(device)
original_model = original_model.to(device)
# Now generate outputs
original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
# Ensure peft_model is also on the correct device
peft_model = peft_model.to(device)
peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
print(f'Human Baseline summary: \n{human_baseline_summary}\n')
print(f'Original Model Output \n{original_model_text_output}\n')
print(f'Peft Model Output \n{peft_model_text_output}\n')

Output:

Human Baseline Summary:

#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

Original Model Output

#Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning #Porning

Preft Model Output

#Person1#: I'm thinking of upgrading my computer.

Batch Summarization and Model Comparison

This code generates summaries for 10 test dialogues using both the original Flan-T5 model and the fine-tuned LoRA model. It formats and tokenizes dialogues, generates summaries using generate() and decodes the outputs. The results are stored in a DataFrame (df), comparing the human-written, original model and LoRA fine-tuned model summaries. This helps evaluate how much LoRA improves summarization quality.

dialogue = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']
original_model_summaries = []
peft_model_summaries = []
for _, dialogue in enumerate(dialogue):
   prompt = f"""
   Summarize the following conversations.
   {dialogue}
   Summary: """
   input_ids = tokenizer(prompt, return_tensors='pt').input_ids
   # Move input_ids to the device
   input_ids = input_ids.to(device)
   original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
   original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
   original_model_summaries.append(original_model_text_output)

   peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
   peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
   peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries,
                          peft_model_summaries))
df = pd.DataFrame(zipped_summaries, columns=['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Evaluating Summarization Performance with ROUGE Score

This code assesses the summarization quality of both the original Flan-T5 model and the fine-tuned LoRA model using the ROUGE metric. It loads rouge from the evaluate library and computes scores by comparing the generated summaries to human-written references. The use_aggregator=True ensures an overall ROUGE score, while use_stemmer=True improves word matching accuracy. Finally, the results are printed, providing a quantifiable comparison to measure how LoRA fine-tuning improves text summarization.

rouge = evaluate.load('rouge')
original_model_results = rouge.compute(predictions=original_model_summaries,
                                      references=human_baseline_summaries[0: len(original_model_summaries)],
                                     use_aggregator=True,
                                     use_stemmer=True)
peft_model_results = rouge.compute(predictions=peft_model_summaries,
                                   references=human_baseline_summaries[0: len(peft_model_summaries)],
                                   use_aggregator=True,
                                   use_stemmer=True)
print(f'Original Model: \n{original_model_results}\n')
print(f'PEFT Model: \n{peft_model_results}\n')

Output:

Original Model:

{'rouge1': 0.26641226265175844, 'rouge2': 0.08712093580178687, 'rougeL': 0.21242966203575447, 'rougeLsum': 0.21253214961198152}

PEFT Model:

{'rouge1': 0.26109650997150996, 'rouge2': 0.11055072463768116, 'rougeL': 0.2302777777777778, 'rougeLsum': 0.2339245014245014}

Conclusion

Fine-tuning Large Language Models (LLMs) like Flan-T5 is often computationally expensive, but Parameter-Efficient Fine-Tuning (PEFT) with LoRA provides a cost-effective and resource-efficient solution for AI projects such as chatbots, text summarization, conversational AI, and domain-specific NLP models. By modifying only a small subset of parameters, LoRA enables faster training while reducing GPU memory consumption, making it ideal for machine learning applications, automated content generation, and speech-to-text AI systems. This guide covered loading and preprocessing data, applying LoRA-based fine-tuning, training the model efficiently, and evaluating performance using ROUGE scores. The results demonstrate that LoRA fine-tuning improves AI model accuracy, making it suitable for financial analysis AI, recommendation systems, and healthcare AI applications. With LoRA and PEFT, fine-tuning LLMs for AI-driven automation, virtual assistants, and intelligent document summarization becomes more accessible and scalable, ensuring better AI performance with lower costs.