Customer Service Chatbot Using LLMs
Every business in the modern age strives to have the best customer service to keep its customers. What if businesses went above and beyond the ordinary and provided support services around the clock? At the same time, they could answer thousands of questions without compromising the quality of service. This is where the LLM Customer Service Chatbot becomes useful. This chatbot is more than just another chatbot. This is a revolution.
The bot employs advanced natural language processing (NLP) techniques to engage users conversationally. Thus allowing customers to enjoy a swift and seamless experience. Be it an alteration of an order or trouble in selecting a product variant, the bot eradicates any unnecessary waiting time for users. The Mistral 7B Instruct model powers it while addressing the requests of the clients satisfactorily. This eliminates wasteful expenditure of time and money on the part of the businesses.
Project Overview
This project aims to build a customer support chatbot using the Mistral 7B Instruct model. It is one of the latest Large Language Models (LLMs).
This chatbot is fine-tuned on real-world customer support conversations. It handles your queries as naturally and proficiently as any human agent. What’s unique about this project is their use of Position Embedding Free Transformers (PEFT). It enables faster and more efficient training with less computational resources. It’s trained with a custom dataset of customer service interactions. This makes sure its responses are very relevant and context-aware.
For improving customer service tasks, the fine-tuning of the SFTTrainer method is used. Advanced techniques such as gradient checkpointing and model quantization make it more efficient. They allow real-world deployment without sacrificing speed or accuracy. The project also qualifies the chatbot to provide a consistent experience on different communication channels. Its purpose is to offer a scalable and cost-effective solution to customer service challenges. This includes solving problems and answering questions immediately. This achieves smooth user interactions.
Prerequisites
So before we dive into this project, you need to know of certain key concepts and tools. Here are the prerequisites you should be familiar with:
- Comfortable in writing and running Python code and familiar with libraries such as torch, and transformers.
- Knowledge in neural networks, training, and optimization.
- Knowledge in NLP tasks such as tokenization, classification, generation, etc.
- Experience in using pre-trained models, tokenizers, and datasets in Hugging Face.
- You should be able to run the Python code on Google Colab environment or a local GPU environment with CUDA.
- Understand PEFT for training very large models well.
- Knowledge about memory optimization techniques such as gradient checkpointing and model quantization.
Approach:
This is the structure and order in which we developed the customer service chatbot. It starts from the initial environment setup and installing the torch, transformers, and PEFT packages. These are required for successful training and deployment of the model. A dataset of real-world customer service interactions is then loaded and preprocessed to train the chatbot well. SFTTrainer fine-tunes the Mistral 7B Instruct model for customer support tasks. We also include Position Embedding Free Transformers (PEFT). It helps in reducing the computational load and speed up training with no loss in accuracy. Techniques such as gradient checkpointing and quantization provide improved memory efficiency and speed. These techniques are employed on top of the model to further enhance performance. The design of the chatbot allows for implementation on various communication platforms. It provides end users with a consistent experience across all platforms. We finally perform inference testing to make sure the chatbot generates contextually accurate. After that, it is ready for deployment in the real world. Throughout the project, the aim is to provide a solution that is scalable, cost-effective, and easy to deploy.
Workflow and Methodologies:
Below is a breakdown of the workflow
Workflow
- Start with the installation of the development environment. Install all the required packages. Including torch, transformers, and PEFT.
- Upload the custom dataset containing actual customer service conversations. Then prepare it for model training.
- Conduct Mistral 7B Instruct model fine tuning via SFTTrainer for the model to address customer queries.
- Test the response of the chatbot and ensure that it is both accurate and relevant before integrating it into the system.
- Launch the chatbot and allow it to interact with users in different communication channels.
- Allow access to the chatbot and make sure it is able to support customers with their inquiries at all times.
Methodology
- Used the Mistral 7B Instruct model
- Loaded a customer service dataset from Hugging Face.
- Transformed the dataset into a data frame and organized it into a Question and Answer session.
- Set up the tokenizer, and prepared the model for training with no caching and gradient checkpointing enabled.
- Achieved Preparation of the Model for KBit Training and outlined the PEFT structure.
- Conducted training of the model with the use of SFTTrainer and designed the inference for response generation.
- Evaluated and improved the performance of the chatbot in regard to correctness and relevance.
Data Collection and Preparation:
Data Collection Workflow
- Seek out actual customer service conversation data sets from sources like support logs or public data sets. Then collect them.
- Make sure that the available data set captures a wide variety of customer questions and their responses to all situations.
Data Preparation Workflow
- Clean the data by removing irrelevant or redundant information such as duplicates or missing values.
- Appropriately label the database. Label like queries and their responses that would be easy to work with.
- Convert the data to fit the model. For example, turning it into a pandas DataFrame or a Hugging face dataset
- Divide the dataset into a train and test set to allow fine-tuning and evaluation of the performance of the model respectively.
Code Explanation:
STEP 1:
Install Required Packages
This command installs fundamental libraries for large language models. It incorporates accelerate, peft, and bitsandbytes for training efficiency. Installs model and dataset libraries like transformers and trl. It also has auto-gptq for quantization and optimal for model optimization.
! pip install accelerate peft bitsandbytes git+https://github.com/huggingface/transformers trl py7zr auto-gptq optimum
Hugging Face Hub Login
The given code snippet imports the user_login feature from the huggingface_hub library. Then invokes it for the user to sign in to their Hugging Face account. Signing in enables users to access private models and datasets hosted on the Hugging Face Hub from Jupyter Notebook and Google Colab. Users are also able to do the activities of training and sharing their models and datasets in a more organized manner by logging in since they are sure of accessing their assets.
from huggingface_hub import notebook_login
notebook_login()
Import Required Libraries
This code imports all the necessary libraries required for creating and fine-tuning a language model. Starting with the deep-learning focused library torch, it adds datasets for fetching training data and peft for model fine-tuning. It also adds the usage of transformers and trl tools for the use of pre-trained models and model training optimization respectively.
import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments
from trl import SFTTrainer
import os
STEP 3:
Loading a Tokenizer for a pre-trained Model
This code loads the tokenizer from Hugging Face's model hub based on the mentioned pretrained model: TheBloke/Mistral-7B-Instruct-v0.1-GPTQ. The tokenizer is fetched using the AutoTokenizer.from_pretrained() method, which helps prepare the input text for the specific model.
In addition, it also assigns the same value for the padding token to be the end-of-sequence (EOS) token thus reinforcing the usage of both the tokens in the same manner without any discrepancies while processing and generating text. This is beneficial when making the model which handles text of different lengths because it helps keep the input order intact.
# Load the "bitext/Bitext-customer-support-llm-chatbot-training-dataset" dataset from Hugging Face Datasets
data = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset", split="train")
# Convert the dataset to a pandas DataFrame
data_df = data.to_pandas()
# Select the first 5000 rows of the DataFrame
data_df = data_df[:5000]
# Combine "instruction" and "response" columns into a new column named ""
data_df[""] = data_df[["instruction", "category", "intent", "response"]].apply(
lambda x: "###Question: " + x["instruction"] + " ###Answer: " + x["response"],
axis=1
)
# Create a new dataset from the modified pandas DataFrame
data = Dataset.from_pandas(data_df)
STEP 3:
Loading a Tokenizer for a pre-trained Model
This code loads the tokenizer from Hugging Face's model hub based on the mentioned pretrained model: TheBloke/Mistral-7B-Instruct-v0.1-GPTQ. The tokenizer is fetched using the AutoTokenizer.from_pretrained() method, which helps prepare the input text for the specific model.
In addition, it also assigns the same value for the padding token to be the end-of-sequence (EOS) token thus reinforcing the usage of both the tokens in the same manner without any discrepancies while processing and generating text. This is beneficial when making the model which handles text of different lengths because it helps keep the input order intact.
# Load a tokenizer for the specified pretrained model name from the Hugging Face model hub
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-GPTQ")
# Set the padding token of the tokenizer to be the same as the end-of-sequence (EOS) token
tokenizer.pad_token = tokenizer.eos_token
Show Data
data_df
Defining Quantization Configuration and Loading Model
The following code establishes a quantization scheme for the "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ" model. This procedure begins with loading the model for automatic causal language modeling. A dynamically assigned device mapping is used to utilize the available resources effectively.
# Define a quantization configuration for the model, specifying 4 bits, disabling EXLLAMA, and providing the tokenizer
quantization_config_loading = GPTQConfig(bits=4, disable_exllama=True, tokenizer=tokenizer)
# Load a model for causal language modeling from the specified pretrained model name
# Apply the quantization configuration to the model
# Use automatic device mapping for the model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.1-GPTQ",
quantization_config=quantization_config_loading,
device_map="auto"
)
Step-4: PEFT(Parameter-Efficient Fine-Tuning) Configuration
Disabling Caching and Configuring the Model for KBit Training
This piece of code eliminates caching in the model settings. It defines pretraining task probability and finally switches on the gradient checkpointing feature to save memory space while training. The model is then set up for KBit training. It is a cost-effective way of training large models.
Configuring and Applying PEFT with LoRA
The probability of the pretraining task is assigned a value of one, ensuring that the focus is on the model improvement for fine turning hence the results are maximized. Memory consumption is managed by enabling gradient checkpointing allowing the intermediate results to be recomputed as necessary. A PEFT configuration is given from using LoraConfig which includes the attention heads, dropout rates, and which modules are to be trained. After modifying the models’ architecture, the PEFT configuration is applied to the model to enhance performance. Then, the model is updated. The updated model is printed for verification of implemented changes and settings.
# Disable caching in the model configuration
model.config.use_cache = False
# Set pretraining task probability to 1 in the model configuration
model.config.pretraining_tp = 1
# Enable gradient checkpointing in the model
model.gradient_checkpointing_enable()
# Prepare the model for KBit training
model = prepare_model_for_kbit_training(model)
# Define a configuration for PEFT (Performance Estimation for Transformers)
peft_config = LoraConfig(
r=16, # Number of attention heads
lora_alpha=16, # Alpha parameter for LORA
lora_dropout=0.05, # Dropout rate for LORA
bias="none", # Type of bias in LORA
task_type="CAUSAL_LM", # Type of task (causal language modeling)
target_modules=["q_proj", "v_proj"] # Target modules for PEFT
)
# Apply the PEFT configuration to the model
model = get_peft_model(model, peft_config)
# Print the modified model object
print(model)
Step-5:
Setting Training Arguments for Chatbot Model Optimization
This code sets the training parameters for training the chatbot model. It indicates the folder for saving, the batch size, and the number of gradient accumulation steps. The type of optimizer used is paged_adamw_32bit, with a learning rate of 2e-4. A cosine learning rate schedule is applied, and iterations are saved after every epoch, along with other information every 100 steps. Training will occur for 5 epochs with the fp16 training mode applied. The final model will be uploaded to the Hugging Face Hub after training.
# Define training arguments for model training and optimization
training_arguments = TrainingArguments(
# Output directory where model checkpoints and logs will be saved
output_dir="customer_service_chatbot",
# Batch size per GPU/CPU for training
per_device_train_batch_size=8,
# Number of steps for gradient accumulation before performing optimization
gradient_accumulation_steps=1,
# Optimizer used for training (paged_adamw_32bit in this case)
optim="paged_adamw_32bit",
# Learning rate used for optimization
learning_rate=2e-4,
# Type of learning rate scheduler (cosine annealing in this case)
lr_scheduler_type="cosine",
# Strategy for saving model checkpoints (saving after each epoch)
save_strategy="epoch",
# Interval for logging training metrics (every 100 steps)
logging_steps=100,
# Total number of training epochs
num_train_epochs=5,
# Maximum number of training steps
max_steps=250,
# Enable mixed precision training (float16)
fp16=True,
# Upload the model to the Hugging Face Hub after training
push_to_hub=True
)
# Print the initialized training arguments
print(training_arguments)
Step-6:
Model Training
This code builds a model-training SFTTrainer object. The dataset, PEFT settings, and training parameters are used. The maximum sequence length is 512 for model training. It uses the tokenizer for text preprocessing. Memoization optimization improves GPU memory use. The trainer uses default batch size, learning rate, and other variables. Finally, the trainer trains the model.function train().
# Initialize a SFTTrainer instance for training
trainer = SFTTrainer(
model=model, # The model to be trained
train_dataset=data, # The training dataset
peft_config=peft_config, # PEFT configuration
dataset_text_field="", # Field in the dataset containing text (empty string indicates default field)
args=training_arguments, # Training arguments (e.g., batch size, learning rate)
tokenizer=tokenizer, # Tokenizer for preprocessing text data
packing=False, # Whether to use packing for GPU memory optimization
max_seq_length=512 # Maximum sequence length for tokenization
)
# Perform training
trainer.train()
Step-7:
Mounting Drive and Installing Required Libraries
Mounting Drive:
This code shows how to connect your Google Drive account to a Colab workspace. It helps in accessing the files available in the user’s Google Drive by making it present in a particular folder (which is ‘/content/drive’).
from google.colab import drive
drive.mount('/content/drive')
Model Backup to Google Drive and Device Configuration
This command copies the customer_service_chatbot_dir directory from the Colab environment to Google Drive, ensuring that the trained model and related files are stored for easy access at any time. Additionally, the code checks for a CUDA-enabled GPU using torch.cuda.is_available(). If available, it sets the DEVICE variable to 0 for GPU usage; otherwise, it sets DEVICE to “cpu” to run the model on the CPU.
! cp -r /content/customer_service_chatbot /content/drive/
#This line of code assigns the variable DEVICE based on the availability of CUDA
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
Step-8:
Interface
This code includes valuable packages that help in the creation and deployment of a language model. Specifically, it imports AutoPeftModelForCausalLM from the peft library to download pre-trained models that have undergone Parameter-Efficient Fine-Tuning (PEFT). The GenerationConfig class from the communities’ transformers section is included for defining different generation-related parameters, including response size and randomness. Furthermore, AutoTokenizer is introduced here to ease the tokenizing of the given input text. Last but not least, the code also constructs a tokenizer by downloading it from the path indicating the pre-trained model of the customer service chatbot.
# Importing AutoPeftModelForCausalLM from the peft module
from peft import AutoPeftModelForCausalLM
# Importing GenerationConfig from the transformers module
from transformers import GenerationConfig
# Importing AutoTokenizer from the transformers module
from transformers import AutoTokenizer
# Importing torch module for various PyTorch functionalities
import torch
# Initializing a tokenizer by loading from a pre-trained model
tokenizer = AutoTokenizer.from_pretrained("/content/drive/aionlinecourse/Project/Customer Service Chatbot Using LLMs model/customer_service_chatbot")
Defining a Question and Constructing a Prompt
A variable named question is created in the code to contain user input or query. Here, What assistance do I need to cancel a purchase? is the query. After that, the prompt is built using f-string where the mentioned question is placed into the format. The Answer: part is left blank for the model to fill in with its answer. The text within the prompt is "cleaned" by the use of the .strip() method, which gets rid of any unnecessary whitespace at the beginning or the end in preparation for tokenization and feeding that to the model.
# Define the question
question = "i want help cancelling purchase?"
# Construct the prompt string
prompt = f"""
Question: {question}
Answer:
""".strip()
Tokenizing and Generating a Response from the Model
In this code, the input prompt gets tokenized first and changed into an appropriate tensor format for processing by the model. (Device) method makes sure that the tensor is placed on the correct device. Then the code enters a section of the code that is designed for inference, this time disabling gradients calculations via torch.inference_mode(). The purpose of this mode is to speed up the calculations and optimize memory usage. In this mode, given an input, the function model.generate() defines the output of the chatbot, setting the temperature to 0.7 for controlling the randomness and max_new_tokens to 512 to control the length of the response.
# Tokenize the prompt and convert it to tensor format, then move it to the specified device (GPU or CPU)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(DEVICE)
# Enter the inference mode (i.e., disable autograd and training-specific behavior)
with torch.inference_mode():
# Generate the output (response) based on the input prompt
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
Decoding and Printing the Model's Output
The code takes the model’s produced output (a tensor) and processes it with a decoder in order to get the original text with the help of the tokenizer. The output[0] denoted the first sequence that had been generated from the model’s response. The function tokenizer.decode() does the reverse operation. It retrieves the text from the numerical output given by the model including the tokens. This enables the resulting text to be displayed correspondingly as the model’s response to the user’s question.
# Decode the generated output tensor and print the decoded text
print(tokenizer.decode(output[0]))
# Import necessary modules
from peft import AutoPeftModelForCausalLM
from transformers import GenerationConfig, AutoTokenizer
import torch
# Initialize tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("/content/drive/aionlinecourse/Project/Customer Service Chatbot Using LLMs model/customer_service_chatbot")
# Define the question
question = "I'm trying to cancel order"
# Construct the prompt string with the question
prompt = f"""
Question: {question}
Answer:
""".strip()
# Tokenize the prompt and convert it to tensor format
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(DEVICE)
# Enter the inference mode
with torch.inference_mode():
# Generate the model's response based on the input prompt
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
# Decode the generated output tensor and print the decoded text
print(tokenizer.decode(output[0]))
Conclusion
In our fast-paced digital society, the days of waiting for hours on the phone are over. No longer do you have to be switched through numerous layers of customer service. You can finally reach the individual you need to speak to. This project delivers just that! You can rewrite the way your business deals with support issues by creating a customer service chatbot with the Mistral 7B Instruct model. It will transform your business in a short period. This chatbot is fast. Because it uses advanced natural language processing combined with novel optimization like PEFT and model quantization. This keeps customers happy with human-like responses.
As an AI-powered solution, it works 24/7 and scales effortlessly. Customer service can be offered with the help of chatbots to any user at any time. No matter the language. No matter the devices.
Bearing this in mind, if you are ready to take your customer service a notch higher, this chatbot is the solution you seek. Quick, courteous, and always professional. That is the future of customer service that we will deliver!
Challenges and solution
Challenge: Large models also tend to be slow and inefficiently resource-heavy.
Solution: To reduce the computational load of the model and accelerate the model, we employ model quantization, PEFT.
Challenge: Fine-tuning large models is hard.
Solution: Use SFTTrainer to parameter efficiently fine tune with fewer resources.
Challenge: Data may be irrelevant or redundant depending on what is present on the dataset.
Solution: Before we feed the model, clean and filter out the extra information from the dataset.
Challenge: Difficult in chatbot integration across different platforms.
Solution: Don’t reinvent the wheel, build around well-documented APIs and tools such as Hugging Face.
Challenge: Too high training and inference times.
Solution: The implementation of gradient checkpointing and GPU acceleration optimizes performance.
FAQ:
Question 1: Why is the Mistral 7B Instruct model used for customer service chatbots?
Answer: Mistral 7B Instruct is a large language model for generating human-like responses to user queries. Therefore, if you want to build a customer service chatbot, it is ideal. Because it can handle complex language tasks and it will give a context-aware answer.
Question 2: Why is PEFT (Position Embedding Free Transformers) important for this project?
Answer: PEFT makes the training process faster and more efficient. It allows the chatbot to be fine tuned with fewer resources. The big plus of this is that it is super useful for large models. Otherwise, it would take a massive amount of computational power.
Question 3: Can model quantization improve performance?
Answer: Model quantization reduces the size of the model converting its weights to lower precision. This helps decrease memory usage and speed up inference without too much accuracy loss, making the chatbot more efficient.
Question 4: Which dataset is used to train it and how is it prepared?
Answer: We train the chatbot on a customer support dataset from Hugging Face. We format and clean the dataset so that the model can understand and give the correct response.
Question 5: How can I deploy this chatbot for multiple platforms?
Answer: With APIs and tools in Hugging Face's ecosystem, you can use it across all communication channels available on websites, apps, or customer service channels.
An approach that provides a competitive advantage in customer service, and superiority of interaction quality over traditional processes.