Question Answer System Training With Distilbert Base Uncased
Did you ever wish to have someone read over a lot of text and get right to the point, giving an answer to your question? Well, you have come to the right place! So let us try this fun project where we are going to create a Question and Answer system using a great model known as DistilBERT. Do not worry; the code in this topic will remain as basic, easy-to-understand code. By the end of the project, you will learn how easy it is to create something that can understand questions and find answers as if by magic.
Project Overview
In this work, we explain to you how to construct a question-answering system using the DistilBERT model. Which is trained on the SQuAD dataset. Imagine if you could build your own small robot or something like that that could read a passage and select the best answer to the question.
We’ll take you through the necessary procedures including the creation of the necessary tools to the training of the model and even the use of the model in answering questions. And the best part is we will be reusing pre-trained models from the Hugging Face model repository.
This project is for you if you have ever wanted to see how an AI system is made and by the end of this, you will have your own question and answer Bot. It’s time to dive in deep. Let us begin!
Prerequisites
Before diving into this project, you will need to have a few things ready. Do not worry, nothing too complicated.
- Requires understanding of the intermediate Python program.
- Knowledge of Jupyter Notebooks or Google Colab for running the project.
- Knowledge about Hugging Face’s Transformers library.
- There is an expectation that the learner has a minimum understanding of the field of machine learning regarding model training.
- A Google account for accessing Google Colab through which the project will be run online.
- The SQuAD dataset is required for the training and fine-tuning of the model.
Approach
In this project, we will show you step by step how to build a question answering system with DistilBERT and the SQuAD dataset. The process starts with creating the environment through the installation of prerequisite packages such as Hugging Face’s transformers and datasets. Then, we load the popular SQuAD which is an essential dataset that is used to train the models. After we have our dataset ready then it is followed by Tokenization where we preprocess questions and text passages into forms that the model can handle.
As for task-specific pretraining or fine-tuning, we retrain the distilbert-base-uncased model to answer questions with reference to the context. For training, we use Hugging Face’s Trainer API and fine-tune such hyperparameters as learning rate and batch size. When the training session is complete, we test the model’s capabilities by providing new questions and contexts to see how accurately it can provide answers. Last but not least, we release the model to the Hugging Face Model Hub where everyone can use it for NLP projects.
Workflow and Methodologies
- Install Dependencies: Prepare the system by adding the libraries of transformers, and datasets.
- Load the SQuAD Dataset: Load and divide the SQuAD dataset into training and testing sections.
- Data tokenization: Prepare questions and contexts for model input using auto tokenizer from hugging face.
- Fine-tuning of the model: Take the pre-trained distilbert-base-uncased model and fine-tune it using SQuAD dataset to perform Question Answering Task.
- Model Training and Evaluation: Employ Trainer to fit the model and evaluate its performance using accuracy, and F1 score, among other metrics.
- Testing with New Questions: Evaluate the model with new questions and respective contexts to see the result of the model.
- Deploy the Model: Use push_to_hub() function to deploy the trained model onto the Hugging Face Model Hub for sharing and accessing the public.
Data Collection and Preparation
Data Collection Workflow
- Collect the Dataset: Load the SQuAD dataset by leveraging the datasets library provided by Hugging Face.
- Explore the Dataset: Look at a few examples in order to get a sense of the way texts with passages, questions, and answers are formatted.
- Split the Dataset: Split the dataset into 80% for training and 20% for validation.
Data Preparation Workflow
- Tokenize the Data: Use Hugging Face’s AutoTokenizer to prepare questions and contexts for processing.
- Map the Answer Location: Identify and mark where the answer is located within the corresponding tokenized context.
- Batch the Data: Use DefaultDataCollator to combine and pad the tokenized sequences into batches of a specific length, improving processing speed.
- Prepare Inputs for Training: Fill the inputs with tokenized text and answer location maps to get them ready for training.
Code Explanation
STEP 1:
Installing required libraries
The first line stated the installation of the transformers library. Which comes with models and SQuAD datasets for carrying out tasks such as text generation, and question answering. The second line installs a tool known as accelerate which improves the training of the models and helps in running the models on different devices.
!pip install transformers datasets evaluate
!pip install accelerate
Logging into Hugging Face Hub
This code enables access to the Hugging Face Hub straight from the notebook. The notebook_login() function asks for the Hugging Face credentials of the user so that one can use the pre-trained models, datasets, and other resources that the platform provides.
from huggingface_hub import notebook_login
notebook_login()
STEP 2:
Loading the SQuAD Dataset
This code calls the load_dataset function. It is used to load datasets for training purposes. But not the whole dataset is loaded; rather just 5000 samples from the training split are retrieved. This number of samples will be used for efficiently training the question answering model.
# Importing the load_dataset function from the datasets library
from datasets import load_dataset
# Loading the SQuAD dataset with only the first 5000 examples from the training split
squad = load_dataset("squad", split="train[:5000]")
Splitting the Dataset
In this code, the SQuAD dataset is prepared for training and testing, with 80% allocated for training and 20% for testing. The code also extracts the first example from the training split, providing insight into the dataset's structure and design before initiating any modeling.
squad = squad.train_test_split(test_size=0.2)
squad["train"][0]
Initializing the Tokenizer
This code imports the AutoTokenizer class from the transformers library. It then prepares a tokenizer for the DistilBERT Model. The tokenizer transforms the input text into a sequence of tokens or numbers. So that it can be computed by the model. In this case, the model used is distilbert-base-uncased. It means that the tokenizers used by the model do not pay attention to the case of the letters. That’s why it’s easier to work with natural language.
# Importing the AutoTokenizer class from the transformers library
from transformers import AutoTokenizer
# Instantiating a tokenizer for the DistilBERT model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
STEP 3:
Preprocessing Function
This preprocess_function is responsible for the cleaning, tokenization, and answer position mapping. First, the function eliminates any extra spaces and tokenizes the text using the same tokenizer for questions and situations. Then it divides it into 384 tokens. Offset mapping tracks token positions in the original text.
After tokenizing the context, the function defines the responses' starting and finishing positions. The position value is (0, 0) if the solution is not available. After processing, the function outputs the inputs and locations to train the model.
def preprocess_function(examples):
# Strips leading and trailing whitespace from each question in the examples dictionary
questions = [q.strip() for q in examples["question"]]
# Tokenizes the questions and contexts, ensuring a maximum length of 384 tokens,
# truncating the context if necessary, and padding sequences to the maximum length
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)
# Retrieves and removes the offset_mapping from the tokenizer inputs
offset_mapping = inputs.pop("offset_mapping")
# Retrieves the answers from the examples dictionary
answers = examples["answers"]
# Initializes lists to store start and end positions of the answers in tokenized sequences
start_positions = []
end_positions = []
# Iterates over the offset_mapping for each example
for i, offset in enumerate(offset_mapping):
# Retrieves the answer information for the current example
answer = answers[i]
start_char = answer["answer_start"][0] # Start character index of the answer
end_char = answer["answer_start"][0] + len(answer["text"][0]) # End character index of the answer
sequence_ids = inputs.sequence_ids(i) # Retrieves sequence ids for the current example
# Finds the start and end of the context sequence in the tokenized sequence
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# Labels the answer (0, 0) if it is not fully inside the context
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# Otherwise, assigns the start and end token positions of the answer
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
# Adds start and end positions of answers to the inputs dictionary
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
In this code, the entire dataset applies the preprocess function via the map function. Which processes the data in chunks to enhance efficiency. The batched=True option indicates that the function in question will process several examples at a go to enhance speed.
In this case the remove_columns argument eliminates unwanted columns, retaining only the ones that contain tokenized data and their answers. Lastly. Every aspect of the SQuAD dataset has been tokenized and prepared for use.
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
Data Collator and Model Initialization for Question Answering
This code sets up the data collator and initializes the model for a question-answering task. First, it imports the DefaultDataCollator class from the transformers library and creates an instance of it. The DefaultDataCollator handles padding and batching, ensuring that all input sequences in a batch have the same length by adding necessary padding. This standardization improves training efficiency and allows the model to process batches with uniform input sizes. Next, the code imports essential classes from the transformers library for model initialization. Specifically, it uses the AutoModelForQuestionAnswering class to load a pre-trained DistilBERT model for question-answering tasks. By calling from_pretrained("distilbert-base-uncased"), the code loads a general-purpose DistilBERT model, which will be further fine-tuned on the specific dataset to answer questions based on context within the data.
# Importing the DefaultDataCollator class from the transformers library
from transformers import DefaultDataCollator
# Instantiating a DefaultDataCollator object
data_collator = DefaultDataCollator()
# Importing necessary classes from the transformers library
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
# Instantiating a pretrained DistilBERT-based model for question answering
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
STEP 4:
Model Training Setup
This code automatically initiates the training procedure by defining TrainingArguments and Trainer classes. Training parameters are part of the training configuration. These are the number of epochs, batch size, learning rate, and the method of evaluation.
The Trainer is also in charge of managing the training loop, the pre-trained model, the datasets that were tokenized, and the batching. Last of all, the model training starts with trainer.train(). Afterward, the model is set to be saved and can be directly uploaded to Hugging Face Hub for sharing or easy deployment.
training_args = TrainingArguments(
# Directory where model checkpoints and outputs will be saved
output_dir="my_awesome_qa_model",
# Strategy for evaluating model performance during training (evaluates once per epoch)
evaluation_strategy="epoch",
# Learning rate for training the model
learning_rate=2e-5,
# Batch size per GPU/CPU for training
per_device_train_batch_size=16,
# Batch size per GPU/CPU for evaluation
per_device_eval_batch_size=16,
# Total number of training epochs
num_train_epochs=1,
# Weight decay coefficient for regularization
weight_decay=0.01,
# Flag indicating whether to push the trained model to the Hugging Face Hub
push_to_hub=True,
)
trainer = Trainer(
# Pre-trained model for question answering
model=model,
# Training arguments defined above
args=training_args,
# Training dataset
train_dataset=tokenized_squad["train"],
# Evaluation dataset
eval_dataset=tokenized_squad["test"],
# Tokenizer used for tokenizing inputs
tokenizer=tokenizer,
# Data collator used for batching and data processing
data_collator=data_collator,
)
# Initiates training of the model using the specified training arguments and datasets
trainer.train()
STEP 5:
Pushing the Model to Hugging Face Hub
The trainer.push_to_hub() command pushes the trained model with configuration, and training details into the Hugging Face Model Hub. This enables the model to be shared or others to download or to be used or to be fine-tuned on someone else’s work. It also enables the model to be deployed in applications from the Hugging Face platform with ease.
trainer.push_to_hub()
STEP 6:
Setting Up Contextual Question-Answering Pipeline with Pre-Trained Model
This code defines a question and its context to set up a question-answering model. The question posed is, “How many programming languages does BLOOM support?” The context provided is: “BLOOM has 176 billion parameters and can generate texts in 46 natural languages and 13 programming languages.” This setup allows the model to rely on the context to accurately answer the question based on the given information. Next, the code imports the pipeline method from transformers to create a question-answering pipeline, utilizing a pre-trained model stored at "aionlinecourse/my_awesome_qa_model". This pipeline simplifies the model’s use, allowing it to easily interpret the question and context and deliver an answer, streamlining the process of querying text-based data.
# Define the question to ask
question = "How many programming languages does BLOOM support?"
# Define the context in which the question is being asked
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."
# Importing the pipeline function from the transformers library
from transformers import pipeline
# Instantiating a question-answering pipeline with the specified model
question_answerer = pipeline("question-answering", model="aionlinecourse/my_awesome_qa_model")
# Using the question-answering pipeline to generate an answer to the given question based on the provided context
question_answerer(question=question, context=context)
# Define the question to ask
question = "who are you?"
# Define the context in which the question is being asked
context = "I am Tareq, I read in class Five, I live in Dhaka."
# Importing the pipeline function from the transformers library
from transformers import pipeline
# Instantiating a question-answering pipeline with the specified model
question_answerer = pipeline("question-answering", model="aionlinecourse/my_awesome_qa_model")
# Using the question-answering pipeline to generate an answer to the given question based on the provided context
question_answerer(question=question, context=context)
Conclusion
Well, we have come to the end of our exciting project on how to build a question answering system using DistilBERT and SQuAD dataset! We have seen all the stages from the environment setup and data loading to the model tuning and deployment to Hugging Face Hub with state of the art artificial intelligence.
With the help of this project, you have now become equipped on how to create your own machine learning model for various natural language processing (NLP) tasks. Your model can not only respond and produce correct answers to whatever query you post, but you can also host it via the Hugging Face platform and let other people use your model. This project, however, helps one build smarter interactive applications not only for those who are data science enthusiasts or developers but even for those who are inquisitive about AI.
What’s the next step? With this strong groundwork you can further improve your model, go ahead and try additional NLP activities, or simply continue with AI.
Challenges and Solution
Challenge: Trouble with setting up the environment.
Solution: Google Colab should come with many of these libraries pre-installed so when you need them, simply use pip install to install missing libraries.Challenge: Slow training on a local machine.
Solution: Take advantage of Google Colab’s free GPU, which greatly enhances model training, and one does not have to buy costly hardware.Challenge: The dataset too large to handle
Solution: It is advisable to begin training from just a part of the dataset, such as first five thousand examples in the context of SQuAD.Challenge: Tokens contain values that are longer than the model’s allowable input size
Solution: There are also truncation and padding options within the tokenizer to make the inputs conform to the token length requirements of the model such as the maximum total tokens allowed of 384.Challenge: The model is overfitting on the training data.
Solution: To prevent overfitting use techniques such as weight decay (weight_decay=0.01) and restrict the number of epochs during training.Challenge: Low accuracy on evaluation
Solution: Try adjusting other hyperparameters (for instance, learning rate and batch size) or increase the training dataset size in order to get the best of it.
FAQ
Question 1: What is DistilBERT and why did we use it for creating a question answer system?
Answer: DistilBERT is similar to BERT but it is a much smaller model and therefore also much faster while still providing very good performance. It is better for operations such as question-answering since it can work with vast amounts of text although it gives correct answers.
Question 2: Can you explain the SQuAD dataset and its significance in creating a question answer system?
Answer: The SQuAD or the Stanford Question Answering Dataset is one of the most used dataset for models in natural language understanding. It consists of text passages, questions and answers, which makes it possible in teaching models to answer questions based on context.
Question 3: What measures should I take after training my model?
Answer: In this project we pushed the model by using the push_to_hub() function provided by Hugging Face the model. Afterward, your model is available on the Hugging Face Hub to be downloaded or tested, and even further fine-tuned by other users.
Question 4: Is it possible for me to use another model apart from DistilBERT in creating a question answer system?
Answer: Absolutely! It’s expressed that DistilBERT can be replaced by other pre-trained models such as BERT, RoBERTa, or even GPT if needed. Thus, Hugging Face hosts a variety of models that can be retrained similarly.
Question 5: In which ways can I enhance the quality of the question-answering model that I have built?
Answer: In order to enhance performance, it is possible to train the model with different hyperparameters (e.g. learning rate, batch-size), perform more training epochs, and/or retrain the model with more data.