Question Answer System Training With Distilbert Base Uncased

Project Overview

In this work, we explain to you how to construct a question-answering system using the DistilBERT model. Which is trained on the SQuAD dataset. Imagine if you could build your own small robot or something like that that could read a passage and select the best answer to the question.

We’ll take you through the necessary procedures including the creation of the necessary tools to the training of the model and even the use of the model in answering questions. And the best part is we will be reusing pre-trained models from the Hugging Face model repository.

This project is for you if you have ever wanted to see how an AI system is made and by the end of this, you will have your own question and answer Bot. It’s time to dive in deep. Let us begin!

Prerequisites

Before diving into this project, you will need to have a few things ready. Do not worry, nothing too complicated.

Requires understanding of the intermediate Python program.
Knowledge of Jupyter Notebooks or Google Colab for running the project.
Knowledge about Hugging Face’s Transformers library.
There is an expectation that the learner has a minimum understanding of the field of machine learning regarding model training.
A Google account for accessing Google Colab through which the project will be run online.
The SQuAD dataset is required for the training and fine-tuning of the model.

Approach

In this project, we will show you step by step how to build a question answering system with DistilBERT and the SQuAD dataset. The process starts with creating the environment through the installation of prerequisite packages such as Hugging Face’s transformers and datasets. Then, we load the popular SQuAD which is an essential dataset that is used to train the models. After we have our dataset ready then it is followed by Tokenization where we preprocess questions and text passages into forms that the model can handle.

As for task-specific pretraining or fine-tuning, we retrain the distilbert-base-uncased model to answer questions with reference to the context. For training, we use Hugging Face’s Trainer API and fine-tune such hyperparameters as learning rate and batch size. When the training session is complete, we test the model’s capabilities by providing new questions and contexts to see how accurately it can provide answers. Last but not least, we release the model to the Hugging Face Model Hub where everyone can use it for NLP projects.

Workflow and Methodologies

Install Dependencies: Prepare the system by adding the libraries of transformers, and datasets.
Load the SQuAD Dataset: Load and divide the SQuAD dataset into training and testing sections.
Data tokenization: Prepare questions and contexts for model input using auto tokenizer from hugging face.
Fine-tuning of the model: Take the pre-trained distilbert-base-uncased model and fine-tune it using SQuAD dataset to perform Question Answering Task.
Model Training and Evaluation: Employ Trainer to fit the model and evaluate its performance using accuracy, and F1 score, among other metrics.
Testing with New Questions: Evaluate the model with new questions and respective contexts to see the result of the model.
Deploy the Model: Use push_to_hub() function to deploy the trained model onto the Hugging Face Model Hub for sharing and accessing the public.

Data Collection and Preparation

Data Collection Workflow

Collect the Dataset: Load the SQuAD dataset by leveraging the datasets library provided by Hugging Face.
Explore the Dataset: Look at a few examples in order to get a sense of the way texts with passages, questions, and answers are formatted.
Split the Dataset: Split the dataset into 80% for training and 20% for validation.

Data Preparation Workflow

Tokenize the Data: Use Hugging Face’s AutoTokenizer to prepare questions and contexts for processing.
Map the Answer Location: Identify and mark where the answer is located within the corresponding tokenized context.
Batch the Data: Use DefaultDataCollator to combine and pad the tokenized sequences into batches of a specific length, improving processing speed.
Prepare Inputs for Training: Fill the inputs with tokenized text and answer location maps to get them ready for training.

Code Explanation

STEP 1:

Installing required libraries

The first line stated the installation of the transformers library. Which comes with models and SQuAD datasets for carrying out tasks such as text generation, and question answering. The second line installs a tool known as accelerate which improves the training of the models and helps in running the models on different devices.

!pip install transformers datasets evaluate
!pip install accelerate

Logging into Hugging Face Hub

This code enables access to the Hugging Face Hub straight from the notebook. The notebook_login() function asks for the Hugging Face credentials of the user so that one can use the pre-trained models, datasets, and other resources that the platform provides.

from huggingface_hub import notebook_login
notebook_login()

STEP 2:

Loading the SQuAD Dataset

This code calls the load_dataset function. It is used to load datasets for training purposes. But not the whole dataset is loaded; rather just 5000 samples from the training split are retrieved. This number of samples will be used for efficiently training the question answering model.

# Importing the load_dataset function from the datasets library
from datasets import load_dataset
# Loading the SQuAD dataset with only the first 5000 examples from the training split
squad = load_dataset("squad", split="train[:5000]")

Splitting the Dataset

In this code, the SQuAD dataset is prepared for training and testing, with 80% allocated for training and 20% for testing. The code also extracts the first example from the training split, providing insight into the dataset's structure and design before initiating any modeling.

squad = squad.train_test_split(test_size=0.2)
squad["train"][0]

Initializing the Tokenizer

This code imports the AutoTokenizer class from the transformers library. It then prepares a tokenizer for the DistilBERT Model. The tokenizer transforms the input text into a sequence of tokens or numbers. So that it can be computed by the model. In this case, the model used is distilbert-base-uncased. It means that the tokenizers used by the model do not pay attention to the case of the letters. That’s why it’s easier to work with natural language.