Document Summarization Using Sentencepiece Transformers
Have you ever wished that there were a quick summary, and no more long documents? And this project has you covered well! In this project, we're equipping ourselves with cutting edge AI tools such as SentencePiece and Transformers by diving into the world of document summarization.
Sounds fun right? Let's see how we can make that possible!
Overview
The objective of this project is to perform a Summarization of a given document using SentencePiece and Transformers. We implement a deep learning model named PEGASUS. It is a pre-trained model that we further train to complete the task of summarization of texts from the SAMSum dataset. Here is what we will learn:
- We download the SAMSum dataset that contains dialogue based texts.
- We Fine-tune the PEGASUS model, a cutting-edge model focused on summarization tasks.
- We Summarize most information while still short and understandable.
- We Apply evaluation metrics such as ROUGE so that we can ensure that our model is accurate
Prerequisites
Before jumping into this exciting project, you need a few things to know. Let’s keep it simple and get you ready:
- Basic understanding of Python programming.
- A Google Colab account to run the project.
- Knowledge of how to use Google Drive for storing data.
- Familiarity with Transformers like Heard of BERT, GPT, or PEGASUS
- Need a huggingface account because we’re using pre-trained models from Hugging Face
- Knowledge of the basics of PyTorch. It’ll help you run and tweak the models.
- CUDA-Enabled GPU
Approach
In this project first, we load the SAMSum dataset. This dataset is full of conversations. Then, we fine-tune the PEGASUS model using Hugging Face Transformers to handle these dialogues and create short summaries. We tokenize the input text using SentencePiece to break it down into pieces. So that the model can be understood. After that, the model is trained to generate summaries while keeping the key information intact. Once the model is fine-tuned, we evaluate it using ROUGE metrics to ensure our summaries are concise but accurate. Finally, we put it to the test with some real-world examples and save the trained model for future use!
Workflow and Methodology
Install Necessary Packages: Before starting the project, first execute the command: pip install dependencies. It is important to have Transformers, SentencePiece, and datasets installed for this project to run well.
Load the Dataset: Load the SAMSum dataset using HuggingFace datasets library.
Tokenization: Consider the text as the input for SentencePiece and transform it into the encoded form. In this stage, the entire conversations are put in smaller bits. So that the model can comprehend and structure the information without any possible obstacles.
Fine-tune the Model: The next step is to fine-tune the PEGASUS model. This model aims to produce a summary of the provided texts.
Evaluate the Model: We employ the evaluation technique using ROUGE after the model training process to evaluate the trained model performance.
Test and Inference: Once trained, the model is instantiated and its ability to summarize previously unseen dialogues in the given example is assessed.
Save the Model: Finally, we store the model as well as the tokenizer. This assists in the purpose of reusing the trained model for a similar task in the future.
Methodology
Data Preparation: Download and load the SAMSum dataset. We prepare these for training by tokenizing them with SentencePiece.
Model Setup: We will be employing the PEGASUS model available in the Hugging Face Library.
Fine-Tuning: The model is trained on the tokenized SAMSum dataset.
Evaluation: When the training is done we measure the performance using ROUGE metrics.
Testing and Deployment: After evaluation, we test the model with new dialogues. Finally, we save the fine-tuned model and tokenizer.
Data Collection and Preparation Workflow
Data Collection Workflow
- Collect the Dataset: First, we collect the SAMSum dataset.
- Load the Dataset: We load the SAMSum dataset into our environment using the Hugging Face datasets library
- Explore the Data: We take a closer look at the dataset, which includes two main columns: dialogue and summary.
Data Preparation Workflow
- Tokenization: The data is tokenized by us using SentencePiece. Adding breaks to the text helps the model understand it and breaks it into smaller tokens so the model can work properly.
- Truncation and Padding: Short long texts and pad short ones, so that our dialogue size is fixed input size.
- Batch the Data: We tokenized the data and then split it into smaller batches.
Code Explanation
STEP 1:
This code installs two important libraries of Python. The first command installs the accelerate library. This library is useful for managing and fastening the training of models especially on GPU. The second command is used to install the transformers library. This library offers pre-trained models for tasks like translation, summarization, etc.
! pip install -U accelerate
! pip install -U transformers
This command installs several important libraries. The transformers[sentencepiece] part installs the Transformers library along with SentencePiece for tokenizing text. datasets are for loading and processing datasets easily. Sacrebleu and rouge_score are tools for evaluating model performance. Especially for text summarization tasks. Finally, py7zr is a library for handling 7zip compressed files.
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q
!nvidia-smi
It imports pipeline and set_seed from Transformers for building models, matplotlib for plotting visual graphs, datasets for loading data, and nltk for text tokenization. Additionally, it includes tqdm for progress bars and torch for deep learning.
# Import necessary libraries
# Import pipeline for easy model inference, and set_seed for reproducibility
from transformers import pipeline, set_seed
# Import matplotlib for visualization
import matplotlib.pyplot as plt
# Import load_dataset to load datasets
from datasets import load_dataset
# Import pandas for data manipulation
import pandas as pd
# Import AutoModelForSeq2SeqLM and AutoTokenizer for sequence-to-sequence tasks
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Import NLTK (Natural Language Toolkit) for text processing
import nltk
# Import sent_tokenize for sentence tokenization
from nltk.tokenize import sent_tokenize
# Import tqdm for progress bars
from tqdm import tqdm
# Import PyTorch for deep learning capabilities
import torch
# Download NLTK tokenizer models
nltk.download("punkt")
STEP 2:
Model setup
This code initializes and configures the PEGASUS model for the purpose of summarization. It begins by checking if a GPU is available. Based on that, adjust the device for either the CUDA GPU or the CPU. After this, it uses the checkpoint “google/pegasus-cnn_dailymail”, which is available on Hugging Face’s model hub, to download the PEGASUS model and the tokenizer. In the end, the model is also loaded to the existing device, which can be a GPU or a CPU, for quick processing.
# Import necessary transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Check for GPU availability and set the device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"
# Define the pre-trained Pegasus model checkpoint
model_ckpt = "google/pegasus-cnn_dailymail"
# Initialize the tokenizer using the specified pre-trained model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
# Load the pre-trained Pegasus model and move it to the specified device (CPU or GPU)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
STEP 3:
Splitting the dataset into batches
This function generate_batch_sized_chunks is used to help by dividing the large lists into smaller batches. It takes two inputs. Which are a list of elements and a batch size. Therefore the function traverses each element of the list and then divides the list into parts of the specified batch size. It then yields each chunk one by one, making it efficient to process large datasets in smaller pieces.
def generate_batch_sized_chunks(list_of_elements, batch_size):
"""split the dataset into smaller batches that we can process simultaneously
Yield successive batch-sized chunks from list_of_elements."""
# Iterate through the list_of_elements in increments of batch_size
for i in range(0, len(list_of_elements), batch_size):
# Yield a batch-sized chunk of elements from the list_of_elements
yield list_of_elements[i : i + batch_size]
Evaluating Summarization Accuracy
The function, calculate_metric_on_test_ds, evaluates how well the model summarizes the texts on the test data. It accepts a dataset, metric, model, and tokenizer as inputs. The generate_batch_sized_chunks function first splits the dataset into batches. Then, for each batch, it tokenizes the text and uses the model to generate summaries over it. The generated summaries are then decoded cleaned up and compared to those target summaries. Finally, the function uses the ROUGE score to perform the task of computing and returning the quality of generated summaries.
def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
batch_size=16, device=device,
column_text="article",
column_summary="highlights"):
# Split the input documents and reference summaries into batch-sized chunks
article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))
# Iterate through each batch of input documents and reference summaries
for article_batch, target_batch in tqdm(
zip(article_batches, target_batches), total=len(article_batches)):
# Tokenize the input documents
inputs = tokenizer(article_batch, max_length=1024, truncation=True,
padding="max_length", return_tensors="pt")
# Generate summaries using the model
summaries = model.generate(input_ids=inputs["input_ids"].to(device),
attention_mask=inputs["attention_mask"].to(device),
length_penalty=0.8, num_beams=8, max_length=128)
''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''
# Finally, we decode the generated texts,
# replace the token, and add the decoded texts with the references to the metric.
decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
clean_up_tokenization_spaces=True)
for s in summaries]
# Add the decoded summaries and references to the metric
decoded_summaries = [d.replace("", " ") for d in decoded_summaries]
metric.add_batch(predictions=decoded_summaries, references=target_batch)
# Finally compute and return the ROUGE scores.
# Compute and return the score of the specified metric
score = metric.compute()
return score
STEP 4:
Load the dataset
The code uses the load_dataset method to import the SAMSum dataset. It outputs the size of several data portions in the dataset. After that, it describes the features of the dataset, especially the dialogue and summary parts. For better understanding, a single dialogue and its summary from the test set are displayed.
# Load the "samsum" dataset
dataset_samsum = load_dataset("samsum")
# Get the lengths of each split in the dataset
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]
# Print the lengths of dataset splits
print(f"Split lengths: {split_lengths}")
# Print the column names of the training split
print(f"Features: {dataset_samsum['train'].column_names}")
# Print a specific dialogue from the test split
print("\nDialogue:")
print(dataset_samsum["test"][1]["dialogue"])
# Print the corresponding summary of the dialogue
print("\nSummary:")
print(dataset_samsum["test"][1]["summary"])
Accessing test dialogue example
This code selects the first dialogue from the test set. This allows us to view the conversation for that particular test sample.
dataset_samsum['test'][0]['dialogue']
Generating summary with model pipeline
This code creates a summarization pipeline using the specified model. It then runs the pipeline on the first dialogue in the test set. The generated summary is stored in the pipe_out variable. Then it prints the summary.
# Create a text summarization pipeline using the specified model checkpoint
pipe = pipeline('summarization', model=model_ckpt)
# Generate a summary for the dialogue from the first sample in the test split of the "samsum" dataset
pipe_out = pipe(dataset_samsum['test'][0]['dialogue'])
# Print the generated summary
print(pipe_out)
Cleaning and formatting the summary
This code format generated summaries for readability. It retrieves summary text from pipe_out's first element [0]. Then substitute \<n> with a new sentence. The summary is cleaner and simpler to understand when printed.
# Print the generated summary text with proper formatting
print(pipe_out[0]['summary_text'].replace(" .<n>", ".\n"))
Calculating ROUGE Scores for Model Evaluation
In this code, the ROUGE metric is loaded using load_metric('rouge'). This rouge metric measures the quality of the generated summaries against the provided reference summaries. Next, the function calculate_metric_on_test_ds is called to assess ROUGE scores on the test set of the SAMSum dataset. It compares the generated summary with the actual summary.
# Load the ROUGE metric
rouge_metric = load_metric('rouge')
# Calculate the ROUGE score on the test split of the "samsum" dataset
score = calculate_metric_on_test_ds(dataset_samsum['test'], rouge_metric, model_pegasus, tokenizer, column_text='dialogue', column_summary='summary', batch_size=8)
Organizing ROUGE Scores into a DataFrame
This code extracts the ROUGE scores for four types of metrics: rouge1, rouge2, rougeL, and rougeLsum. It then creates a dictionary where each metric name maps to its respective F-measure score. Finally, the scores are organized into a pandas DataFrame, with the model name ('pegasus') as the index, for easy viewing and analysis of the model’s summarization performance.
# Define the names of ROUGE metrics
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
# Create a dictionary to store ROUGE scores for each metric
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
# Convert the dictionary into a DataFrame with 'pegasus' as the index
rouge_df = pd.DataFrame(rouge_dict, index=['pegasus'])
# Print the DataFrame containing ROUGE scores
print(rouge_df)
STEP 5:
Visualizing Token Lengths for Dialogues and Summaries
This code calculates SAMSum training set dialogue and summary token lengths. It then encodes the text with the tokenizer and estimates the length of each tokenized sequence. Dialogue and summary length histograms are shown.
# Calculate token lengths for dialogues and summaries in the training dataset
dialogue_token_len = [len(tokenizer.encode(s)) for s in dataset_samsum['train']['dialogue']]
summary_token_len = [len(tokenizer.encode(s)) for s in dataset_samsum['train']['summary']]
# Create subplots for dialogue and summary token lengths
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Plot histogram for dialogue token lengths
axes[0].hist(dialogue_token_len, bins=20, color='C0', edgecolor='C0')
axes[0].set_title("Dialogue Token Length")
axes[0].set_xlabel("Length")
axes[0].set_ylabel("Count")
# Plot histogram for summary token lengths
axes[1].hist(summary_token_len, bins=20, color='C0', edgecolor='C0')
axes[1].set_title("Summary Token Length")
axes[1].set_xlabel("Length")
# Adjust layout and display the plots
plt.tight_layout()
plt.show()
STEP 6:
Tokenizing and Preparing Dataset for Training
This function convert_examples_to_features prepares the dataset for model training. It tokenizes the dialogue and summarizes it. This also ensures they don't exceed the specified max lengths. The function returns the input IDs, attention masks, and labels. Finally, it applies this transformation to the entire dataset using a map with batching enabled.
def convert_examples_to_features(example_batch):
# Encode dialogue texts using the tokenizer with truncation
input_encodings = tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)
# Use the target tokenizer for summary texts
with tokenizer.as_target_tokenizer():
# Encode summary texts with truncation
target_encodings = tokenizer(example_batch['summary'], max_length=128, truncation=True)
# Return features containing input_ids, attention_mask, and labels
return {
'input_ids': input_encodings['input_ids'],
'attention_mask': input_encodings['attention_mask'],
'labels': target_encodings['input_ids']
}
# Map the conversion function to the dataset in batched mode
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched=True)
Setting Up Data Collator for Sequence-to-Sequence Tasks
This code imports the DataCollatorForSeq2Seq from the transformers library. It
helps efficiently prepare data for summarization. It creates a seq2seq_data_collator using the tokenizer and PEGASUS model. The collator handles the padding and formatting of the tokenized inputs and labels to ensure they are aligned properly for training.
from transformers import DataCollatorForSeq2Seq
# Initialize the data collator for sequence-to-sequence tasks
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)
This code shows how to connect your Google Drive account to a Colab workspace. It helps in accessing the files available in the user’s Google Drive by making it present in a particular folder (which is ‘/content/drive’).
from google.colab import drive
drive.mount('/content/drive')
This command changes the current working directory in your Colab environment to a specified path on Google Drive.
%cd /content/drive/MyDrive/Aionlinecourse_uday/Project/Untitled folder/aionline
STEP 7:
Setting Training Arguments for Model Fine-Tuning
To fine-tune the model, this code uses Transformers' TrainingArguments class. Some of the key parameters include specifying the location of the model to be stored, the number of training epochs, and training and validation batch sizes. Other fields which are included are those that describe the number of warm up, weight decay regularization, and logging/evaluation. The memory is controlled by updating the gradients after every 16 steps in gradient_accumulation_steps.
from transformers import TrainingArguments, Trainer
# Define training arguments
trainer_args = TrainingArguments(
# Directory to save model checkpoints and results
output_dir='/content/drive/MyDrive/Aionlinecourse', # Directory to save model checkpoints and results
# Number of training epochs
num_train_epochs=1,
# Number of warmup steps for learning rate scheduling
warmup_steps=500,
# Batch size per GPU for training
per_device_train_batch_size=1,
# Batch size per GPU for evaluation
per_device_eval_batch_size=1,
# Weight decay coefficient for regularization
weight_decay=0.01,
# Log training metrics every specified number of steps
logging_steps=10,
# Evaluation strategy during training
evaluation_strategy='steps',
# Evaluate every specified number of steps
eval_steps=500,
# Save model checkpoints every specified number of steps
save_steps=1e6,
# Number of steps for gradient accumulation
gradient_accumulation_steps=16
)
Initializing the Trainer for Model Fine-Tuning
The following code sets up the Trainer in order to fine-tune the PEGASUS architecture. It has multiple arguments. These include the model to fine-tune, the arguments for the training, the tokenizer, and the data collator (seq2seq_data_collator). Moreover, it defines the training and validation datasets derived from the processed SAMSum dataset. The Trainer takes care of the complete training process, as well as evaluation and model checkpointing.
from transformers import Trainer
# Initialize the Trainer object
trainer = Trainer(
# The sequence-to-sequence model to be trained
model=model_pegasus,
# Training arguments defined earlier
args=trainer_args,
# Tokenizer associated with the model
tokenizer=tokenizer,
# Data collator for batch processing
data_collator=seq2seq_data_collator,
# Training dataset
train_dataset=dataset_samsum_pt["train"],
# Evaluation dataset
eval_dataset=dataset_samsum_pt["validation"]
)
Starting the Model Training Process
This command starts the training process using the Trainer you just set up.
trainer.train()
Evaluating the Model's Summarization Performance
This code evaluates the fine-tuned PEGASUS model on the test set. It calculates ROUGE scores for model summaries using calculate_metric_on_test_ds. The variable score holds results. After extracting the ROUGE metrics, a dictionary (rouge_dict) contains the F-measure scores for each type. For easy comparison, these scores are given in a pandas DataFrame.
# Calculate ROUGE scores on the test dataset
score = calculate_metric_on_test_ds(
dataset_samsum['test'], rouge_metric, trainer.model, tokenizer,
batch_size=2, column_text='dialogue', column_summary='summary'
)
# Extract ROUGE scores for different metrics
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
# Create a DataFrame to display ROUGE scores
rouge_df = pd.DataFrame(rouge_dict, index=['pegasus'])
Saving the model and tokenizer
This command saves the fine-tuned PEGASUS model and tokenizer to a directory called "summarizing-model" and "p_tokenizer" respectively to use in the future.
## Save model
model_pegasus.save_pretrained("summarizing-model")
Save Tokenizer
## Save tokenizer
tokenizer.save_pretrained("p_tokenizer")
STEP 8:
Loading the Dataset and Pre-trained Tokenizer
This code begins by loading the SAMSum dataset using the load_dataset function and then loads the pre-trained tokenizer, ensuring text processing consistency with the model's training phase. Next, it retrieves the first sample of the test set from the SAMSum dataset, storing the dialogue in sample_text and the reference summary in reference, which aids in both generating and evaluating summaries. The summarization pipeline is then set up by defining generation parameters such as length_penalty, num_beams, and max_length. The code uses the pipeline function to load a fine-tuned model with the pre-trained tokenizer, generating concise summaries from the dialogue. Finally, it displays the original dialogue, reference summary, and model-generated summary for a comprehensive comparison.
# Load the "samsum" dataset using the load_dataset function
dataset_samsum = load_dataset("samsum")
# Load the tokenizer from a pretrained model named "p_tokenizer"
tokenizer = AutoTokenizer.from_pretrained("p_tokenizer")
# Retrieve a sample dialogue and its corresponding reference summary from the test split of the "samsum" dataset
sample_text = dataset_samsum["test"][0]["dialogue"]
reference = dataset_samsum["test"][0]["summary"]
# Define generation parameters for the summarization pipeline
gen_kwargs = {"length_penalty": 0.8, "num_beams": 8, "max_length": 128}
# Initialize a summarization pipeline with the specified model checkpoint and tokenizer
pipe = pipeline("summarization", model="summarizing-model", tokenizer=tokenizer)
# Print the sample dialogue, reference summary, and the model-generated summary
print("Dialogue:")
print(sample_text)
print("\nReference Summary:")
print(reference)
print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])
Interactive Dialogue Summarization Loop
This code sets up an interactive loop where users can input dialogues, and the model generates and prints summaries. The loop continues until the user types "exit" to quit.
from transformers import pipeline, AutoTokenizer
# Load tokenizer and summarization pipeline
tokenizer = AutoTokenizer.from_pretrained("p_tokenizer")
pipe = pipeline("summarization", model="summarizing-model", tokenizer=tokenizer)
# Define generation parameters
gen_kwargs = {"length_penalty": 0.8, "num_beams": 8, "max_length": 128}
while True:
# Prompt user for input
user_input = input("Enter a dialogue (or 'exit' to quit): ")
if user_input.lower() == 'exit':
print("Exiting...")
break
# Generate summary
summary = pipe(user_input, **gen_kwargs)[0]["summary_text"]
# Print the generated summary
print("\nGenerated Summary:")
print(summary)
Conclusion
This project shows how to summarize text practically with AI based summarization models such as PEGASUS. By using Hugging Face’s Transformers, SentencePiece, and a dataset like SAMSum, we showed how to fine-tune a model for generating summaries of dialogues. Important processes such as preparation of data, transforming it into tokens, training and validation of the model using ROUGE scoring are described. This method provides an efficient way to compress large volumes of text into a concise and clear summary.
This AI-powered tool can be leveraged for applications such as call centers, chat summarization, and meeting notes summarization.
Challenges and Solution
Challenge: Long Training Times
Solution: Use colab's free GPU to significantly speed up the training process.
Challenge: Handling a large dataset.
Solution: Adjust the batch size or use a pre-trained model like PEGASUS to handle large datasets.
Challenge: Long Texts issue with tokenization.
Solution: Use SentencePiece for tokenization and set appropriate max_length values. This ensures that longer texts are truncated and tokenized efficiently.
Challenge: Low ROUGE Scores for Summarization
Solution: Play with the parameters like num_beams, max_length, and length_penalty to improve the summary quality.
FAQ
Question 1: Why is the SAMSum dataset used for AI text summarization?
Answer: The SAMSum dataset is widely used for training AI models to summarize dialogues. This dataset contains real life conversation and dialogue. It helps the model to train.
Question 2: How can I improve the accuracy of AI-generated text summaries?
Answer: You can fine-tune your model. Just adjust hyperparameters like num_beams and length_penalty. Or you can use a more diverse dataset.
Question 3: What tools are required to implement AI-based text summarization using transformers?
Answer: Implementing AI text summarization with Transformers requires tools like Python, the Hugging Face library, and a GPU. SentencePiece is also essential for efficient tokenization.
Question 4: How do I fix out-of-memory errors when training large models like PEGASUS?
Answer: Reduce batch size, apply gradient accumulation, or switch to a more memory-efficient model to handle out-of-memory errors during training with large datasets.
Question 5: Can AI models be used to summarize non-conversational texts?
Answer: Yes! While models like PEGASUS are trained on conversational datasets, they can be fine-tuned to summarize various text types, from articles to technical documents.