Build Multi-Class Text Classification Models with RNN and LSTM
How many times do you come across large volumes of text data and pause to think of an easier way to decipher it? This is where this project comes in handy as it employs sophisticated RNNs and LSTM techniques in its implementation. It aims at intelligently classifying real-life customer complaints. Be it identity theft or credit card issues, this project takes the text and applies it to real-life problems.
Project Overview
The purpose of this undertaking is to investigate the classification of text as comprehensively as possible using Python, PyTorch, and Natural Language Processing. To begin with the proper data set; cleansing, formatting, and vectorization of corporeal text into GloVe embeddings is explained. Following that the RNN and the LSTM neural networks where the models have already been trained on sample data to predict the category of the complaint are constructed using the tokenized data. The models developed are checked for performance using accuracy and confusion matrices among other metrics.
Step by step and with good code, the course demonstrates how to normalize text, create embeddings, and build classification models. It’s a practical approach to understand the concepts of Natural language processing, deep learning, and the use of AI for problem-solving. Ideal for programmers, data analysts, and enthusiasts of machine learning!
Prerequisites
Before commencing this project, ensure that you have the following skills and tools:
- Python: You need to know how to program in Python and use libraries such as NumPy and Pandas.
- Natural Language Processing Basics: You should be familiar with how tokenization works, what embeddings are, and basic text preprocessing techniques.
- Knowledge of PyTorch: Knowing how to use PyTorch in the creation and training of neural networks is a must.
- Machine Learning: Knowledge in classification, classification loss, and accuracy.
- GloVe Embeddings: Be able to explain how and why word embeddings are useful in representing and manipulating text data.
- Tools Installed: Confirm that NLTK, Scikit-learn, Matplotlib, and Seaborn libraries are present in your Python environment.
Provided all these prerequisites are in place, you are good to go! Let’s begin with the art of text classification!
Approach
This project carries out text classification using deep learning models such as RNN and LSTM in an organized manner. The project commences with data preprocessing. In this case, the raw text is cleaned of unwanted characters, such as digits and other special characters, as well as excess whitespace. Each text fragment is processed into tensors of a certain size using tokenization. Then turns of these tokens are converted from words to relevant quantized vectors using GloVe embedding techniques that help in maintaining the semantic relationships among the words.
The prepared data comprises three datasets, for training, validation, and model testing. To optimally fill in the data inside the model, the custom data sets and data loaders based on the Pytorch framework are created. Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM) are established. The models are trained using cross-entropy loss which has been subjected to Adam optimization and also validated on the training for overfitting.
Once training is complete, the models are now validated against new datasets that were not presented to them. The performance evaluation takes place using metrics such as accuracy, confusion matrices, and classification reports. The methodology is straightforward and clear and comprises simple but powerful text processing techniques combined with deep learning approaches to promote an effective text classification strategy.
Workflow and Methodology
Workflow
- First load and collect the dataset.
- Preprocess it by eliminating unnecessary factors, preparing and even correcting the missing data, and synthesizing, or reorganizing existing information.
- Perform tokenization of the textual information and encode it into a vectorized form by employing GloVe.
- Organize the data into training, validation, and test sets to facilitate the evaluation of the model.
- Implement the training process of RNN and LSTM networks with PyTorch framework and with cross-entropy loss and Adam optimizer.
- Measure how well the trained model assesses and classifies the data using the following performance measurement tools: accuracy, confusion matrix, and classification report.
Methodology
- Make use of the GloVe embeddings to represent text data in an appropriate form.
- Utilize RNN and LSTM networks to model temporal patterns of complaint Narratives.
- Improve the performance of the networks by reducing the given loss function and also monitoring the validation phase.
- Keep the network that has the best performance on the validation data set so that the performance on the test set is as good as possible.
- Evaluate the test predictions with the help of confusion matrices and extensive classification reports.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- The dataset should be imported and preprocessed including noise removal and treatment of empty or missing data.
- Perform text tokenization, padding or truncating, and token mapping into Glove embedding vectors.
- For model assessment, the data should be split into training, validation, and test sets.
Code explanation
Here’s what is happening under the hood. Let’s go through it step by step:
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Installing Required Libraries
This code installs a few important libraries: NLTK which allows natural language processing, NumPy for working with arrays, Pandas when dealing with data, Pytorch for deep learning tasks, TQDM provides easy-to-use progress bars, and Scikit-learn is a package used for machine learning applications.
# import the required packages
!pip install nltk
!pip install numpy
!pip install pandas
!pip install torch
!pip install tqdm
!pip install scikit_learn
Importing Libraries and Handling Warnings
This piece of code imports fundamental libraries required for natural language processing, building and implementing machine learning models, and visualizing the information generated with the help of NLTK, PyTorch, and Scikit-learn. Moreover, certain warnings are turned off to keep the workspace clean and organized.
import nltk
nltk.download('punkt')
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.exceptions import UndefinedMetricWarning
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
# Suppress specific warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
Defining Configuration and File Paths
This code snippet initializes important configurations such as learning rate, input dimensions, and model files used for training and testing. It also defines directories for data, models, tokenizers, embeddings, and product mapping for uniform label encoding.
# define configuration file paths
lr = 0.0001
input_size = 50
num_epochs = 50
hidden_size = 50
label_col = "Product"
tokens_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/tokens.pkl"
labels_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/labels.pkl"
data_path = "/content/drive/MyDrive/New 90 Projects/Project_11/Data/complaints.csv"
rnn_model_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/model_rnn.pth"
lstm_model_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/model_lstm.pth"
vocabulary_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/vocabulary.pkl"
embeddings_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/embeddings.pkl"
glove_vector_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/glove.6B.50d.txt"
text_col_name = "Consumer complaint narrative"
label_encoder_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/label_encoder.pkl"
product_map = {'Vehicle loan or lease': 'vehicle_loan',
'Credit reporting, credit repair services, or other personal consumer reports': 'credit_report',
'Credit card or prepaid card': 'card',
'Money transfer, virtual currency, or money service': 'money_transfer',
'virtual currency': 'money_transfer',
'Mortgage': 'mortgage',
'Payday loan, title loan, or personal loan': 'loan',
'Debt collection': 'debt_collection',
'Checking or savings account': 'savings_account',
'Credit card': 'card',
'Bank account or service': 'savings_account',
'Credit reporting': 'credit_report',
'Prepaid card': 'card',
'Payday loan': 'loan',
'Other financial service': 'others',
'Virtual currency': 'money_transfer',
'Student loan': 'loan',
'Consumer Loan': 'loan',
'Money transfers': 'money_transfer'}
Functions to Save and Load a File
This code introduces two functions: saving an object in a pickle format using the function save_file and bringing back the object into existence using the function load_file.
# define function for saving a file
def save_file(name, obj):
"""
Function to save an object as pickle file
"""
with open(name, 'wb') as f:
pickle.dump(obj, f)
# define function for loading a file
def load_file(name):
"""
Function to load a pickle object
"""
return pickle.load(open(name, "rb"))
STEP 2:
Reading GloVe Embeddings
This code reads the GloVe embeddings file and calculates the total number of unique words.
# open the glove embeddings file and read
with open(glove_vector_path, "rt") as f:
emb = f.readlines()
# 400000 unique words are there in the embeddings (length of embeddings)
len(emb)
Retrieval of the First Entry in the GloVe Embedding Dataset
This piece of code fetches the first entry from the GloVe embeddings file and prints out the first word along with a vector assigned to it.
# check first record
emb[0]
Extracting the Initial Word from GloVe Embeddings
The following code separates the initial word present in the first entry of the GloVe embeddings from its respective vector values.
# split the first record and check for vocabulary
emb[0].split()[0]
Extraction of Embedding Values of the First Word
The given code snippet extracts the details or vector values (embeddings) corresponding to the first word in the first line of the GloVe embedding file.
# split the first record and check for embeddings
emb[0].split()[1:]
Forming a Vocabulary and Creating Embeddings Array
This code creates a vocabulary of words and associated embeddings matrix using GloVe embeddings. The embedding matrix is transformed into a float32 numpy array whose shape denotes the number of words and their respective vector sizes.
vocabulary, embeddings = [], []
for item in emb:
vocabulary.append(item.split()[0])
embeddings.append(item.split()[1:])
# Convert embeddings to numpy float array
embeddings = np.array(embeddings, dtype=np.float32)
embeddings.shape
Displaying the First 10 Words in the Vocabulary
This section of code fetches the first ten entries appearing on the lexicon constructed along with the GloVe embedding and outputs them for visualization purposes.
vocabulary[:10]
Updating Vocabulary and Embeddings
The algorithm incorporates special tokens \<pad> and \<unk> to the lexical content and attaches the respective vectors: all ones for the \<pad> vector and the mean of the existing embeddings for \<unk>. It modifies the embeddings matrix correspondingly and displays the new vocabulary size and shape of embeddings.
vocabulary = ["\", "\"] + vocabulary
embeddings = np.vstack([np.ones(50, dtype=np.float32), np.mean(embeddings, axis=0),
embeddings])
print(len(vocabulary), embeddings.shape)
Saving Embeddings and Vocabulary
The code saves the revised embedding matrix and the vocabulary list to pickle files using the assigned directory paths for potential use in the future.
save_file(embeddings_path, embeddings)
save_file(vocabulary_path, vocabulary)
STEP 3:
Importing and Processing Data
The code loads the dataset provided in CSV format, deletes the corresponding entries of the rows where they have null values in the text column, and also normalizes the target column by clearing the existing duplicates using the pre-developed product map.
#Read the data file
Aionlinecourse_data = pd.read_csv(data_path)
# Drop rows where the text column is empty
Aionlinecourse_data.dropna(subset=[text_col_name], inplace=True)
#Replace duplicate labels
Aionlinecourse_data.replace({label_col: product_map}, inplace=True)
Label Encoding
The provided code implements a LabelEncoder, applies it to the dataset’s target labels, and encodes them into integers. It then returns the numeric representation of the first label.
label_encoder = LabelEncoder()
label_encoder.fit(Aionlinecourse_data[label_col])
labels = label_encoder.transform(Aionlinecourse_data[label_col])
labels[0]
Retrieving Label Classes
This code displays all the unique label classes that the LabelEncoder has learned from the dataset.
label_encoder.classes_
Observing the Target Column
The following code obtains the values in the target column (label_col) from the dataset and presents it, as it appears before the label is converted into an encoded form.
Aionlinecourse_data[label_col]
Saving Labels and Label Encoder
The code saves the numeric labels and the trained LabelEncoder as pickle files for later use in the project.
save_file(labels_path, labels)
save_file(label_encoder_path, label_encoder)
Processing Text Input
This code transforms the entire text in the given column into lowercase so that all the entries are the same during processing. It has a progress bar to show the estimate of how much the conversion has gone.
input_text = Aionlinecourse_data[text_col_name]
# Convert text to lower case
input_text = [i.lower() for i in tqdm(input_text)]
Eliminating Special Characters in Text
The script eliminates special characters, punctuation, and symbols from the corresponding text, which only allows words, numbers, and spaces. This guarantees cleaner input for the subsequent steps.
# Remove punctuations except apostrophe
input_text = [re.sub(r"[^\w\d'\s]+", " ", i) for i in tqdm(input_text)]
Removing Numbers From Text
The purpose of this code is to remove any numeric digits present in the text to concentrate only on the words which in turn makes the data more clean for analysis
# Remove digits
input_text = [re.sub("\d+", "", i) for i in tqdm(input_text)]
Eliminating Consecutive 'x' Characters
The purpose of this particular code is to erase all sequences in input_text that consist of two or more consecutive 'x' characters using regular expression thereby enhancing the cleanup of the text.
#Remove more than one consecutive instance of 'x'
input_text = [re.sub(r'[x]{2,}', "", i) for i in tqdm(input_text)]
Removing Extra Spaces
This code removes multiple consecutive spaces in the input_text and replaces them with a single space, ensuring cleaner text formatting.
# Replace multiple spaces with single space
input_text = [re.sub(' +', ' ', i) for i in tqdm(input_text)]
Tokenizing Texts to Words
Here, for the first 100 data in the input_text, we apply NLTK's word_tokenize. Each entry turns into a list of tokens, which are further used for processing.
# Tokenize the text
tokens = [word_tokenize(t) for t in tqdm(input_text)]
Converting Text into Tokens of Equal Size
This code makes sure each text is tokenized into exactly 20 tokens. In the event the text exceeds 20 tokens, the remaining tokens are omitted and only the first 20 are retained. For the shorter texts, \<pad> tokens are added to fill the gap.
# Take the first 20 tokens in each complaint text
tokens = [i[:20] if len(i) > 19 else ['\'] * (20 - len(i)) + i for i in tqdm(tokens)]
Converting Tokens to Indices
This specific function is utilized to assign a numerical index to the word tokens available in the vocabulary. Any token that does not exist is replaced with the index value of \<unk>, thus all the words have a numerical representation.
def token_index(tokens, vocabulary, missing='\'):
"""
:param tokens: List of word tokens
:param vocabulary: All words in the embeddings
:param missing: Token for words not present in the vocabulary
:return: List of integers representing the word tokens
"""
idx_token = []
for text in tqdm(tokens):
idx_text = []
for token in text:
if token in vocabulary:
idx_text.append(vocabulary.index(token))
else:
idx_text.append(vocabulary.index(missing))
idx_token.append(idx_text)
return idx_token
Mapping Tokens to Indices
This code converts a list of tokens into their corresponding numerical indices using the vocabulary for further steps in machine learning models.
tokens = token_index(tokens, vocabulary)
Checking the Number of Tokenized Texts
This code returns the overall count of tokenized texts which is equivalent to the number of rows in the data.
len(tokens)
Observing the Indices of the Initial Tokenized Text
The code below presents the numeric indices of the first tokenized text and illustrates the relationship of each word with the vocabulary.
tokens[0]
Displaying the First Few Rows of the Dataset
This code shows the first five rows of the dataset, providing a quick overview of the data structure and content.
Aionlinecourse_data.head()
Extracting the Initial Lexeme from a List of Numeric Codes
The code extracts the first word from the vocabulary that matches the first token index of the first tokenized document.
vocabulary[tokens[0][0]]
Saving Tokenized Data
The code saves the tokenized text data as a pickle file for reuse in future processing or model training.
save_file(tokens_path, tokens)
STEP 4:
Creating a Custom Dataset Class
The TextDataset class is designed for preparing text-related data for use with PyTorch models. It takes tokenized data, embeddings, and labels and makes access to samples by indices efficient. The __getitem__ method gets the label and the word embeddings belonging to that label for a particular data point.
class TextDataset(torch.utils.data.Dataset):
def __init__(self, tokens, embeddings, labels):
"""
:param tokens: List of word tokens
:param embeddings: Word embeddings (from glove)
:param labels: List of labels
"""
self.tokens = tokens
self.embeddings = embeddings
self.labels = labels
def __len__(self):
return len(self.tokens)
def __getitem__(self, idx):
return self.labels[idx], self.embeddings[self.tokens[idx], :]
Creating an RNN Model Class
Here, the RNNNetwork class creates an RNN model for the task of text classification. It has an RNN Unit, which is meant for handling intermittent sequences of data, and a linear layer that channels the hidden state to the various output classes. In short, the forward method accepts input data, runs it through an RNN, and generates output predictions.
class RNNNetwork(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
"""
:param input_size: Size of embedding
:param hidden_size: Hidden vector size
:param num_classes: Number of classes in the dataset
"""
super(RNNNetwork, self).__init__()
# RNN Layer
self.rnn = torch.nn.RNN(input_size=input_size,
hidden_size=hidden_size,
batch_first=True)
# Linear Layer
self.linear = torch.nn.Linear(hidden_size, num_classes)
def forward(self, input_data):
_, hidden = self.rnn(input_data)
output = self.linear(hidden)
return output
Create an LSTMNetwork Class
The LSTM Network class creates a Long Short-Term Memory (LSTM) network with the core application of classifying textual data. Long Short-Term Memory (LSTM) layer is included in the design, to deal with the sequential input data, as well as a linear layer that returns the class predictions. The forward method handles the input data and the hidden state is used for classification.
class LSTMNetwork(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
"""
:param input_size: Size of embedding
:param hidden_size: Hidden vector size
:param num_classes: Number of classes in the dataset
"""
super(LSTMNetwork, self).__init__()
# LSTM Layer
self.rnn = torch.nn.LSTM(input_size=input_size,
hidden_size=hidden_size,
batch_first=True)
# Linear Layer
self.linear = torch.nn.Linear(hidden_size, num_classes)
def forward(self, input_data):
_, (hidden, _) = self.rnn(input_data)
output = self.linear(hidden[\-1])
return output
The Train Function of Neural Networks
This train function is used to train and test the provided Pytorch model for several epochs. It measures the training and validation losses, performs backpropagation to adjust model weights, and stores the model if the validation loss increases.
def train(train_loader, valid_loader, model, criterion, optimizer, device, num_epochs, model_path):
best_loss = float("inf")
for i in range(num_epochs):
print(f"Epoch {i+1} of {num_epochs}")
valid_loss, train_loss = [], []
model.train()
# Train loop
for batch_labels, batch_data in tqdm(train_loader):
# Move data to the specified device
batch_labels = batch_labels.to(device).long() # Ensure it's on device and type is Long
batch_data = batch_data.to(device)
# Forward pass
batch_output = model(batch_data)
batch_output = torch.squeeze(batch_output)
# Calculate loss
loss = criterion(batch_output, batch_labels)
train_loss.append(loss.item())
optimizer.zero_grad()
loss.backward() # Backward pass
optimizer.step() # Gradient update step
# Validation loop
model.eval()
with torch.no_grad(): # Disable gradients for validation
for batch_labels, batch_data in tqdm(valid_loader):
batch_labels = batch_labels.to(device).long() # Ensure it's on device and type is Long
batch_data = batch_data.to(device)
# Forward pass
batch_output = model(batch_data)
batch_output = torch.squeeze(batch_output)
# Calculate loss
loss = criterion(batch_output, batch_labels)
valid_loss.append(loss.item())
# Calculate average losses
t_loss = np.mean(train_loss)
v_loss = np.mean(valid_loss)
print(f"Train Loss: {t_loss}, Validation Loss: {v_loss}")
# Save the model if validation loss improves
if v_loss \< best_loss:
best_loss = v_loss
torch.save(model.state_dict(), model_path)
print(f"Best Validation Loss: {best_loss}")
STEP 5:
Model Assessment through Test Function
The test function assesses the performance of a trained model using the testing phase data. It calculates the test loss, performs label predictions, and computes the accuracy of the model while maintaining constant weights of the model.
def test(test_loader, model, criterion, device):
model.eval()
test_loss = []
test_accu = []
with torch.no_grad(): # Disable gradients for testing
for batch_labels, batch_data in tqdm(test_loader):
batch_labels = batch_labels.to(device).long() # Move to device and type is Long
batch_data = batch_data.to(device)
# Forward pass
batch_output = model(batch_data)
batch_output = torch.squeeze(batch_output)
# Calculate loss
loss = criterion(batch_output, batch_labels)
test_loss.append(loss.item())
batch_preds = torch.argmax(batch_output, axis=1)
batch_labels = batch_labels.cpu() # Move to CPU for accuracy calculation
batch_preds = batch_preds.cpu()
# Compute accuracy
test_accu.append(accuracy_score(batch_labels.numpy(), batch_preds.numpy()))
test_loss = np.mean(test_loss)
test_accu = np.mean(test_accu)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accu}")
Loading Preprocessed Data and Model Parameters
This code loads previously saved tokens, labels, embeddings, and the label encoder and also calculates the total number of output classes for the model.
tokens = load_file(tokens_path)
labels = load_file(labels_path)
embeddings = load_file(embeddings_path)
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)
Dividing into Train, Validation, and Test Sets
This code is used to divide tokenized data and labels into the train, validation, and test sets; 20% is reserved for the test set and 25% of the training set is reserved for validation.
X_train, X_test, y_train, y_test = train_test_split(tokens, labels,
test_size=0.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,
test_size=0.25)
Creating Datasets for Training, Validation, and Testing
In this code, we create TextDataset objects for training, validation, and test datasets, and integrate tokenized inputs, their embeddings, and the relevant labels for each of the datasets.
train_dataset = TextDataset(X_train, embeddings, y_train)
valid_dataset = TextDataset(X_valid, embeddings, y_valid)
test_dataset = TextDataset(X_test, embeddings, y_test)
Constructing Data Loaders for Model Training and Evaluation Tasks
This piece of code creates DataLoader objects in the PyTorch framework for training, validation, and test purposes. This ensures that there is efficient processing of data in batches with a batch size of sixteen, shuffling during training, and stable loading during validation and testing.
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16,
shuffle=True, drop_last=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16)
Setting up the RNN Model
This code creates an object of the RNNNetwork class using the provided input, hidden layer dimensions, and the count of output classes.
model = RNNNetwork(input_size, hidden_size, num_classes)
Configuring Your Device for Training the Model
This piece of code evaluates if a GPU (cuda) or a CPU will be used for model training and transfers the RNN model to the appropriate device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
Specifying Loss Function, Optimizer, and Device
The following code defines the configuration of a cross-entropy loss function intended for classification tasks, an Adam optimization algorithm responsible for updating the parameters of the model, and results to a GPU (GPU:0) where applicable, and a CPU otherwise.
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Executing the RNN Model Training Process
The following code block trains the RNN model using the function train, making use of the training and validation data loaders, the loss function, the optimizer, and the number of epochs defined. The model with the optimal performance is saved at the location specified.
train(train_loader, valid_loader, model, criterion, optimizer,
device, num_epochs, rnn_model_path)
Assessing the Model with Comprehensive Reports
This section of the code tests the trained model on the test data and produces a classification report in addition to the confusion matrix and implementation accuracy. It begins by loading the best model and proceeds to make predictions which are then compared with actual observations to provide a performance overview of the model.
def test_with_report(test_loader, model, device):
"""
Function to test the model, generate a classification report, confusion matrix, and training accuracy.
"""
model.eval()
y_true = []
y_pred = []
with torch.no_grad():
for batch_labels, batch_data in tqdm(test_loader):
batch_labels = batch_labels.to(device).long()
batch_data = batch_data.to(device)
batch_output = model(batch_data)
batch_output = torch.squeeze(batch_output)
batch_preds = torch.argmax(batch_output, axis=1)
y_true.extend(batch_labels.cpu().numpy())
y_pred.extend(batch_preds.cpu().numpy())
# Generate the classification report and confusion matrix
report = classification_report(y_true, y_pred, zero_division=0)
cm = confusion_matrix(y_true, y_pred)
# Calculate training accuracy
accuracy = accuracy_score(y_true, y_pred)
return report, cm, accuracy
# Load the best model
model.load_state_dict(torch.load(rnn_model_path, weights_only=True))
# Run the test and retrieve the classification report, confusion matrix, and training accuracy
report, cm, accuracy = test_with_report(test_loader, model, device)
Generating the Classification Report
The code prints a detailed classification report, which includes precision, recall, F1-score, and support values for all classes.
print("Classification Report:\n", report)
This code prints the RNN model Accuracy.
print("\nModel Accuracy:", accuracy)
Confusion Matrix Visualization
The following code plots the confusion matrix and emphasizes how well the model was able to classify the classes. It adds title and axis labels for better understanding.
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()
Initializing the LSTM Model
An instance of the LSTMNetwork model is created here by specifying the input size and hidden size along with the number of classes in the case of modeling for classification.
model = LSTMNetwork(input_size, hidden_size, num_classes)
This piece of code evaluates if a GPU (cuda) or a CPU will be used for model training and transfers the model to the appropriate device.
if torch.cuda.is_available():
model = model.cuda()
Specifying Loss Function, Optimizer, and Device
The following code defines the configuration of a cross-entropy loss function intended for classification tasks, an Adam optimization algorithm responsible for updating the parameters of the model, and results to a GPU (GPU:0) where applicable, and a CPU otherwise.
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Executing the LSTM Model Training Process
The following code block trains the LSTM model using the function train, making use of the training and validation data loaders, the loss function, the optimizer, and the number of epochs defined. The model with the optimal performance is saved at the location specified.
train(train_loader, valid_loader, model, criterion, optimizer,
device, num_epochs, lstm_model_path)
Testing the Best LSTM Model
This piece of code retrieves the saved highest-scoring LSTM model and assesses it using the test data. This will include a classification report and confusion matrix and the overall efficiency of the model will also be computed.
# Load the best LSTM model
model.load_state_dict(torch.load(lstm_model_path, weights_only=True))
# Run the test and retrieve the classification report, confusion matrix, and training accuracy
report, cm, accuracy = test_with_report(test_loader, model, device)
The code prints a detailed classification report, which includes precision, recall, F1-score, and support values for all classes.
print("Classification Report:\n", report)
This code prints the LSTM model Accuracy.
print("\nModel Accuracy:", accuracy)
Confusion Matrix Visualization
The following code plots the confusion matrix and emphasizes how well the model was able to classify the classes. It adds title and axis labels for better understanding.
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()
Input Text for Test Purposes
This is a hypothetical text about a customer’s complaint regarding identity theft and troubles in reaching Experian which can be used to evaluate the performance of the model on classifying and analyzing the real-world input data.
input_text = '''I am a victim of Identity Theft & currently have an Experian account that
I can view my Experian Credit Report and getting notified when there is activity on
my Experian Credit Report. For the past 3 days I've spent a total of approximately 9
hours on the phone with Experian. Every time I call I get transferred repeatedly and
then my last transfer and automated message states to press 1 and leave a message and
someone would call me. Every time I press 1 I get an automatic message stating than you
before I even leave a message and get disconnected. I call Experian again, explain what
is happening and the process begins again with the same end result. I was trying to have
this issue attended and resolved informally but I give up after 9 hours. There are hard
hit inquiries on my Experian Credit Report that are fraud, I didn't authorize, or recall
and I respectfully request that Experian remove the hard hit inquiries immediately just
like they've done in the past when I was able to speak to a live Experian representative
in the United States. The following are the hard hit inquiries : BK OF XXXX XX/XX/XXXX
XXXX XXXX XXXX XX/XX/XXXX XXXX XXXX XXXX XX/XX/XXXX XXXX XX/XX/XXXX XXXX XXXX
XX/XX/XXXX'''
Preprocessing Text Input for Model Prediction
The following procedure addresses the cleaning and tokenizing of the input text intended for prediction, turning the tokens into indices and indexing them for the embeddings. The embeddings are also PyTorch tensors, which can be transferred to GPU if one is available for the processing and finally batched into a suitable shape ready for model inference.
input_text = input_text.lower()
input_text = re.sub(r"[^\w\d'\s]+", " ", input_text)
input_text = re.sub("\d+", "", input_text)
input_text = re.sub(r'[x]{2,}', "", input_text)
input_text = re.sub(' +', ' ', input_text)
tokens = word_tokenize(input_text)
# Add padding if the length of tokens is less than 20
tokens = ['\']*(20-len(tokens))+tokens
# Tokenize the input text
idx_token = []
for token in tokens:
if token in vocabulary:
idx_token.append(vocabulary.index(token))
else:
idx_token.append(vocabulary.index('\'))
# Get embeddings for tokens
token_emb = embeddings[idx_token,:]
# Convert to torch tensor
inp = torch.from_numpy(token_emb)
# Move the tensor to GPU if available
inp = inp.to(device)
# Create a batch of one record
inp = torch.unsqueeze(inp, 0)
# Load label encoder
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Predicting the Class of Input Text using the RNN Model
The given code sets up the RNN model which has been trained, fetches the parameters associated with the model, and sends the model to the GPU if there is any. It does a forward computation on the preprocessed text and it outputs the predicted class by selecting the label that has the highest score. The output class is then displayed.
# Create model object
model = RNNNetwork(input_size, hidden_size, num_classes)
# Load trained weights
model.load_state_dict(torch.load(rnn_model_path))
# Move the model to GPU if available
if torch.cuda.is_available():
model = model.cuda()
# Forward pass
out = torch.squeeze(model(inp))
# Find predicted class
prediction = label_encoder.classes_[torch.argmax(out)]
print(f"Predicted Class: {prediction}")
Predicting the Class of Input Text using the LSTM Model
The given code sets up the RNN model which has been trained, fetches the parameters associated with the model, and sends the model to the GPU if there is any. It does a forward computation on the preprocessed text and it outputs the predicted class by selecting the label that has the highest score. The output class is then displayed.
# Create model object
model = LSTMNetwork(input_size, hidden_size, num_classes)
# Load trained weights
model.load_state_dict(torch.load(lstm_model_path))
# Move the model to GPU if available
if torch.cuda.is_available():
model = model.cuda()
# Forward pass
out = torch.squeeze(model(inp))
# Find predicted class
prediction = label_encoder.classes_[torch.argmax(out)]
print(f"Predicted Class: {prediction}")
Conclusion
This project demonstrates the ability of neural networks such as Recurrent Neural Networks and Long Short Term Memory networks to convert unrefined text data into useful information. With the help of data preprocessing, GloVe, and training strong neural networks, text classification was made efficient. The methodology, apart from easing the task of managing customer complaints, also demonstrates the relevance of NLP in addressing pragmatic issues. If it's growing your expertise in deep learning, or finding ways to automate customer service, better yet, this project is a great starting point for AI applications where text is involved.
Challenges New Coders Might Face
Challenge: Handling noisy or unstructured text data.
Solution: Utilize text cleaning methods, which may include the exclusion of special symbols, figures, and extra spaces.Challenge: Preprocessing Large Text Data
*Solution***:** Enhance text cleaning processes by employing better libraries such as NLTK and adopting batch processing for the data.Challenge: Curse of Dimensionality in the high dimensional text datasets affecting clustering and classification results.
*Solution***:** Use TF-IDF vectorization and reduction techniques like (PCA) to control dimensionality.Challenge: Inaccessibility of GPU
Solution: For debugging or initial testing, consider using smaller datasets or incorporating GPU-based cloud platforms for quick turnaround times.*Challenge***: Model Overfitting on Training Data
**Solution: Use validation data during training, apply dropout layers, and tune hyperparameters for better generalization.
FAQs
Question 1: Why are RNN and LSTM used for text classification?
Answer: As RNN and LSTM models are designed to deal with sequential data, they can get context and detect patterns in the text to have better classification.
Question 2: Why is GloVe embedding used for this project?
Answer: GloVe embedding allows for deriving semantic relatedness between words, which makes it easier for the model to understand the data.
Question 3: What is the process of pre-processing text data for an NLP project?
Answer: This is where preprocessing comes in, as it’s about cleaning text, removing noise, and tokenizing them so that they can be standardized for use in machine learning models.
Question 4: What is natural language processing LSTM?
Answer: Long Short-Term Memory (LSTM) is a powerful natural language processing (NLP) technique. This powerful algorithm can learn and understand sequential data, making it ideal for analyzing text and speech
Question 5: Is LSTM good for text classification?
Answer: Yes, LSTM (Long Short-Term Memory) networks are commonly used for text classification tasks due to their ability to capture long-range dependencies in sequential data like text.