Build Multi-Class Text Classification Models with RNN and LSTM

Project Overview

The purpose of this undertaking is to investigate the classification of text as comprehensively as possible using Python, PyTorch, and Natural Language Processing. To begin with the proper data set; cleansing, formatting, and vectorization of corporeal text into GloVe embeddings is explained. Following that the RNN and the LSTM neural networks where the models have already been trained on sample data to predict the category of the complaint are constructed using the tokenized data. The models developed are checked for performance using accuracy and confusion matrices among other metrics.

Step by step and with good code, the course demonstrates how to normalize text, create embeddings, and build classification models. It’s a practical approach to understand the concepts of Natural language processing, deep learning, and the use of AI for problem-solving. Ideal for programmers, data analysts, and enthusiasts of machine learning!

Prerequisites

Before commencing this project, ensure that you have the following skills and tools:

Python: You need to know how to program in Python and use libraries such as NumPy and Pandas.
Natural Language Processing Basics: You should be familiar with how tokenization works, what embeddings are, and basic text preprocessing techniques.
Knowledge of PyTorch: Knowing how to use PyTorch in the creation and training of neural networks is a must.
Machine Learning: Knowledge in classification, classification loss, and accuracy.
GloVe Embeddings: Be able to explain how and why word embeddings are useful in representing and manipulating text data.
Tools Installed: Confirm that NLTK, Scikit-learn, Matplotlib, and Seaborn libraries are present in your Python environment.

Provided all these prerequisites are in place, you are good to go! Let’s begin with the art of text classification!

Approach

This project carries out text classification using deep learning models such as RNN and LSTM in an organized manner. The project commences with data preprocessing. In this case, the raw text is cleaned of unwanted characters, such as digits and other special characters, as well as excess whitespace. Each text fragment is processed into tensors of a certain size using tokenization. Then turns of these tokens are converted from words to relevant quantized vectors using GloVe embedding techniques that help in maintaining the semantic relationships among the words.

The prepared data comprises three datasets, for training, validation, and model testing. To optimally fill in the data inside the model, the custom data sets and data loaders based on the Pytorch framework are created. Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM) are established. The models are trained using cross-entropy loss which has been subjected to Adam optimization and also validated on the training for overfitting.

Once training is complete, the models are now validated against new datasets that were not presented to them. The performance evaluation takes place using metrics such as accuracy, confusion matrices, and classification reports. The methodology is straightforward and clear and comprises simple but powerful text processing techniques combined with deep learning approaches to promote an effective text classification strategy.

Workflow and Methodology

Workflow

First load and collect the dataset.
Preprocess it by eliminating unnecessary factors, preparing and even correcting the missing data, and synthesizing, or reorganizing existing information.
Perform tokenization of the textual information and encode it into a vectorized form by employing GloVe.
Organize the data into training, validation, and test sets to facilitate the evaluation of the model.
Implement the training process of RNN and LSTM networks with PyTorch framework and with cross-entropy loss and Adam optimizer.
Measure how well the trained model assesses and classifies the data using the following performance measurement tools: accuracy, confusion matrix, and classification report.

Methodology

Make use of the GloVe embeddings to represent text data in an appropriate form.
Utilize RNN and LSTM networks to model temporal patterns of complaint Narratives.
Improve the performance of the networks by reducing the given loss function and also monitoring the validation phase.
Keep the network that has the best performance on the validation data set so that the performance on the test set is as good as possible.
Evaluate the test predictions with the help of confusion matrices and extensive classification reports.

Data Collection and Preparation

Data Collection:

In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

The dataset should be imported and preprocessed including noise removal and treatment of empty or missing data.
Perform text tokenization, padding or truncating, and token mapping into Glove embedding vectors.
For model assessment, the data should be split into training, validation, and test sets.

Code explanation

Here’s what is happening under the hood. Let’s go through it step by step:

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Installing Required Libraries

This code installs a few important libraries: NLTK which allows natural language processing, NumPy for working with arrays, Pandas when dealing with data, Pytorch for deep learning tasks, TQDM provides easy-to-use progress bars, and Scikit-learn is a package used for machine learning applications.

# import the required packages
!pip install nltk
!pip install numpy
!pip install pandas
!pip install torch
!pip install tqdm
!pip install scikit_learn

Importing Libraries and Handling Warnings

This piece of code imports fundamental libraries required for natural language processing, building and implementing machine learning models, and visualizing the information generated with the help of NLTK, PyTorch, and Scikit-learn. Moreover, certain warnings are turned off to keep the workspace clean and organized.

import nltk
nltk.download('punkt')
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.exceptions import UndefinedMetricWarning
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
# Suppress specific warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

Defining Configuration and File Paths

This code snippet initializes important configurations such as learning rate, input dimensions, and model files used for training and testing. It also defines directories for data, models, tokenizers, embeddings, and product mapping for uniform label encoding.

# define configuration file paths
lr = 0.0001
input_size = 50
num_epochs = 50
hidden_size = 50
label_col = "Product"
tokens_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/tokens.pkl"
labels_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/labels.pkl"
data_path = "/content/drive/MyDrive/New 90 Projects/Project_11/Data/complaints.csv"
rnn_model_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/model_rnn.pth"
lstm_model_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/model_lstm.pth"
vocabulary_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/vocabulary.pkl"
embeddings_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/embeddings.pkl"
glove_vector_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/glove.6B.50d.txt"
text_col_name = "Consumer complaint narrative"
label_encoder_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/label_encoder.pkl"
product_map = {'Vehicle loan or lease': 'vehicle_loan',
           'Credit reporting, credit repair services, or other personal consumer reports': 'credit_report',
'Credit card or prepaid card': 'card',
'Money transfer, virtual currency, or money service': 'money_transfer',
'virtual currency': 'money_transfer',
'Mortgage': 'mortgage',
'Payday loan, title loan, or personal loan': 'loan',
'Debt collection': 'debt_collection',
'Checking or savings account': 'savings_account',
'Credit card': 'card',
'Bank account or service': 'savings_account',
'Credit reporting': 'credit_report',
'Prepaid card': 'card',
'Payday loan': 'loan',
'Other financial service': 'others',
'Virtual currency': 'money_transfer',
'Student loan': 'loan',
'Consumer Loan': 'loan',
'Money transfers': 'money_transfer'}

Functions to Save and Load a File

This code introduces two functions: saving an object in a pickle format using the function save_file and bringing back the object into existence using the function load_file.

# define function for saving a file
def save_file(name, obj):
"""
Function to save an object as pickle file
"""
with open(name, 'wb') as f:
pickle.dump(obj, f)
# define function for loading a file
def load_file(name):
"""
Function to load a pickle object
"""
return pickle.load(open(name, "rb"))

STEP 2:

Reading GloVe Embeddings

This code reads the GloVe embeddings file and calculates the total number of unique words.