NLP Project for Beginners on Text Processing and Classification
Have you ever questioned how objects interpret or categorize text? This project provides insight into text processing as well as text classification with the use of NLP. It is a beginner-focused project where one gets the theory but also has practice in creating a machine-learning model. Hence you will work with NLTK, Scikit-learn, Pandas and so on and learn how to clean, tokenize, and organize information into different categories.
Project Overview
In this project, you will delve into the machine’s ability to read and understand text with ease and classify it most appropriately. You will learn what natural language processing (NLP) is and the processes involved in preparing the raw text for further analysis. Libraries such as NLTK and Scikit-learn will be used to quantify the text into figures that will be used by the machine learning models.
Using CountVectorizer and TfidfVectorizer, you will learn how to perform feature extraction and use Logistic Regression to create a classifier. What’s the objective? Classify the emotion of a particular text as positive, negative, or neutral. In between, you will also check the performance of your model with the help of classification accuracy and confusion matrices so that it is not performing poorly.
This project will leave you with a functional text classifier and basic skills in using NLP techniques. This includes the understanding of how sentiment analyzers or rating systems work using reviewing web content. Hence this project is all about comprehending the text classification tasks!
Prerequisites
This project is beginner-friendly, but having some basic knowledge will make things smoother and more fun! Here’s what you need:
- Familiarity with Python programming and libraries like Pandas, Numpy, and Matplotlib.
- Having prior knowledge of concepts of machine learning, such as Logistic Regression.
- Familiarity with libraries such as NLTK library and the SKlearn library.
- Basic Familiarity with NLP concepts and techniques to apply to the text data.
Approach
This project applies a systematic approach to text classification using Natural Language Processing (NLP). It begins with installing essential libraries like Scikit-learn, Numpy, Pandas, NLTK, and Seaborn. Text data is then preprocessed through tokenization, stopword removal, and stemming using NLTK. Features are extracted using CountVectorizer and TfidfVectorizer to convert text into numerical representations suitable for machine learning models. Logistic Regression is utilized for text classification, with the model’s performance evaluated using metrics like accuracy, confusion matrix, and classification reports. Results are visualized using Matplotlib and Seaborn for clear interpretation. This hands-on approach ensures a practical understanding of NLP fundamentals.
Workflow and Methodology
Workflow
- Install necessary libraries like Scikit-learn, NLTK, Pandas, and Matplotlib for text processing and analysis.
- Preprocess raw text data by tokenizing, removing stopwords, and applying stemming techniques using NLTK.
- Convert text into numerical features using CountVectorizer and TfidfVectorizer for model training.
- Split the dataset into training and testing sets using Scikit-learn's train_test_split function.
- Train a Logistic Regression model to classify text sentiment as positive, negative, or neutral.
- Evaluate the model’s performance with metrics like accuracy, confusion matrix, and classification reports.
- Visualize results and metrics using Matplotlib and Seaborn for better insights.
Methodology
- Data preprocessing ensures text is clean and standardized for effective analysis.
- Feature extraction transforms text data into numerical formats suitable for machine learning models.
- Logistic Regression is used for its simplicity and effectiveness in classification tasks.
- Model evaluation measures accuracy and provides insights into prediction quality.
- Visualization helps to interpret results and identify areas for improvement.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow
- Load ut the text data in a Pandas Dataframe so that it can be manipulated and analyzed with ease.
- Correctly Format the text by removing unnecessary characters, punctuation, and noise.
- Tokenize the text into words/sentences using NLTK by implementing the tokenization methods.
- Eliminate any stop words such as the, is, etc. to retain only the important words in the text.
- Stemming or lemmatization is then done to ensure consistency in the usage of words by reducing them to their respective root forms.
- Vectors are formed out of the processed text using CountVectorizer or TfidfVectorizer.
- Split the dataset into training and testing sets to prepare for model building.
Code Explanation
STEP 1
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Library Installation
The following code installs the required Python Libraries such as scikit-learn, numpy, pandas, seaborn, matplotilb, etc. They are employed to assist in Data Analytics, Visualization, Machine Learning and Natural language Processing.
!pip install scikit-learn
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install matplotlib
!pip install collections
!pip install nltk
!pip install sklearn
!pip install warnings
Import Library and Environment Configuration
The following code imports data manipulation (NumPy, Pandas), plotting (Matplotlib, Seaborn), and NLP (NLTK) libraries. It also initializes machine learning libraries (scikit-learn), downloads the necessary NLTK resources, turns off the warning messages, and allows for the plotting in the notebook.
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from sklearn.model_selection import train_test_split
import warnings
# Download necessary NLTK resources
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Enable inline plotting for Jupyter Notebooks
%matplotlib inline
STEP 2
Loading Data and Checking Dimensions:
This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.
data = pd.read_csv("/content/drive/MyDrive/New 90 Projects/Project_8/Canva_ review_data.csv")
data.shape
Previewing Data
This block of code displays the first 3 rows of the dataset to have a quick overview of the structure of the dataset.
data.head(3)
STEP 3
Create a Subplot Grid for Data Visualization
This script creates a grid of 8 plots arranged in 2 rows and 4 columns to visualize several features of the dataset such as histograms, bar charts, heat maps, violin charts, scatter diagrams, and line graphs.
# Create a 2x4 subplot grid with a larger figure size to display the best 8 plots
fig, axs = plt.subplots(2, 4, figsize=(24, 12))
plt.subplots_adjust(hspace=0.6, wspace=0.6)
# Plot 1: Histogram of 'score'
data['score'].plot(kind='hist', bins=20, title='Score Histogram', ax=axs[0, 0])
axs[0, 0].spines[['top', 'right']].set_visible(False)
axs[0, 0].set_xlabel('Score')
axs[0, 0].set_ylabel('Frequency')
# Plot 2: Bar plot for Sentiment counts
data.groupby('Sentiment').size().plot(kind='barh', color=sns.color_palette('Dark2'), ax=axs[0, 1])
axs[0, 1].set_title('Sentiment Counts')
axs[0, 1].spines[['top', 'right']].set_visible(False)
# Plot 3: Bar plot for 'Sub Category' counts
data.groupby('Sub Category').size().plot(kind='barh', color=sns.color_palette('Dark2'), ax=axs[0, 2])
axs[0, 2].set_title('Sub Category Counts')
axs[0, 2].spines[['top', 'right']].set_visible(False)
# Plot 4: Heatmap of 'Sub Category' vs 'Sentiment'
df_2dhist = pd.DataFrame({
x_label: grp['Sub Category'].value_counts()
for x_label, grp in data.groupby('Sentiment')
}).fillna(0)
sns.heatmap(df_2dhist, cmap='viridis', ax=axs[0, 3], annot=True, fmt=".0f", cbar=False)
axs[0, 3].set_xlabel('Sentiment')
axs[0, 3].set_ylabel('Sub Category')
axs[0, 3].set_title('Sub Category vs Sentiment Heatmap')
# Plot 5: Heatmap of 'Sub Category' vs 'Sub Category_test'
df_2dhist_test = pd.DataFrame({
x_label: grp['Sub Category_test'].value_counts()
for x_label, grp in data.groupby('Sub Category')
}).fillna(0)
sns.heatmap(df_2dhist_test, cmap='viridis', ax=axs[1, 0], annot=True, fmt=".0f", cbar=False)
axs[1, 0].set_xlabel('Sub Category')
axs[1, 0].set_ylabel('Sub Category_test')
axs[1, 0].set_title('Sub Category vs Sub Category_test Heatmap')
# Plot 6: Violin plot of 'score' by 'Sentiment'
sns.violinplot(data=data, x='score', y='Sentiment', inner='box', palette='Dark2', ax=axs[1, 1])
sns.despine(ax=axs[1, 1], top=True, right=True, bottom=True, left=True)
axs[1, 1].set_title('Score by Sentiment')
# Plot 7: Scatter plot of 'score' vs 'thumbsUpCount'
data.plot(kind='scatter', x='score', y='thumbsUpCount', s=32, alpha=0.8, ax=axs[1, 2], color='coral')
axs[1, 2].spines[['top', 'right']].set_visible(False)
axs[1, 2].set_title('Score vs ThumbsUpCount')
# Plot 8: Line plot of 'thumbsUpCount'
data['thumbsUpCount'].plot(kind='line', ax=axs[1, 3], color='teal')
axs[1, 3].spines[['top', 'right']].set_visible(False)
axs[1, 3].set_title('ThumbsUpCount over Time')
axs[1, 3].set_xlabel('Index')
axs[1, 3].set_ylabel('ThumbsUpCount')
# Apply tight layout to prevent overlapping
plt.tight_layout()
plt.show()
This piece of code accesses the “review” column of the DataFrame for the row with the index number 1495.
data.loc[1495, "review"]
Fetching Specific Value
This piece of code accesses the “review” column of the data DataFrame for the row with the index number 13.
data.loc[13, "review"]
This piece of code accesses the “Sentiment” column of the DataFrame for the row with the index number 13.
data.loc[13, "Sentiment"]
This piece of code accesses the “Sentiment” column of the DataFrame for the row with the index number 1495.
data.loc[1495, "Sentiment"]
Count Plot for Sentiment Distribution
The following code generates a count plot that illustrates the ‘Sentiment’ values in the data DataFrame with the use of a ‘Set2’ color palette. It gives the idea of how much each sentiment category is present.
sns.countplot(x="Sentiment", data=data, palette="Set2")
plt.show()
Sentiment Categories Count
The following code counts the number of of each distinct value in the ‘Sentiment’ column of the data DataFrame and its shows the sentiment distribution statistics.
data["Sentiment"].value_counts()
Positive Sentiment Proportion
The calculation represents the proportion of positive sentiment among the total of positive and negative sentiments.
468/(1032+468)
Count Plot for Score
The following code generates a count plot that illustrates the ‘score’ values in the data DataFrame with the use of a ‘Set3’ color palette.
sns.countplot(x="score", data=data, palette="Set3")
plt.show()
Count Plot for Scores with Sentiment
This code generates a count plot that illustrates the distribution of values of "score" present in the data DataFrame across various categories of "Sentiment," with each category represented in a different color. It helps in understanding the variation of scores across different sentiments.
sns.countplot(x="score", data=data, hue="Sentiment")
plt.show()
Calculating and Explaining the Duration of Reviews
This code calculates the length in number of characters of each review present in the DataFrame, assigns it to a new column len, and gives summary statistics. For example, the average length of the review, shortest and longest reviews, and inter-quartile ranges.
data["len"] = data["review"].apply(len)
data["len"].describe()
Histogram of Review Lengths
This code builds a histogram for analysis of the distribution of review lengths in the len column of data DataFrame. It allows us to find patterns in the lengths of the reviews or the distribution of the review lengths.
sns.displot(data["len"])
plt.show()
KDE Plot for Review Length by Sentiment
This code generates a Kernel Density Estimate (KDE) plot which helps to view the distribution of review lengths (len) by sentiment categories filled for better clarity. It demonstrates the tendency of review lengths by sentiment.
sns.displot(data=data, x="len", hue="Sentiment", kind="kde", fill=True)
plt.show()
STEP 4
Filtering Data and Obtaining Particular Review
This code filters the information contained within the DataFrame object – data – to focus on the “review” and “Sentiment” fields only. Next, it gets the “review” string from the text for the entry that is located at position 13 (i.e. row with index 13).
data = data[["review", "Sentiment"]]
data.loc[13, "review"]
Sentence Tokenization of a Specific Review
The following piece of code uses the sentence_tokenize function from the NLTK to break down the 13th index text into individual sentences. It gives back several sentences that comprise the review.
sent_tokenize(data.loc[13,"review"])
Fetching Specific Value
This piece of code accesses the “review” column of the data DataFrame for the row with the index number 1495.
data.loc[1495, "review"]
This code splits the 'review' text from index 1495 in the data DataFrame into individual sentences in a list.
sent_tokenize(data.loc[1495, "review"])
Word Tokenization of a Specific Review
This piece of code tokenizes the 'review' text of data DataFrame indexed at 13 using the word_tokenize method from NLTK library. It outputs a list containing words and punctuation from the review.
word_tokenize(data.loc[13, "review"])
This piece of code tokenizes the 'review' text of data DataFrame indexed at 1495 using the word_tokenize method from NLTK library. It outputs a list containing words and punctuation from the review.
word_tokenize(data.loc[1495, "review"])
Creating a List of Reviews
This code stores all the values from the "review" column of the data DataFrame as a list in the variable reviews.
reviews = list(data["review"])
This code checks the total number of reviews in the reviews list.
len(reviews)
Accessing a Specific Review
This code retrieves the 14th review (index 13) from the reviews list. It returns the full text of that review.
reviews[13]
This code retrieves the 1496th review (index 1495) from the reviews list. It returns the full text of that review.
reviews[1495]
This code retrieves the 1496th review (index 1495) from the reviews list. It returns the full text in lowercase of that review.
reviews[1495].lower()
Changing the Case of Reviews to Lowercase
The following code creates a new list reviews_lower converting all the reviews present in the reviews list in lower case.
reviews_lower = [r.lower() for r in reviews]
This code retrieves the 14th review (index 13) from the reviews list. It returns the full text of that review.
reviews_lower[13]
Alternative Method for Converting Reviews to Lowercase
This code uses a for loop to iterate through the reviews list, converts each review to lowercase, and appends it to the reviews_lower list.
reviews_lower = []
for r in reviews:
reviews_lower.append(r.lower())
Tokenizing All Lowercase Reviews
The following implementation takes each review from the reviews_lower list and splits it into constituent words using word_tokenize. It creates a nested list, where each inner list contains the tokenized words of a single review.
tokens = [word_tokenize(r) for r in reviews_lower]
This code retrieves the 14th review (index 13) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[13]
This code retrieves the 1496th review (index 1495) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[1495]
This code checks the total number of reviews that have been tokenized in the token list.
len(tokens)
Defining Stopwords for English
Using NLTK’s stopwords.words() function this code loads a predefined list of common English stopwords. The variable sw often stores the list of words used to remove words that are irrelevant, such as the, is or and from the text during preprocessing.
sw = stopwords.words('english')
Displaying the First 10 Stopwords
This code retrieves and displays the first 10 stopwords from the sw list.
sw[:10]
Eliminating Stop Words From the Tokenized Reviews Diagram
The current implementation removes all the stop words from each of the tokenized lists in tokens. The final list of tokens for each review without any meaningless words allows for better text analysis.
tokens = [[word for word in t if word not in sw] for t in tokens]
This code retrieves the 14th review (index 13) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[13]
This code retrieves the 14th review (index 13) from the reviews list. It returns the full text of that review.
reviews[13]
This code retrieves the 1496th review (index 1495) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[1495]
This code retrieves the 1496th review (index 1495) from the reviews list. It returns the full text of that review.
reviews[1495]
Remove punctuations
Setting up a Regular Expression Tokenizer
This code takes a RegexpTokenizer object that tokenizes the text by matching only word characters ( \w+ ). It does not include punctuation and symbols so that the tokens are only words.
tokenizer = RegexpTokenizer(r'\w+')
The tokenizer splits words into word pieces, separating contractions with spaces, and stems word characters.
tokenizer.tokenize("wasn't")
Tokenizing Text with Regular Expression Tokenizer
This code also tokenizes the term 'wasn't' using the RegexpTokenizer. It splits up the text into words like 'wasn', 't', without punctuation but allowing word characters only.
t = tokenizer.tokenize("wasn't")
Combining Tokenized Words
This code combines the words in t in a single string with no spaces in it. For instance, ['wasn', 't'] is combined to form "wasnt".
"".join(t)
Tokenizing the Colon
This code uses RegexpTokenizer to tokenize the colon (:). Since the tokenizer only extracts word characters (\w+), the result will be an empty list.
tokenizer.tokenize(":")
Cleaning and Merging the Individual Words into a Sentence
This code cleans every tokenized word contained in the tokens list. In addition, it merges token parts that are not empty and removes all such empty results, thus producing a neat and orderly tokens list.
tokens = [["".join(tokenizer.tokenize(word)) for word in t
if len(tokenizer.tokenize(word))>0] for t in tokens]
This code retrieves the 14th review (index 13) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[13]
This code retrieves the 14th review (index 13) from the reviews list. It returns the full text of that review.
reviews[13]
This code retrieves the 1496th review (index 1495) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[1495]
This code retrieves the 1496th review (index 1495) from the reviews list. It returns the full text of that review.
reviews[1495]
STEP 5
By using this implementation we can perform the initialization of two stemmers.
- PorterStemmer: A common and less harsh kind of stemmer.
- LancasterStemmer: A more extreme kind of stemmer that returns roots for the words broadly.
These tools assist in Stemming the words whilst doing text preprocessing.
porter = PorterStemmer()
lancaster = LancasterStemmer()
It reduces the word to its root form.
porter.stem("teachers")
Stemming a Word with LancasterStemmer
The LancasterStemmer is utilized in the following code to stem the word ‘teachers’. In this case, the word is reduced to ‘teach’ as a result of employing a more extreme approach to stemming.
lancaster.stem("teachers")
Stemming with PorterStemmer
This code uses PorterStemmer to stem "absolutely". The result is "absolut", reducing it to its root form.
porter.stem("absolutely")
Stemming a Word with LancasterStemmer
The LancasterStemmer is utilized in the following code to stem the word ‘absolutely’. In this case, the word is reduced to absolv as a result of employing a more extreme approach to stemming.
lancaster.stem("absolutely")
Applying the Porter's Stemmer to all the tokens
This code also implements the PorterStemmer to every single word contained in every tokenized review situated in the tokens list. Every word is converted to its base form and the tokens list is changed accordingly.
tokens = [[porter.stem(word) for word in t] for t in tokens]
This code retrieves the 14th review (index 13) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[13]
This code retrieves the 14th review (index 13) from the reviews list. It returns the full text of that review.
reviews[13]
This code retrieves the 1496th review (index 1495) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[1495]
This code retrieves the 1496th review (index 1495) from the reviews list. It returns the full text of that review
reviews[1495]
STEP 6
Flattening the Tokens List
The following code defines a flat list called flat_tokens which takes all the words from the list of lists in the tokens and does not have any nested lists. It combines all the tokenized words into one list for efficient processing.
flat_tokens = [word for t in tokens for word in t]
This code checks the total number of words in theflat_tokens list.
len(flat_tokens)
This code retrieves and displays the first 10 words from the flat_tokens list.
flat_tokens[:10]
Counting Word Frequencies
This code employs the Counter class to evaluate every unique word located in the flat_tokens array and count their frequencies.
counts = Counter(flat_tokens)
This code checks the total number of words in the counts list.
len(counts)
Showing the 10 Most Frequent Terms
The code obtains the ten most common words from the counts object. Each result is a pair consisting of a word and a number associated with its frequency.
counts.most_common(10)
STEP 7
This code retrieves the 14th review (index 13) from the tokenized list. That specific review is broken down into individual words by punctuation marks.
tokens[13]
Joining Tokens of a Specific Review.
This code combines the tokens in the tokens list of the 14th review (index 13) into a string, with words separated by space. It reprocesses the review.
" ".join(tokens[13])
Reconstructing Cleaned Reviews
The given code takes the list named tokens, which is comprised of processed tokens, and appends them all together in the list clean_reviews. Each entry in the list corresponds to a performed cleaning of the review in question.
clean_reviews = [" ".join(t) for t in tokens]
Accessing a Cleaned Review
This code retrieves the 14th review (index 13) from the clean_reviews list. It shows the processed and reconstructed version of the review after cleaning and tokenization.
clean_reviews[13]
This code retrieves the 1496th review (index 1495) from the clean_reviews list. It shows the processed and reconstructed version of the review after cleaning and tokenization.
clean_reviews[1495]
Initializing CountVectorizer
The below code initializes CountVectorizer which would transform the content to binary arrays while eliminating words that occur in less than 5 documents.
vect = CountVectorizer(binary=True, min_df=5)
Transforming Cleaned Reviews into Vectors
This script employs the CountVectorizer on a collection of clean_reviews. It is a text vectorization method that transforms the text into a sparse matrix X with binary values denoting the occurrence or non-occurrence of words.
X = vect.fit_transform(clean_reviews)
Verifying the Dimensions of the Vectorized Matrix
This code gives the shape of the sparse matrix denoted as X. From the output, it can be understood the number of reviews, the rows in the dataset, and the columns that represent the unique features in that dataset.
X.shape
STEP 8
Counting the Vocabulary Size
This code calculates the total number of unique words in the CountVectorizer vocabulary.
len(vect.vocabulary_)
This code checks the data types of X. It will return a sparse matrix.
type(X)
Converting Sparse Matrix to Dense Array
This segment of code takes the sparse matrix X and transforms it into a dense form of NumPy array X_a. This makes the data more convenient to check but takes up more space in the memory.
X_a = X.toarray()
This code gives the shape of the dense array. From the output, it can be understood the number of reviews, the rows in the dataset, and the columns that represent the unique featuresafter conversion.
X_a.shape
Retrieving Specific Review's Vector
This code accesses the vector representation of the 14th review (13th index) in the two-dimensional array X_a. It depicts the binary state (present/ absent) of the word feature for that review.
X_a[13,:]
Calculating the Word Count of the Vector
This piece of code is used to sum up the values present in the vector for the 14th review, which is indexed at 13. The output states the number of different words appearing in that review after it has been converted into a vector form.
X_a[13,:].sum()
This code retrieves the 14th review (index 13) from the clean_reviews list.
clean_reviews[13]
Obtaining Feature Names.
This code allows to retrieve the entire list of feature names (i.e., unique words) from CountVectorizer. The output generated and saved under feature_names consists of all the words that served as columns in the vectorized matrix.
feature_names = vect.get_feature_names_out()
index = np.where(feature_names == "unabl")[0][0] # Use np.where to find the index
# np.where returns a tuple of arrays, we need the first element ([0]) of the first array ([0])
print(index)
Accessing a Specific Feature Value
The following code accesses row 13, and column 370 from the dense array X_a. It shows whether the word corresponding to column 370 in the feature_names list appears in the 14th review or not…(1 -present, 0 -absent).
X_a[13,370]
The following code returns the position of the given word “work” in feature_names array. To do this, np.where method is used to find the location of all the matches and only the first one is returned. It is the index of the “working” column in the matrix after it is converted into a vector.
# Assuming 'vect' is your CountVectorizer object
# Replace 'vect.get_feature_names()' with 'vect.get_feature_names_out()'
feature_names = vect.get_feature_names_out() # Get the feature names as a NumPy array
index = np.where(feature_names == "work")[0][0] # Use np.where to find the index
# np.where returns a tuple of arrays, we need the first element ([0]) of the first array ([0])
print(index)
The following code accesses row 13, and column 401 from the dense array X_a. It shows whether the word corresponding to column 401 in the feature_names list appears in the 14th review or not…(1 -present, 0 -absent).
X_a[13,401]
This block of code displays the first 3 rows of the dataset to have a quick overview of the structure of the dataset.
data.head()
Converting Sentiments into a Binary Form
The following code converts the Sentiment column of the data frame into numbers:
1 corresponding to ‘Positive’ sentiments. 0 for other sentiments.
data["Sentiment"] = data["Sentiment"].apply(lambda x: 1 if x=="Positive" else 0)
Selecting the Target Variable
This piece of code extracts the Sentiment column from the data DataFrame and assigns it to y. It forms the target class labels of the machine learning model that is to be developed.
y = data["Sentiment"]
STEP 9
Dividing Dataset into Training and Testing Sets
This piece of code divides the values (X) and target values (y) into training and testing subsets. The testing data is 20% of the available data, and using stratify=y helps in maintaining the class balance, while consistent results are achieved through random_state=42.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
Setting up the Regression Model
This code creates an instance of the LogisticRegression model.
model = LogisticRegression()
Training the Logistic Regression Model
This piece of code fits the LogisticRegression model on the training dataset X_train and y_train and makes the model learn the mapping of input features to target labels.
model.fit(X_train, y_train)
Using the Training Data for Prediction
This piece of code applies the LogisticRegression model which has been previously fitted to the training data X_train and predicts the labels for the training data. The predicted labels are stored in train_pred.
train_pred = model.predict(X_train)
Calculating Accuracy
The below code computes the accuracy of the prediction made by the model on the training labels based on the training prediction values. In other words, it measures the proportion of accurately predicted ones.
accuracy_score(y_train, train_pred)
Assessing the Model’s performance on Test Dataset
This code uses the trained model to predict the labels for the test data (X_demo) and saves it under test_pred. Next, it compares the predicted values with the actual labels and computes their accuracy using accuracy_score for the value of y_test.
test_pred = model.predict(X_test)
accuracy_score(y_test, test_pred)
Saving the Vectorizer and Model
This code saves the vectorizer and the trained model as pickle files for later use.
with open("/content/drive/MyDrive/New 90 Projects/Project_8/binary_count_vect.pkl", "wb") as f:
pickle.dump(vect, f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/binary_count_vect_lr.pkl", "wb") as f:
pickle.dump(model, f)
STEP 10
Create and Use CountVectorizer
The following piece of code creates CountVectorizer instance with min_df=5 which means it will ignore words occurring in less than 5 documents. Then clean_reviews is converted into a sparse feature matrix X.
vect = CountVectorizer(min_df=5)
X = vect.fit_transform(clean_reviews)
Converting Sparse Matrix to Dense Array
This segment of code takes the sparse matrix X and transforms it into a dense form of NumPy array X_a. This makes the data more convenient to check but takes up more space in the memory.
X_a = X.toarray()
This code retrieves the 14th review (index 13) from the clean_reviews list.
clean_reviews[13]
Finding the Index of a Specific Feature
The following code returns the position of the given word “work” in the feature_names array. If the word is not found, it prints an error message.
# vect.get_feature_names().index("work")
feature_names = vect.get_feature_names_out()
try:
work_index = feature_names.tolist().index("work") # Convert to list for index()
print(f"Index of 'work': {work_index}")
except ValueError:
print("'work' is not found in the feature names.")
This code retrieves the entire feature vector for the 14th review (index 13) from the dense array X_a.
X_a[13,:]
Accessing Word Frequency in the Feature Array
This code accesses column 401 and the 14th review (row 13) from X_a and gives the frequency of the word. The word has a frequency of 2 because the occurrence count is two in the review as the CountVectorizer is instructed to count word occurrences and not the whether hte word is extant or not.
X_a[13,401]
Dividing Dataset into Training and Testing Sets
This piece of code divides the values (X) and target values (y) into training and testing subsets. The testing data is 20% of the available data, and using stratify=y helps in maintaining the class balance, while consistent results are achieved through random_state=42.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
Setting up and Training the Model
This code creates an instance of the LogisticRegression model and fits the LogisticRegression model on the training dataset X_train and y_train and makes the model learn the mapping of input features to target labels.
model = LogisticRegression()
model.fit(X_train, y_train)
Evaluating Model Performance
This code assesses the efficacy of the LogisticRegression model when the labels of the training (train_pred) and testing (test_pred) datasets are predicted. The accuracy of the predictions is computed by making use of accuracy_score which gives the ratio of instances classified correctly to the total number of instances. As a result, it prints out both the training accuracy as well as the testing accuracy, thus showing how well the model can fit the training data and how well it can work with new data.
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
print(f"Train Accuracy:{accuracy_score(y_train, train_pred)}")
print(f"Test Accuracy:{accuracy_score(y_test, test_pred)}")
Saving the Vectorizer and Model
This code saves the vectorizer and the trained model as pickle files for later use.
with open("/content/drive/MyDrive/New 90 Projects/Project_8/count_vect.pkl", "wb") as f:
pickle.dump(vect, f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/count_vect_lr.pkl", "wb") as f:
pickle.dump(model, f)
STEP 11
Creating N-Gram Features with CountVectorizer
This piece of code sets up a CountVectorizer object specifying minimum document frequency to five (min_df=5) and limits the n-gram to up to 3 i.e. (ngram_range=(1,3)). It does so on the clean_reviews list and creates a sparse matrix X that has unigrams, bigrams and trigrams as features composed of the respective counts in the text data.
vect = CountVectorizer(min_df=5, ngram_range=(1,3))
X = vect.fit_transform(clean_reviews)
Verifying the Dimensions of the Sparsed Matrix
This code gives the shape of the sparse matrix denoted as X. From the output, it can be understood the number of reviews, the rows in the dataset, and the columns that represent the unique features in that dataset.
X.shape
Analyzing the Vocabulary Generated by CountVectorizer
This piece of code produces the vocabulary developed by the CountVectorizer. It depicts the terms as keys and the corresponding indices which represent their positions in the feature matrix as the values.
vect.vocabulary_
Dividing Dataset into Training and Testing Sets
This piece of code divides the values (X) and target values (y) into training and testing subsets. The testing data is 20% of the available data, and using stratify=y helps in maintaining the class balance, while consistent results are achieved through random_state=42.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
Setting up and Training the Model
This code creates an instance of the LogisticRegression model, fits it to the training datasets X_train and y_train, and makes the model learn the mapping of input features to target labels.
model = LogisticRegression()
model.fit(X_train, y_train)
Evaluating Model Performance
This code assesses the efficacy of the LogisticRegression model when the labels of the training (train_pred) and testing (test_pred) datasets are predicted. The accuracy of the predictions is computed by making use of accuracy_score which gives the ratio of instances classified correctly to the total number of instances. As a result, it prints out both the training accuracy as well as the testing accuracy, thus showing how well the model can fit the training data and how well it can work with new data.
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
print(f"Train Accuracy:{accuracy_score(y_train, train_pred)}")
print(f"Test Accuracy:{accuracy_score(y_test, test_pred)}")
Saving the Vectorizer and Model
This code saves the vectorizer and the trained model as pickle files for later use.
with open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram.pkl", "wb") as f:
pickle.dump(vect, f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram_lr.pkl", "wb") as f:
pickle.dump(model, f)
STEP 12:
Starting the TfidfVectorizer
This code snippet starts the TfidfVeсtorizer, where min_df=5 means that words occurring in less than 5 documents will be ignored. Rather than just using raw basis, it derives and weights features using Term Frequency-Inverse Document Frequency (TF-IDF) values.
vect = TfidfVectorizer(min_df=5)
Transforming Reviews with TfidfVectorizer
Hence, we are transforming the clean_reviews list into a sparse matrix X, which contains the TF-IDF values of the words in the respective reviews. Here, the word values in the reviews denote the significance of that word in the review for all the reviews in the corpus.
X = vect.fit_transform(clean_reviews)
Verifying the Dimensions of the Sparsed Matrix
This code gives the shape of the sparse matrix denoted as X. From the output, it can be understood the number of reviews, the rows in the dataset, and the columns that represent the unique features in that dataset.
X.shape
Dividing Dataset into Training and Testing Sets
This piece of code divides the values (X) and target values (y) into training and testing subsets. The testing data is 20% of the available data, and using stratify=y helps in maintaining the class balance, while consistent results are achieved through random_state=42.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
Setting up and Training the Model
This code creates an instance of the LogisticRegression model and fits the LogisticRegression model on the training dataset X_train and y_train and makes the model learn the mapping of input features to target labels.
model = LogisticRegression()
model.fit(X_train, y_train)
Evaluating Model Performance
This code assesses the efficacy of the LogisticRegression model when the labels of the training (train_pred) and testing (test_pred) datasets are predicted. The accuracy of the predictions is computed by making use of accuracy_score which gives the ratio of instances classified correctly to the total number of instances. As a result, it prints out both the training accuracy as well as the testing accuracy, thus showing how well the model can fit the training data and how well it can work with new data.
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
print(f"Train Accuracy:{accuracy_score(y_train, train_pred)}")
print(f"Test Accuracy:{accuracy_score(y_test, test_pred)}")
Saving the Vectorizer and Model
This code saves the vectorizer and the trained model as pickle files for later use.
with open("/content/drive/MyDrive/New 90 Projects/Project_8/tf-idf.pkl", "wb") as f:
pickle.dump(vect, f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/tf-idf_lr.pkl", "wb") as f:
pickle.dump(model, f)
Evaluating and Comparing Multiple Models
This code assesses and compares several machine learning models employing distinct vectorization techniques such as Binary CountVectorizer, CountVectorizer, N-gram, and TF-IDF. First, the trained models and their vectorizers are retrieved from pickled files. The evaluate_model function transforms the test data, predicts labels, and computes performance measures such as accuracy, classification report, and confusion matrix. The last part of the code runs through each model handles evaluation of the model and prints the results for easy comparison of the models based on the vectorization employed.
from sklearn.metrics import classification_report
# Assuming you have already trained and saved your models
# Load the models and vectorizers for comparison
# Load binary_count_vect_lr
with open("/content/drive/MyDrive/New 90 Projects/Project_8/binary_count_vect_lr.pkl", "rb") as f:
binary_count_vect_lr = pickle.load(f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/binary_count_vect.pkl", "rb") as f:
binary_count_vect = pickle.load(f)
# Load count_vect_lr
with open("/content/drive/MyDrive/New 90 Projects/Project_8/count_vect_lr.pkl", "rb") as f:
count_vect_lr = pickle.load(f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/count_vect.pkl", "rb") as f:
count_vect = pickle.load(f)
# Load n_gram_lr
with open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram_lr.pkl", "rb") as f:
n_gram_lr = pickle.load(f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram.pkl", "rb") as f:
n_gram = pickle.load(f)
# Load tf-idf_lr
with open("/content/drive/MyDrive/New 90 Projects/Project_8/tf-idf_lr.pkl", "rb") as f:
tf_idf_lr = pickle.load(f)
with open("/content/drive/MyDrive/New 90 Projects/Project_8/tf-idf.pkl", "rb") as f:
tf_idf = pickle.load(f)
# Function to evaluate and print model performance
def evaluate_model(model, vectorizer, X_test, y_test):
X_test_vec = vectorizer.transform(X_test)
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
return accuracy, report, cm
# Prepare X_test (assuming 'clean_reviews' is your preprocessed data)
X_test = clean_reviews[:len(y_test)] # Or adapt this as needed to have your testing data
models_data = [
(binary_count_vect_lr, binary_count_vect, "Binary CountVectorizer"),
(count_vect_lr, count_vect, "CountVectorizer"),
(n_gram_lr, n_gram, "N-gram"),
(tf_idf_lr, tf_idf, "TF-IDF")
]
for model, vectorizer, model_name in models_data:
accuracy, report, cm = evaluate_model(model, vectorizer, X_test, y_test)
print(f"Model: {model_name}")
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n{report}")
print("-" * 50)
print("\n")
Visualizing Confusion Matrices for Several Models
This piece of code implements a 2x2 grid of sub-plots to visualize the confusion matrices corresponding to models applied with various vectorization techniques namely, Binary CountVectorizer, CountVectorizer, N- gram and TF-IDF. For each model
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
axs = axs.flatten()
for i, (model, vectorizer, model_name) in enumerate(models_data):
accuracy, report, cm = evaluate_model(model, vectorizer, X_test, y_test)
# Confusion Matrix Plot
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=axs[i], cbar=False)
axs[i].set_title(f"Confusion Matrix: {model_name}")
axs[i].set_xlabel("Predicted Label")
axs[i].set_ylabel("True Label")
plt.tight_layout()
plt.show()
STEP 13:
Predicting Sentiment for Reviews
This code demonstrates how to load the best model (n-gram) and its vectorizer, pre-process the input reviews, and classify the sentiment of the reviews as “Positive” or “Negative.”
# Load the best model (n-gram) and its vectorizer
vect = pickle.load(open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram.pkl", "rb"))
model = pickle.load(open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram_lr.pkl", "rb"))
def predict_sentiment(review):
"""Predicts the sentiment of a given review using the loaded model.
Args:
review: The review text.
Returns:
A string indicating the predicted sentiment ("Positive" or "Negative").
"""
# Preprocess the review
review = review.lower()
tokens = word_tokenize(review)
tokens = [word for word in tokens if word not in sw]
tokens = ["".join(tokenizer.tokenize(word)) for word in tokens if len(tokenizer.tokenize(word)) > 0]
tokens = [porter.stem(word) for word in tokens]
clean_review = " ".join(tokens)
# Transform the review using the vectorizer
X_test = vect.transform([clean_review])
# Make a prediction
prediction = model.predict(X_test)[0]
if prediction == 1:
return "Positive"
else:
return "Negative"
# Sample usage
test_review_1 = '''this is a truly amazing app , best for those who havw
content but don't know how to express it in a good and shareable manner.
Thanks Team Canva for such a great app.'''
test_review_2 = '''Its the worst app ever I save my design lts not save'''
print(f"Review 1 sentiment: {predict_sentiment(test_review_1)}")
print(f"Review 2 sentiment: {predict_sentiment(test_review_2)}")
Sample Test Reviews
This code defines two test reviews:
- A positive review.
- A negative review criticizing the app.
# Sample test reviews
test_review_1 = '''this is a truly amazing app , best for those who havw
content but don't know how to express it in a good and shareable manner.
Thanks Team Canva for such a great app.'''
test_review_2 = '''Its the worst app ever I save my design lts not save'''
In this code, we will be Loading the saved N-Gram CountVectorizer and the trained Logistic Regression model from pickle files for sentiment prediction.
vect = pickle.load(open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram.pkl", "rb"))
model = pickle.load(open("/content/drive/MyDrive/New 90 Projects/Project_8/n_gram_lr.pkl", "rb"))
Wrapping Test Reviews in Lists
This code transforms the test reviews known as test_review_1 and test_review_2 into lists to make them vectorizer-friendly, which expects the inputs to be iterable for processing purposes
test_review_1 = [test_review_1]
test_review_2 = [test_review_2]
Converting test reviews to lowercase.
Test_review_1 and Test_review_2 have been converted to lowercase to make sure that the text is consistent for preprocessing and sentiment prediction.
test_review_1 = [r.lower() for r in test_review_1]
test_review_2 = [r.lower() for r in test_review_2]
Tokenizing Test Reviews
This code tokenizes the text in test_review_1 and test_review_2 into lists of words.
tokens_1 = [word_tokenize(r) for r in test_review_1]
tokens_2 = [word_tokenize(r) for r in test_review_2]
Eliminating Stopwords from the Tokenized Text
In this codes remove stopwords from both tokens_1 and tokens_2 retaining only the significant words of each tokenized test feedback.
tokens_1 = [[word for word in t if word not in sw] for t in tokens_1]
tokens_2 = [[word for word in t if word not in sw] for t in tokens_2]
Cleaning Tokens Using RegexpTokenizer
This code works on tokens_1 and tokens_2 by doing the following actions:
- It eliminates the punctuations with the use of RegexpTokenizer.
- It also combines the cleaned tokens into a proper word.
- Any empty results are eliminated for both test reviews.
tokens_1 = [["".join(tokenizer.tokenize(word)) for word in t
if len(tokenizer.tokenize(word))>0] for t in tokens_1]
tokens_2 = [["".join(tokenizer.tokenize(word)) for word in t
if len(tokenizer.tokenize(word))>0] for t in tokens_2]
Stemming Tokens
This code applies PorterStemmer to the tokens in tokens_1 and tokens_2, reducing each word to its root for both test reviews.
tokens_1 = [[porter.stem(word) for word in t] for t in tokens_1]
tokens_2 = [[porter.stem(word) for word in t] for t in tokens_2]
Accessing the Processed Tokens for Review 1
This code outputs the stemmed and cleaned tokens from tokens_1
tokens_1
Accessing the Processed Test Review 1
This snippet of code shows the preprocessed test_review_1 but in the form of a list where all the characters are in small letters and which have been cleaned and tokenized.
test_review_1
Accessing the Processed Tokens for Review 2
This code outputs the stemmed and cleaned tokens from tokens_2
tokens_2
Accessing the Processed Test Review 2
This snippet of code shows the preprocessed test_review_2 but in the form of a list where all the characters are in small letters and which have been cleaned and tokenized.
test_review_2
Reconstructuring Cleaned Reviews
The code concatenates all the elements in arrays tokens_1 and tokens_2 into proper sentences forming clean_review_1 and clean_review_2 as the thoroughly cleaned and processed test reviews.
clean_review_1 = [" ".join(review) for review in tokens_1]
clean_review_2 = [" ".join(review) for review in tokens_2]
Transforming Cleaned Review into Feature Vector
This code uses the loaded vect (N-Gram CountVectorizer) to transform clean_review_1 into a sparse matrix X_test. The matrix represents the review as numerical features based on the N-Gram model.
X_test = vect.transform(clean_review_1)
This code checks the dimension of X_test data.
X_test.shape
Predicting the Probabilities of Sentiment
This script employs the model that has been loaded to predict the probabilities for each category of sentiment (in this instance, only Positive and Negative) for the processed X_test. The output displays the score of confidence in each category.
model.predict_proba(X_test)
Sentiment classification using trained models
This script makes use of the model deployed to test the sentiment class of X_test data. It provides the corresponding sentiment label for each data point, such as 1 or 0 for “Positive” or “Negative” respectively.
model.predict(X_test)
This code uses the loaded vect (N-Gram CountVectorizer) to transform clean_review_2 into a sparse matrix X_test. The matrix represents the review as numerical features based on the N-Gram model.
X_test = vect.transform(clean_review_2)
This code checks the dimension of X_test data.
X_test.shape
This script employs the model that has been loaded to predict the probabilities for each category of sentiment (in this instance, only Positive and Negative) for the processed X_test for review 2. The output displays the score of confidence in each category.
model.predict_proba(X_test)
Sentiment classification using trained models
This script makes use of the model deployed to test the sentiment class of X_test data. It provides the corresponding sentiment label for each data point, such as 1 or 0 for “Positive” or “Negative” respectively.
model.predict(X_test)
Conclusion
The project provides an enriching experience in the handling and categorization of text with the help of NLP strategies. Step by step, you have covered every preliminary stage that the raw text must go through in order to use it to make meaningful observations – including cleaning and tokenizing the text, fitting and testing a Logistic Regression model, etc. Exploring the various approaches, the study illustrates how text can be efficiently organized by integrating the capabilities of NLTK, Scikit-learn, TfidfVectorizer, and others. Now that you have such a strong base, you can move on to more advanced aspects of NLP and develop applications for text analysis that make a difference.
Challenges and Solutions
Challenge: Dealing with noisy text data, especially during the preprocessing stage is quite difficult.
*Solution***:** Regular expressions are used in cleaning the text by removing the unnecessary characters, NLTK is used for effective tokenization and erasure of the stop words.Challenge: Classifier prediction is biased since balancing an imbalanced dataset is a challenge.
*Solution***:** Use oversampling techniques like SMOTE or undersampling to create a balanced class distribution.Challenge: Occurs excessive fitting on training the model towards the completed datasets in the case of small-sized datasets.
*Solution***:** Implement processes like cross-validation and regularization (L1 or L2) to curb excessive fitting.Challenge: Logistics regression does not perform well in complex datasets.
*Solution***:** Use others better as Random Forest or Support Vector Machine models to try to enhance precision.
FAQ
Question 1: Define text classification in the context of NLP.
Answer: Text classification is an NLP (natural language processing) activity, in which the text is assigned into some categories or labels such as positive, negative, and neutral, etc.
Question 2: What are the libraries needed for this NLP task?
Answer: NLTK, Scikit-learn, Pandas, Numpy, Matplotlib, and Seaborn are the tools that are essential to carry out this project successfully.
Question 3: How is text data preparation done for machine learning?
Answer: Cleansing of the text by removing stop words and tokenizing and preparing things stemming or lemmatization etc. is what is called the Preprocessing stage.
Question 4: Why use CountVectorizer and TfidfVectorizer?
Answer: CountVectorizer and TfidfVectorizer are used to transform the text into a machine-readable numerical format for better interaction with machine-learning algorithms for efficiency in working with textual data.
Question 5: What machine learning model is applied in text classification?
Answer: In this project, Logistic Regression is used which is a well-known statistical model for the binary/multiclass classification problem that is also simple.