Natural Language Processing | Machine Learning


Natural Language Processing: Natural language processing (NLP) is a field of artificial intelligence concerned with the interactions between computers and human(natural) language.

In a simple sense, Natural language Processing is applying machine learning to text and language to teach computers understanding what is said in the spoken and written words. The main focus of NLP is to read, decipher, understand and make sense of the human language in a manner that is useful. 

Examples of NLP in Real Life: You will find a lot of applications of NLP in your life. Here we name a few-

  • Translating one language to another i.e. Google Translator

  • Checking grammatical errors i.e. Microsoft word or Grammarly applies NLP to check and correct grammatical errors.

  • Sentiment analysis that is identifying the mood or subjective opinions of a text

  • Summarizing a text or article

  • Predicting the genre of books

  • Speech recognition which is used in the virtual assistants such as Apple Siri, Google Assistant, and Amazon Alexa

  • Question answering

How NLP Works?

Most of NLP algorithms are classification models, and they include Logistic Regression, Naive Bayes,  CART which is a model based on decision trees, Maximum Entropy and other classification algorithms to predict the outcome.

The working procedure of NLP can be divided into three major steps:

Step 1: Preprocessing of text that includes Cleaning the data, Tokenizing, Stemming, Parts of Speech(POS) Tagging, Lemmatization, Name Entity Recognition (NER)

Step 2: This step is for vectorizing data, that is encoding text into integer i.e. numeric form to create feature vector.

Step 3: The final step is to fit a suitable classification algorithm to the dataset and make the predictions.

We can implement all these steps using NLP libraries. Some of the popular NLP libraries are-

  • Natural Language Toolkit-NLTK

  • SpaCy

  • Stanford NLP

  • OpenNLP

In this article, we are going to implement all these steps using the NLTK library and classification algorithm. From this part, you will learn how to-

  • Clean texts to prepare them for the Machine Learning models,

  • Create a Bag of Words model,

  • Apply Machine Learning models onto this Bag of Worlds model.

Natural Language Processing in Python: Now, we will perform the steps of NLP in Python. For this task, we are going to use Restaurant_Reviews.tsv dataset. The dataset contains 1000 reviews from customers. These reviews are identified with values 0 and 1 whether they are positive or negative. 0 means the review is positive and 1 means the review is positive. Let's have a glimpse of that dataset.

02_1_NLP

This dataset looks different than other datasets as it is in tsv(tab-separated value) format. For NLP task we can not use csv(comma-separated value) files. This is because of the strings may contain commas, which will confuse our model.

You can download the whole dataset from here.

It contains two columns namely Review and Liked. They are separated by a tab. Now our task is to preprocess this data. Then we will implement any of the classification algorithms to classify the reviews whether it is positive or negative.

Fist of all, we will import some essential libraries.

# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd

Now we will import the dataset.

# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

As our dataset is in tsv format, we need to clarify that in the delimiter parameter. The reviews contain double quotes, that may cause confusion to the model. So we set the quoting parameter to 3 to avoid this problem.

Now, we will clean the texts using the NLTK library from Python. The texts contain a lot of useless words which have no impact on the characteristic of the review, we need to get rid of those words like wow, place, texture, etc. Then we need to perform stemming that is we will take the root of a word like loved, loving, lovely, etc. all can be replaced by the same word love. The texts also contain some common words like was, that, this, it, is, etc. which are known as stopwords and have no use at all. So we will remove those words using the stopwords package from the NLTK library. We only consider the English words and also take all the words into lowercase. 

# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
 We proceed to the most important part of NLP, creation of the bag of words model. Bag of words is a multiset of words which will help us to analyze the different reviews and classify them.

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

After doing all these steps, the corpus will now look like this-

                                                   


We have completed preprocessing the texts and creating the bag of words model. Now, we split this preprocessed dataset into training and test sets. 

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


Let's fit a classification algorithm to our training set. Here we will use Naive Bayes which is one of the most popular and most effective classification algorithms for NLP.


# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

Now, we will perform prediction on the test set.

# Predicting the Test set results
y_pred = classifier.predict(X_test)

 Let's see how good is our model in performing predictions using the confusion matrix.

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)


                                                                                      


From the above confusion matrix, we can see that the accuracy of our model is 73%.