Word2Vec and FastText Word Embedding with Gensim in Python

Project Overview

This study uses Natural Language Processing (NLP) methods for text-document analysis via word embedding models. First, the text data is pre-processed, cleansing, tokenizing, and removing stopwords. Thereafter, three popular models- CBOW, Skip-Gram, and FastText-are trained on the dataset, creating vector representations capable of defining the relationship of words with each other.

Afterward, it will explore the following: word similarity, analogical reasoning- for instance, 'doctor + medicine' - hospital and outlier detection of groups of words. To better understand the performance of these models, we implement diminution techniques PCA and t-SNE to illustrate how devices work on arranging words in a 2D space. The project is geared towards comparing how different embeddings capture word meanings and relations: CBOW, Skip-Gram, and FastText.

Prerequisites

Python Programming: Basic knowledge of Python is relevant to this work.
NLP Basics: Familiarity with typologies like tokenization and stopword removal along with word embeddings would be helpful.
Machine Learning: Knowledge of training models and validating text data features will be useful.
Libraries: Knowledge of Python libraries like NumPy, Pandas, NLTK, Gensim, and scikit-learn is necessary.
Basic Linear Algebra: Understanding vectors, matrices, and cosine similarity is fundamental to making sense of word embeddings.

Approach

It begins with the preprocessing of text data cleaning, tokenization, and stop word removal. Then three different word embedding models, CBOW, Skip-Gram, and FastText are trained using processed text to capture the relationship of words in a high-dimensional vector space. Comparing these models will be based on word similarity checking, a test of analogical reasoning—for example, 'doctor + medicine - hospital' as well as the detection of outliers within a group of words. Finally, dimensionality reduction techniques like PCA and t-SNE will be applied to make sense of these high-dimensional word vectors in a 2D space to visualize their relationships. Ultimately, all these will lead to comparing each model's performance in capturing semantic relationships between words, thereby giving insights into how different word embeddings represent language.

Workflow

Data Collection: Collect and load the dataset that contains text information (titles and abstracts').
Preprocessing: Clean, tokenize, and stop word removal.
Word Embedding Models: Build CBOW, Skip-Gram, and Fasttext-word embedding models over the processed text.
Evaluate Models: Evaluate word embedding through checking of similarity, analogy, and anomalies.
Dimensionality Reduction: Perform the PCA and t-SNE to visualize the word embeddings in 2D.
Visualization: Results will be visualized using Matplotlib and Plotly for better interpretation.
Analysis: Comparisons and analyses of how each model's performance would be based as a result of the embeddings it generated would be done.

Methodology

Text Preprocessing: Use NLTK to clean and preprocess the text data (tokenization, stopword removal).
Word2Vec and FastText: Train CBOW, Skip-Gram, and FastText models using Gensim to produce word embeddings.
Similarity Metrics: Compute similarity between words by Cosine Similarity of Word Vectors.
Analogical Reasoning: Perform tasks of analogical reasoning such as doing "Doctor + medicine - Hospital"; to assess the models
Anomaly Detection: Use the doesn't match method to identify the outlier word in any group.
Dimensionality Reduction: Reduction of dimensional high word embedding using PCA and t-SNE for its visualization.
Visualization: Plots the word embeddings in 2D through scatter plots for comparison purposes between models visually.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Loading Dataset: The dataset will be imported which contains text data in the form of titles and abstracts.
Missing Values Checking: Missing Values can then be detected and treated in the dataset.
Select Relevant Columns: The text columns "title" and "abstract" are extracted for analysis.
Drop Missing Data: Remove all the rows where no values are present in the selected text columns.
Merge the Text: Now merge these two columns "title" and "abstract" into a single column for analysis.
Text Preprocessing: Cleaning and tokenizing the joint text: Lowercase, remove punctuation, and filter out stopwords.
Tokenization: The disintegration of the text into separate words or tokens to carry out subsequent processing.
Lemmatization: Lemmatization is the process of reducing the word to its base form; e.g., "running" becomes "run.".
Preparation of Modeling: The preprocessed text is stored in the appropriate format to train word embedding models.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the data stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Library setup and installation

This installs the required libraries, textblob for text processing, streamlit for web app development, and wordcloud for making word clouds. It also downloads the necessary corpora for textblob to work properly.

!pip install textblob
!pip install streamlit
!pip install wordcloud
!python -m textblob.download_corpora

Importing Libraries for NLP and Visualization

This code imports libraries for text preprocessing, natural language processing (NLP), and machine learning. Some of the key tools being used are nltk to perform tokenization and lemmatization activities on the text corpus, gensim to obtain word embeddings, matplotlib and plotly for visualizing the data, and sklearn for machine learning activities like dimensionality reduction and feature extraction.

import re # used for preprocessing
import nltk
import gensim
import string # used for preprocessing
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')
from textblob import TextBlob
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from nltk.corpus import stopwords # used for preprocessing
from gensim.models import Word2Vec
from gensim.models import FastText
from sklearn.decomposition import PCA
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer # used for preprocessing
from gensim.models import Word2Vec, FastText
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

STEP 2:

Loading Data

This code loads the CSV file from the specified path

df=pd.read_csv('/content/drive/MyDrive/Aionlinecourse_badhon/Project/Word2Vec and FastText Word Embedding with Gensim in Python/Dimension-covid.csv')   #for preprocessing

Previewing Data

This block of code displays the first 10 rows of the dataset to give a quick overview of its structure.

df.head(10)

Checking Null Values

This code mainly checks whether the null data is present, which may be required to be treated before processing for further analysis.