Word2Vec and FastText Word Embedding with Gensim in Python
This project deals with applying various techniques from Natural Language Processing to the analysis and visualization of data in text form. We will use some of the most popular models: Word2Vec (CBOW, Skip-Gram) and FastText to process and explore text and further extract extremely crucial insights such as its word similarities and analogies. The objective is to understand how machine learning models capture relations between words.
Project Overview
This study uses Natural Language Processing (NLP) methods for text-document analysis via word embedding models. First, the text data is pre-processed, cleansing, tokenizing, and removing stopwords. Thereafter, three popular models- CBOW, Skip-Gram, and FastText-are trained on the dataset, creating vector representations capable of defining the relationship of words with each other.
Afterward, it will explore the following: word similarity, analogical reasoning- for instance, 'doctor + medicine' - hospital and outlier detection of groups of words. To better understand the performance of these models, we implement diminution techniques PCA and t-SNE to illustrate how devices work on arranging words in a 2D space. The project is geared towards comparing how different embeddings capture word meanings and relations: CBOW, Skip-Gram, and FastText.
Prerequisites
- Python Programming: Basic knowledge of Python is relevant to this work.
- NLP Basics: Familiarity with typologies like tokenization and stopword removal along with word embeddings would be helpful.
- Machine Learning: Knowledge of training models and validating text data features will be useful.
- Libraries: Knowledge of Python libraries like NumPy, Pandas, NLTK, Gensim, and scikit-learn is necessary.
- Basic Linear Algebra: Understanding vectors, matrices, and cosine similarity is fundamental to making sense of word embeddings.
Approach
It begins with the preprocessing of text data cleaning, tokenization, and stop word removal. Then three different word embedding models, CBOW, Skip-Gram, and FastText are trained using processed text to capture the relationship of words in a high-dimensional vector space. Comparing these models will be based on word similarity checking, a test of analogical reasoning—for example, 'doctor + medicine - hospital' as well as the detection of outliers within a group of words. Finally, dimensionality reduction techniques like PCA and t-SNE will be applied to make sense of these high-dimensional word vectors in a 2D space to visualize their relationships. Ultimately, all these will lead to comparing each model's performance in capturing semantic relationships between words, thereby giving insights into how different word embeddings represent language.
Workflow
- Data Collection: Collect and load the dataset that contains text information (titles and abstracts').
- Preprocessing: Clean, tokenize, and stop word removal.
- Word Embedding Models: Build CBOW, Skip-Gram, and Fasttext-word embedding models over the processed text.
- Evaluate Models: Evaluate word embedding through checking of similarity, analogy, and anomalies.
- Dimensionality Reduction: Perform the PCA and t-SNE to visualize the word embeddings in 2D.
- Visualization: Results will be visualized using Matplotlib and Plotly for better interpretation.
- Analysis: Comparisons and analyses of how each model's performance would be based as a result of the embeddings it generated would be done.
Methodology
- Text Preprocessing: Use NLTK to clean and preprocess the text data (tokenization, stopword removal).
- Word2Vec and FastText: Train CBOW, Skip-Gram, and FastText models using Gensim to produce word embeddings.
- Similarity Metrics: Compute similarity between words by Cosine Similarity of Word Vectors.
- Analogical Reasoning: Perform tasks of analogical reasoning such as doing "Doctor + medicine - Hospital"; to assess the models
- Anomaly Detection: Use the doesn't match method to identify the outlier word in any group.
- Dimensionality Reduction: Reduction of dimensional high word embedding using PCA and t-SNE for its visualization.
- Visualization: Plots the word embeddings in 2D through scatter plots for comparison purposes between models visually.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Loading Dataset: The dataset will be imported which contains text data in the form of titles and abstracts.
- Missing Values Checking: Missing Values can then be detected and treated in the dataset.
- Select Relevant Columns: The text columns "title" and "abstract" are extracted for analysis.
- Drop Missing Data: Remove all the rows where no values are present in the selected text columns.
- Merge the Text: Now merge these two columns "title" and "abstract" into a single column for analysis.
- Text Preprocessing: Cleaning and tokenizing the joint text: Lowercase, remove punctuation, and filter out stopwords.
- Tokenization: The disintegration of the text into separate words or tokens to carry out subsequent processing.
- Lemmatization: Lemmatization is the process of reducing the word to its base form; e.g., "running" becomes "run.".
- Preparation of Modeling: The preprocessed text is stored in the appropriate format to train word embedding models.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the data stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Library setup and installation
This installs the required libraries, textblob for text processing, streamlit for web app development, and wordcloud for making word clouds. It also downloads the necessary corpora for textblob to work properly.
!pip install textblob
!pip install streamlit
!pip install wordcloud
!python -m textblob.download_corpora
Importing Libraries for NLP and Visualization
This code imports libraries for text preprocessing, natural language processing (NLP), and machine learning. Some of the key tools being used are nltk to perform tokenization and lemmatization activities on the text corpus, gensim to obtain word embeddings, matplotlib and plotly for visualizing the data, and sklearn for machine learning activities like dimensionality reduction and feature extraction.
import re # used for preprocessing
import nltk
import gensim
import string # used for preprocessing
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')
from textblob import TextBlob
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from nltk.corpus import stopwords # used for preprocessing
from gensim.models import Word2Vec
from gensim.models import FastText
from sklearn.decomposition import PCA
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer # used for preprocessing
from gensim.models import Word2Vec, FastText
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
STEP 2:
Loading Data
This code loads the CSV file from the specified path
df=pd.read_csv('/content/drive/MyDrive/Aionlinecourse_badhon/Project/Word2Vec and FastText Word Embedding with Gensim in Python/Dimension-covid.csv') #for preprocessing
Previewing Data
This block of code displays the first 10 rows of the dataset to give a quick overview of its structure.
df.head(10)
Checking Null Values
This code mainly checks whether the null data is present, which may be required to be treated before processing for further analysis.
df.isna().sum()
Dataset Overview
The code summaries along with the number of entries and column names, data types, and memory usage of the dataset df. This is useful for easily understanding the topology and composition of the dataset.
df.info()
Inspecting Text Columns and Missing Values
This piece of code selects Title and Abstract columns from the dataset, looking for possible missing values in these text-rich columns, and showing the initial few rows of them along with the missing counts to evaluate data quality.
# Inspect relevant text-rich columns: Title and Abstract
text_columns = df[['Title', 'Abstract']]
# Check for missing values in these columns
missing_values = text_columns.isnull().sum()
# Display the first few rows of the relevant columns and missing value counts
text_columns.head(), missing_values
Dropping Rows with Null Values
This code drops all rows from the text_columns Dataframe that miss data in either the Title or Abstract columns, producing a neat dataset to conduct further analyses.
# Drop rows with missing Abstract values
text_columns_cleaned = text_columns.dropna()
Checking Null Values
This code checks whether the null data is present after removing rows from the dataset.
text_columns_cleaned.isnull().sum()
Shape-checking of Data after Cleaning
The code returns the shape of the text_columns_cleaned DataFrame in terms of dimension (number of rows and columns). It provides an understanding of how many entries exist after dropping rows with missing data in it.
text_columns_cleaned.shape
STEP 3:
Merged Title and Abstract
This piece of code creates a new column, Combined_Text, by concatenation of columns Title and Abstract in text_columns_cleaned DataFrame.
text_columns_cleaned['Combined_Text'] = text_columns_cleaned['Title'] + " " + text_columns_cleaned['Abstract']
WordNetLemmatizer and Stopwords Initializer
The code initializes a WordNetLemmatizer for lemmatization and loads a set of English stopwords using nltk. These will be used for text preprocessing, such as reducing words to their root forms and eliminating common insignificant terms.
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
Text Preprocessing Function
The function processes input in lower case, without punctuation and special characters, tokenizes into words, cleanses stopwords, and completes the lemmatization of each token. Returns the cleaned, processed tokens for further analysis.
# Function to preprocess text
def preprocess_text(text):
# Lowercase the text
text = text.lower()
# Remove punctuation and special characters
text = re.sub(r'\[^a-zA-Z\\s\]', '', text)
# Tokenize the text
tokens = word\_tokenize(text)
# Remove stopwords and lemmatize
tokens = \[lemmatizer.lemmatize(word) for word in tokens if word not in stop\_words\]
return tokens
Preprocessing Combined Text
The following code calls the preprocess_text function for the Combined_Text column of the text_columns_cleaned DataFrame and creates a new column Processed_Text that captures the cleaned and tokenized version of the original text. It then displays the first few rows of the original versus the transformed text for inspection.
# Apply preprocessing to the Combined_Text column
text_columns_cleaned['Processed_Text'] = text_columns_cleaned['Combined_Text'].apply(preprocess_text)
# Display the first few rows of the processed text
text_columns_cleaned[['Combined_Text', 'Processed_Text']].head()
Word Cloud Generation and Display
This code will combine them and convert the resulting string to a word cloud from the text using WordCloud. The image will then be displayed, visualizing again in terms of word frequency using larger words that represent greater frequency in the dataset.
# Combine all processed text into a single string
all_text = ' '.join([' '.join(tokens) for tokens in text_columns_cleaned['Processed_Text']])
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)
# Display the generated image:
plt.figure(figsize=(10, 5), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
STEP 4:
Summary Statistics for Text Lengths
The code computes and provides summary statistics on the counts of words in Title, Abstract, and Combined_Text columns. It employs describe() to get these insights on how the distribution of text lengths looks like for example average numbers of words, minimum and maximum lengths
# Summary statistics for text lengths
title_stats = text_columns_cleaned['Title'].str.split().apply(len).describe()
abstract_stats = text_columns_cleaned['Abstract'].str.split().apply(len).describe()
combined_stats = text_columns_cleaned['Combined_Text'].str.split().apply(len).describe()
Display text statistics by length
This code helps in analyzing the distribution of text lengths in each of those columns by showing parameters like mean, standard deviation, and range of word counts.
print("Title Length Statistics:")
print(title_stats)
print("\nAbstract Length Statistics:")
print(abstract_stats)
print("\nCombined Text Length Statistics:")
print(combined_stats)
Finding Most Common Words
\This code flattens the processed token list, counts the fatness of each token using Counter, and gets the twenty most common words. It helps find the most bona fide common words existing in the dataset, post-preprocessing.
all_tokens = [token for tokens in text_columns_cleaned['Processed_Text'] for token in tokens]
word_counts = Counter(all_tokens)
common_words = word_counts.most_common(20)
common_words
Sentiment Analysis
This code creates a function get_sentiment which will give the score of an input text using TextBlob. This function is then applied to the Combined_Text field of text_columns_cleaned DataFrame, resulting in a new Sentiment column with the corresponding scores for positive and negative polarities of the input text. Positive sentiment is scored positively while negative sentiment is scored negatively.
# Function to compute sentiment polarity
def get_sentiment(text):
return TextBlob(text).sentiment.polarity
# Apply sentiment analysis
text_columns_cleaned['Sentiment'] = text_columns_cleaned['Combined_Text'].apply(get_sentiment)
Plotting Distribution of Sentiments
This code plots the histogram for the sentiment polarity scores in the Sentiment column. This gives an understanding of how the entire sentiment (positive, negative, or neutral) in the dataset represents the frequency distribution of different sentiment scores.
# Plot sentiment distribution
plt.hist(text_columns_cleaned['Sentiment'], bins=30, color='blue', alpha=0.7)
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()
STEP 5:
Extracts n-grams
This function extracts n-grams-from a corpus given in text form and uses CountVectorizer to get their frequency. It introduces the top-n most common ngrams for use in identifying frequent word combinations across the dataset.
# Function to extract n-grams
def get_ngrams(corpus, n=2, top_n=20):
vectorizer = CountVectorizer(ngram_range=(n, n))
ngram_matrix = vectorizer.fit_transform(corpus)
ngram_counts = ngram_matrix.sum(axis=0)
ngram_freq = [(ngram, ngram_counts[0, idx]) for ngram, idx in vectorizer.vocabulary_.items()]
return sorted(ngram_freq, key=lambda x: x[1], reverse=True)[:top_n]
Extracting Top Bigrams and Trigrams
This program extracts the top 20 bigrams (2-word combinations) and trigrams (3-word combinations) found in the Combined_Text column, using the get_ngrams function. Using it gives us the possibility of extracting common words in pairs and triplets that co-occur frequently into the dataset.
# Extract top bigrams and trigrams
top_bigrams = get_ngrams(text_columns_cleaned['Combined_Text'], n=2)
top_trigrams = get_ngrams(text_columns_cleaned['Combined_Text'], n=3)
It will display the top 20 most frequent bigrams (2-word combinations) from the Combined_Text column.
top_bigrams
It will display the top 20 most frequent trigrams (3-word combinations) from the Combined_Text column.
top_trigrams
List of sentences from Processed Text
This code converts the column Processed_Text of the DataFrame text_columns_cleaned into a list of sentences, each being a tokenized and processed version of the text, which can be used further in analysis or modeling.
# Prepare tokenized sentences
sentences = text_columns_cleaned['Processed_Text'].tolist()
STEP 6:
Train a CBOW word2vec model
The code above uses processed sentences to train a Continuous Bag of Words (CBOW) Word2Vec model. The model is configured to have a vector size of 100, a window size of 5, and count words of at least 5. It trains itself over 4 worker threads and 10 epochs. The produced model is saved as word2vec_cbow.model.
# Train CBOW Word2Vec model
word2vec_cbow = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=5, sg=0, workers=4, epochs=10)
word2vec_cbow.save("word2vec_cbow.model")
Skip-Gram Word2Vec Model Training
This code is for training a Skip-Gram Word2Vec model using already processed sentences. Like the CBOW model, it employs vector size-100, window size-5 and minimum word count-5. It runs for 10 epochs and with 4-worker threads. The model will be saved into word2vec_skipgram.model for future use. The only difference here with the CBOW is the use of sg=1 to represent the Skip-Gram.
# Train Skip-Gram Word2Vec model
word2vec_skipgram = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=5, sg=1, workers=4, epochs=10)
word2vec_skipgram.save("word2vec_skipgram.model")
Training FastText Model
This code takes the input sentences as data to train a FastText model. The model is configured with a vector size of a hundred, a window size of 5, and a minimum word occurrence of 5. It runs at 10 epochs in a 4 worker thread mode. The resulting model is stored in a file named fasttext.model. In contrast to Word2Vec, FastText can create vectors for out-of-vocabulary words with the help of their subword information.
# Train FastText model
fasttext_model = FastText(sentences=sentences, vector_size=100, window=5, min_count=5, workers=4, epochs=10)
fasttext_model.save("fasttext.model")
Selecting Common Vocabulary and Extracting Vectors
The code identifies the common vocabulary shared by these three models (CBOW, Skip-Gram, and FastText) by finding the intersection across all their vocabularies. The next step consists of selecting the first hundred words from this joint common vocabulary and extracting their word vectors from each model. The collected vectors are stored in cbow_vectors, skipgram_vectors, and fasttext_vectors. It then prints to visualize the number of selected words.
# Define a common vocabulary shared by all models
common_vocab = set(word2vec_cbow.wv.index_to_key).intersection(
set(word2vec\_skipgram.wv.index\_to\_key),
set(fasttext\_model.wv.index\_to\_key)
)
# Select 100 words for visualization
selected_words = list(common_vocab)[:100]
# Extract vectors for the selected words from each model
cbow_vectors = np.array([word2vec_cbow.wv[word] for word in selected_words])
skipgram_vectors = np.array([word2vec_skipgram.wv[word] for word in selected_words])
fasttext_vectors = np.array([fasttext_model.wv[word] for word in selected_words])
# Print confirmation
print(f"Number of words selected for visualization: {len(selected_words)}")
Using PCA for Reduction in Dimensions
This code implements PCA (Principal Component Analysis) with 2 components for the projected CBOW, Skip-Gram, and FastText word vectors into 2D space, taking into account the explained variance ratio that indicates how much of the total variance is captured in the first two principal components. It helps understand how effective dimensionality reduction is.
# Initialize PCA
pca = PCA(n_components=2)
# Apply PCA to each model's word embeddings
cbow_pca = pca.fit_transform(cbow_vectors)
skipgram_pca = pca.fit_transform(skipgram_vectors)
fasttext_pca = pca.fit_transform(fasttext_vectors)
# Print explained variance ratio to assess PCA effectiveness
print("Explained Variance by PCA:", pca.explained_variance_ratio_.sum())
Function to Plot Word Embeddings
This function visualizes the word embeddings in a 2D space by plotting PCA-reduced word vectors. It accepts the embeddings (2D vectors), words, and titles as inputs and generates a scatter plot where each point represents each of the words. The function then labels each point by each corresponding word in addition to the axes' labels and grid lines to enhance clarity.
# Function to plot word embeddings
def plot_embeddings(embeddings, words, title):
plt.figure(figsize=(12, 8))
plt.scatter(embeddings\[:, 0\], embeddings\[:, 1\], edgecolor='k', c='b', alpha=0.7)
for i, word in enumerate(words):
plt.text(embeddings\[i, 0\] \+ 0.01, embeddings\[i, 1\] \+ 0.01, word, fontsize=9)
plt.title(title)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid(True)
plt.show()
Visualization of Word Embeddings
The code uses the function 'plot_embeddings' to visualize the word embeddings generated from the CBOW, Skip-Gram, and FastText models in 2D spatial coordinates after PCA dimensionality reduction. Each plot shows how the different models depict the relationships between the words chosen in 2D space
# Plot CBOW embeddings
plot_embeddings(cbow_pca, selected_words, "CBOW Word Embeddings")
# Plot Skip-Gram embeddings
plot_embeddings(skipgram_pca, selected_words, "Skip-Gram Word Embeddings")
# Plot FastText embeddings
plot_embeddings(fasttext_pca, selected_words, "FastText Word Embeddings")
Using t-SNE for Dimensionality Reduction
The code below uses t-SNE (t-distributed stochastic neighbor embedding) with 2 components to reduce the CBOW, skip-gram, and fasttext word vectors in two dimensions. It can be defined as a technique that facilitates the visualization of high-dimensional data such that similar items are clustered. Thus, it is a method well suited for the observation of word embeddings.
# Initialize t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
# Apply t-SNE to each model's embeddings
cbow_tsne = tsne.fit_transform(cbow_vectors)
skipgram_tsne = tsne.fit_transform(skipgram_vectors)
fasttext_tsne = tsne.fit_transform(fasttext_vectors)
Visualization t-SNE of Word Embeddings.
The code visualizes 2D plots of word embedding points from the CBOW, Skip-Gram, and FastText models using t-SNE. It then generates plots of the selected words to see how they work with each other. The plots will display how two or more words are grouped, allowing the comparison of separation by context across words through different models.
# Plot t-SNE embeddings
plot_embeddings(cbow_tsne, selected_words, "CBOW Word Embeddings (t-SNE)")
plot_embeddings(skipgram_tsne, selected_words, "Skip-Gram Word Embeddings (t-SNE)")
plot_embeddings(fasttext_tsne, selected_words, "FastText Word Embeddings (t-SNE)")
Count Word Frequencies
The Counter counts the number of each token after flattening the text tokenized and processed from the Processed_Text column in a single list of words. Thus, word_freq would be defined as a dictionary where the words are keys and their respective frequencies in the dataset are values.
# Assuming your dataset has a column 'text_column' with relevant text
all_words = text_columns_cleaned['Processed_Text'].sum()
word_freq = Counter(all_words)
word_freq
STEP 7:
The Evaluation Word Selection Code
It takes the top 50 domain-related words from the top 50 most frequent words in the dataset and cross-matches them against the CBOW Word2Vec model's vocabulary. After that, it randomly chooses another 50 words among the top 1000 words available in the CBOW model. The two collections are finally combined into a list called evaluation_words, which has the domain-specific and random words combined. The evaluation or analysis of the two sets will proceed with this combined list.
import random # Import the random module
# Select domain-specific words (top 50 most frequent)
domain_words = [word for word, freq in word_freq.most_common(50) if word in word2vec_cbow.wv]
# Add random words from the vocabulary
vocab = word2vec_cbow.wv.index_to_key[:1000] # Top 1000 words in the model
random_words = random.sample(vocab, 50)
# Combine domain-specific and random words
evaluation_words = list(set(domain_words + random_words))
Generation of word pairs and word groups.
This program produces five random word pairs from the evaluation_words list, creating pairs from adjacent words. This selects one of the words (test_word) in the list and then randomly chooses 4 other words to form a word_group. These can be used in analogy or similarity tests.
# Generate word pairs and groups
word_pairs = random.sample([(evaluation_words[i], evaluation_words[i+1]) for i in range(len(evaluation_words) - 1)], 5)
test_word = random.choice(evaluation_words)
word_group = random.sample(evaluation_words, 4)
Calculate Word Pairs' Pairwise Similarity
This code calculates and prints the similarity between two words in a pair from word pairs using the CBOW, Skip-Gram, and FastText models. It uses the similarity method for each model to calculate the cosine similarity of the word vectors into which the two words are converted. The output is generated, describing how similar those models perceive the two words to be.
print("\nPairwise Word Similarities:")
for word1, word2 in word_pairs:
cbow\_sim = word2vec\_cbow.wv.similarity(word1, word2)
skipgram\_sim = word2vec\_skipgram.wv.similarity(word1, word2)
fasttext\_sim = fasttext\_model.wv.similarity(word1, word2)
print(f"{word1}\-{word2}: CBOW={cbow\_sim:.4f}, Skip-Gram={skipgram\_sim:.4f}, FastText={fasttext\_sim:.4f}")
Finding words most related to a specific word
In this code, the 5 nearest words are found and printed and given to the random test_word using the CBOW, Skip-Gram, and FastText models. Each model has a method called most_similar, that returns the most similar words as per their vector created for each word. Thus, the output shall show how all models relate the test_word with other words in the vocabulary.
print(f"\nMost Similar Words to '{test_word}':")
print("CBOW:", word2vec_cbow.wv.most_similar(test_word, topn=5))
print("Skip-Gram:", word2vec_skipgram.wv.most_similar(test_word, topn=5))
print("FastText:", fasttext_model.wv.most_similar(test_word, topn=5))
Analogical Reasoning: Relationship of Words
This code implements the analogical reasoning to find out what the third word is to complete the analogy "doctor + medicine - hospital". This is achieved through the most_similar method specifying positive and negative parameters with respect to each model (CBOW, Skip-Gram, and FastText). The output is printed as results for each model showing which word each model has predicted as the answer to the analogy.
print("\nAnalogical Reasoning (doctor + medicine - hospital):")
positive = ["doctor", "medicine"]
negative = ["hospital"]
cbow_analogy = word2vec_cbow.wv.most_similar(positive=positive, negative=negative, topn=1)
skipgram_analogy = word2vec_skipgram.wv.most_similar(positive=positive, negative=negative, topn=1)
fasttext_analogy = fasttext_model.wv.most_similar(positive=positive, negative=negative, topn=1)
print(f"CBOW: {cbow_analogy}")
print(f"Skip-Gram: {skipgram_analogy}")
print(f"FastText: {fasttext_analogy}")
Outlier Detection
The code implements outlying detection by determining which of the words does not belong to the word_group and using the doesnt_match method from each model (CBOW, Skip-Gram, and FastText). After calling the method, the models will return the word which does not match mostly with the others of the group. The output will show which odd word has been indicated by each model.
print("\nOutlier Detection:")
cbow_outlier = word2vec_cbow.wv.doesnt_match(word_group)
skipgram_outlier = word2vec_skipgram.wv.doesnt_match(word_group)
fasttext_outlier = fasttext_model.wv.doesnt_match(word_group)
print(f"CBOW: {cbow_outlier}")
print(f"Skip-Gram: {skipgram_outlier}")
print(f"FastText: {fasttext_outlier}")
Conclusion
This project showcases how powerful word embedding models such as CBOW, Skip-Gram, and FastText can capture semantic relationships between words, through preprocessing, training, and evaluation tasks like word similarity, analogical reasoning, and outlier detection. Apart from this, we gain an insight into how different models represent word meanings at the end. Furthermore, visualizing word embeddings using PCA and t-SNE will contribute to developing a better understanding of spatial relations and the clustering behavior of words in the models. Finally, the project emphasizes the importance of embedding models, depending on the task involved, and provides the reader with a deeper perception of NLP techniques and their applications.
Challenges New Coders Might Face
Challenge: Handling noisy or unstructured text data.
Solution: Utilize text cleaning methods, which may include the exclusion of special symbols, figures, and extra spaces.Challenge: Preprocessing Large Text Data
*Solution***:** Enhance text cleaning processes by employing better libraries such as NLTK and adopting batch processing for the data.Challenge: Curse of Dimensionality in the high dimensional text datasets affecting clustering and classification results.
*Solution***:** Use TF-IDF vectorization and reduction techniques like (PCA) to control dimensionality.Challenge: Inaccessibility of GPU
Solution: For debugging or initial testing, consider using smaller datasets or incorporating GPU-based cloud platforms for quick turnaround times.Challenge: Training Time is High due to the Large Vocabulary Size
Solution: Use subsampling techniques or restrict vocabulary to use only the N most frequent words for efficiency purposes.
Frequently Asked Questions (FAQs)
Question 1: What is Word Embeddings?
Answer: Word embedding is the vector representation of the words in a continuous and dense vector space in addition to capturing the semantic relationships between words, thereby enabling the models to practically understand meanings based on the context rather than individual words.
Question 2: What is the difference between CBOW and Skip-Gram in Word2Vec?
Answer:
- CBOW (Continuous Bag of Words): Predicts a target word from a context (surrounding words).
- Skip-Gram: Predicts surrounding words from a target word. Skip-Gram typically works better with smaller datasets.
Question 3: What is FastText, and how does it differ from Word2Vec?
Answer: FastText uses bags of character n-grams and treats words as these bags: for this reason, it is capable of generating embeddings for out-of-vocabulary words. As a result, FastText does a better job at handling rare or unseen words.
Question 4: What are the preprocessing stages of Skip-Gram projects?
Answer: These include text cleaning and normalization, tokenization, and other processes to remove noise, such as special characters and numbers, for uniformity of the input data.
Question 5: What libraries are needed to implement the Word embedding project?
Answer: Building and training the model cannot be done without Pytorch whilst NLTK, NumPy, and Matplotlib are useful for preprocessing, and numerical and graphical manipulations respectively.