Semantic Search Using Msmarco Distilbert Base & Faiss Vector Database
Imagine you are in search of the right movie! But instead of looking up certain keywords or using regular search engines, you could simply explain your needs to a system. It would work for you to know what you are looking for. That's called semantic search! In this project, we implement advanced transformer models used in natural processing.
With an extremely fast vector database named Faiss build an engine that is not limited to keyword matching. The system understands the meaning, the context, and the purpose of the system. This technology provides personalized and extremely fast results. Whether it is seeking appropriate goods and services about products and services with e-commerce, finding information health-wise, or looking for various forms of entertainment.
Project Overview:
This project “Semantic Search System Using Transformers and Vector Database” adopts advanced AI approaches to enhancing the information retrieval process. This is achieved by transformer models that represent the text content as vectors and searching for more similar content using Faiss. Which is a concept designed for efficient similar item search. Finding meaning is what matters most, not just matching words.
The transformer-based model training which uses Faiss for the vector database generation is fully embodied in this project. Then it turns towards the actual text materials which in this case happen to be movie scripts. It allows the system to perform a very fast and precise semantic search. The usage is broad from enhancing the search features of e-commerce or online shopping to providing customized recommendations in a health care system.
Prerequisites
The acquired skills will enable you to smoothly dive into the project and create an effective semantic search system with the tools and technologies available.
The fundamental theory and practice of the Python programming language, including basic knowledge of the use of libraries, and functions.
Knowledge of the basics of machine learning methods. Especially natural language will assist in understanding the transformer architecture.
A broad idea about the application of transformers for solving different NLP problems.
The use of Google Colab for using Jupyter Notebooks where execution of code is carried out in the clouds.
Concept of coding and constructing databases that store and retrieve high dimensional Vectors especially using models like Faiss.
Being familiar with the use of Pandas in constructing data frames as well as carrying out simple data cleansing and processing work.
Some knowledge of Faiss regarding large dataset similarity searching and indexing practices.
Access to GPU-supported environments for faster processing
Approach:
In this project, we first gather and then perform a step of preprocessing the given text data. Which will consist of eliminating missing values and duplicates. It’s this step that will improve the quality of your input. Then we pick a transformer model like a Sentence transformer. which takes the text and outputs high-dimensional vectors. These vectors are the semantic meaning of the text and hence are easier to process queries based on context instead of keywords. After the text becomes vectors, we index them through Faiss. It allows us to do fast similarity searches. The system can quickly compare vectors and find the most important as you enter a query.
When performing the query processing, it converts the search query into a vector. Then does a search using Faiss, finds the top k most similar results, and gives the users an accurate and context-based result. They test the whole system for speed and accuracy to make the results make sense. The model can also be fine-tuned if needed to enhance performance in certain domains, like movies, e-commerce, or healthcare. This approach will enable us to build a scalable, fast and semantic search system for meaningful personal results across various industries.
Workflow and methodologies:
The workflow and methodology for building the "Semantic Search System Using Transformers and Vector Database" are as follows:
Data Collection and Preprocessing:
Step 1: Collect a text dataset such as movie plot descriptions in a structured format (CSV).
Step 2: The dataset will be cleaned by filling out the rows with missing data, removing duplicates, and removing any rows that do not contain useful data.
Step 3: Now you have the text length and structure of data to analyze and structure the data for efficient processing and for further steps.Transformers and Text Embedding
Step 1: Select a pre-trained transformer model out of the SentenceTransformer library like msmarco-distilbert-base-dot-prod-v3 as shown.
Step 2: Clean the text data then convert them into high dimensional vectors.
Step 3: A fixed-size vector is learned from each movie plot to capture the underlying meaning of it to improve search results quality.Indexing with Faiss
Step 1: We create an index with the Faiss library to vectorize the data. Because this project requires a fast similarity search.
Step 2: Faiss index maps vectorized data to their corresponding text entries then adds the vectorized data to the Faiss index.
Step 3: The index can be used for later use to speed up those searches when performing on a large dataset.Semantic Search
Step 1: For a user inputting in their search query, convert it into a vector with the same transformer model.
Step 2: On the Faiss index, perform a similarity search on the query vector. And find those closest to the query vector using the stored vectors.
Step 3: Vector proximity is used to retrieve the top k most similar results, giving the user relevant results based on the query's semantics.Evaluation and Optimization
Step 1: Assess the precision of the system through the execution of different queries.
Step 2: Adjust and retrain the transformer model if needed.
Step 3: Optimize the system for scalability.
Methodology:
- Semantic Embedding: Represent a piece of text in the form of a meaningful vector using transformer models.
- Efficient Indexing: Store and retrieve high-dimensional vector embeddings using Faiss.
- Similarity Search: Use the inner product or cosine distance between the semantic vectors to retrieve the nearest neighbors.
- Customizable and Scalable: Fine-tune the model to smaller domains and expand it to larger datasets to boost its overall efficacy.
Data Collection and Preparation:
Data collection workflow:
Collect a reliable dataset related to the project. Let’s assume movie plots from Wikipedia or some structured database.
Get the data as a structure such as CSV or JSON where you obtain fields like Title, Plot, Release Year, Genre, etc.
Make sure the dataset is in the same format. With each entry covering the entire content and being relevant for the task.
We will load the dataset and look at the numbers of rows, columns, and types of data to have an overview.
Save it locally or on cloud storage(Google Drive) for quick access.
Data Preparation workflow:
- Load the dataset into a data frame to work with more easily.
- Avoid issues during model processing by eliminating any rows having missing data. Remove duplicate rows to ensure using unique columns.
- Then optionally do anything else you want to do to your text, like removing punctuation or characters that you don’t want.
- Determine the right sequence length for the model by calculating the word count of each plot.
- Using a pre-trained transformer model it converts the cleaned text data into high dimensional vectors.
- Make sure that the vectorized data has the right format (a NumPy array) for indexing by the Faiss database.
- Cleaned datasets and vectors are stored for use in the search system, and later retrieved.
Code Explanation
STEP 1:
Mounting Drive and Installing Required Libraries
Mounting Drive
This code shows how to connect your Google Drive account to a Colab workspace. It helps in accessing the files available in the user’s Google Drive by making it present in a particular folder (which is ‘/content/drive’).
from google.colab import drive
drive.mount('/content/drive')
Install Transformer
By executing this command, the sentence-transformers library gets installed by pip. The sentence-transformers library focuses on dealing with transformation models. Specifically for the text embedding, semantic similarity, and sentence classification tasks.
!pip install sentence-transformers
This command provides information about the NVIDIA GPU, including its current state, usage stats, and memory details
.
!nvidia-smi
STEP 2:
Importing Libraries and Initializing the Model
Importing Libraries and Initializing Model
This block loads some libraries for data manipulation, visualization, and NLP tasks. Using the msmarco-distilbert-base-dot-prod-v3 pre-trained SentenceTransformer model it initializes the model. Then converts text into semantic vectors.
# Importing the pandas library and aliasing it as pd
import pandas as pd
# Importing the time module
import time
# Importing tqdm for progress bars
from tqdm import tqdm
# Importing seaborn for data visualization
import seaborn as sns
# Importing numpy and aliasing it as np
import numpy as np
# Importing TextBlob for text processing tasks
from textblob import TextBlob
# Importing matplotlib.pyplot for plotting
import matplotlib.pyplot as plt
# Importing SentenceTransformer from the sentence_transformers library
from sentence_transformers import SentenceTransformer
# Initializing the SentenceTransformer model with the 'msmarco-distilbert-base-dot-prod-v3' pre-trained model
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')
STEP 3:
Optimized Data Loading and Structural Overview of Dataset
The code first reads a CSV file from Google Drive and loads it into a Pandas DataFrame, enabling efficient data handling by using memory mapping to reduce memory consumption, which is especially useful for large datasets. Following this, it provides a summary overview of the dataset, displaying the number of entries, column names, data types, and the count of non-null values for each column. This overview assists in understanding the dataset's structure and the completeness of its data.
data = pd.read_csv('/content/drive/MyDrive/aionlinecourse/wiki_movie_plots_deduped.csv',memory_map=True)
data.info()
Memory Management and Data Selection
This code snippet imports the garbage collection (gc) module. It only selects the ‘Title’ and ‘Plot’ columns and stores them in a new DataFrame df. Thereafter, it erases the earlier data to release memory space and invokes garbage collection to minimize resources.
# Importing the gc module for garbage collection
import gc
# Selecting only the 'Title' and 'Plot' columns from the DataFrame 'data' and assigning it to 'df'
df = data[['Title', 'Plot']]
# Deleting the 'data' DataFrame from memory
del data
# Performing garbage collection to free up memory
gc.collect()
STEP 4:
Data Cleaning and Plot Length Distribution Analysis
This code cleans the DataFrame df by removing any rows containing NaN values and eliminating duplicate entries based on the ‘Plot’ column, ensuring a refined dataset for further processing. Additionally, it analyzes the length of each movie plot by calculating the word count and adds this as a new column, doc_len, in the DataFrame. Using the mean and standard deviation of plot lengths, it determines the maximum sequence length. The code then visualizes plot lengths with a distribution plot, marking the calculated maximum sequence length with a vertical dashed line. The plot includes legends and labels, effectively illustrating the text length distribution across the dataset.
# Drops rows with missing values in any column and modifies 'df' in place
df.dropna(inplace=True)
# Drops duplicate rows based on the 'Plot' column and modifies 'df' in place
df.drop_duplicates(subset=['Plot'], inplace=True)
# Calculates the length of each plot in terms of the number of words and assigns the result to a new column 'doc_len' in the DataFrame 'df'.
df['doc_len'] = df['Plot'].apply(lambda words: len(words.split()))
# Calculates the maximum sequence length by rounding the mean of 'doc_len' plus its standard deviation, and converts it to an integer.
max_seq_len = np.round(df['doc_len'].mean() + df['doc_len'].std()).astype(int)
# Plots a distribution plot (histogram and kernel density estimate) of the 'doc_len' column from the DataFrame 'df', with a blue color and labeled as 'doc len'.
sns.distplot(df['doc_len'], hist=True, kde=True, color='b', label='doc len')
# Adds a vertical dashed line at the position of 'max_seq_len' on the plot, indicating the maximum sequence length, with a black color and labeled as 'max len'.
plt.axvline(x=max_seq_len, color='k', linestyle='--', label='max len')
# Sets the title of the plot as 'plot length'.
plt.title('plot length')
# Displays the legend on the plot.
plt.legend()
# Displays the plot.
plt.show()
STEP 5:
Installing and Setting Up Faiss
Installing Faiss Library
This command installs Faiss with GPU support. This is a library designed for efficient similarity search and clustering of dense vectors, thereby speeding up vector operations.
!pip install faiss-gpu
Encoding Data and Indexing with Faiss
This snippet of code utilizes the SonnetTransformer model to transform and high-dimensionalize the movie plot descriptions into vectors. Thereafter, the vectors are transformed to a numpy 32-bit floats array. A Faiss index is constructed with an inner product distance metric of dimension 768. The encoded vectors with their respective IDs are inserted in the index. The index is then written to the disk and named movie_plot.index in a binary format for later retrieval when performing similarity searches.
# Importing the Faiss library for efficient similarity search and clustering in large-scale datasets.
import faiss
# Encoding the plot descriptions from the DataFrame 'df' into fixed-size vectors using the pre-trained SentenceTransformer model 'model'.
encoded_data = model.encode(df.Plot.tolist())
# Converting the encoded data into a numpy array of 32-bit floating-point numbers.
encoded_data = np.asarray(encoded_data.astype('float32'))
# Creating an index using Faiss, specifying a flat index with inner product (IP) distance metric and dimensionality of 768 for the encoded vectors.
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
# Adding the encoded data to the Faiss index along with corresponding IDs (range from 0 to the length of 'df').
index.add_with_ids(encoded_data, np.array(range(0, len(df))))
# Writing the Faiss index to disk as a binary file named 'movie_plot.index'.
faiss.write_index(index, 'movie_plot.index')
STEP 6:
Functions for Fetching Movie Info and Performing Semantic Search
The fetch_movie_info function is utilized to obtain the movie information. Specifically the title from a particular index in the DataFrame. The search function receives a query and converts it into a vector with the use of the SentenceTransformer model. Then search through the Faiss index to find the top-k most relevant movie plot. After, it provides the movie titles for the top retrieved results and the taken time for the search is shown.
# Define a function named fetch_movie_info that takes a parameter dataframe_idx.
def fetch_movie_info(dataframe_idx):
# Retrieve the row of the DataFrame 'df' at the specified index 'dataframe_idx' and assign it to the variable 'info'.
info = df.iloc[dataframe_idx]
# Initialize an empty dictionary named 'meta_dict'.
meta_dict = {}
# Assign the value of the 'Title' column of the 'info' DataFrame row to the 'Title' key in 'meta_dict'.
meta_dict['Title'] = info['Title']
# Return the populated dictionary 'meta_dict'.
return meta_dict
# Define a function named search that takes four parameters: query, top_k, index, and model.
def search(query, top_k, index, model):
# Record the current time and store it in the variable 't'.
t=time.time()
# Encode the input query using the SentenceTransformer model 'model' and store the resulting vector in 'query_vector'.
query_vector = model.encode([query])
# Perform a similarity search using the Faiss index 'index' to find the top_k most similar vectors to 'query_vector'.
top_k = index.search(query_vector, top_k)
# Print the total time taken for the search operation.
print('>>>> Results in Total Time: {}'.format(time.time()-t))
# Extract the IDs of the top_k most similar vectors from the search results.
top_k_ids = top_k[1].tolist()[0]
# Convert the list of IDs to a unique list to avoid duplicate entries.
top_k_ids = list(np.unique(top_k_ids))
# Retrieve movie information for each ID in the top_k list using the fetch_movie_info function and store the results in a list named 'results'.
results = [fetch_movie_info(idx) for idx in top_k_ids]
# Return the list of movie information.
return results
STEP 7:
Query Execution and Displaying Results
This piece of code executes a semantic query by searching for the phrase ‘Artificial Intelligence based action movie’. In the search section, it returns the 5 most relevant movie titles from the Faiss index and the SentenceTransformer model. The output is formatted and beautified by using the pprint library to print out the relevant titles of the movies that were found in the search.
# Importing the pprint function from the pprint module for pretty-printing of results.
from pprint import pprint
# Assigning a query string to the variable 'query'.
query = "Artificial Intelligence based action movie"
# Performing a search using the 'search' function defined earlier, with the query string, top_k value of 5, Faiss index 'index', and the SentenceTransformer model 'model'.
results = search(query, top_k=5, index=index, model=model)
# Printing a newline character to create space before printing the results.
print("\n")
# Iterating over each result in the 'results' list.
for result in results:
# Printing each result with a tab indentation.
print('\t', result)
Converting DataFrame Column to List of Plot Descriptions
This code converts the Plot column from the DataFrame df into a list of plot descriptions and assigns it to the variable paragraphs. Each element in paragraphs is a string representing a movie plot, making it easy to iterate over or perform operations on the plots as a list rather than as a DataFrame column. This format is often used for tasks such as text processing, encoding, or performing similarity searches.
paragraphs=df.Plot.tolist()
STEP 8:
Initializing and Configuring T5 Model for Inference
This code initializes the T5 model and tokenizer by importing the T5Tokenizer and T5ForConditionalGeneration classes, which allow for text tokenization and conditional text generation. Specifically, it loads the pre-trained model BeIR/query-gen-msmarco-t5-large-v1, which is designed for text-to-text generation tasks such as query generation in NLP applications. The code also sets the model to evaluation mode, disabling certain layers like dropout and preventing gradient computation, thereby ensuring consistent and stable output during inference.
# Importing the T5Tokenizer and T5ForConditionalGeneration classes from the transformers library for tokenization and model loading.
from transformers import T5Tokenizer, T5ForConditionalGeneration
# Importing the torch library for tensor operations.
import torch
# Initializing a T5 tokenizer using the pretrained 'BeIR/query-gen-msmarco-t5-large-v1' model.
tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
# Initializing a T5 model for conditional generation using the pretrained 'BeIR/query-gen-msmarco-t5-large-v1' model.
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model.eval()
Selecting Device for Computation
By setting the GPU ('cuda') as a device to run the computations on and moving the T5 model to GPU, we can also run the inference faster as it uses GPU resources.
#Select the device
# Specifies the device to be used for computations, which is the GPU ('cuda') in this case.
device = 'cuda'
# Moves the T5 model to the specified device (GPU), enabling computation on the GPU.
model.to(device)
Setting Generation Parameters
This section specifies fundamental text generator parameters. Batch size has been set to 16. Each paragraph will generate five queries, and the input paragraph as well as output queries will not exceed 512 and 64 tokens respectively.
# Parameters for generation
#Batch size
batch_size = 16
#Number of queries to generate for every paragraph
num_queries = 5
#Max length for paragraph
max_length_paragraph = 512
#Max length for output query
max_length_query = 64
Removing Non-ASCII Characters
The function, _removeNonAscii, accepts a single string parameter and processes it to return another string that contains only ASCII characters. Any character whose ASCII value is equal to or greater than 128 is thrown off. The remaining string is only standard ASCII character content.
# Define a function named _removeNonAscii that takes a string s as input.
def _removeNonAscii(s):
# Return a new string created by joining characters from the original string s if their ASCII value is less than 128.
return "".join(i for i in s if ord(i) < 128)
Step 9: Model Training for Semantic Search
Model Training for Semantic Search
This piece of code creates and manages a SentenceTransformer model from the ground up for the purpose of semantic search. It first gathers the movie plot information and divides it into query-paragraph pairs. Then shuffles the order of training pairs. We use the msmarco-distilbert-base-dot-prod-v3 transformer model to create a dynamic mapping of words to vector embeddings. A pooling layer is then added to aggregate the embeddings. It employs the MultipleNegativesRankingLoss, which is ideal for contrastive learning in semantic search tasks, and trains the model for 3 epochs with a progress bar. Finally, the trained model is stored in a directory labeled as ‘search-model’ for use at a later date.
from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets
from torch import nn
import os
import random
train_examples = []
with open('/content/drive/MyDrive/new /wiki_movie_plots_deduped.csv') as fIn:
for line in fIn:
try:
query, paragraph = line.strip().split('\t', maxsplit=1)
train_examples.append(InputExample(texts=[query, paragraph]))
except:
pass
random.shuffle(train_examples)
train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=8)
# Now we create a SentenceTransformer model from scratch
word_emb = models.Transformer('sentence-transformers/msmarco-distilbert-base-dot-prod-v3')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])
# MultipleNegativesRankingLoss requires input pairs (query, relevant_passage)
# and trains the model so that is is suitable for semantic search
train_loss = losses.MultipleNegativesRankingLoss(model)
#Tune the model
num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)
os.makedirs('search', exist_ok=True)
model.save('search/search-model')
Loading Pre-Trained SentenceTransformer Model
This snippet of code includes relevant libraries for embedding sentences and handling files. Next, it loads a previously trained Encoder-Decoder transformer known as SentenceTransformer from the given path on the local disk which was saved in Google Drive. The previously trained model is now ready for use in tasks such as generating sentence embeddings and similarity searches.
# Importing the SentenceTransformer library and util module for sentence embeddings and utility functions.
from sentence_transformers import SentenceTransformer, util
# Importing gzip and json modules for handling compressed files and JSON data, respectively.
# Also importing the os module for file operations.
import gzip
import json
import os
# Initializing a SentenceTransformer model by loading it from the specified path on disk.
# The path points to a previously trained and saved model in the 'search' directory.
model = SentenceTransformer('/content/drive/MyDrive/Untitled Folder/search/search-model')
Zipping the Search Model Directory
This command compresses the entire search directory into a single ZIP file named search_model.zip. The -r flag ensures that all files and subdirectories within search are recursively included in the ZIP archive. This process creates a portable file for easy sharing or backup of the search model and related files.
!zip -r search_model.zip "./search"
Building and Saving Faiss Index for Plot Similarity Search
This code performs similarity search preparation on plot descriptions from a DataFrame df using the Faiss library. First, it encodes the plot descriptions into fixed-size vectors using a pre-trained SentenceTransformer model, generating embeddings that capture semantic information about each plot. The encoded data is then converted into a 32-bit floating-point NumPy array, which Faiss requires for efficient indexing. A Faiss index is created with a flat inner product (IP) distance metric and a vector dimensionality of 768, matching the output size of the encoder model. The encoded vectors are added to this index with unique IDs corresponding to each plot, allowing for efficient similarity retrieval later. Finally, the index is saved as a binary file named movie_plot.index, enabling quick loading and retrieval in future sessions.
# Importing the Faiss library for similarity search and clustering.
import faiss
# Encoding the plot descriptions from the DataFrame 'df' into fixed-size vectors using the pre-trained SentenceTransformer model.
encoded_data = model.encode(df.Plot.tolist())
# Converting the encoded data into a numpy array of 32-bit floating-point numbers.
encoded_data = np.asarray(encoded_data.astype('float32'))
# Creating an index using Faiss, specifying a flat index with inner product (IP) distance metric and dimensionality of 768 for the encoded vectors.
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
# Adding the encoded data to the Faiss index along with corresponding IDs (range from 0 to the length of 'df').
index.add_with_ids(encoded_data, np.array(range(0, len(df))))
# Writing the Faiss index to disk as a binary file named 'movie_plot.index'.
faiss.write_index(index, 'movie_plot.index')
Movie Information Retrieval Using Similarity Search
This code defines two functions, fetch_movie_info and search, to retrieve and display information about movies based on a similarity search query. The fetch_movie_info function takes an index (dataframe_idx) as input and retrieves the row from the DataFrame df at that index, storing the movie’s title in a dictionary (meta_dict). The search function performs a similarity search by encoding an input query with the SentenceTransformer model and then finding the top_k most similar vectors in the Faiss index (index). It records the time taken for the search, extracts and deduplicates the IDs of the top matching vectors, and retrieves movie information for each result by calling fetch_movie_info. The function returns a list of dictionaries containing the movie information for the top matches.
# Define a function named fetch_movie_info that takes a parameter dataframe_idx.
def fetch_movie_info(dataframe_idx):
# Retrieve the row of the DataFrame 'df' at the specified index 'dataframe_idx' and assign it to the variable 'info'.
info = df.iloc[dataframe_idx]
# Initialize an empty dictionary named 'meta_dict'.
meta_dict = {}
# Assign the value of the 'Title' column of the 'info' DataFrame row to the 'Title' key in 'meta_dict'.
meta_dict['Title'] = info['Title']
# Return the populated dictionary 'meta_dict'.
return meta_dict
# Define a function named search that takes four parameters: query, top_k, index, and model.
def search(query, top_k, index, model):
# Record the current time and store it in the variable 't'.
t=time.time()
# Encode the input query using the SentenceTransformer model 'model' and store the resulting vector in 'query_vector'.
query_vector = model.encode([query])
# Perform a similarity search using the Faiss index 'index' to find the top_k most similar vectors to 'query_vector'.
top_k = index.search(query_vector, top_k)
# Print the total time taken for the search operation.
print('>>>> Results in Total Time: {}'.format(time.time()-t))
# Extract the IDs of the top_k most similar vectors from the search results.
top_k_ids = top_k[1].tolist()[0]
# Convert the list of IDs to a unique list to avoid duplicate entries.
top_k_ids = list(np.unique(top_k_ids))
# Retrieve movie information for each ID in the top_k list using the fetch_movie_info function and store the results in a list named 'results'.
results = [fetch_movie_info(idx) for idx in top_k_ids]
# Return the list of movie information.
return results
Query-Based Movie Search and Results Display
This code performs a movie similarity search based on a specific query and displays the results in a readable format. The pprint function is imported for neat output formatting. A query string, "Thriller based action movie," is assigned to query. The search function, defined earlier, is called with the query, a top result count of 5, the Faiss index, and the SentenceTransformer model, returning the top 5 most similar movies in the results list. After printing a newline for spacing, it iterates over each result in results, printing each with a tab for clear alignment.
# Importing the pprint function from the pprint module for pretty-printing of results.
from pprint import pprint
# Assigning a query string to the variable 'query'.
query = "Thriller based action movie"
# Performing a search using the 'search' function defined earlier, with the query string, top_k value of 5, Faiss index 'index', and the SentenceTransformer model 'model'.
results = search(query, top_k=5, index=index, model=model)
# Printing a newline character to create space before printing the results.
print("\n")
# Iterating over each result in the 'results' list.
for result in results:
# Printing each result with a tab indentation.
print('\t', result)
Conclusion:
In this project, we developed an interesting and sophisticated semantic search system that blends with Transformer architectures and the Faiss vector database. Instead of simply searching and retrieving data based on the match of keywords, this one goes further by doing a deep semantic analysis of the search queries. The outcome? Convincing and relevant results which are highly suitable to what users are searching for.
This project adventure began when we focused on the system to a dataset of movie plots, which system worked well. Faiss is used to ensure that the system can process large-scale datasets efficiently, thereby being scalable for industries with large volumes of data. By working with semantic search, businesses can provide a better, more intuitive, and user-friendly experience. Which correlates with the increasing need for smart search solutions. This system can be further optimized to achieve better performance by fine-tuning the transformer models for domain-specific queries and also including user feedback on the relevance of the results.
Overall, in this research, we demonstrated that at a high level, modern approaches for natural language processing can be integrated with vector indexing to craft an efficient solution that is appropriate for the contemporary, data-centric age. No matter whether it’s about films or merchandise and medical history records, this system is all set to bring about quick, advanced, and adaptive search interfaces for users.
Challenges and Solutions:
The process of creating a semantic index is interesting and fun but unfortunately includes several challenges of creating such projects. Here are particular problems that you may encounter while carrying out this project, and their solutions which will not cause much effort to find.
- Dealing with Large Dataset: Process the large datasets into smaller chunks and utilize the inbuilt GPUs as in Google Colab Pro for faster processing.
- Faiss Compatibility: Make sure you download the right version of Faiss for installation and make use of resources such as Google Colab for easy use without a lot of hitches.
- Model Fine-Tuning: Adjust the model to fit the needs of a given sector for instance healthcare or e-commerce to improve relevance in particular fields.
- Understanding MultipleNegativesRankingLoss: Take it slow and practice with small datasets at first to understand the loss function and always refer to the official documents for help.
- Scaling the Search System: One may carry out distributed indexing with Faiss or apply an approximate nearest neighbor ANN strategy to such an extent that it applies to massive data.
FAQ:
In this project, we answer some frequently asked questions (FAQs) regarding this project and ways to build a semantic search system using Transformer models and Faiss.
Question 1: Why are we using Transformer models here?
Answer: In this case, we use transformer models like the one from SentenceTransformer that turn the text data (like movie plots) into high-dimensional vectors representing the semantic meaning of that text. The search system will understand the context of the query rather than searching through keywords.
Question 2: Why is Faiss used for similarity search?
Answer: It’s a powerful and efficient large-scale similarity search library. This feature is good for making the process of searching high-dimensional vector data speedy and precise, especially when the speed and scale are important for the task of semantic search.
Question 3: What do I do about large datasets?
Answer: For large datasets, batch processing can be used during text encoding. The computation can be sped up using GPUs. Google Colab Pro offers GPUs that accelerate the processing of lots of data.
Question 4: What if Faiss doesn’t work as expected in my environment?
Answer: Check that you are using the right version of Faiss, especially in case you need GPU support. Run the project in a cloud environment such as Google Colab. Just make sure your local machine can support the needed libraries.
Question 5: Can we deploy this system to domains other than movies?
Answer: Absolutely! It extends to many domains including e-commerce and healthcare. The Transformer model can be fine-tuned with domain-specific data to improve its performance in any field.