
HyDE-Powered Document Retrieval Using DeepSeek
In this project, we're combining some exciting technologies such as FAISS, DeepSeek, LangChain and HuggingFace to develop an intelligent information retrieval system. The aim is to create a system that can efficiently load, process and store PDF documents, making it incredibly easy to search for and find relevant information. Whether you're posing a specific question or seeking context, the system will quickly generate responses and pull up the most pertinent documents.
Project Overview
Imagine having a bunch of PDF documents and then needing to pull out the exact answer for some specific inquiry. It is the LangChain system that loads and splits the documents and HuggingFace transforms the documents into embedded. Then comes DeepSeek, which creates a deep hypothetical answer to the question.
Once split and embedded, store the documents in FAISS, a quick vector store capable of efficiently searching for the most pertinent information. So, the answer to your query is generated by DeepSeek; along with that, important documents are also found with the use of FAISS. As a result, a smart and efficient system can be put in place for document analysis and query answering.
This system is all about finding accurate answers to a query by digging into the documents and clearing out all that mess of lines and pages written.
Prerequisites
- Python (Version 3.7 or higher)
- Google Colab (for easy access to GPU resources)
- Libraries:
- LangChain: For document processing and interaction with language models
- HuggingFace Transformers: For model handling and text embeddings
- FAISS: For efficient vector storage and similarity search
- PyMuPDF: For PDF loading and content extraction
- Sentence-Transformers: For text embedding generation
- Torch: For model inference and handling deep learning tasks
- Google Drive: To store and load PDF files
- Pre-trained Models (like DeepSeek or similar) for generating hypothetical answers and text generation
These tools and libraries will help you set up the system for loading documents, embedding them, generating answers and performing efficient document retrieval.
Approach
The approach of this project revolves around processing and embedding PDF documents into a FAISS vector store for fast and efficient similarity search. First, the PDF documents are loaded using LangChain's PyPDFLoader and split into smaller, manageable chunks using the RecursiveCharacterTextSplitter. These chunks are then embedded using HuggingFace's embeddings model, specifically designed to convert text into vector representations. Once embedded, the documents are stored in a FAISS vector store, enabling quick retrieval based on similarity to a given query. When a user submits a question, the system generates a hypothetical answer using the DeepSeek language model, which is then used to search for the most relevant documents in the FAISS vector store. The system returns both the generated hypothetical document and the retrieved documents, providing users with detailed, contextually relevant answers. This combination of deep learning models and efficient vector search techniques ensures a seamless, powerful solution for document analysis and query answering.
Workflow and Methodology
Workflow
- Step 1: Use LangChain's PyPDFLoader to load the PDF document.
- Step 2: Break the document into smaller sections with RecursiveCharacterTextSplitter to handle large texts more effectively.
- Step 3: Clean the text by removing unwanted characters, such as tabs, with a custom function.
- Step 4: Create embeddings for each text section using HuggingFace Embeddings.
- Step 5: Save the embeddings in a FAISS vector store for quick similarity searches.
- Step 6: For any given query, generate a hypothetical response using DeepSeek.
- Step 7: Conduct a similarity search in the FAISS vector store based on the generated response.
- Step 8: Retrieve the most relevant documents and present them alongside the hypothetical response.
Methodology
- Document Preprocessing: Utilize PyPDFLoader to extract text from PDF files and apply RecursiveCharacterTextSplitter to divide the text into smaller, meaningful segments.
- Text Embedding: Transform the document segments into vector embeddings using a pre-trained HuggingFace model that captures the semantic meaning of the text.
- Vector Store: Save the embeddings in a FAISS vector store, enabling efficient retrieval of similar documents based on query relevance.
- Query Answering: When a user submits a query, employ DeepSeek (or another suitable LLM) to create a hypothetical answer, which is then used to find the most relevant documents in the vector store.
- Similarity Search: Leverage FAISS to conduct a similarity search and obtain the top-k most relevant documents that correspond to the hypothetical answer or query.
- Result Presentation: Present the generated hypothetical answer alongside the retrieved documents for comprehensive context.
Data Collection and Preparation
Data Collection
Gather all PDF files that contain the relevant content for processing. Store them in an accessible location, such as Google Drive, for easy access to the code.
Data Preparation Workflow
- Use LangChain's PyPDFLoader to extract raw text from PDFs.
- Split the text into smaller chunks using RecursiveCharacterTextSplitter (e.g., 1000 characters).
- Clean the text by removing unwanted characters (e.g., tabs).
- Use HuggingFace embeddings (e.g., all-MiniLM-L6-v2) to convert text into vector representations.
- Store the embeddings in a FAISS vector store for fast search.
- The system is ready to return relevant documents based on user queries using FAISS.
Code Explanation
STEP 1:
Installation of Required Libraries
This code installs essential libraries for natural language processing and document processing. It includes LangChain for managing language models, sentence-transformers for embeddings, PyMuPDF for working with PDFs and FAISS for similarity search. Additionally, OpenAI and Cohere APIs are installed for language model integration.
!pip install langchain langchain-community langchain-openai langchain-cohere sentence-transformers faiss-cpu PyMuPDF rank-bm25 openai transformers torch accelerate
!pip install pypdf
Importing Necessary Libraries
This code imports libraries for document processing, text splitting and embeddings. It uses PyPDFLoader from LangChain to load PDF files, RecursiveCharacterTextSplitter to split text and HuggingFaceEmbeddings for embeddings. Additionally, it imports FAISS for vector storage, Hugging Face's AutoModelForCausalLM for language models and pipeline for task-specific pipelines. Torch is used for model handling.
import os
import textwrap
import numpy as np
from typing import List
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings # Using Hugging Face instead of OpenAI
from langchain.vectorstores import FAISS
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
STEP 2:
Mounting Google Drive
This code mounts Google Drive to the Colab environment, allowing access to files stored on the drive. The drive is mounted at /content/drive, enabling file interactions within the Colab notebook.
from google.colab import drive
drive.mount('/content/drive')
Setting the PDF File Path
This code assigns the path of the PDF file (tesla.pdf) stored on Google Drive to the pdf_path variable. The path can be updated if a different file is needed.
pdf_path = "/content/drive/MyDrive/tesla.pdf" # Change this if needed
Loading the Pre-trained Model
This code sets the model name and loads the tokenizer and model for inference. It uses the AutoTokenizer and AutoModelForCausalLM from Hugging Face, loading the model DeepSeek-R1-Distill-Qwen-1.5B. It creates a text generation pipeline using the loaded model, optimized for inference with torch.float16.
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" # Change to DeepSeek if needed
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, device_map="auto", torch_dtype=torch.float16
)
# Create inference pipeline
llm_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)
STEP 3:
Replacing Tabs with Spaces in Documents
This function takes a list of documents and replaces all tab characters (\t) with spaces in the content of each document. It iterates through the documents, modifies their page_content and returns the updated list.
def replace_t_with_space(list_of_documents):
"""Replaces all tab characters ('\\t') with spaces in document content."""
for doc in list_of_documents:
doc.page_content = doc.page_content.replace('\\t', ' ')
return list_of_documents
Encoding a PDF into a FAISS Vector Store
This function converts a PDF into a FAISS vector store using Hugging Face embeddings. It first loads the PDF, splits its text into chunks with specified sizes and overlaps and cleans the chunks by removing tab characters. Then, it uses Hugging Face embeddings (all-MiniLM-L6-v2) to convert the text into embeddings and creates a FAISS vector store for efficient similarity search.
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
"""Encodes a PDF into a FAISS vector store using Hugging Face embeddings."""
# Load PDF
loader = PyPDFLoader(path)
documents = loader.load()
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
)
texts = text_splitter.split_documents(documents)
cleaned_texts = replace_t_with_space(texts)
# Use Hugging Face Embeddings instead of OpenAI
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Create FAISS vector store
vectorstore = FAISS.from_documents(cleaned_texts, embeddings)
return vectorstore
Retrieving Context for a Question
This function retrieves the most relevant context for a given question using vector search. It performs a similarity search on the provided vector store (vectorstore) and returns the page_content of the top k most relevant documents. The default value for k is 3.
def retrieve_context_per_question(question, vectorstore, k=3):
"""Retrieves relevant context for a given question using vector search."""
docs = vectorstore.similarity_search(question, k=k)
return [doc.page_content for doc in docs]
Generating a Hypothetical Document
This function generates a hypothetical document based on a given query using the Qwen or DeepSeek LLM. It constructs a prompt with the query and requests a detailed answer with a specified character length (chunk_size). The answer is generated using the language model pipeline and the resulting text is returned.
def generate_hypothetical_document(query, chunk_size=500):
"""Generate a hypothetical document using Qwen or DeepSeek LLM."""
prompt = f"Given the question '{query}', generate a detailed answer. The response should be exactly {chunk_size} characters."
response = llm_pipeline(prompt, max_length=chunk_size, truncation=True)[0]["generated_text"]
return response
Retrieving Similar Documents and Generating a Hypothetical Answer
This function retrieves similar documents from the FAISS vector store based on a generated hypothetical document. It first generates the hypothetical document using the provided query, then retrieves the most relevant documents from the vector store using the hypothetical document. It returns both the similar documents and the hypothetical document.
def retrieve(query, vectorstore, k=3):
"""Retrieve similar documents using FAISS and generate a hypothetical answer."""
hypothetical_doc = generate_hypothetical_document(query)
similar_docs = retrieve_context_per_question(hypothetical_doc, vectorstore, k=k)
return similar_docs, hypothetical_doc
Encoding the PDF into a FAISS Vector Store
This code calls the encode_pdf function, passing the path of the PDF (pdf_path), to convert the document into a FAISS vector store. The resulting vector_store will be used for similarity search and document retrieval.
# Encode PDF into FAISS vector store
vector_store = encode_pdf(pdf_path)
STEP 4:
Retrieving Documents and Generating Hypothetical Answer
This code queries the system with a test question about Tesla factories. It uses the retrieve function to generate a hypothetical document based on the query and retrieves the most relevant documents from the FAISS vector store.
test_query = "Which Tesla factory has the highest vehicle production capacity?"
retrieved_docs, hypothetical_doc = retrieve(test_query, vector_store)
Displaying the Hypothetical Document
This code prints the generated hypothetical document by wrapping the text to a width of 100 characters. The output will display the document, providing a clear and formatted answer based on the query.
# Display Results
print("Hypothetical Document:\n")
print(textwrap.fill(hypothetical_doc, width=100) + "\n")
Displaying Retrieved Documents
This code prints the retrieved documents that are most relevant to the hypothetical document. Each document is displayed with a context number and the text is wrapped to 100 characters for better readability.
print("Retrieved Documents:\n")
for i, doc in enumerate(retrieved_docs):
print(f"Context {i + 1}:")
print(textwrap.fill(doc, width=100))
print("\n")
Retrieving Documents and Generating Hypothetical Answers for Tesla's Q3 2024 Revenue
This code queries the system with a new test question about Tesla's revenue in Q3 2024. It uses the retrieve function to generate a hypothetical document based on the query and retrieves the most relevant documents from the FAISS vector store.
test_query = "What was Tesla's total revenue in Q3 2024?"
retrieved_docs, hypothetical_doc = retrieve(test_query, vector_store)
Displaying the Hypothetical Document for Tesla's Q3 2024 Revenue
This code will print the generated hypothetical document that answers the query about Tesla's total revenue in Q3 2024. The text is wrapped to a width of 100 characters for improved readability.
# Display Results
print("Hypothetical Document:\n")
print(textwrap.fill(hypothetical_doc, width=100) + "\n")
Displaying Retrieved Documents for Tesla's Q3 2024 Revenue
This code prints the retrieved documents relevant to Tesla's Q3 2024 revenue query. Each document is displayed with its context number and the text is wrapped to 100 characters for better readability.
print("Retrieved Documents:\n")
for i, doc in enumerate(retrieved_docs):
print(f"Context {i + 1}:")
print(textwrap.fill(doc, width=100))
print("\n")
Conclusion
In this project, we successfully built an efficient document retrieval system by leveraging powerful tools like FAISS, DeepSeek, LangChain and HuggingFace. We enabled fast and accurate similarity searches by embedding PDF documents into a FAISS vector store, while DeepSeek helped generate hypothetical answers to queries. The process of loading, splitting, cleaning and embedding the text ensures that the system can handle large volumes of data effectively. With this setup, users can easily retrieve relevant documents and answers, making it a robust solution for information extraction and analysis. By combining these advanced technologies, we’ve created a flexible and powerful system ready for real-world applications.
Challenges New Coders Might Face
Challenge: Large Document Processing
Solution: Use text splitting (e.g., RecursiveCharacterTextSplitter) to divide the document into smaller chunks. Additionally, storing the embeddings in FAISS ensures that only the most relevant chunks are searched, reducing the load on the system.Challenge: Slow Similarity Search
Solution: Use FAISS's indexing and quantization techniques, such as IVF (Inverted File Index) and HNSW (Hierarchical Navigable Small World graphs), which allow for faster and more efficient retrieval, even with large datasets.Challenge: Model Incompatibility or Version Mismatch
Solution: Ensure that the required versions of the libraries and models are properly installed using version management tools like pip or conda. It's also helpful to document the versions of libraries used for consistency across environments.Challenge: Resource Limitations (RAM/Storage)
Solution: Use FAISS's disk-based index options, such as FAISS’s IVFPQ (Inverted File with Product Quantization), which allows for efficient disk storage and retrieval without overloading memory.Challenge: Generating Accurate Hypothetical Answers
Solution: Regularly fine-tune the language model based on domain-specific data to improve the accuracy and relevance of generated responses. Additionally, leveraging user feedback can help continuously improve the answer-generation process.
FAQ
Question 1. What is the purpose of using FAISS in this project?
Answer: FAISS is used to store document embeddings and perform fast similarity searches. By converting document content into vector representations, FAISS allows the system to quickly retrieve the most relevant documents based on a query, making it an essential tool for efficient information retrieval.
Question 2. Why did you choose DeepSeek for generating hypothetical answers?
Answer: We chose DeepSeek because it is a powerful language model capable of generating contextually relevant and detailed hypothetical answers to specific queries. It helps bridge the gap between raw document data and user queries by providing intelligent responses based on the content.
Question 3. What role does LangChain play in this project?
Answer: LangChain is responsible for loading and processing PDFs, splitting the text into manageable chunks and interacting with language models for document analysis. It simplifies handling the document flow, allowing the system to process large amounts of text data efficiently.
Question 4. How does the text-splitting process work?
Answer: The RecursiveCharacterTextSplitter splits large text into smaller chunks of a defined size (e.g., 1000 characters), with overlapping sections to maintain context. This ensures that even long documents are handled efficiently while preserving meaning across chunks.
Question 5. How accurate is the document retrieval process?
Answer: The accuracy of document retrieval depends on the quality of the embeddings and the similarity search algorithm. With HuggingFace embeddings and FAISS, the system offers high accuracy in retrieving documents that closely match the context of the query. However, fine-tuning the model and embeddings can further improve accuracy.