
Enhancing Document Retrieval with Contextual Overlapping Windows
This project demonstrates a method to enhance document retrieval using contextually overlapping windows in a vector database. Adding surrounding context to retrieved text chunks improves the coherence and completeness of the information. The approach uses PDF processing, text chunking, and FAISS with OpenAI embeddings to create a vector store. A custom retrieval function fetches relevant chunks with added context, offering a better alternative to traditional vector search methods that often return isolated, context-lacking information.
Project Overview
This project focuses on enhancing document retrieval by incorporating contextually overlapping windows in a vector database. Traditional vector search methods often return isolated chunks of text that may lack sufficient context, making it harder to understand the information. This technique addresses this issue by adding surrounding context to the retrieved chunks, improving the coherence and completeness of the results.
The project involves PDF processing, which divides documents into manageable text chunks. These chunks are stored in a vector store using FAISS and OpenAI embeddings to facilitate fast retrieval. A custom retrieval function is then used to fetch relevant chunks and their surrounding context. The effectiveness of this approach is compared with standard retrieval methods, offering a more comprehensive and accurate search experience.
Prerequisite
- Familiarity with Python
- Knowledge of text chunking and contextual information retrieval
- Experience with Colab Notebooks for project development
- Basic understanding of document retrieval and vector databases,
- Libraries: Python, FAISS (for vector search and indexing), OpenAI embeddings (for text embeddings), NumPy, Pandas, PyPDF2, and LangChain.
- Basic knowledge of embedding generation and usage with FAISS.
Approach
The approach involves improving document retrieval by incorporating contextually overlapping windows. First, documents are processed using PDF extraction techniques like PyPDF2 to break them down into manageable text chunks. These chunks are then stored in a vector database using FAISS for efficient search and retrieval. To enhance the search results, OpenAI embeddings are used to generate vector representations of the text chunks, ensuring semantic accuracy. When a query is made, a custom retrieval function fetches the relevant text along with its surrounding context, creating a more comprehensive and coherent response. This method is compared against traditional retrieval techniques, highlighting improvements in context and results in completeness.
Workflow and Methodology
Workflow
- Extract text from PDFs using PDF processing libraries (e.g., PyPDF2)
- Divide the extracted text into smaller chunks for easier processing.
- Generate embeddings for each text chunk using OpenAI embeddings.
- Store the embeddings in a FAISS vector database for efficient searching.
- Create a custom retrieval function that retrieves text chunks along with their surrounding context.
- Compare results from standard retrieval and contextual retrieval to evaluate improvements in coherence and completeness.
Methodology
- Text Processing: Use PDF extraction to parse documents and convert them into text chunks.
- Embedding Generation: Apply OpenAI embeddings to generate vector representations of the text chunks.
- Vector Search: Store these embeddings in a FAISS database for fast retrieval based on similarity.
- Contextual Retrieval: Implement a custom retrieval function that fetches relevant chunks and includes their surrounding context to provide more complete answers.
- Evaluation: Compare the new method with traditional retrieval to assess contextual understanding and search accuracy improvements.
Data Collection and Preparation
Data Collection
The data used in this project consists of PDF documents called Climate_Change.pdf, which are stored in a specific directory. The PDF files contain textual information that is extracted and processed. The extraction process involves using libraries like PyPDF2 to pull out the text content from these documents.
Data Preparation Workflow:
- Extract text from PDFs using PyPDF2 or pdfplumber.
- Clean the text by removing unnecessary characters and formatting.
- Split the text into smaller chunks.
- Generate embeddings for each chunk using OpenAI embeddings.
- Store the embeddings in a FAISS vector database.
- Group chunks with surrounding context for more coherent retrieval.
- Validate the data for accuracy and correct embedding storage.
Code Explanation
Mounting Google Drive
This code mounts Google Drive to Colab, allowing access to files stored in Drive. The mounted directory is /content/drive, enabling seamless file handling.
from google.colab import drive
drive.mount('/content/drive')
Installing Necessary Packages
These commands install several useful Python packages: LangChain-OpenAI connects your code to OpenAI's language models and LangChain-community adds extra community tools, sentence_transformers converts text into numerical vectors for analysis, DuckDuckGo lets you search the internet using DuckDuckGo, PyPDF2 helps you work with PDF files, tiktoken tokenizes text for processing, and faiss-cpu enables fast similarity searches in large datasets.
!pip install langchain-openai
!pip install langchain-community
!pip install sentence_transformers
!pip install -U duckduckgo-search
!pip install PyPDF2
!pip install tiktoken
!pip install faiss-cpu
Library Import and Key Setup
This code imports necessary libraries for handling PDFs, JSON data, system paths, and interfacing with OpenAI and other tools; it then retrieves an API key from either Colab or the system environment and sets it, appending a parent directory to the system path and confirming successful setup with a print message.
import os
import sys
import json
import PyPDF2
from typing import List, Tuple
from langchain.schema import Document
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.tools import DuckDuckGoSearchResults
from langchain.embeddings import OpenAIEmbeddings
import warnings
warnings.filterwarnings("ignore")
try:
from google.colab import userdata
api_key = userdata.get("OPENAI_API_KEY")
except ImportError:
api_key = None # Not running in Colab
if not api_key:
api_key = os.getenv("OPENAI_API_KEY")
if api_key:
os.environ["OPENAI_API_KEY"] = 'ADD YOUR OPENAI_API_KEY'
else:
raise ValueError(" OpenAI API Key is missing\! Add it to Colab Secrets or .env file.")
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
print("OPENAI_API_KEY setup completed successfully!")
File Path Setup
This code sets the variable path to the specific location of the "Climate_Change.pdf" file in Google Drive, indicating where the file is stored for later access and processing in the project.
path = "/content/drive/MyDrive/Context Enrichment Window Around Chunks Using LlamaInde/Climate_Change.pdf"
Read PDF Function
This function opens a PDF file in binary mode, reads text from each page using PyPDF2, concatenates all the extracted text into one single string, and returns that string for further use.
def read_pdf_to_string(pdf_path):
"""Reads a PDF file and returns its content as a string.
Args:
pdf_path (str): The path to the PDF file.
Returns:
str: The content of the PDF file as a string.
"""
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
num_pages = len(pdf_reader.pages)
text = ""
for page_num in range(num_pages):
page = pdf_reader.pages[page_num]
text += page.extract_text()
return text
PDF Content Extraction
This code calls the read_pdf_to_string function using the specified file path and stores the extracted text from the PDF into the variable content, making the PDF’s text available for further processing.
content = read_pdf_to_string(path)
Text Chunking Function
This function splits a long text into smaller overlapping chunks by iterating over the text, extracting segments of a specified size, and appending each segment as a Document object that includes its chronological index and the original text as metadata, ultimately returning a list of these Document objects.
def split_text_to_chunks_with_indices(text: str, chunk_size: int, chunk_overlap: int) -> List[Document]:
"""Splits text into chunks with metadata of the chunk chronological index.
Args:
text (str): The text to be split.
chunk_size (int): The size of each chunk.
chunk_overlap (int): The overlap between chunks.
Returns:
List[Document]: A list of Document objects, each containing a chunk and its metadata.
"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(Document(page_content=chunk, metadata={"index": len(chunks), "text": text}))
start += chunk_size - chunk_overlap
return chunks
Splitting PDF Text into Chunks
This code sets the chunk size to 400 characters and an overlap of 200 characters, then splits the PDF text stored in content into smaller Document objects using the function and saves the resulting chunks in the docs variable.
chunks_size = 400
chunk_overlap = 200
docs = split_text_to_chunks_with_indices(content, chunks_size, chunk_overlap)
Embeddings and Vector Store Setup
This code converts text chunks into numerical embeddings using OpenAIEmbeddings, stores them in a FAISS vector store, and then creates a retriever that finds the single most relevant chunk for a given query.
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
Function to Retrieve a Chunk by Index
The get_chunk_by_index function retrieves a chunk from the vectorstore by comparing its index with the given target_index. It returns the chunk as a document if found, or none if not.
def get_chunk_by_index(vectorstore, target_index: int) -> Document:
"""
Retrieve a chunk from the vectorstore based on its index in the metadata.
Args:
vectorstore (VectorStore): The vectorstore containing the chunks.
target_index (int): The index of the chunk to retrieve.
Returns:
Optional[Document]: The retrieved chunk as a Document object, or None if not found.
"""
# This is a simplified version. In practice, you might need a more efficient method
# to retrieve chunks by index, depending on your vectorstore implementation.
all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
for doc in all_docs:
if doc.metadata.get('index') == target_index:
return doc
return None
Fetching and Displaying a Chunk from the Vectorstore
This code fetches a chunk from the vectorstore using index 0. The function get_chunk_by_index(vectorstore, 0) grabs the chunk at that specific position. Then, it prints the text content of the chunk using chunk.page_content.
chunk = get_chunk_by_index(vectorstore, 0)
print(chunk.page_content)
Contextual Overlap Retrieval
The retrieve_with_context_overlap function retrieves relevant chunks from a vectorstore based on a query, calculates the neighboring chunks within a specified index range, and sorts them by index. It then concatenates the chunks while handling overlap to avoid repetition. This process results in a list of concatenated chunks that provide a more coherent and contextually enriched retrieval.
def retrieve_with_context_overlap(vectorstore, retriever, query: str, num_neighbors: int = 1, chunk_size: int = 200, chunk_overlap: int = 20) -> List[str]:
"""
Retrieve chunks based on a query, then fetch neighboring chunks and concatenate them,
accounting for overlap and correct indexing.
Args:
vectorstore (VectorStore): The vectorstore containing the chunks.
retriever: The retriever object to get relevant documents.
query (str): The query to search for relevant chunks.
num_neighbors (int): The number of chunks to retrieve before and after each relevant chunk.
chunk_size (int): The size of each chunk when originally split.
chunk_overlap (int): The overlap between chunks when originally split.
Returns:
List[str]: List of concatenated chunk sequences, each centered on a relevant chunk.
"""
relevant_chunks = retriever.get_relevant_documents(query)
result_sequences = []
for chunk in relevant_chunks:
current_index = chunk.metadata.get('index')
if current_index is None:
continue
# Determine the range of chunks to retrieve
start_index = max(0, current_index - num_neighbors)
end_index = current_index + num_neighbors + 1 # +1 because range is exclusive at the end
# Retrieve all chunks in the range
neighbor_chunks = []
for i in range(start_index, end_index):
neighbor_chunk = get_chunk_by_index(vectorstore, i)
if neighbor_chunk:
neighbor_chunks.append(neighbor_chunk)
# Sort chunks by their index to ensure correct order
neighbor_chunks.sort(key=lambda x: x.metadata.get('index', 0))
# Concatenate chunks, accounting for overlap
concatenated_text = neighbor_chunks[0].page_content
for i in range(1, len(neighbor_chunks)):
current_chunk = neighbor_chunks[i].page_content
overlap_start = max(0, len(concatenated_text) - chunk_overlap)
concatenated_text = concatenated_text[:overlap_start] + current_chunk
result_sequences.append(concatenated_text)
return result_sequences
Baseline vs. Focused Context Enrichment Approach
This code compares the baseline approach, which retrieves a single relevant chunk, and the enriched approach, which fetches relevant chunks with neighboring context for more detail. It prints both results to highlight the difference in context and completeness.
# Baseline approach
query = "Explain the role of deforestation and fossil fuels in climate change."
baseline_chunk = chunks_query_retriever.get_relevant_documents(query
,
k=1
)
# Focused context enrichment approach
enriched_chunks = retrieve_with_context_overlap(
vectorstore,
chunks_query_retriever,
query,
num_neighbors=1,
chunk_size=400,
chunk_overlap=200
)
print("Baseline Chunk:")
print(baseline_chunk[0].page_content)
print("\nEnriched Chunks:")
print(enriched_chunks[0])
Helper Function to Display Context
The show_context function prints each item from the context list, followed by a separator (-) for better readability. This makes it easier to view the content of each item in context.
def show_context(context):
"""
Helper function to print the context.
"""
for item in context:
print(item)
print("-" * 50) # Separator for better readability
Document Retrieval with Context Enrichment
This code compares regular retrieval and context-enriched retrieval for a query on deep learning in AI. The document is split into chunks, stored in a FAISS vectorstore, and retrieved using a retriever. Regular retrieval fetches the most relevant chunk, while context-enriched retrieval retrieves neighboring chunks for more detailed context. The results are displayed using the show_context function, highlighting the difference in the level of context between both methods.
document_content = """
Artificial Intelligence (AI) has a rich history dating back to the mid-20th century. The term "Artificial Intelligence" was coined in 1956 at the Dartmouth Conference, marking the field's official beginning.
In the 1950s and 1960s, AI research focused on symbolic methods and problem-solving. The Logic Theorist, created in 1955 by Allen Newell and Herbert A. Simon, is often considered the first AI program.
The 1960s saw the development of expert systems, which used predefined rules to solve complex problems. DENDRAL, created in 1965, was one of the first expert systems, designed to analyze chemical compounds.
However, the 1970s brought the first "AI Winter," a period of reduced funding and interest in AI research, largely due to overpromised capabilities and underdelivered results.
The 1980s saw a resurgence with the popularization of expert systems in corporations. The Japanese government's Fifth Generation Computer Project also spurred increased investment in AI research globally.
Neural networks gained prominence in the 1980s and 1990s. The backpropagation algorithm, although discovered earlier, became widely used for training multi-layer networks during this time.
The late 1990s and 2000s marked the rise of machine learning approaches. Support Vector Machines (SVMs) and Random Forests became popular for various classification and regression tasks.
Deep Learning, a subset of machine learning using neural networks with many layers, began to show promising results in the early 2010s. The breakthrough came in 2012 when a deep neural network significantly outperformed other machine learning methods in the ImageNet competition.
Since then, deep learning has revolutionized many AI applications, including image and speech recognition, natural language processing, and game playing. In 2016, Google's AlphaGo defeated a world champion Go player, a landmark achievement in AI.
The current era of AI is characterized by the integration of deep learning with other AI techniques, the development of more efficient and powerful hardware, and the ethical considerations surrounding AI deployment.
Transformers, introduced in 2017, have become a dominant architecture in natural language processing, enabling models like GPT (Generative Pre-trained Transformer) to generate human-like text.
As AI continues to evolve, new challenges and opportunities arise. Explainable AI, robust and fair machine learning, and artificial general intelligence (AGI) are among the key areas of current and future research in the field.
"""
chunks_size = 250
chunk_overlap = 20
document_chunks = split_text_to_chunks_with_indices(document_content, chunks_size, chunk_overlap)
document_vectorstore = FAISS.from_documents(document_chunks, embeddings)
document_retriever = document_vectorstore.as_retriever(search_kwargs={"k": 1})
query = "When did deep learning become prominent in AI?"
context = document_retriever.get_relevant_documents(query)
context_pages_content = [doc.page_content for doc in context]
print("Regular retrieval:\n")
show_context(context_pages_content)
sequences = retrieve_with_context_overlap(document_vectorstore, document_retriever, query, num_neighbors=1)
print("\nRetrieval with context enrichment:\n")
show_context(sequences)
Conclusion
The context enrichment window technique presented in this project significantly enhances document retrieval in vector-based search systems. By incorporating contextually overlapping windows, this method mitigates the issue of isolated text chunks, which are often returned by traditional vector search methods. The addition of surrounding context improves the coherence, completeness, and relevance of the retrieved information, ensuring a more accurate and comprehensive search experience. This approach leverages the power of FAISS and OpenAI embeddings for efficient vector store creation and retrieval. The comparison with standard retrieval methods highlights the benefits of this context-enriched approach, making it highly suitable for applications like question answering and content summarization, where understanding the full context is crucial.
Challenges New Coders Might Face
Challenge: Large Document Processing
Solution: Text chunking is used to split large documents into smaller, more manageable pieces. By processing documents in chunks with defined overlap, the system can efficiently handle large documents without overwhelming memory or computation resources.Challenge: Maintaining Context in Retrieval
Solution: The contextual enrichment approach solves the problem by retrieving not only the relevant chunk but also its neighboring chunks, thus preserving the context. Using overlap between chunks ensures smooth transitions and avoids information loss.Challenge: Efficient Vector Search
Solution: FAISS (Facebook AI Similarity Search) is used to efficiently store and search vectorized data. FAISS enables fast retrieval by indexing the embeddings, optimizing the search for similar chunks based on the query.Challenge: Ensuring Quality of Retrieval Results
Solution: Fine-tuning the retriever settings, such as adjusting k to retrieve the top results and refining the chunk size and overlap parameters, ensures better-quality results.Challenge: Lack of Real-Time Adaptability
Solution: The retriever and vectorstore can be updated periodically to reflect new documents, ensuring that the system remains up-to-date. Implementing incremental updates allows for continuous improvements without requiring full reprocessing.
FAQ
Question 1. What is document retrieval with contextual enrichment?
Answer: Document retrieval with contextual enrichment is a method where relevant chunks of text are retrieved from a database, along with their neighbouring context. The process ensures that the retrieved information is more coherent and complete, making it easier for users to understand the broader context of the content.
Question 2. How does FAISS improve document retrieval performance?
Answer: FAISS (Facebook AI Similarity Search) is used to store and search large datasets efficiently. It speeds up document retrieval by indexing embeddings of text chunks, enabling fast and accurate searches based on similarity to the query, even in large datasets.
Question 3. What is the role of OpenAI embeddings in document retrieval?
Answer: OpenAI embeddings convert text into vector representations, allowing the retrieval system to measure the semantic similarity between the query and the document chunks. These embeddings help improve the accuracy of retrieval, ensuring that the most relevant chunks are returned.
Question 4. Why is chunking important in document retrieval?
Answer: Chunking is essential for breaking down large documents into smaller, manageable sections. This approach helps in processing large documents efficiently while also maintaining the context between chunks.
Question 5. What is the difference between regular retrieval and context-enriched retrieval?
Answer: Regular retrieval fetches only the most relevant chunk based on a query, which may lack full context. In contrast, context-enriched retrieval fetches relevant chunks along with their neighboring chunks to provide additional context, resulting in more detailed and comprehensive answers.