Image

Fusion Retrieval: Combining Vector Search and BM25 for Enhanced Document Retrieval

This project enhances document retrieval by combining semantic search (FAISS) and keyword-based ranking (BM25). It enables efficient search across PDF documents, using vector embeddings and language model-driven content generation for improved accuracy.

Project Overview

The system processes PDFs using PyMuPDF, extracts text and splits it into manageable chunks with LangChain’s RecursiveCharacterTextSplitter. Text chunks are then embedded using Hugging Face’s MiniLM model and stored in FAISS for fast similarity searches. Additionally, BM25 scoring enhances retrieval by ranking documents based on keyword relevance. An LLM-powered hypothetical document is generated using DeepSeek-R1-Distill-Qwen-1.5B to improve responses, refining search results with context-aware insights. Retrieved documents are displayed with citations, making this system ideal for academic research, legal analysis, enterprise knowledge retrieval, and AI-driven Q&A solutions.

Prerequisites

  • Google Colab or a local Python environment to run the code.
  • Python 3.8+ installed.
  • Libraries & Dependencies:
    • langchain, langchain-community, langchain-openai, langchain-cohere
    • sentence-transformers, faiss-cpu, PyMuPDF, rank-bm25
    • openai, transformers, torch, accelerate, pypdf
  • Google Drive access (if running on Colab) for storing and retrieving PDFs.
  • A pre-trained language model (e.g., DeepSeek-R1-Distill-Qwen-1.5B) for LLM-based generation.
  • Basic understanding of FAISS, BM25 and text embeddings for fine-tuning retrieval.

Approach

The system starts by processing PDF documents with PyMuPDF (fitz) to extract the text. This text is then divided into manageable chunks using LangChain’s RecursiveCharacterTextSplitter. To maintain clean formatting, tab characters (\t) are replaced with spaces. The text chunks are transformed into vector embeddings using Hugging Face’s MiniLM model and stored in FAISS, which enables quick semantic similarity searches. Additionally, BM25 ranking is utilized to improve keyword-based retrieval, ensuring that documents are ranked according to both contextual meaning and exact keyword matches. A hybrid retrieval approach merges FAISS-based semantic search with BM25 scoring for better accuracy. To further enhance the results, an LLM-powered hypothetical document is generated using DeepSeek-R1-Distill-Qwen-1.5B, offering context-aware search improvements. Finally, the retrieved documents are presented with citations, ensuring transparency and traceability for research, enterprise search and AI-driven Q\&A applications.

Workflow and Methodology

Workflow:

Step 1: Data Ingestion

  • Manually upload PDF files or load them from Google Drive in Colab.

Step 2: Text Extraction & Cleaning

  • Utilize PyMuPDF (fitz) to extract text from PDFs.
  • Replace tab characters (\t) with spaces for better formatting.

Step 3: Text Splitting & Chunking

  • Apply LangChain’s RecursiveCharacterTextSplitter to divide text into manageable chunks while maintaining context.

Step 4: Embedding Generation & Vector Storage

  • Convert text chunks into vector embeddings using Hugging Face’s MiniLM model.
  • Store these embeddings in FAISS, enabling fast semantic similarity searches.

Step 5: Keyword-Based Retrieval with BM25

  • Compute BM25 scores to rank documents based on keyword similarity.

Step 6: Hybrid Retrieval (FAISS + BM25)

  • Perform FAISS-based similarity search to retrieve semantically relevant documents.
  • Refine search results using BM25 keyword ranking to ensure both semantic and keyword relevance.

Step 7: LLM-Powered Hypothetical Document Generation

  • Use DeepSeek-R1-Distill-Qwen-1.5B to generate a hypothetical document based on the user’s query.
  • Use this document to enhance retrieval by adding context-aware results.

Step 8: Retrieval with Citations

  • Attach source citations to retrieved documents, including page numbers (if available).
  • Display results in a structured format for easy interpretation and traceability.

Methodology

  • PDF Processing and Text Cleaning: Get text from PDFs and format it by removing a few characters that are not necessary to chunk it into smaller parts.
  • Vector Embeddings: Translate and encode these text chunks as packed vectors utilizing MiniLM for embedding conversion.
  • FAISS-Based Semantic Search: It embeds the embeddings in FAISS and retrieves them through similarity-based searching.
  • BM25 Keyword Matching: The BM25 ranking is used to advance searches by keyword relevance.
  • Hybrid Retrieval: A combination of FAISS search in similarity and BM25 scoring to improve accuracy.
  • Hypothetical Answer Generation: The DeepSeek-R1-Distill-Qwen-1.5B is employed to produce context-rich documents that enhance retrieval quality.
  • Citations of Results: Create source references in retrieved documents for traceability and validation.

Data Collection and Preparation

Data Collection

  1. Users manually upload PDF files through Google Colab using files.upload().
  2. Alternatively, files can be loaded from Google Drive by mounting them in Colab.

Data Preparation Workflow

  1. Upload PDFs via Google Colab or Google Drive.
  2. Extract text using PyMuPDF (fitz).
  3. Clean text by removing tab characters (\t).
  4. Split text into chunks using LangChain’s RecursiveCharacterTextSplitter.
  5. Generate embeddings with Hugging Face’s MiniLM model.
  6. Store embeddings in FAISS for fast semantic search.

This process ensures clean, structured and retrievable data for the system.

Code Explanation

Step 1:

Installing Required Libraries

This code installs essential libraries for building an AI-powered retrieval system. It includes LangChain, FAISS, PyMuPDF, Sentence Transformers and Rank-BM25 for text embeddings, vector search and document processing. It also installs PyPDF for handling PDF files.

!pip install langchain langchain-community langchain-openai langchain-cohere sentence-transformers faiss-cpu PyMuPDF rank-bm25 openai transformers torch accelerate
!pip install pypdf

Importing Libraries for Document Processing & AI Retrieval

This code imports essential libraries for PDF reading, text processing and AI-powered retrieval. It uses PyMuPDF (fitz) for PDFs, LangChain for text splitting and embeddings, FAISS for vector storage and BM25 for keyword-based search. It also includes HuggingFace Transformers for LLM-based text generation and retrieval.

import os
import fitz  # PyMuPDF for PDF reading
import textwrap
import numpy as np
from typing import List
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from rank_bm25 import BM25Okapi
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

Step 2:

Mounting Google Drive

This code mounts Google Drive to Colab, allowing access to files stored in Drive. The mounted directory is /content/drive, enabling seamless file handling.

from google.colab import drive
drive.mount('/content/drive')

Uploading PDF Files in Colab

This code allows manual PDF file uploads in Google Colab. It uses files.upload() to let users select files, then extracts their filenames into pdf_paths for further processing.

from google.colab import files
uploaded = files.upload()  # Upload PDF files manually
pdf_paths = list(uploaded.keys())  # Get the uploaded file names

Step 3:

Loading a Pretrained LLM for Text Generation

This code loads the DeepSeek-R1-Distill-Qwen-1.5B model for text generation using Hugging Face. It initializes the tokenizer, loads the model with optimized settings (float16 precision and automatic device mapping) and creates an inference pipeline for generating text.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"  # You can change this
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, device_map="auto", torch_dtype=torch.float16
)
# Create inference pipeline (Streaming Disabled)
llm_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer) # stream=True removed

Processing PDFs and Creating a FAISS Vector Store

This code processes multiple PDFs into a FAISS vector store for efficient similarity search. It loads PDFs using PyPDFLoader, splits the text into chunks with overlap for better retrieval and cleans the text by replacing tab characters (\t) with spaces. The cleaned text is then converted into embeddings using HuggingFace's MiniLM model and stored in FAISS for fast retrieval. Finally, it confirms the successful creation of the vector store.

def replace_t_with_space(list_of_documents):
"""Replaces all tab characters ('\\t') with spaces in document content."""
for doc in list_of_documents:
doc.page_content = doc.page_content.replace('\\t', ' ')
return list_of_documents
def encode_multiple_pdfs(paths, chunk_size=1000, chunk_overlap=200):
"""Encodes multiple PDFs into a single FAISS vector store."""
all_documents = []
for path in paths:
loader = PyPDFLoader(path)
documents = loader.load()
all_documents.extend(documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len)
texts = text_splitter.split_documents(all_documents)
cleaned_texts = replace_t_with_space(texts)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(cleaned_texts, embeddings)
return vectorstore
vector_store = encode_multiple_pdfs(pdf_paths)
print("✅ FAISS vector store created successfully!")

Step 5:

Hybrid Retrieval Using FAISS and BM25

This code combines FAISS (semantic search) and BM25 (keyword-based search) for better document retrieval. The bm25_retrieval function scores documents based on keyword relevance and returns the top k matches. The hybrid_retrieval function first retrieves documents using FAISS, then refines the results using BM25 for a balanced mix of semantic and keyword-based relevance. This approach improves search accuracy by leveraging both methods

def bm25_retrieval(bm25, cleaned_texts, query, k=5):
"""Perform BM25 retrieval and return the top k cleaned text chunks."""
query_tokens = query.split()
bm25_scores = bm25.get_scores(query_tokens)
top_k_indices = np.argsort(bm25_scores)[::-1][:k]
return [cleaned_texts[i] for i in top_k_indices]
def hybrid_retrieval(query, vectorstore, bm25, k=3):
"""Combines FAISS semantic search and BM25 keyword-based search."""
faiss_docs = vectorstore.similarity_search(query, k=k)
bm25_docs = bm25_retrieval(bm25, [doc.page_content for doc in faiss_docs], query, k=k)
return list(set(faiss_docs + bm25_docs))

Generating Hypothetical Documents Using LLM

This function generates a hypothetical document using the DeepSeek/Qwen model. It constructs a prompt based on the given query and requests a response with a specified character limit (chunk_size). The model generates text in response to the query, ensuring a detailed and relevant answer. This is useful for enhancing retrieval systems by creating additional context.

def generate_hypothetical_document(query, chunk_size=500):
"""Generate a hypothetical document using DeepSeek/Qwen."""
prompt = f"Given the question '{query}', generate a detailed answer of {chunk_size} characters."
response = ""
for output in llm_pipeline(prompt, max_length=chunk_size, truncation=True):
response += output["generated_text"]
return response

Step 6:

Retrieving Documents with Citations

This function retrieves relevant documents from a FAISS vector store and attaches source citations. It first generates a hypothetical document using an LLM based on the query, then finds similar documents using semantic search. The results are formatted with citations, including page numbers when available. This approach enhances document retrieval by providing contextual answers with references.

def retrieve_with_citations(query, vectorstore, k=3):
"""Retrieve documents and attach source citations."""
hypothetical_doc = generate_hypothetical_document(query)
similar_docs = vectorstore.similarity_search(hypothetical_doc, k=k)
citations = []
for i, doc in enumerate(similar_docs):
citations.append(f"[{i+1}] (Page {doc.metadata.get('page', 'Unknown')})")
return similar_docs, hypothetical_doc, citations

Testing the Retrieval System

This code tests the retrieval system by querying it with "What is the main cause of climate change?". It first generates a hypothetical document using an LLM, then retrieves similar documents from the FAISS vector store based on semantic similarity. Finally, it attached citations to the retrieved documents, ensuring accurate and contextually relevant results with proper references.

test_query = "What is the main cause of climate change?"
retrieved_docs, hypothetical_doc, citations = retrieve_with_citations(test_query, vector_store)

Displaying the Hypothetical Answer

This code prints the hypothetical answer generated by the LLM in a readable format. It uses textwrap.fill() to wrap the text at 100 characters per line, ensuring better readability. This helps present the AI-generated response in a structured way.

# Display Results
print("🔹 Hypothetical Answer:")
print(textwrap.fill(hypothetical_doc, width=100))

Displaying Retrieved Documents with Citations

This code prints the retrieved documents along with their citations for better reference. It loops through the results, displaying the source identifier(e.g., [1] (Page X)) and neatly formatting the text using textwrap.fill() to maintain readability. This ensures the retrieved content is structured and easy to review.

print("\n🔹 hypothetical Documents:")
for i, doc in enumerate(retrieved_docs):
print(f"📌 {citations[i]}")
print(textwrap.fill(doc.page_content, width=100))
print("\n")

Conclusion

This project efficiently combines FAISS-based semantic search, BM25 keyword ranking and LLM-powered text generation to enhance document retrieval from PDFs. By leveraging text embeddings, hybrid retrieval and citation-based results, it ensures accurate, context-aware and transparent search outcomes. The integration of DeepSeek-R1-Distill-Qwen-1.5B further improves retrieval by generating hypothetical responses for better context matching. This system is highly useful for academic research, legal document retrieval, enterprise knowledge search and AI-powered Q\&A applications, making it a scalable and intelligent retrieval solution.

Challenges New Coders Might Face

  • Challenge: Handling Large PDF Files
    Solution: Use batch processing, split documents into smaller chunks and utilize Colab's high-RAM runtime for better performance.

  • Challenge: Slow Embedding Generation
    Solution: Use optimized models like all-MiniLM-L6-v2, process text in parallel and enable GPU acceleration in Colab.

  • Challenge: FAISS Search Accuracy Issues
    Solution: Adjust chunk size & overlap, fine-tune embedding models and combine FAISS with BM25 for hybrid retrieval.

  • Challenge: Query Mismatch in Retrieval
    Solution: Use LLM-generated hypothetical documents to refine search queries and improve retrieval accuracy.

  • Challenge: Dependency Installation Issues
    Solution: Ensure Python 3.8+, install dependencies with !pip install --upgrade and use virtual environments for package management

FAQ

Question 1: How does this project combine FAISS and BM25 for document retrieval?
Answer: This project uses FAISS for semantic similarity search and BM25 for keyword-based ranking. The hybrid retrieval approach ensures both contextual meaning and exact keyword matches, improving search accuracy.

Question 2: What preprocessing steps are applied to PDFs before retrieval?
Answer: The PDF text is extracted using PyMuPDF (fitz), cleaned by removing tab characters (\t) and then split into manageable chunks using LangChain’s RecursiveCharacterTextSplitter for efficient processing.

Question 3: How are embeddings generated for document retrieval?
Answer: Text chunks are converted into vector embeddings using Hugging Face’s MiniLM model, which enables fast and accurate semantic search using FAISS.

Question 4: How does the LLM-generated hypothetical document improve retrieval?
Answer: The LLM (DeepSeek-R1-Distill-Qwen-1.5B) generates a context-aware response based on the query. This hypothetical document helps retrieve more relevant documents by providing a better context match for FAISS.

Question 5. Why are some retrieved documents not relevant to my query?
Answer: FAISS relies on vector similarity, which may not always align with keyword intent. Combining it with BM25 ranking and using hypothetical document generation refines the search for better results.

Code Editor