
Document Augmentation through Question Generation for Enhanced Retrieval
This project focuses on document retrieval enhancement through text augmentation via question generation. The method aims to improve document search systems by generating additional questions from text content, which increases the chance of retrieving the most relevant text fragments. These fragments then serve as the context for generative question-answering tasks, using OpenAI's language models to produce answers from documents.
Project Overview
The implementation demonstrates a document augmentation technique integrating question generation to enhance document retrieval in a vector database. Generating questions from text fragments improves the accuracy of finding relevant document sections. The pipeline incorporates PDF processing, question augmentation, FAISS vector store creation and retrieval of documents for answer generation. The approach significantly enriches the retrieval process, ensuring better comprehension and more precise answers, leveraging OpenAI's models for improved question generation and semantic search.
Prerequisites
- Python 3.8+ (for compatibility with LangChain, OpenAI API and FAISS)
- Google Colab or Local Machine (for execution environment)
- OpenAI API Key (for generating embeddings and using the GPT-4o model)
- LangChain (for document processing and retrieval logic)
- FAISS (for storing and retrieving document embeddings)
- PyPDF2 (for PDF document reading and conversion to text)
- Pydantic (for data modeling and validation)
- langchain-openai (for OpenAI model integration with LangChain)
Approach
The approach of this project revolves around utilizing OpenAI’s language models to process and enhance document retrieval through question generation automatically. Initially, the content of a document, typically in PDF format, is extracted and split into smaller, manageable chunks based on token size and overlap. Each chunk is processed to generate relevant questions, either at the fragment level or document level, depending on the configuration. The generated questions are then used to augment the document fragments. FAISS is employed to create a vector store where these augmented fragments and questions are embedded for efficient similarity search. Once the documents are processed and indexed, a retriever is created to fetch the most relevant fragments in response to a user query. The retriever uses query embedding to identify similar fragments from the document store. The context of the most relevant fragment is then used to generate an accurate, concise answer to the query. This approach optimizes document retrieval by improving search relevance and ensuring the ability to provide precise answers based on document content.
Workflow and Methodology
Workflow
- Document Input: A PDF document is provided for processing.
- Document Extraction: The document content is extracted into text using PyPDF2.
- Text Splitting: The extracted text is split into smaller fragments based on specified token limits.
- Question Generation: Questions are automatically generated from the document or fragments using OpenAI’s GPT-4 model.
- Vectorization: The document fragments and generated questions are embedded using OpenAI's embeddings model.
- Indexing: FAISS is used to create a vector store that indexes the embedded fragments and questions for efficient retrieval.
- Query Handling: A user query is provided and the retriever searches for the most relevant fragments based on the query.
- Answer Generation: The context of the most relevant fragment is used to generate a precise answer using the language model.
Methodology
- Document Processing: Split the document into smaller chunks to handle large content efficiently.
- Question Generation: Use OpenAI's GPT-4 model to generate questions that are contextually relevant and answerable from the document.
- FAISS Vector Store: Embed the document fragments and questions, storing them in a FAISS vector store for fast retrieval.
- Query Embedding: The user query is embedded to identify the most relevant documents from the vector store.
- Retrieval and Answering: Retrieve the most relevant fragments from the store and generate an answer using the context of those fragments. This ensures the answer is directly tied to the content of the document.
Data Collection and Preparation
Data Collection
The PDF document used in the example is named "Climate_Change.pdf". It is located at the path:
/content/drive/MyDrive/Document Augmentation through Question Generation for Enhanced Retrieval/Climate_Change.pdf
Data Preparation Workflow
- Collect PDFs: Gather documents.
- Extract Text: Use PyPDF2 to extract text.
- Split Documents: Break text into chunks.
- Generate Questions: Use GPT-4 for question generation.
- Clean Questions: Filter and validate questions.
- Generate Embeddings: Convert to embeddings.
- Create FAISS Store: Store embeddings for search.
- Index Data: Prepare for query retrieval.
Code Explanation
Installing Required Libraries
This command installs several Python libraries. LangChain and OpenAI help work with language models, FAISS-CPU is for efficient similarity search, PyPDF2 is used for reading and manipulating PDFs and Pydantic is for data validation and settings management.
!pip install langchain openai faiss-cpu PyPDF2 pydantic
Upgrading langchain-community
This command upgrades the langchain-community library to the latest version. It ensures you have the most recent features and updates for building language model applications with community enhancements.
!pip install -U langchain-community
Installing langchain-openai
This command installs the langchain-openai library, which integrates OpenAI's models with the LangChain framework. It allows you to use OpenAI's language models for various tasks like natural language processing and conversational AI.
!pip install langchain-openai
Mounting Google Drive in Colab
This code mounts your Google Drive to the Colab environment, allowing you to access files stored in your drive. After running it, you'll be able to interact with your drive's contents directly within Colab under the /content/drive directory.
from google.colab import drive
drive.mount('/content/drive')
Setting Up OpenAI API Key and Libraries
This code sets up the necessary libraries and configurations to use OpenAI's models in a project. It imports essential modules like langchain for language processing, FAISS for vector storage and OpenAIEmbeddings for embedding generation. The script loads the OpenAI API key either from the Colab secrets or a .env file, ensuring secure access to the API. If the API key is missing, it raises an error to prompt the user to add it.
from typing import Any, Dict, List, Tuple
from langchain.docstore.document import Document
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from enum import Enum
import re
import os
import sys
from dotenv import load_dotenv
a
# Import necessary modules from pydantic
from pydantic import BaseModel, Field
#Import PromptTemplate
from langchain import PromptTemplate
try:
from google.colab import userdata
api_key = userdata.get("OPENAI_API_KEY")
except ImportError:
api_key = None # Not running in Colab
if not api_key:
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
if api_key:
os.environ["OPENAI_API_KEY"] = 'ADD YOUR OPENAI_API_KEY'
else:
raise ValueError("❌ OpenAI API Key is missing! Add it to Colab Secrets or .env file.")
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
print("OPENAI_API_KEY setup completed successfully!")
Configuring Question Generation and Token Limits
This code sets the level of question generation (document or fragment level) and defines token limits for documents and fragments. It also specifies the number of questions to generate per document or fragment.
class QuestionGeneration(Enum):
"""
Enum class to specify the level of question generation for document processing.
Attributes:
DOCUMENT_LEVEL (int): Represents question generation at the entire document level.
FRAGMENT_LEVEL (int): Represents question generation at the individual text fragment level.
"""
DOCUMENT_LEVEL = 1
FRAGMENT_LEVEL = 2
#Depending on the model, for Mitral 7B it can be max 8000, for Llama 3.1 8B 128k
DOCUMENT_MAX_TOKENS = 4000
DOCUMENT_OVERLAP_TOKENS = 100
#Embeddings and text similarity calculated on shorter texts
FRAGMENT_MAX_TOKENS = 128
FRAGMENT_OVERLAP_TOKENS = 16
#Questions generated on document or fragment level
QUESTION_GENERATION = QuestionGeneration.DOCUMENT_LEVEL
#how many questions will be generated for specific document or fragment
QUESTIONS_PER_DOCUMENT = 40
Creating Question List Model and OpenAI Embeddings Wrapper
The QuestionList class is a Pydantic model that holds a list of generated questions, which could be used for document or fragment processing. The “OpenAIEmbeddingsWrapper” is a wrapper around the OpenAIEmbeddings class that allows an instance to be used as a callable. It generates embeddings for a query string using the embed_query method and returns the result as a list of floats. This wrapper provides a similar interface to another embedding class (OllamaEmbeddings).
class QuestionList(BaseModel):
question_list: List[str] = Field(..., title="List of questions generated for the document or fragment")
class OpenAIEmbeddingsWrapper(OpenAIEmbeddings):
"""
A wrapper class for OpenAI embeddings, providing a similar interface to the original OllamaEmbeddings.
"""
def __call__(self, query: str) -> List[float]:
"""
Allows the instance to be used as a callable to generate an embedding for a query.
Args:
query (str): The query string to be embedded.
Returns:
List[float]: The embedding for the query as a list of floats.
"""
return self.embed_query(query)
Cleaning and Filtering Questions
This function cleans a list of questions by removing number prefixes and returns only those that end with a question mark. It uses regular expressions to strip the numbers and checks if the cleaned question ends with a?.
def clean_and_filter_questions(questions: List[str]) -> List[str]:
"""
Cleans and filters a list of questions.
Args:
questions (List[str]): A list of questions to be cleaned and filtered.
Returns:
List[str]: A list of cleaned and filtered questions that end with a question mark.
"""
cleaned_questions = []
for question in questions:
cleaned_question = re.sub(r'^\\d+\\.\\s*', '', question.strip())
if cleaned_question.endswith('?'):
cleaned_questions.append(cleaned_question)
return cleaned_questions
Generating and Filtering Questions
This function uses OpenAI’s GPT-4 model to generate a list of questions from the provided text, ensuring the questions are answerable from the context. It filters the questions to remove unwanted ones and returns a unique list of valid questions.
def generate_questions(text: str) -> List[str]:
"""
Generates a list of questions based on the provided text using OpenAI.
Args:
text (str): The context data from which questions are generated.
Returns:
List[str]: A list of unique, filtered questions.
"""
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = PromptTemplate(
input_variables=["context", "num_questions"],
template="Using the context data: {context}\n\nGenerate a list of at least {num_questions}"
"possible questions that can be asked about this context. Ensure the questions are "
"directly answerable within the context and do not include any answers or headers. "
"Separate the questions with a new line character."
)
chain = prompt | llm.with_structured_output(QuestionList)
input_data = {"context": text, "num_questions": QUESTIONS_PER_DOCUMENT}
result = chain.invoke(input_data)
# Extract the list of questions from the QuestionList object
questions = result.question_list
filtered_questions = clean_and_filter_questions(questions)
return list(set(filtered_questions))
Generating Answers Based on Context
This function uses OpenAI’s GPT-4 model to generate precise answers to a given question, using the provided context. It formats the input with a prompt and returns the answer based on the context information.
def generate_answer(content: str, question: str) -> str:
"""
Generates an answer to a given question based on the provided context using OpenAI.
Args:
content (str): The context data used to generate the answer.
question (str): The question for which the answer is generated.
Returns:
str: The precise answer to the question based on the provided context.
"""
llm = ChatOpenAI(model="gpt-4o-mini",temperature=0)
prompt = PromptTemplate(
input_variables=["context", "question"],
template="Using the context data: {context}\n\nProvide a brief and precise answer to the question: {question}"
)
chain = prompt | llm
input_data = {"context": content, "question": question}
return chain.invoke(input_data)
Splitting a Document into Chunks
This function splits a document into smaller chunks of text based on the specified chunk size and overlap. It breaks the document into tokens, ensuring each chunk overlaps with the next and returns a list of text chunks.
def split_document(document: str, chunk_size: int, chunk_overlap: int) -> List[str]:
"""
Splits a document into smaller chunks of text.
Args:
document (str): The text of the document to be split.
chunk_size (int): The size of each chunk in terms of the number of tokens.
chunk_overlap (int): The number of overlapping tokens between consecutive chunks.
Returns:
List[str]: A list of text chunks, where each chunk is a string of the document content.
"""
tokens = re.findall(r'\\b\\w+\\b', document)
chunks = []
for i in range(0, len(tokens), chunk_size - chunk_overlap):
chunk_tokens = tokens[i:i + chunk_size]
chunks.append(chunk_tokens)
if i + chunk_size \>= len(tokens):
break
return [" ".join(chunk) for chunk in chunks]
Printing Document with Comment
This function prints a comment followed by the document's content. It includes metadata such as the document's type and index, along with the actual content.
def print_document(comment: str, document: Any) -> None:
"""
Prints a comment followed by the content of a document.
Args:
comment (str): The comment or description to print before the document details.
document (Any): The document whose content is to be printed.
Returns:
None
"""
print(f'{comment} (type: {document.metadata["type"]}, index: {document.metadata["index"]}): {document.page_content}')
Running the Document Processing Pipeline
This code demonstrates a complete pipeline for processing documents using OpenAI's embeddings and language models. It generates questions from a sample document, provides an answer to one of those questions, splits the document into smaller chunks and generates embeddings for both the document and a sample query. It prints the generated questions, the answers and the document chunks for further analysis.
# Initialize OpenAIEmbeddings
embeddings = OpenAIEmbeddingsWrapper()
# Example document
example_text = "This is an example document. It contains information about various topics."
# Generate questions
questions = generate_questions(example_text)
print("Generated Questions:")
for q in questions:
print(f"- {q}")
# Generate an answer
sample_question = questions[0] if questions else "What is this document about?"
answer = generate_answer(example_text, sample_question)
print(f"\nQuestion: {sample_question}")
print(f"Answer: {answer}")
# Split document
chunks = split_document(example_text, chunk_size=10, chunk_overlap=2)
print("\nDocument Chunks:")
for i, chunk in enumerate(chunks):
print(f"Chunk {i + 1}: {chunk}")
# Example of using OpenAIEmbeddings
doc_embedding = embeddings.embed_documents([example_text])
query_embedding = embeddings.embed_query("What is the main topic?")
print("\nDocument Embedding (first 5 elements):", doc_embedding[0][:5])
print("Query Embedding (first 5 elements):", query_embedding[:5])
Processing Documents and Creating a Retriever
This function processes the document content by splitting it into smaller fragments, generating questions and creating a FAISS vector store for efficient similarity search. It splits the document into chunks, generates questions at the document or fragment level and then creates Document
objects with metadata. After processing, it calculates embeddings for the documents and returns a retriever that fetches the most relevant document from the FAISS store.
def process_documents(content: str, embedding_model: OpenAIEmbeddings):
"""
Process the document content, split it into fragments, generate questions,
Create a FAISS vector store and return a retriever.
Args:
content (str): The content of the document to process.
embedding_model (OpenAIEmbeddings): The embedding model to use for vectorization.
Returns:
VectorStoreRetriever: A retriever for the most relevant FAISS document.
"""
# Split the whole text content into text documents
text_documents = split_document(content, DOCUMENT_MAX_TOKENS, DOCUMENT_OVERLAP_TOKENS)
print(f'Text content split into: {len(text_documents)} documents')
documents = []
counter = 0
for i, text_document in enumerate(text_documents):
text_fragments = split_document(text_document, FRAGMENT_MAX_TOKENS, FRAGMENT_OVERLAP_TOKENS)
print(f'Text document {i} - split into: {len(text_fragments)} fragments')
for j, text_fragment in enumerate(text_fragments):
documents.append(Document(
page_content=text_fragment,
metadata={"type": "ORIGINAL", "index": counter, "text": text_document}
))
counter += 1
if QUESTION_GENERATION == QuestionGeneration.FRAGMENT_LEVEL:
questions = generate_questions(text_fragment)
documents.extend([
Document(page_content=question, metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document})
for idx, question in enumerate(questions)
])
counter += len(questions)
print(f'Text document {i} Text fragment {j} - generated: {len(questions)} questions')
if QUESTION_GENERATION == QuestionGeneration.DOCUMENT_LEVEL:
questions = generate_questions(text_document)
documents.extend([
Document(page_content=question, metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document})
for idx, question in enumerate(questions)
])
counter += len(questions)
print(f'Text document {i} - generated: {len(questions)} questions')
for document in documents:
print_document("Dataset", document)
print(f'Creating store, calculating embeddings for {len(documents)} FAISS documents')
vectorstore = FAISS.from_documents(documents, embedding_model)
print("Creating retriever returning the most relevant FAISS document")
return vectorstore.as_retriever(search_kwargs={"k": 1})
Reading a PDF, Processing Documents and Using a Retriever
This code reads a PDF file, extracts its content, processes the content by generating questions and splitting it into fragments and then creates a retriever using FAISS for document retrieval. It uses the OpenAIEmbeddings model to calculate embeddings for the document and later retrieves the most relevant document based on a query. The result is printed with the query and the retrieved document content.
from PyPDF2 import PdfReader
def read_pdf_to_string(path):
"""Reads a PDF file and returns its content as a string.
Args:
path (str): The path to the PDF file.
Returns:
str: The content of the PDF file as a string.
"""
with open(path, 'rb') as file:
reader = PdfReader(file)
num_pages = len(reader.pages)
text = ""
for page_num in range(num_pages):
page = reader.pages[page_num]
text += page.extract_text()
return text
# Load sample PDF document to string variable
path = "/content/drive/MyDrive/Document Augmentation through Question Generation for Enhanced Retrieval/Climate_Change.pdf"
content = read_pdf_to_string(path)
# Instantiate OpenAI Embeddings class that will be used by FAISS
embedding_model = OpenAIEmbeddings()
# Process documents and create retriever
document_query_retriever = process_documents(content, embedding_model)
# Example usage of the retriever
query = "What is climate change?"
retrieved_docs = document_query_retriever.get_relevant_documents(query)
print(f"\nQuery: {query}")
print(f"Retrieved document: {retrieved_docs[0].page_content}")
Retrieving Relevant Documents Based on Query
This code takes a query about how freshwater ecosystems are affected by climatic changes, retrieves relevant document fragments using the previously created retriever and prints the relevant fragments. It utilizes the document_query_retrieve
query = "How do freshwater ecosystems change due to alterations in climatic factors?"
print (f'Question:{os.linesep}{query}{os.linesep}')
retrieved_documents = document_query_retriever.invoke(query)
for doc in retrieved_documents:
print_document("Relevant fragment retrieved", doc)
Generating and Printing an Answer Based on Context
This code retrieves the context of a document fragment and uses it to generate an answer to the query. It first prints the context and then calls the generate_answer function to respond, displaying both the context and the generated answer.
context = doc.metadata['text']
print (f'{os.linesep}Context:{os.linesep}{context}')
answer = generate_answer(context, query)
print(f'{os.linesep}Answer:{os.linesep}{answer}')
Conclusion
This project successfully demonstrates how document processing, question generation and document retrieval can enhance search systems. By leveraging OpenAI's GPT-4 for question generation and FAISS for fast similarity search, it ensures that relevant information is retrieved efficiently and accurately. The system’s ability to process large documents, generate contextual questions and provide precise answers based on user queries showcases its potential for improving knowledge extraction, document accessibility and information retrieval in various domains like research, business intelligence and content management.
Challenges New Coders Might Face
Challenge: Handling Large Documents
Solution: To tackle this, split documents into smaller fragments using a token-based approach. This ensures that the system can process and generate relevant questions for manageable sections of text.Challenge: Inaccurate Text Extraction from PDFs
Solution: Use specialized tools like OCR (Optical Character Recognition) for image-based PDFs, or consider cleaning the extracted text to improve accuracy before further processing.Challenge: Generating Contextually Relevant Questions
Solution: Fine-tune the question generation model by providing better prompt templates or adjusting parameters like temperature to control creativity and specificity in the generated questions.Challenge: Missing or Incorrect API Key
Solution: Ensure that the OpenAI API key is correctly stored and loaded, either through Colab secrets or a .env file. Implement error handling to check for the API key before processing begins, providing clear instructions if it's missing or invalid.Challenge: Dependency Installation Issues
Solution: Ensure Python 3.8+, install dependencies with !pip install --upgrade and use virtual environments for package management
FAQ
Question 1: What is document augmentation through question generation?
Answer: Document augmentation through question generation involves creating questions from document content to improve document retrieval and enhance information extraction. This method uses AI models like OpenAI's GPT-4 to generate relevant questions, which can be used to retrieve more precise information from the document.
Question 2: How does FAISS improve search efficiency?
Answer: FAISS (Facebook AI Similarity Search) is an optimized vector search library that enables fast and scalable similarity search by storing vector embeddings and retrieving the most relevant matches efficiently.
Question 3: Why is my OpenAI API key not working?
Answer: If you see an API authentication error, ensure that.
- You have a valid OpenAI API key.
- The key is stored correctly in Colab Secrets or a .env file.
- You are not exceeding OpenAI’s rate limits or usage quotas.
Question 4. How do I deploy this document retrieval system?
Answer: You can deploy it using Flask, FastAPI, or Streamlit and integrate it with LLMs like GPT-4 for real-time Q\&A systems.
Question 5. What are the best alternatives to FAISS for vector search?
Answer: If FAISS is not suitable, you can use alternatives like:
- ChromaDB (for local, scalable vector search)
- Weaviate (for cloud-based semantic search)
- Pinecone (for large-scale AI-powered retrieval)