Corrective Retrieval-Augmented Generation (RAG) with Dynamic Adjustments

Project Overview

The project builds a sophisticated query-processing pipeline using Python, powered by libraries like LangChain, OpenAI’s GPT-4o, Sentence Transformers, and DuckDuckGo search. It begins by loading and vectorizing a PDF file ("Image Recognition Using Deep Neural Network") into a Chroma vector store, then employs a FAISS index for similarity-based document retrieval, evaluated by a custom relevance scoring mechanism driven by GPT-4o. Depending on the relevance score, the system either uses the retrieved document, fetches refined knowledge from the web, or combines both, generating a final response with sourced citations—demonstrated through example queries about neural network-based image recognition and object detection training.

Prerequisites

Python (version 3.8+) is required to run the project scripts and manage dependencies.
Libraries like langchain, openai, chromadb, tiktoken, pypdf, langchain-openai, langchain-community, sentence_transformers, and duckduckgo-search must be installed via pip.
An OpenAI API key needs to be configured in Google Colab secrets or a .env file for GPT-4o access.
The PDF file "Image Recognition Using Deep Neural Network.pdf" must be accessible at the specified path (e.g., Google Drive).
Familiarity with a code editor like Google Colab is essential for coding and debugging.
Basic understanding of NLP concepts such as embeddings and vector stores is needed to follow the workflow.
Sufficient system resources (CPU/GPU, RAM) are required for efficient document processing and model inference.

Approach

The project adopts a systematic approach to implement a Corrective Retrieval-Augmented Generation (CRAG) system by first setting up a Python environment with necessary libraries and an OpenAI API key, then loading and vectorizing a PDF document ("Image Recognition Using Deep Neural Network.pdf") into a Chroma vector store using PyPDFLoader and SentenceTransformerEmbeddings for efficient retrieval. A FAISS index facilitates similarity-based document retrieval, followed by a relevance evaluation step powered by GPT-4o, which scores document relevance on a 0-1 scale to determine the next action. The system refines retrieved or web-sourced knowledge into key points, parses search results into title-link pairs, and generates a final response with citations using a structured prompt and the language model, ensuring adaptability and accuracy. This workflow is executed through modular functions, culminating in the crag_process that dynamically adjusts based on query needs, as demonstrated with example queries about image recognition and object detection training.

Workflow and Methodology

Workflow

Set up the Python environment by importing required libraries and configuring the OpenAI API key.
Load and vectorize the PDF file into a Chroma vector store using encode_pdf.
Initialize the GPT-4o language model (llm) and DuckDuckGo search tool (search) for response generation and web queries.
Define a query to process through the system.
Retrieve relevant documents from the vector store using FAISS similarity search with retrieve_documents.
Evaluate the retrieved documents’ relevance to the query using evaluate_documents and GPT-4o scoring.
Decide the action based on the highest relevance score.
Perform a web search with perform_web_search if needed, rewriting the query and refining results into key points.
Combine or select the final knowledge and sources, then generate a response with generate_response.
Print the query and final answer with citations for review and validation.

Methodology

Environment Setup: Configure Python with libraries like LangChain and OpenAI, securing an API key for GPT-4o access.
Document Processing: Use PyPDFLoader to load the PDF, split it into chunks with RecursiveCharacterTextSplitter, and vectorize it with SentenceTransformerEmbeddings into a Chroma store.
Retrieval Mechanism: Employ a FAISS index for a fast similarity search to fetch top-k relevant document chunks based on the query.
Relevance Evaluation: Implement retrieval_evaluator with GPT-4o to score document relevance on a 0-1 scale, guiding the system’s decision-making.
Dynamic Adjustment: Design a corrective logic in crag_process to choose between document use, web search, or a hybrid approach based on score thresholds (0.7 and 0.3).
Web Search Integration: Rewrite queries with rewrite_query, fetch results via DuckDuckGo, and parse them into title-link pairs with parse_search_results.
Knowledge Refinement: Extract key points from documents or web results using knowledge_refinement for concise, usable information.
Response Generation: Format knowledge and sources into a prompt, leveraging GPT-4o in generate_response to produce a coherent answer with citations.
Execution and Output: Run the full pipeline with crag_process, logging steps and displaying the query and response for transparency.

Data Collection and Preparation

Data Collection

Obtain the PDF file "Image Recognition Using Deep Neural Network.pdf" as the primary data source.
Store the PDF in a specified directory (e.g., "/content/drive/MyDrive/...") accessible to the script.

Data Preparation Workflow

Load the PDF using PyPDFLoader from the defined path.
Split the document into chunks of 1000 characters with RecursiveCharacterTextSplitter.
Generate embeddings for chunks using SentenceTransformerEmbeddings (model: "all-mpnet-base-v2").
Create a Chroma vector store with encode_pdf to store the vectorized document.
Assign the vector store to vectorstore for retrieval in the CRAG process.

Code Explanation

Mounting Google Drive

This code mounts Google Drive to Colab, allowing access to files stored in Drive. The mounted directory is /content/drive, enabling seamless file handling.

from google.colab import drive
drive.mount('/content/drive')

This command installs five Python libraries using pip: langchain for building language model applications, openai for accessing OpenAI's API, chromadb for working with a vector database, tiktoken for tokenizing text efficiently, and pypdf for handling PDF file operations. These tools are commonly used together for tasks like natural language processing, document retrieval, or AI-powered search systems

!pip install langchain openai chromadb tiktoken pypdf

Installing Necessary Packages

These commands install langchain-openai, langchain-community, sentence_transformers, and an updated duckduckgo-search, enabling OpenAI integration with LangChain, community tools, sentence embeddings, and web search functionality for advanced NLP and data retrieval tasks.

!pip install langchain-openai
!pip install langchain-community
!pip install sentence_transformers
!pip install -U duckduckgo-search

Setting Up OpenAI API and Environment

This code sets up a Python environment by importing essential libraries and modules like os, json, and LangChain tools, then retrieves an OpenAI API key from Google Colab secrets or an environment variable. It configures the API key in the environment for use with ChatOpenAI, raising an error if the key is missing, and appends a parent directory to the system path for additional module access. The script ensures proper setup for an NLP project reliant on OpenAI's API, suppressing warnings for cleaner output.

import os
import sys
import json
from typing import List, Tuple
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI # This import should now work
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.tools import DuckDuckGoSearchResults # This import should now work
import warnings
warnings.filterwarnings("ignore")
try:
from google.colab import userdata
api_key = userdata.get("OPENAI_API_KEY")
except ImportError:
api_key = None  # Not running in Colab
if not api_key:
api_key = os.getenv("OPENAI_API_KEY")
if api_key:
os.environ["OPENAI_API_KEY"] = 'ADD YOUR OPEN API KEY'
else:
raise ValueError("❌ OpenAI API Key is missing\! Add it to Colab Secrets or .env file.")
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
print("OPENAI_API_KEY setup completed successfully!")

Defining File Path for Project Resource

This code assigns a string to the variable path, specifying the location of a PDF file titled "Image Recognition Using Deep Neural Network" within a nested directory structure under Google Drive. The path points to a resource likely used in a generative AI project focused on Corrective Retrieval-Augmented Generation (RAG) with dynamic adjustments, indicating the file’s relevance to the project’s context.

path = "/content/drive/MyDrive/New 90 Projects/generative_ai_project/Corrective Retrieval-Augmented Generation (RAG) with Dynamic Adjustments/Image Recognition Using Deep Neural Network .pdf"

Processing and Vectorizing PDF Content

This code defines a function encode_pdf that takes a PDF file path, loads the document using PyPDFLoader, and splits it into chunks of 1000 characters with no overlap using RecursiveCharacterTextSplitter. It then generates embeddings for these chunks using the SentenceTransformerEmbeddings model "all-mpnet-base-v2" and stores them in a Chroma vector database, enabling efficient retrieval and analysis of the PDF content for tasks like Retrieval-Augmented Generation (RAG).