Multi-Modal Retrieval-Augmented Generation (RAG) with Text and Image Processing

Project Overview

Research documents become directly and efficiently analyzed and summarized through a research assistant that runs on AI infrastructure which utilizes natural language processing (NLP) optical character recognition (OCR) and vector search methods. Research papers go through the system which extracts text images and tables then performs semantic searches through Hugging Face embeddings and ChromaDB followed by generating AI summaries of text and image explanations with GPT-4o. Users can submit research-related inquiries through the system which delivers specific relevant content sections thus cutting down the time needed for manual reading. This project delivers increased academic research speed because it incorporates PyMuPDF, pdfplumber, Tesseract OCR, OpenCV, LangChain, and OpenAI to optimize document processing along with AI retrieval capabilities. This allows students along with researchers and analysts to obtain automated content retrieval and available access to literature reviews, academic insights and research paper analysis.

Prerequisites

Python 3.8+ with Google Colab or Jupyter Notebook for execution.
The text and image analysis with GPT-4o requires an OpenAI API Key as a prerequisite for operation.
The program relies on the Tesseract OCR & Poppler-utils combination to extract text content from PDF files alongside image documents.
LangChain, ChromaDB & Hugging Face Embeddings for semantic search and AI-powered retrieval.
The program requires PyMuPDF together with pdfplumber and pdf2image to extract text and images along with tables from PDF documents.
Pandas, NumPy, Matplotlib & IPython Display for data processing and visualization.

Approach

This research project uses an AI-based method that combines Multimodal RAG with OCR, NLP, vector-based retrieval to efficiently extract and analyze academic content. The processing steps for PDFs begin with the application of PyMuPDF and pdfplumber and pdf2image for text table and image extraction and Tesseract OCR for processing scanned documents. Text and tables extracted from the documents are embedded with Hugging Face sentence transformers to create entries in ChromaDB for semantic search functions.

The system passes base64-formatted images to GPT-4o for structured figure and table and visual data explanations. RAG into the system allows users to achieve text embeddings together with AI-generated insights which provide context-dependent document comprehension. The AI querying capability of GPT-4o allows it to retrieve pertinent research paper sections automatically for users as well as generate summarized content from their search requests which enhances the speed and precision of research results.

Pandas DataFrame serves as structured storage that supports advanced analysis and visualization for the organized extracted content. This RAG system enables efficient academic research delivery because it removed the requirement for manual document assessment and delivered improved summary solutions that process images along with text content.

Workflow and Methodology

Workflow

The system handles PDF upload and processing duties which result in reading and analysis of these documents.
It extracts structured text data alongside tables from PDFs through PyMuPDF and pdfplumber tools.
Employs Tesseract OCR as part of its OCR for Scanned PDFs feature which extracts text from scanned or image-based PDFs.
Utilizes GPT-4o to convert academic figures into base64 while creating explanations through AI technologies.
Saves extracted content under ChromaDB embedding format using Hugging Face for semantic search operations.
Users can use this system to perform queries that enable them to find sections containing research answers.
Structure Summaries for all file types come from GPT-4o during automated summarization processes.
The system transforms raw data contents into Pandas DataFrames for both analysis and visualization purposes.

Methodology

Loads the PDF, determines its structure and extracts text, tables and pictures.
Uses Tesseract OCR and OpenCV to convert scanned PDFs and figures to text.
Using Hugging Face sentence transformers, convert extracted information into numerical embeddings.
Implements semantic search by storing and retrieving contextually relevant document portions using ChromaDB.
Multimodal Analysis with GPT-4o - AI is used to analyse both the textual and visual components of the study paper.
Uses user input to dynamically retrieve and summarise relevant portions.
Context-Aware Summarisation creates organised, AI-driven summaries of study findings.
Provides extracted and summarised material in a structured manner for review and analysis.

Data Collection and Preparation

Data Collection

Research Paper PDFs are collected using Google Scholar for publicly available academic papers and Sci-Hub for accessing paywalled research.
Academic Journals, Theses, and Conference Papers are sourced to ensure a diverse dataset.

Data Preparation Workflow:

PDFs are processed to extract text, tables, and images using PyMuPDF, pdfplumber, and Pandas.
Tesseract OCR retrieves text from scanned documents for better accessibility.
OpenCV enhances figures and converts them to base64 for AI-based analysis.
Extracted content is labeled with sections, page numbers, and metadata for structured retrieval.
Hugging Face embeddings are generated and stored in ChromaDB for semantic search.
Processed data is structured into a Pandas DataFrame for AI-driven querying and analysis.

Code Explanation

STEP 1:

Mounting of Google Drive

This code mounts your Google Drive into the Colab environment so that you can access files stored in your drive. Your Google Drive is made accessible under /content/drive path.

from google.colab import drive
drive.mount('/content/drive')

This code installs various tools for working with PDFs, enabling text extraction through OCR, and utilizing AI models. It configures Poppler and Tesseract OCR for recognizing text, along with LangChain, OpenAI, and ChromaDB for AI-driven document processing. Essentially, it readies your system to manage PDFs, images, and AI-enhanced text extraction.

!apt-get update && apt-get install -y poppler-utils tesseract-ocr libtesseract-dev
!pip install --no-cache-dir pdf2image pdfplumber opencv-python-headless unstructured[all-docs] chromadb langchain openai python-dotenv pillow tiktoken pymupdf pytesseract
!pip install langchain_openai
!pip install -U langchain-community --quiet

This code imports various libraries to manage PDFs, images, AI models, and data processing tasks. It utilizes utilities such as os and dotenv for managing the environment, along with numpy and pandas for handling data. For PDF extraction, it incorporates pdfplumber and PyMuPDF. Additionally, it configures pytesseract for optical character recognition (OCR) and employs OpenAI and LangChain for AI-driven text analysis, while also using matplotlib for data visualization.

# Basic Libraries
import os
import uuid
import base64
import io
from pathlib import Path
from dotenv import load_dotenv
from google.colab import userdata
#  Data Processing & Utilities
import numpy as np
import pandas as pd
# PDF Processing
import fitz  # PyMuPDF
import pdfplumber
from pdf2image import convert_from_path
from unstructured.partition.pdf import partition_pdf
# Image Processing
import cv2
import pytesseract
from PIL import Image
# OpenAI & LangChain (LLM & Embeddings)
import openai
from openai import OpenAI
from langchain_openai import ChatOpenAI  # Updated OpenAI API wrapper
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage, AIMessage
# LangChain Document Handling
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
# Visualization
import matplotlib.pyplot as plt
from IPython.display import display, HTML, Markdown
from IPython.display import display

Setting Up OpenAI API Key

This code fetches the OpenAI API key from the userdata in Google Colab and saves it in the environment variable OPENAI_API_KEY. After that, it sets up the OpenAI client with this key, enabling the program to communicate with OpenAI’s API for various AI-related tasks.

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]
client = OpenAI(api_key=OPENAI_API_KEY)

Initializing Hugging Face Embeddings

This code imports Hugging Face embeddings with the "sentence-transformers/all-MiniLM-L6-v2" model. It transforms text into numerical vectors, which are beneficial for tasks such as semantic search, text similarity and AI-driven retrieval.