Multi-Document RAG System

A production-ready Retrieval-Augmented Generation (RAG) system for intelligent question-answering over multiple PDF documents. Features hybrid retrieval (vector + keyword search), cross-encoder re-ranking, semantic chunking, and a Gradio web interface.

Architecture Python LLM

Model Description

This system implements an advanced RAG pipeline that combines multiple state-of-the-art techniques for optimal document retrieval and question answering:

Core Models Used

Component Model Purpose
Embeddings BAAI/bge-large-en-v1.5 1024-dim normalized embeddings for semantic search
Re-ranker BAAI/bge-reranker-v2-m3 Cross-encoder neural re-ranking for precision
Chunker sentence-transformers/all-MiniLM-L6-v2 Semantic similarity for intelligent chunking
LLM Llama 3.3 70B (via Groq API) Generation with inline citations

Architecture

User Query
    │
    ├── Query Classification (factoid/summary/comparison/extraction/reasoning)
    ├── Multi-Query Expansion (3 alternative phrasings)
    └── HyDE Generation (hypothetical answer document)
           │
           ▼
    ┌──────────────────────────────────────┐
    │         Hybrid Retrieval             │
    │  ┌─────────────┐  ┌─────────────┐    │
    │  │ ChromaDB    │  │ BM25        │    │
    │  │ (Vector)    │  │ (Keyword)   │    │
    │  └─────────────┘  └─────────────┘    │
    │           │              │           │
    │           └──────┬───────┘           │
    │                  ▼                   │
    │         RRF Fusion + Deduplication   │
    └──────────────────────────────────────┘
                       │
                       ▼
              Cross-Encoder Re-ranking
              (BAAI/bge-reranker-v2-m3)
                       │
                       ▼
              LLM Generation (Llama 3.3 70B)
              with inline source citations
                       │
                       ▼
              Answer Verification (for complex queries)

Key Features

Hybrid Retrieval

  • Vector Search (MMR): Semantic similarity with diversity via ChromaDB
  • Keyword Search (BM25): Exact term matching for rare words
  • Reciprocal Rank Fusion: Combines multiple ranked lists optimally

Semantic Chunking

Documents are split based on sentence embedding similarity rather than fixed character counts, preserving coherent ideas within chunks.

Intelligent Query Classification

Automatically classifies queries into 5 types with adaptive retrieval:

Query Type Retrieval Depth (k) Answer Style
Factoid 6 Direct
Summary 10 Bullets
Comparison 12 Bullets
Extraction 8 Direct
Reasoning 10 Steps

Multi-Document Support

  • Upload multiple PDFs to build a combined knowledge base
  • Automatic PDF diversity enforcement for cross-document queries
  • Clear source attribution with document name and page number

Query Enhancement

  • HyDE: Generates hypothetical answer documents for better retrieval
  • Multi-Query Expansion: Creates 3 alternative phrasings for broader coverage

Answer Verification

Self-verification step for complex queries ensures answers are direct, structured, and grounded in sources.

Intended Uses

Primary Use Cases

  • Academic Research: Analyze and compare research papers
  • Document Q&A: Answer questions over technical documentation
  • Literature Review: Synthesize information across multiple sources
  • Knowledge Extraction: Extract specific facts, methodologies, or findings

Out-of-Scope Uses

  • Real-time streaming applications (latency-sensitive)
  • Non-English documents (optimized for English)
  • Image/table-heavy PDFs (text extraction only)

How to Use

Requirements

  • Python 3.10+
  • Groq API key (free at console.groq.com)
  • GPU recommended but not required

Installation

pip install numpy==1.26.4 pandas==2.2.2 scipy==1.13.1
pip install langchain-core==0.2.40 langchain-community==0.2.16 langchain==0.2.16
pip install langchain-groq==0.1.9 langchain-text-splitters==0.2.4
pip install chromadb==0.5.5 sentence-transformers==3.0.1
pip install pypdf==4.3.1 rank-bm25==0.2.2 gradio torch

Quick Start

  1. Open rag.ipynb in Jupyter Notebook or Google Colab
  2. Run all cells sequentially
  3. Enter your Groq API key in the Setup tab
  4. Upload PDF documents
  5. Ask questions in the Chat tab

Example Queries

# Single Document Analysis
"What is the main contribution of this paper?"
"Explain the methodology in detail"
"What are the limitations mentioned by the authors?"

# Multi-Document Comparison
"Compare the approaches discussed in these papers"
"What are the key differences between the methodologies?"

Technical Specifications

Performance Benchmarks

Operation Typical Duration
Model initialization 30-60 seconds
PDF ingestion (per doc) 10-30 seconds
Simple queries 5-8 seconds
Complex queries 10-15 seconds
Full document summary 30-90 seconds

Configuration Parameters

Parameter Default Description
max_chunk_size 1000 Maximum characters per semantic chunk
similarity_threshold 0.5 Cosine similarity for chunk grouping
chunk_size 800 Fallback text splitter chunk size
chunk_overlap 150 Character overlap between chunks
fetch_factor 2 Multiplier for initial retrieval pool
lambda_mult 0.6 MMR diversity parameter
cache_max_size 100 Maximum cached query responses

Limitations

  • Requires active internet connection for Groq API calls
  • PDF quality affects text extraction accuracy
  • Large documents may take longer to process
  • Query cache does not persist between sessions
  • Optimized for English language documents

Training Details

This is a retrieval system, not a trained model. It orchestrates pre-trained models:

  • Embeddings: Uses pre-trained BAAI/bge-large-en-v1.5 without fine-tuning
  • Re-ranker: Uses pre-trained BAAI/bge-reranker-v2-m3 without fine-tuning
  • LLM: Uses Llama 3.3 70B via Groq API with zero-shot prompting

Evaluation

The system was evaluated qualitatively on academic papers and technical documents for:

  • Answer relevance and accuracy
  • Source attribution correctness
  • Cross-document comparison quality
  • Response structure and readability

Environmental Impact

  • Hardware: Developed and tested on Google Colab (NVIDIA T4 GPU)
  • Inference: Primary compute via Groq API (cloud-hosted)
  • Local model loading: ~2GB VRAM for embeddings + re-ranker

Citation

@software{multi_doc_rag_system,
  title = {Multi-Document RAG System},
  year = {2024},
  description = {Production-ready RAG system with hybrid retrieval and cross-encoder re-ranking},
  url = {https://huggingface.co/your-username/your-repo}
}

Acknowledgements

This project builds upon:

Contact

For questions or feedback, please open an issue on the repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for goutam-dev/rag-chatbot

Finetuned
(49)
this model