Akashmj22122002's picture
Upload folder using huggingface_hub
b02e301 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Digital Persona - Personal Knowledge Base

A RAG application that creates a queryable digital persona based on Facebook and LinkedIn data exports.


πŸ“ Project Structure

dkisselev-zz/
β”œβ”€β”€ persona_rag/                    # πŸš€ Main Application
β”‚   β”œβ”€β”€ persona_app.py             # Gradio chat interface
β”‚   β”œβ”€β”€ ingest.py                  # Data ingestion + vector DB creation
β”‚   β”œβ”€β”€ answer.py                  # RAG retrieval with configurable techniques
β”‚   β”œβ”€β”€ evaluate.py                # Evaluation, tuning, RAG comparison
β”‚   β”œβ”€β”€ tests.jsonl                # 25 test questions (LLM-generated) (.gitginored)
β”‚   β”œβ”€β”€ pyproject.toml             # Dependencies (uv)
β”‚   β”œβ”€β”€ README.md                  # πŸ“˜ Complete documentation
β”‚   └── data/
β”‚       β”œβ”€β”€ process_data.py        # Data processing (Facebook + LinkedIn)
β”‚       β”œβ”€β”€ processed_facebook_data.json  # (.gitignored)
β”‚       β”œβ”€β”€ processed_linkedin_data.json  # (.gitignored)
β”‚       └── vector_db/             # Chroma database (.gitignored)
β”œβ”€β”€ data_raw/
β”‚   β”œβ”€β”€ facebook/                  # Raw Facebook export
β”‚   └── linkedin/                  # Raw LinkedIn export
└── README.md                      # This file

πŸ“Š Data Overview

Data for Facebook and LinkedIn files is collected through data export functionality of the corresponding services and process usign process_data.py to tranform to json file that is locaed to ChromaDB for RAG

Facebook Data (Personal Life)

  • Profile information (name, location, family, education)
  • Years of posts and status updates
  • Comments and social interactions
  • Messages (privacy-preserving: only sent)
  • Pages liked
  • Events attended
  • Group memberships
  • Saved content
  • Books read and app activities

LinkedIn Data (Professional Career)

  • Professional profile and headline
  • Career history
  • Technical skills
  • Professional certifications
  • Education
  • Colleague recommendations
  • Projects
  • Publications and thought leadership

Test Question Generation

The evaluation framework uses tests.jsonl - a collection of 25 test questions generated by an LLM based on your processed data.

Example test question:

{
  "question": "What is your current position?",
  "keywords": ["Tensor Lab", "Research Fellow", "current"],
  "reference_answer": "Research Fellow at The Tensor Lab, UCSF",
  "category": "career"
}

Generating tests:

  1. Load your processed data into an LLM (GPT-4)
  2. Prompting it to generate diverse test questions based on the content
  3. Saving in JSONL format with required fields: question, keywords, reference_answer, category

Note: Questions could be generated manually as well without LLM participation


🎯 Architecture Highlights

Data Pipeline

Raw Data (Facebook + LinkedIn)
    ↓
Processing (first-person natural language)
    ↓
Grouping (semantic units by category & time)
    ↓
Chunking (1250 chars, 250 overlap)
    ↓
Embeddings (GTE-small, 384 dims)
    ↓
Vector DB (Chroma)

RAG Query Pipeline

User Query 
    ↓
(Query Expansion) β†’ Sub-queries β†’ 
Semantic Search β†’ (Hybrid BM25) β†’ 
Reranking β†’ Context Retrieved
    ↓
LLM with RAG Context
    ↓
Agentic Tool Calls (as needed):
  β€’ record_user_details (email capture)
  β€’ record_unknown_question (improvement tracking)
  β€’ push (Pushover notifications)
    ↓
Final Response

πŸŽ“ Key Features

Advanced RAG Techniques

  • Query Expansion - Alternative phrasings for better coverage
  • Hybrid Search - BM25 keyword + semantic search
  • Sub-query Generation - Break complex questions into parts
  • Cross-Encoder Reranking - Precision-focused ranking

Evaluation & Optimization

  • Hyperparameter Tuning - Optimized chunk size
  • RAG Comparison - Test all 4 configurations to find best approach
  • Comprehensive Metrics - MRR, nDCG, Coverage, Accuracy, Completeness, Relevance
  • LLM-as-Judge - Answer quality evaluation

Application Features

  • Gradio Interface - Clean, interactive chat UI
  • Agentic Architecture - Uses OpenAI function calling (tools) for intelligent actions
    • record_user_details - Captures email addresses and user information
    • record_unknown_question - Logs questions that cannot be answered for future improvement
    • push - Sends Pushover notifications for important interactions
  • Smart Email Collection - Collects contact info once via tool calling, doesn't re-ask
  • Conversation History - Multi-turn context management

πŸš€ Quick Start

# Navigate to application
cd persona_rag

# Install dependencies
uv sync

# Create .env file
echo "OPENAI_API_KEY=sk-proj-..." > .env

# Ingest data (if not already done)
uv run python ingest.py

# Launch application
uv run python persona_app.py

Open browser: http://127.0.0.1:7860


πŸ› οΈ Development Commands

# Data Processing
cd persona_rag/data
python process_data.py facebook          # Process Facebook only
python process_data.py linkedin          # Process LinkedIn only
python process_data.py both              # Process both

# Application
cd ..
uv run python persona_app.py             # Launch app

# Evaluation & Optimization
uv run python evaluate.py --compare-rag  # Compare RAG techniques
uv run python evaluate.py --tune         # Hyperparameter tuning
uv run python evaluate.py --eval         # Run evaluation
uv run python evaluate.py --all          # Everything

# RAG Technique Testing
uv run python evaluate.py --eval --query-expansion
uv run python evaluate.py --eval --hybrid-search