Spaces:

Akashmj22122002
/

Akash_ai_talks

Running

App Files Files Community

Akash_ai_talks / community_contributions /dkisselev-zz /README.md

Akashmj22122002

Upload folder using huggingface_hub

b02e301 verified 13 days ago

preview code

raw

history blame contribute delete

5.93 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Digital Persona - Personal Knowledge Base

A RAG application that creates a queryable digital persona based on Facebook and LinkedIn data exports.

📁 Project Structure

dkisselev-zz/
├── persona_rag/                    # 🚀 Main Application
│   ├── persona_app.py             # Gradio chat interface
│   ├── ingest.py                  # Data ingestion + vector DB creation
│   ├── answer.py                  # RAG retrieval with configurable techniques
│   ├── evaluate.py                # Evaluation, tuning, RAG comparison
│   ├── tests.jsonl                # 25 test questions (LLM-generated) (.gitginored)
│   ├── pyproject.toml             # Dependencies (uv)
│   ├── README.md                  # 📘 Complete documentation
│   └── data/
│       ├── process_data.py        # Data processing (Facebook + LinkedIn)
│       ├── processed_facebook_data.json  # (.gitignored)
│       ├── processed_linkedin_data.json  # (.gitignored)
│       └── vector_db/             # Chroma database (.gitignored)
├── data_raw/
│   ├── facebook/                  # Raw Facebook export
│   └── linkedin/                  # Raw LinkedIn export
└── README.md                      # This file

📊 Data Overview

Data for Facebook and LinkedIn files is collected through data export functionality of the corresponding services and process usign process_data.py to tranform to json file that is locaed to ChromaDB for RAG

Facebook Data (Personal Life)

Profile information (name, location, family, education)
Years of posts and status updates
Comments and social interactions
Messages (privacy-preserving: only sent)
Pages liked
Events attended
Group memberships
Saved content
Books read and app activities

LinkedIn Data (Professional Career)

Professional profile and headline
Career history
Technical skills
Professional certifications
Education
Colleague recommendations
Projects
Publications and thought leadership

Test Question Generation

The evaluation framework uses tests.jsonl - a collection of 25 test questions generated by an LLM based on your processed data.

Example test question:

{
  "question": "What is your current position?",
  "keywords": ["Tensor Lab", "Research Fellow", "current"],
  "reference_answer": "Research Fellow at The Tensor Lab, UCSF",
  "category": "career"
}

Generating tests:

Load your processed data into an LLM (GPT-4)
Prompting it to generate diverse test questions based on the content
Saving in JSONL format with required fields: question, keywords, reference_answer, category

Note: Questions could be generated manually as well without LLM participation

🎯 Architecture Highlights

Data Pipeline

Raw Data (Facebook + LinkedIn)
    ↓
Processing (first-person natural language)
    ↓
Grouping (semantic units by category & time)
    ↓
Chunking (1250 chars, 250 overlap)
    ↓
Embeddings (GTE-small, 384 dims)
    ↓
Vector DB (Chroma)

RAG Query Pipeline

User Query 
    ↓
(Query Expansion) → Sub-queries → 
Semantic Search → (Hybrid BM25) → 
Reranking → Context Retrieved
    ↓
LLM with RAG Context
    ↓
Agentic Tool Calls (as needed):
  • record_user_details (email capture)
  • record_unknown_question (improvement tracking)
  • push (Pushover notifications)
    ↓
Final Response

🎓 Key Features

Advanced RAG Techniques

Query Expansion - Alternative phrasings for better coverage
Hybrid Search - BM25 keyword + semantic search
Sub-query Generation - Break complex questions into parts
Cross-Encoder Reranking - Precision-focused ranking

Evaluation & Optimization

Hyperparameter Tuning - Optimized chunk size
RAG Comparison - Test all 4 configurations to find best approach
Comprehensive Metrics - MRR, nDCG, Coverage, Accuracy, Completeness, Relevance
LLM-as-Judge - Answer quality evaluation

Application Features

Gradio Interface - Clean, interactive chat UI
Agentic Architecture - Uses OpenAI function calling (tools) for intelligent actions
- record_user_details - Captures email addresses and user information
- record_unknown_question - Logs questions that cannot be answered for future improvement
- push - Sends Pushover notifications for important interactions
Smart Email Collection - Collects contact info once via tool calling, doesn't re-ask
Conversation History - Multi-turn context management

🚀 Quick Start

# Navigate to application
cd persona_rag

# Install dependencies
uv sync

# Create .env file
echo "OPENAI_API_KEY=sk-proj-..." > .env

# Ingest data (if not already done)
uv run python ingest.py

# Launch application
uv run python persona_app.py

Open browser: http://127.0.0.1:7860

🛠️ Development Commands

# Data Processing
cd persona_rag/data
python process_data.py facebook          # Process Facebook only
python process_data.py linkedin          # Process LinkedIn only
python process_data.py both              # Process both

# Application
cd ..
uv run python persona_app.py             # Launch app

# Evaluation & Optimization
uv run python evaluate.py --compare-rag  # Compare RAG techniques
uv run python evaluate.py --tune         # Hyperparameter tuning
uv run python evaluate.py --eval         # Run evaluation
uv run python evaluate.py --all          # Everything

# RAG Technique Testing
uv run python evaluate.py --eval --query-expansion
uv run python evaluate.py --eval --hybrid-search