Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.1.0
Digital Persona - Personal Knowledge Base
A RAG application that creates a queryable digital persona based on Facebook and LinkedIn data exports.
π Project Structure
dkisselev-zz/
βββ persona_rag/ # π Main Application
β βββ persona_app.py # Gradio chat interface
β βββ ingest.py # Data ingestion + vector DB creation
β βββ answer.py # RAG retrieval with configurable techniques
β βββ evaluate.py # Evaluation, tuning, RAG comparison
β βββ tests.jsonl # 25 test questions (LLM-generated) (.gitginored)
β βββ pyproject.toml # Dependencies (uv)
β βββ README.md # π Complete documentation
β βββ data/
β βββ process_data.py # Data processing (Facebook + LinkedIn)
β βββ processed_facebook_data.json # (.gitignored)
β βββ processed_linkedin_data.json # (.gitignored)
β βββ vector_db/ # Chroma database (.gitignored)
βββ data_raw/
β βββ facebook/ # Raw Facebook export
β βββ linkedin/ # Raw LinkedIn export
βββ README.md # This file
π Data Overview
Data for Facebook and LinkedIn files is collected through data export functionality of the corresponding services and process usign process_data.py to tranform to json file that is locaed to ChromaDB for RAG
Facebook Data (Personal Life)
- Profile information (name, location, family, education)
- Years of posts and status updates
- Comments and social interactions
- Messages (privacy-preserving: only sent)
- Pages liked
- Events attended
- Group memberships
- Saved content
- Books read and app activities
LinkedIn Data (Professional Career)
- Professional profile and headline
- Career history
- Technical skills
- Professional certifications
- Education
- Colleague recommendations
- Projects
- Publications and thought leadership
Test Question Generation
The evaluation framework uses tests.jsonl - a collection of 25 test questions generated by an LLM based on your processed data.
Example test question:
{
"question": "What is your current position?",
"keywords": ["Tensor Lab", "Research Fellow", "current"],
"reference_answer": "Research Fellow at The Tensor Lab, UCSF",
"category": "career"
}
Generating tests:
- Load your processed data into an LLM (GPT-4)
- Prompting it to generate diverse test questions based on the content
- Saving in JSONL format with required fields:
question,keywords,reference_answer,category
Note: Questions could be generated manually as well without LLM participation
π― Architecture Highlights
Data Pipeline
Raw Data (Facebook + LinkedIn)
β
Processing (first-person natural language)
β
Grouping (semantic units by category & time)
β
Chunking (1250 chars, 250 overlap)
β
Embeddings (GTE-small, 384 dims)
β
Vector DB (Chroma)
RAG Query Pipeline
User Query
β
(Query Expansion) β Sub-queries β
Semantic Search β (Hybrid BM25) β
Reranking β Context Retrieved
β
LLM with RAG Context
β
Agentic Tool Calls (as needed):
β’ record_user_details (email capture)
β’ record_unknown_question (improvement tracking)
β’ push (Pushover notifications)
β
Final Response
π Key Features
Advanced RAG Techniques
- Query Expansion - Alternative phrasings for better coverage
- Hybrid Search - BM25 keyword + semantic search
- Sub-query Generation - Break complex questions into parts
- Cross-Encoder Reranking - Precision-focused ranking
Evaluation & Optimization
- Hyperparameter Tuning - Optimized chunk size
- RAG Comparison - Test all 4 configurations to find best approach
- Comprehensive Metrics - MRR, nDCG, Coverage, Accuracy, Completeness, Relevance
- LLM-as-Judge - Answer quality evaluation
Application Features
- Gradio Interface - Clean, interactive chat UI
- Agentic Architecture - Uses OpenAI function calling (tools) for intelligent actions
record_user_details- Captures email addresses and user informationrecord_unknown_question- Logs questions that cannot be answered for future improvementpush- Sends Pushover notifications for important interactions
- Smart Email Collection - Collects contact info once via tool calling, doesn't re-ask
- Conversation History - Multi-turn context management
π Quick Start
# Navigate to application
cd persona_rag
# Install dependencies
uv sync
# Create .env file
echo "OPENAI_API_KEY=sk-proj-..." > .env
# Ingest data (if not already done)
uv run python ingest.py
# Launch application
uv run python persona_app.py
Open browser: http://127.0.0.1:7860
π οΈ Development Commands
# Data Processing
cd persona_rag/data
python process_data.py facebook # Process Facebook only
python process_data.py linkedin # Process LinkedIn only
python process_data.py both # Process both
# Application
cd ..
uv run python persona_app.py # Launch app
# Evaluation & Optimization
uv run python evaluate.py --compare-rag # Compare RAG techniques
uv run python evaluate.py --tune # Hyperparameter tuning
uv run python evaluate.py --eval # Run evaluation
uv run python evaluate.py --all # Everything
# RAG Technique Testing
uv run python evaluate.py --eval --query-expansion
uv run python evaluate.py --eval --hybrid-search