diff --git a/.github/README.md b/.github/README.md
index 5ca61af8fee9f9feb0805df6636417391aa55a81..fc0df8d8957440e376c22ebd3ab7ab27f76117cf 100644
--- a/.github/README.md
+++ b/.github/README.md
@@ -7,7 +7,11 @@ sdk: gradio
 sdk_version: "6.0.1"
 python_version: "3.11"
 app_file: src/app.py
-pinned: false
+hf_oauth: true
+hf_oauth_expiration_minutes: 480
+hf_oauth_scopes:
+ - inference-api
+pinned: true
 license: mit
 tags:
   - mcp-in-action-track-enterprise
@@ -19,6 +23,18 @@ tags:
   - modal
 ---
 
+<div align="center">
+
+[![GitHub](https://img.shields.io/github/stars/DeepCritical/GradioDemo?style=for-the-badge&logo=github&logoColor=white&label=🐙%20GitHub&labelColor=181717&color=181717)](https://github.com/DeepCritical/GradioDemo)
+[![Documentation](https://img.shields.io/badge/📚%20Docs-0080FF?style=for-the-badge&logo=readthedocs&logoColor=white&labelColor=0080FF&color=0080FF)](docs/index.md)
+[![Demo](https://img.shields.io/badge/🚀%20Demo-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white&labelColor=FFD21E&color=FFD21E)](https://huggingface.co/spaces/DataQuests/DeepCritical)
+[![CodeCov](https://img.shields.io/badge/📊%20Coverage-F01F7A?style=for-the-badge&logo=codecov&logoColor=white&labelColor=F01F7A&color=F01F7A)](https://codecov.io/gh/DeepCritical/GradioDemo)
+[![Join us on Discord](https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square)](https://discord.gg/qdfnvSPcqP) 
+
+
+</div>
+
+
 # DeepCritical
 
 ## Intro
@@ -27,9 +43,10 @@ tags:
 
 - **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
 - **MCP Integration**: Use our tools from Claude Desktop or any MCP client
+- **HuggingFace OAuth**: Sign in with your HuggingFace account to automatically use your API token
 - **Modal Sandbox**: Secure execution of AI-generated statistical code
 - **LlamaIndex RAG**: Semantic search and evidence synthesis
-- **HuggingfaceInference**: 
+- **HuggingfaceInference**: Free tier support with automatic fallback
 - **HuggingfaceMCP Custom Config To Use Community Tools**:
 - **Strongly Typed Composable Graphs**:
 - **Specialized Research Teams of Agents**: 
@@ -55,7 +72,20 @@ uv run gradio run src/app.py
 
 Open your browser to `http://localhost:7860`.
 
-### 3. Connect via MCP
+### 3. Authentication (Optional)
+
+**HuggingFace OAuth Login**:
+- Click the "Sign in with HuggingFace" button at the top of the app
+- Your HuggingFace API token will be automatically used for AI inference
+- No need to manually enter API keys when logged in
+- OAuth token is used only for the current session and never stored
+
+**Manual API Key (BYOK)**:
+- You can still provide your own API key in the Settings accordion
+- Supports HuggingFace, OpenAI, or Anthropic API keys
+- Manual keys take priority over OAuth tokens
+
+### 4. Connect via MCP
 
 This application exposes a Model Context Protocol (MCP) server, allowing you to use its search tools directly from Claude Desktop or other MCP clients.
 
@@ -81,7 +111,13 @@ Add this to your `claude_desktop_config.json`:
 - `analyze_hypothesis`: Secure statistical analysis using Modal sandboxes.
 
 
-## Deep Research Flows 
+## Architecture
+
+DeepCritical uses a Vertical Slice Architecture:
+
+1.  **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and bioRxiv.
+2.  **Judge Slice**: Evaluating evidence quality using LLMs.
+3.  **Orchestrator Slice**: Managing the research loop and UI.
 
 - iterativeResearch
 - deepResearch
@@ -89,6 +125,7 @@ Add this to your `claude_desktop_config.json`:
 
 ### Iterative Research
 
+```mermaid
 sequenceDiagram
     participant IterativeFlow
     participant ThinkingAgent
@@ -121,10 +158,12 @@ sequenceDiagram
             JudgeHandler-->>IterativeFlow: should_continue
         end
     end
+```
 
 
 ### Deep Research
 
+```mermaid
 sequenceDiagram
     actor User
     participant GraphOrchestrator
@@ -159,8 +198,10 @@ sequenceDiagram
     end
     
     GraphOrchestrator->>User: AsyncGenerator[AgentEvent]
+```
 
 ### Research Team
+
 Critical Deep Research Agent
 
 ## Development
@@ -177,27 +218,6 @@ uv run pytest
 make check
 ```
 
-## Architecture
-
-DeepCritical uses a Vertical Slice Architecture:
-
-1.  **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and bioRxiv.
-2.  **Judge Slice**: Evaluating evidence quality using LLMs.
-3.  **Orchestrator Slice**: Managing the research loop and UI.
-
-Built with:
-- **PydanticAI**: For robust agent interactions.
-- **Gradio**: For the streaming user interface.
-- **PubMed, ClinicalTrials.gov, bioRxiv**: For biomedical data.
-- **MCP**: For universal tool access.
-- **Modal**: For secure code execution.
-
-## Team
-
-- The-Obstacle-Is-The-Way
-- MarioAderman
-- Josephrp
-
 ## Links
 
-- [GitHub Repository](https://github.com/The-Obstacle-Is-The-Way/DeepCritical-1)
\ No newline at end of file
+- [GitHub Repository](https://github.com/DeepCritical/GradioDemo)
\ No newline at end of file
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index c3d4ae628cb76fa4376889ab1e365d86ea34a8d5..64e149aa223ffee84f51369d58f13d36de211bfc 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -16,6 +16,11 @@ jobs:
     steps:
       - uses: actions/checkout@v4
 
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          version: "latest"
+
       - name: Set up Python ${{ matrix.python-version }}
         uses: actions/setup-python@v5
         with:
@@ -23,45 +28,40 @@ jobs:
 
       - name: Install dependencies
         run: |
-          python -m pip install --upgrade pip
-          pip install -e ".[dev]"
+          uv sync --dev
 
       - name: Lint with ruff
         run: |
-          ruff check . --exclude tests
-          ruff format --check . --exclude tests
+          uv run ruff check . --exclude tests
+          uv run ruff format --check . --exclude tests
 
       - name: Type check with mypy
         run: |
-          mypy src
-
-      - name: Install embedding dependencies
-        run: |
-          pip install -e ".[embeddings]"
+          uv run mypy src
 
-      - name: Run unit tests (excluding OpenAI and embedding providers)
+      - name: Run unit tests (No Black Box Apis)
         env:
           HF_TOKEN: ${{ secrets.HF_TOKEN }}
         run: |
-          pytest tests/unit/ -v -m "not openai and not embedding_provider" --tb=short -p no:logfire
+          uv run pytest tests/unit/ -v -m "not openai and not embedding_provider" --tb=short -p no:logfire
 
       - name: Run local embeddings tests
         env:
           HF_TOKEN: ${{ secrets.HF_TOKEN }}
         run: |
-          pytest tests/ -v -m "local_embeddings" --tb=short -p no:logfire || true
+          uv run pytest tests/ -v -m "local_embeddings" --tb=short -p no:logfire || true
         continue-on-error: true  # Allow failures if dependencies not available
 
       - name: Run HuggingFace integration tests
         env:
           HF_TOKEN: ${{ secrets.HF_TOKEN }}
         run: |
-          pytest tests/integration/ -v -m "huggingface and not embedding_provider" --tb=short -p no:logfire || true
+          uv run pytest tests/integration/ -v -m "huggingface and not embedding_provider" --tb=short -p no:logfire || true
         continue-on-error: true  # Allow failures if HF_TOKEN not set
 
       - name: Run non-OpenAI integration tests (excluding embedding providers)
         env:
           HF_TOKEN: ${{ secrets.HF_TOKEN }}
         run: |
-          pytest tests/integration/ -v -m "integration and not openai and not embedding_provider" --tb=short -p no:logfire || true
+          uv run pytest tests/integration/ -v -m "integration and not openai and not embedding_provider" --tb=short -p no:logfire || true
         continue-on-error: true  # Allow failures if dependencies not available
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
new file mode 100644
index 0000000000000000000000000000000000000000..62529da06a06d1b86253aa9c9f70298d538db2b4
--- /dev/null
+++ b/.github/workflows/docs.yml
@@ -0,0 +1,55 @@
+name: Documentation
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - 'docs/**'
+      - 'mkdocs.yml'
+      - '.github/workflows/docs.yml'
+  pull_request:
+    branches:
+      - main
+    paths:
+      - 'docs/**'
+      - 'mkdocs.yml'
+      - '.github/workflows/docs.yml'
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+
+      - name: Install uv
+        run: |
+          pip install uv
+
+      - name: Install dependencies
+        run: |
+          uv sync --all-extras --dev
+
+      - name: Build documentation
+        run: |
+          uv run mkdocs build --strict
+
+      - name: Deploy to GitHub Pages
+        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
+        uses: peaceiris/actions-gh-pages@v3
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: ./site
+          cname: false
+
+
+
diff --git a/.gitignore b/.gitignore
index 8b9c2be2dd32820057ce520015e4904a7648f6b7..13eb23a33c1152c4b2258e8a27bfb0251837eb46 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,10 @@
+=0.22.0
+=0.22.0,
 folder/
+site/
 .cursor/
 .ruff_cache/
+docs/contributing/
 # Python
 __pycache__/
 *.py[cod]
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 12a77673427b329be27cc9533caffbde84013527..0d08dd3bf813709c4c4df5a8fc5f6ebdb16c84f3 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -31,14 +31,9 @@ repos:
         types: [python]
         args: [
           "run",
-          "pytest",
-          "tests/unit/",
-          "-v",
-          "-m",
-          "not openai and not embedding_provider",
-          "--tb=short",
-          "-p",
-          "no:logfire",
+          "python",
+          ".pre-commit-hooks/run_pytest_with_sync.py",
+          "unit",
         ]
         pass_filenames: false
         always_run: true
@@ -50,14 +45,9 @@ repos:
         types: [python]
         args: [
           "run",
-          "pytest",
-          "tests/",
-          "-v",
-          "-m",
-          "local_embeddings",
-          "--tb=short",
-          "-p",
-          "no:logfire",
+          "python",
+          ".pre-commit-hooks/run_pytest_with_sync.py",
+          "embeddings",
         ]
         pass_filenames: false
         always_run: true
diff --git a/.pre-commit-hooks/run_pytest.ps1 b/.pre-commit-hooks/run_pytest.ps1
index ec548f3ca13fb782048df191298defe82e21518d..3df4f371b845a48ce3a1ea32e307218abbd5a033 100644
--- a/.pre-commit-hooks/run_pytest.ps1
+++ b/.pre-commit-hooks/run_pytest.ps1
@@ -2,6 +2,8 @@
 # Uses uv if available, otherwise falls back to pytest
 
 if (Get-Command uv -ErrorAction SilentlyContinue) {
+    # Sync dependencies before running tests
+    uv sync
     uv run pytest $args
 } else {
     Write-Warning "uv not found, using system pytest (may have missing dependencies)"
@@ -12,3 +14,6 @@ if (Get-Command uv -ErrorAction SilentlyContinue) {
 
 
 
+
+
+
diff --git a/.pre-commit-hooks/run_pytest.sh b/.pre-commit-hooks/run_pytest.sh
index 8ecca4a4ca37f53f7bf9f749e6add363225ead41..b2a4be920113fd340631f64602c24042e8c81086 100644
--- a/.pre-commit-hooks/run_pytest.sh
+++ b/.pre-commit-hooks/run_pytest.sh
@@ -3,6 +3,8 @@
 # Uses uv if available, otherwise falls back to pytest
 
 if command -v uv >/dev/null 2>&1; then
+    # Sync dependencies before running tests
+    uv sync
     uv run pytest "$@"
 else
     echo "Warning: uv not found, using system pytest (may have missing dependencies)"
@@ -13,3 +15,6 @@ fi
 
 
 
+
+
+
diff --git a/.pre-commit-hooks/run_pytest_embeddings.ps1 b/.pre-commit-hooks/run_pytest_embeddings.ps1
new file mode 100644
index 0000000000000000000000000000000000000000..47a3e32a202240c42e5a205d2afd778a23292db7
--- /dev/null
+++ b/.pre-commit-hooks/run_pytest_embeddings.ps1
@@ -0,0 +1,14 @@
+# PowerShell wrapper to sync embeddings dependencies and run embeddings tests
+
+$ErrorActionPreference = "Stop"
+
+if (Get-Command uv -ErrorAction SilentlyContinue) {
+    Write-Host "Syncing embeddings dependencies..."
+    uv sync --extra embeddings
+    Write-Host "Running embeddings tests..."
+    uv run pytest tests/ -v -m local_embeddings --tb=short -p no:logfire
+} else {
+    Write-Error "uv not found"
+    exit 1
+}
+
diff --git a/.pre-commit-hooks/run_pytest_embeddings.sh b/.pre-commit-hooks/run_pytest_embeddings.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6f1b80746217244367ee86fcd7d69837df648b40
--- /dev/null
+++ b/.pre-commit-hooks/run_pytest_embeddings.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+# Wrapper script to sync embeddings dependencies and run embeddings tests
+
+set -e
+
+if command -v uv >/dev/null 2>&1; then
+    echo "Syncing embeddings dependencies..."
+    uv sync --extra embeddings
+    echo "Running embeddings tests..."
+    uv run pytest tests/ -v -m local_embeddings --tb=short -p no:logfire
+else
+    echo "Error: uv not found"
+    exit 1
+fi
+
diff --git a/.pre-commit-hooks/run_pytest_unit.ps1 b/.pre-commit-hooks/run_pytest_unit.ps1
new file mode 100644
index 0000000000000000000000000000000000000000..c1196d22e86fe66a56d12f673c003ac88aa6b09f
--- /dev/null
+++ b/.pre-commit-hooks/run_pytest_unit.ps1
@@ -0,0 +1,14 @@
+# PowerShell wrapper to sync dependencies and run unit tests
+
+$ErrorActionPreference = "Stop"
+
+if (Get-Command uv -ErrorAction SilentlyContinue) {
+    Write-Host "Syncing dependencies..."
+    uv sync
+    Write-Host "Running unit tests..."
+    uv run pytest tests/unit/ -v -m "not openai and not embedding_provider" --tb=short -p no:logfire
+} else {
+    Write-Error "uv not found"
+    exit 1
+}
+
diff --git a/.pre-commit-hooks/run_pytest_unit.sh b/.pre-commit-hooks/run_pytest_unit.sh
new file mode 100644
index 0000000000000000000000000000000000000000..173ab1b607647ecf4b4a1de6b75abd47fc0130ec
--- /dev/null
+++ b/.pre-commit-hooks/run_pytest_unit.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+# Wrapper script to sync dependencies and run unit tests
+
+set -e
+
+if command -v uv >/dev/null 2>&1; then
+    echo "Syncing dependencies..."
+    uv sync
+    echo "Running unit tests..."
+    uv run pytest tests/unit/ -v -m "not openai and not embedding_provider" --tb=short -p no:logfire
+else
+    echo "Error: uv not found"
+    exit 1
+fi
+
diff --git a/.pre-commit-hooks/run_pytest_with_sync.ps1 b/.pre-commit-hooks/run_pytest_with_sync.ps1
new file mode 100644
index 0000000000000000000000000000000000000000..546a5096bc6e4b9a46d039f5761234022b8658dd
--- /dev/null
+++ b/.pre-commit-hooks/run_pytest_with_sync.ps1
@@ -0,0 +1,25 @@
+# PowerShell wrapper for pytest runner
+# Ensures uv is available and runs the Python script
+
+param(
+    [Parameter(Position=0)]
+    [string]$TestType = "unit"
+)
+
+$ErrorActionPreference = "Stop"
+
+# Check if uv is available
+if (-not (Get-Command uv -ErrorAction SilentlyContinue)) {
+    Write-Error "uv not found. Please install uv: https://github.com/astral-sh/uv"
+    exit 1
+}
+
+# Get the script directory
+$ScriptDir = Split-Path -Parent $MyInvocation.MyCommand.Path
+$PythonScript = Join-Path $ScriptDir "run_pytest_with_sync.py"
+
+# Run the Python script using uv
+uv run python $PythonScript $TestType
+
+exit $LASTEXITCODE
+
diff --git a/.pre-commit-hooks/run_pytest_with_sync.py b/.pre-commit-hooks/run_pytest_with_sync.py
new file mode 100644
index 0000000000000000000000000000000000000000..a29427a6737b7b37f80c60d7d11c17b91a61c8d8
--- /dev/null
+++ b/.pre-commit-hooks/run_pytest_with_sync.py
@@ -0,0 +1,93 @@
+#!/usr/bin/env python3
+"""Cross-platform pytest runner that syncs dependencies before running tests."""
+
+import subprocess
+import sys
+
+
+def run_command(
+    cmd: list[str], check: bool = True, shell: bool = False, cwd: str | None = None
+) -> int:
+    """Run a command and return exit code."""
+    try:
+        result = subprocess.run(
+            cmd,
+            check=check,
+            shell=shell,
+            cwd=cwd,
+            env=None,  # Use current environment, uv will handle venv
+        )
+        return result.returncode
+    except subprocess.CalledProcessError as e:
+        return e.returncode
+    except FileNotFoundError:
+        print(f"Error: Command not found: {cmd[0]}")
+        return 1
+
+
+def main() -> int:
+    """Main entry point."""
+    import os
+    from pathlib import Path
+
+    # Get the project root (where pyproject.toml is)
+    script_dir = Path(__file__).parent
+    project_root = script_dir.parent
+
+    # Change to project root to ensure uv works correctly
+    os.chdir(project_root)
+
+    # Check if uv is available
+    if run_command(["uv", "--version"], check=False) != 0:
+        print("Error: uv not found. Please install uv: https://github.com/astral-sh/uv")
+        return 1
+
+    # Parse arguments
+    test_type = sys.argv[1] if len(sys.argv) > 1 else "unit"
+    extra_args = sys.argv[2:] if len(sys.argv) > 2 else []
+
+    # Sync dependencies - always include dev
+    # Note: embeddings dependencies are now in main dependencies, not optional
+    # So we just sync with --dev for all test types
+    sync_cmd = ["uv", "sync", "--dev"]
+
+    print(f"Syncing dependencies for {test_type} tests...")
+    if run_command(sync_cmd, cwd=project_root) != 0:
+        return 1
+
+    # Build pytest command - use uv run to ensure correct environment
+    if test_type == "unit":
+        pytest_args = [
+            "tests/unit/",
+            "-v",
+            "-m",
+            "not openai and not embedding_provider",
+            "--tb=short",
+            "-p",
+            "no:logfire",
+        ]
+    elif test_type == "embeddings":
+        pytest_args = [
+            "tests/",
+            "-v",
+            "-m",
+            "local_embeddings",
+            "--tb=short",
+            "-p",
+            "no:logfire",
+        ]
+    else:
+        pytest_args = []
+
+    pytest_args.extend(extra_args)
+
+    # Use uv run python -m pytest to ensure we use the venv's pytest
+    # This is more reliable than uv run pytest which might find system pytest
+    pytest_cmd = ["uv", "run", "python", "-m", "pytest", *pytest_args]
+
+    print(f"Running {test_type} tests...")
+    return run_command(pytest_cmd, cwd=project_root)
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/=0.22.0 b/=0.22.0
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/=0.22.0, b/=0.22.0,
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
deleted file mode 100644
index 01b17d0d73a01b2c97d35f2d7d09c81437e274dc..0000000000000000000000000000000000000000
--- a/CONTRIBUTING.md
+++ /dev/null
@@ -1 +0,0 @@
-make sure you run the full pre-commit checks before opening a PR (not draft) otherwise Obstacle is the Way will loose his mind 
\ No newline at end of file
diff --git a/Makefile b/Makefile
index eebf37bb63cd097a6f312bde21fe9877975bc8e1..185a214d84f7fd56284179c298c26790e1f938c8 100644
--- a/Makefile
+++ b/Makefile
@@ -37,6 +37,15 @@ typecheck:
 check: lint typecheck test-cov
 	@echo "All checks passed!"
 
+docs-build:
+	uv run mkdocs build
+
+docs-serve:
+	uv run mkdocs serve
+
+docs-clean:
+	rm -rf site/
+
 clean:
 	rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ .coverage htmlcov
 	find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
diff --git a/README.md b/README.md
index 2dd0e49df9108088969f2e7ebba53115201d7200..858344c44c6824cae59b1872987c98500e8c01a9 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,17 @@
 ---
-title: DeepCritical
-emoji: 🧬
-colorFrom: blue
-colorTo: purple
+title: Critical Deep Resarch
+emoji: 🐉
+colorFrom: red
+colorTo: orange
 sdk: gradio
 sdk_version: "6.0.1"
 python_version: "3.11"
 app_file: src/app.py
-pinned: false
+hf_oauth: true
+hf_oauth_expiration_minutes: 480
+hf_oauth_scopes:
+ - inference-api
+pinned: true
 license: mit
 tags:
   - mcp-in-action-track-enterprise
@@ -19,178 +23,100 @@ tags:
   - modal
 ---
 
+> [!IMPORTANT]
+> **You are reading the Gradio Demo README!**
+> 
+> - 📚 **Documentation**: See our [technical documentation](docs/index.md) for detailed information
+> - 📖 **Complete README**: Check out the [full README](.github/README.md) for setup, configuration, and contribution guidelines
+> - 🏆 **Hackathon Submission**: Keep reading below for more information about our MCP Hackathon submission
+
+<div align="center">
+
+[![GitHub](https://img.shields.io/github/stars/DeepCritical/GradioDemo?style=for-the-badge&logo=github&logoColor=white&label=🐙%20GitHub&labelColor=181717&color=181717)](https://github.com/DeepCritical/GradioDemo)
+[![Documentation](https://img.shields.io/badge/📚%20Docs-0080FF?style=for-the-badge&logo=readthedocs&logoColor=white&labelColor=0080FF&color=0080FF)](docs/index.md)
+[![Demo](https://img.shields.io/badge/🚀%20Demo-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white&labelColor=FFD21E&color=FFD21E)](https://huggingface.co/spaces/DataQuests/DeepCritical)
+[![CodeCov](https://img.shields.io/badge/📊%20Coverage-F01F7A?style=for-the-badge&logo=codecov&logoColor=white&labelColor=F01F7A&color=F01F7A)](https://codecov.io/gh/DeepCritical/GradioDemo)
+[![Join us on Discord](https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square)](https://discord.gg/qdfnvSPcqP) 
+
+
+</div>
+
 # DeepCritical
 
-## Intro
-
-## Features
-
-- **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
-- **MCP Integration**: Use our tools from Claude Desktop or any MCP client
-- **Modal Sandbox**: Secure execution of AI-generated statistical code
-- **LlamaIndex RAG**: Semantic search and evidence synthesis
-- **HuggingfaceInference**: 
-- **HuggingfaceMCP Custom Config To Use Community Tools**:
-- **Strongly Typed Composable Graphs**:
-- **Specialized Research Teams of Agents**: 
-
-## Quick Start
-
-### 1. Environment Setup
-
-```bash
-# Install uv if you haven't already
-pip install uv
-
-# Sync dependencies
-uv sync
-```
-
-### 2. Run the UI
-
-```bash
-# Start the Gradio app
-uv run gradio run src/app.py
-```
-
-Open your browser to `http://localhost:7860`.
-
-### 3. Connect via MCP
-
-This application exposes a Model Context Protocol (MCP) server, allowing you to use its search tools directly from Claude Desktop or other MCP clients.
-
-**MCP Server URL**: `http://localhost:7860/gradio_api/mcp/`
-
-**Claude Desktop Configuration**:
-Add this to your `claude_desktop_config.json`:
-```json
-{
-  "mcpServers": {
-    "deepcritical": {
-      "url": "http://localhost:7860/gradio_api/mcp/"
-    }
-  }
-}
-```
-
-**Available Tools**:
-- `search_pubmed`: Search peer-reviewed biomedical literature.
-- `search_clinical_trials`: Search ClinicalTrials.gov.
-- `search_biorxiv`: Search bioRxiv/medRxiv preprints.
-- `search_all`: Search all sources simultaneously.
-- `analyze_hypothesis`: Secure statistical analysis using Modal sandboxes.
-
-
-
-## Architecture
-
-DeepCritical uses a Vertical Slice Architecture:
-
-1.  **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and bioRxiv.
-2.  **Judge Slice**: Evaluating evidence quality using LLMs.
-3.  **Orchestrator Slice**: Managing the research loop and UI.
-
-- iterativeResearch
-- deepResearch
-- researchTeam
-
-### Iterative Research
-
-sequenceDiagram
-    participant IterativeFlow
-    participant ThinkingAgent
-    participant KnowledgeGapAgent
-    participant ToolSelector
-    participant ToolExecutor
-    participant JudgeHandler
-    participant WriterAgent
-
-    IterativeFlow->>IterativeFlow: run(query)
-    
-    loop Until complete or max_iterations
-        IterativeFlow->>ThinkingAgent: generate_observations()
-        ThinkingAgent-->>IterativeFlow: observations
-        
-        IterativeFlow->>KnowledgeGapAgent: evaluate_gaps()
-        KnowledgeGapAgent-->>IterativeFlow: KnowledgeGapOutput
-        
-        alt Research complete
-            IterativeFlow->>WriterAgent: create_final_report()
-            WriterAgent-->>IterativeFlow: final_report
-        else Gaps remain
-            IterativeFlow->>ToolSelector: select_agents(gap)
-            ToolSelector-->>IterativeFlow: AgentSelectionPlan
-            
-            IterativeFlow->>ToolExecutor: execute_tool_tasks()
-            ToolExecutor-->>IterativeFlow: ToolAgentOutput[]
-            
-            IterativeFlow->>JudgeHandler: assess_evidence()
-            JudgeHandler-->>IterativeFlow: should_continue
-        end
-    end
-
-
-### Deep Research
-
-sequenceDiagram
-    actor User
-    participant GraphOrchestrator
-    participant InputParser
-    participant GraphBuilder
-    participant GraphExecutor
-    participant Agent
-    participant BudgetTracker
-    participant WorkflowState
-
-    User->>GraphOrchestrator: run(query)
-    GraphOrchestrator->>InputParser: detect_research_mode(query)
-    InputParser-->>GraphOrchestrator: mode (iterative/deep)
-    GraphOrchestrator->>GraphBuilder: build_graph(mode)
-    GraphBuilder-->>GraphOrchestrator: ResearchGraph
-    GraphOrchestrator->>WorkflowState: init_workflow_state()
-    GraphOrchestrator->>BudgetTracker: create_budget()
-    GraphOrchestrator->>GraphExecutor: _execute_graph(graph)
-    
-    loop For each node in graph
-        GraphExecutor->>Agent: execute_node(agent_node)
-        Agent->>Agent: process_input
-        Agent-->>GraphExecutor: result
-        GraphExecutor->>WorkflowState: update_state(result)
-        GraphExecutor->>BudgetTracker: add_tokens(used)
-        GraphExecutor->>BudgetTracker: check_budget()
-        alt Budget exceeded
-            GraphExecutor->>GraphOrchestrator: emit(error_event)
-        else Continue
-            GraphExecutor->>GraphOrchestrator: emit(progress_event)
-        end
-    end
-    
-    GraphOrchestrator->>User: AsyncGenerator[AgentEvent]
-
-### Research Team
-
-Critical Deep Research Agent
-
-## Development
-
-### Run Tests
-
-```bash
-uv run pytest
-```
-
-### Run Checks
-
-```bash
-make check
-```
-
-## Join Us
-
-- The-Obstacle-Is-The-Way
+## About
+
+The [Deep Critical Gradio Hackathon Team](### Team) met online in the Alzheimer's Critical Literature Review Group in the Hugging Science initiative. We're building the agent framework we want to use for ai assisted research to [turn the vast amounts of clinical data into cures](https://github.com/DeepCritical/GradioDemo).
+
+For this hackathon we're proposing a simple yet powerful Deep Research Agent that iteratively looks for the answer until it finds it using general purpose websearch and special purpose retrievers for technical retrievers. 
+
+## Deep Critical In the Medial 
+
+- Social Medial Posts about Deep Critical :
+  - 
+  -
+  -
+  -
+  -
+  -
+  -
+
+## Important information
+
+- **[readme](.github\README.md)**: configure, deploy , contribute and learn more here.
+- **[docs]**: want to know how all this works ? read our detailed technical documentation here.
+- **[demo](https://huggingface/spaces/DataQuests/DeepCritical)**: Try our demo on huggingface
+- **[team](### Team)**: Join us , or follow us !
+- **[video]**: See our demo video
+
+## Future Developments
+
+- [] Apply Deep Research Systems To Generate Short Form Video (up to 5 minutes)
+- [] Visualize Pydantic Graphs as Loading Screens in the UI
+- [] Improve Data Science with more Complex Graph Agents
+- [] Create Deep Critical Drug Reporposing / Discovery Demo
+- [] Create Deep Critical Literal Review
+- [] Create Deep Critical Hypothesis Generator
+
+## Completed
+
+- [] **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
+- [] **MCP Integration**: Use our tools from Claude Desktop or any MCP client
+- [] **HuggingFace OAuth**: Sign in with HuggingFace 
+- [] **Modal Sandbox**: Secure execution of AI-generated statistical code
+- [] **LlamaIndex RAG**: Semantic search and evidence synthesis
+- [] **HuggingfaceInference**: 
+- [] **HuggingfaceMCP Custom Config To Use Community Tools**:
+- [] **Strongly Typed Composable Graphs**:
+- [] **Specialized Research Teams of Agents**: 
+
+
+
+### Team
+
+- ZJ
 - MarioAderman
 - Josephrp
 
+
+## Acknowledgements
+
+- McSwaggins
+- Magentic
+- Huggingface
+- Gradio
+- DeepCritical
+- Sponsors
+- Microsoft
+- Pydantic
+- Llama-index
+- Anthhropic/MCP
+- List of Tools Makers
+
+
 ## Links
 
-- [GitHub Repository](https://github.com/The-Obstacle-Is-The-Way/DeepCritical-1)
\ No newline at end of file
+[![GitHub](https://img.shields.io/github/stars/DeepCritical/GradioDemo?style=for-the-badge&logo=github&logoColor=white&label=🐙%20GitHub&labelColor=181717&color=181717)](https://github.com/DeepCritical/GradioDemo)
+[![Documentation](https://img.shields.io/badge/📚%20Docs-0080FF?style=for-the-badge&logo=readthedocs&logoColor=white&labelColor=0080FF&color=0080FF)](docs/index.md)
+[![Demo](https://img.shields.io/badge/🚀%20Demo-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white&labelColor=FFD21E&color=FFD21E)](https://huggingface.co/spaces/DataQuests/DeepCritical)
+[![CodeCov](https://img.shields.io/badge/📊%20Coverage-F01F7A?style=for-the-badge&logo=codecov&logoColor=white&labelColor=F01F7A&color=F01F7A)](https://codecov.io/gh/DeepCritical/GradioDemo)
+[![Join us on Discord](https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square)](https://discord.gg/qdfnvSPcqP) 
\ No newline at end of file
diff --git a/.cursorrules b/dev/.cursorrules
similarity index 99%
rename from .cursorrules
rename to dev/.cursorrules
index 8fbe6def025d95d15c47f657eafbbbf0643a5ca5..1f295e800902da3888e751d3b615b39c75aa2f19 100644
--- a/.cursorrules
+++ b/dev/.cursorrules
@@ -238,3 +238,4 @@
 
 
 
+
diff --git a/AGENTS.txt b/dev/AGENTS.txt
similarity index 100%
rename from AGENTS.txt
rename to dev/AGENTS.txt
diff --git a/dev/Makefile b/dev/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..185a214d84f7fd56284179c298c26790e1f938c8
--- /dev/null
+++ b/dev/Makefile
@@ -0,0 +1,51 @@
+.PHONY: install test lint format typecheck check clean all cov cov-html
+
+# Default target
+all: check
+
+install:
+	uv sync --all-extras
+	uv run pre-commit install
+
+test:
+	uv run pytest tests/unit/ -v -m "not openai" -p no:logfire
+
+test-hf:
+	uv run pytest tests/ -v -m "huggingface" -p no:logfire
+
+test-all:
+	uv run pytest tests/ -v -p no:logfire
+
+# Coverage aliases
+cov: test-cov
+test-cov:
+	uv run pytest --cov=src --cov-report=term-missing -m "not openai" -p no:logfire
+
+cov-html:
+	uv run pytest --cov=src --cov-report=html -p no:logfire
+	@echo "Coverage report: open htmlcov/index.html"
+
+lint:
+	uv run ruff check src tests
+
+format:
+	uv run ruff format src tests
+
+typecheck:
+	uv run mypy src
+
+check: lint typecheck test-cov
+	@echo "All checks passed!"
+
+docs-build:
+	uv run mkdocs build
+
+docs-serve:
+	uv run mkdocs serve
+
+docs-clean:
+	rm -rf site/
+
+clean:
+	rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ .coverage htmlcov
+	find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
diff --git a/dev/docs_plugins.py b/dev/docs_plugins.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fe1ed9c64756aed732e5ede0706f0c5b93bf44c
--- /dev/null
+++ b/dev/docs_plugins.py
@@ -0,0 +1,74 @@
+"""Custom MkDocs extension to handle code anchor format: ```start:end:filepath"""
+
+import re
+from pathlib import Path
+
+from markdown import Markdown
+from markdown.extensions import Extension
+from markdown.preprocessors import Preprocessor
+
+
+class CodeAnchorPreprocessor(Preprocessor):
+    """Preprocess code blocks with anchor format: ```start:end:filepath"""
+
+    def __init__(self, md: Markdown, base_path: Path):
+        super().__init__(md)
+        self.base_path = base_path
+        self.pattern = re.compile(r"^```(\d+):(\d+):([^\n]+)\n(.*?)```$", re.MULTILINE | re.DOTALL)
+
+    def run(self, lines: list[str]) -> list[str]:
+        """Process lines and convert code anchor format to standard code blocks."""
+        text = "\n".join(lines)
+        new_text = self.pattern.sub(self._replace_code_anchor, text)
+        return new_text.split("\n")
+
+    def _replace_code_anchor(self, match) -> str:
+        """Replace code anchor format with standard code block + link."""
+        start_line = int(match.group(1))
+        end_line = int(match.group(2))
+        file_path = match.group(3).strip()
+        existing_code = match.group(4)
+
+        # Determine language from file extension
+        ext = Path(file_path).suffix.lower()
+        lang_map = {
+            ".py": "python",
+            ".js": "javascript",
+            ".ts": "typescript",
+            ".md": "markdown",
+            ".yaml": "yaml",
+            ".yml": "yaml",
+            ".toml": "toml",
+            ".json": "json",
+            ".html": "html",
+            ".css": "css",
+            ".sh": "bash",
+        }
+        language = lang_map.get(ext, "python")
+
+        # Generate GitHub link
+        repo_url = "https://github.com/DeepCritical/GradioDemo"
+        github_link = f"{repo_url}/blob/main/{file_path}#L{start_line}-L{end_line}"
+
+        # Return standard code block with source link
+        return (
+            f'[View source: `{file_path}` (lines {start_line}-{end_line})]({github_link}){{: target="_blank" }}\n\n'
+            f"```{language}\n{existing_code}\n```"
+        )
+
+
+class CodeAnchorExtension(Extension):
+    """Markdown extension for code anchors."""
+
+    def __init__(self, base_path: str = ".", **kwargs):
+        super().__init__(**kwargs)
+        self.base_path = Path(base_path)
+
+    def extendMarkdown(self, md: Markdown):  # noqa: N802
+        """Register the preprocessor."""
+        md.preprocessors.register(CodeAnchorPreprocessor(md, self.base_path), "codeanchor", 25)
+
+
+def makeExtension(**kwargs):  # noqa: N802
+    """Create the extension."""
+    return CodeAnchorExtension(**kwargs)
diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md
deleted file mode 100644
index 50d73b26220673abfc9f2b94368233454d49c0a3..0000000000000000000000000000000000000000
--- a/docs/CONFIGURATION.md
+++ /dev/null
@@ -1,301 +0,0 @@
-# Configuration Guide
-
-## Overview
-
-DeepCritical uses **Pydantic Settings** for centralized configuration management. All settings are defined in `src/utils/config.py` and can be configured via environment variables or a `.env` file.
-
-## Quick Start
-
-1. Copy the example environment file (if available) or create a `.env` file in the project root
-2. Set at least one LLM API key (`OPENAI_API_KEY` or `ANTHROPIC_API_KEY`)
-3. Optionally configure other services as needed
-
-## Configuration System
-
-### How It Works
-
-- **Settings Class**: `Settings` class in `src/utils/config.py` extends `BaseSettings` from `pydantic_settings`
-- **Environment File**: Automatically loads from `.env` file (if present)
-- **Environment Variables**: Reads from environment variables (case-insensitive)
-- **Type Safety**: Strongly-typed fields with validation
-- **Singleton Pattern**: Global `settings` instance for easy access
-
-### Usage
-
-```python
-from src.utils.config import settings
-
-# Check if API keys are available
-if settings.has_openai_key:
-    # Use OpenAI
-    pass
-
-# Access configuration values
-max_iterations = settings.max_iterations
-web_search_provider = settings.web_search_provider
-```
-
-## Required Configuration
-
-### At Least One LLM Provider
-
-You must configure at least one LLM provider:
-
-**OpenAI:**
-```bash
-LLM_PROVIDER=openai
-OPENAI_API_KEY=your_openai_api_key_here
-OPENAI_MODEL=gpt-5.1
-```
-
-**Anthropic:**
-```bash
-LLM_PROVIDER=anthropic
-ANTHROPIC_API_KEY=your_anthropic_api_key_here
-ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
-```
-
-## Optional Configuration
-
-### Embedding Configuration
-
-```bash
-# Embedding Provider: "openai", "local", or "huggingface"
-EMBEDDING_PROVIDER=local
-
-# OpenAI Embedding Model (used by LlamaIndex RAG)
-OPENAI_EMBEDDING_MODEL=text-embedding-3-small
-
-# Local Embedding Model (sentence-transformers)
-LOCAL_EMBEDDING_MODEL=all-MiniLM-L6-v2
-
-# HuggingFace Embedding Model
-HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
-```
-
-### HuggingFace Configuration
-
-```bash
-# HuggingFace API Token (for inference API)
-HUGGINGFACE_API_KEY=your_huggingface_api_key_here
-# Or use HF_TOKEN (alternative name)
-
-# Default HuggingFace Model ID
-HUGGINGFACE_MODEL=meta-llama/Llama-3.1-8B-Instruct
-```
-
-### Web Search Configuration
-
-```bash
-# Web Search Provider: "serper", "searchxng", "brave", "tavily", or "duckduckgo"
-# Default: "duckduckgo" (no API key required)
-WEB_SEARCH_PROVIDER=duckduckgo
-
-# Serper API Key (for Google search via Serper)
-SERPER_API_KEY=your_serper_api_key_here
-
-# SearchXNG Host URL
-SEARCHXNG_HOST=http://localhost:8080
-
-# Brave Search API Key
-BRAVE_API_KEY=your_brave_api_key_here
-
-# Tavily API Key
-TAVILY_API_KEY=your_tavily_api_key_here
-```
-
-### PubMed Configuration
-
-```bash
-# NCBI API Key (optional, for higher rate limits: 10 req/sec vs 3 req/sec)
-NCBI_API_KEY=your_ncbi_api_key_here
-```
-
-### Agent Configuration
-
-```bash
-# Maximum iterations per research loop
-MAX_ITERATIONS=10
-
-# Search timeout in seconds
-SEARCH_TIMEOUT=30
-
-# Use graph-based execution for research flows
-USE_GRAPH_EXECUTION=false
-```
-
-### Budget & Rate Limiting Configuration
-
-```bash
-# Default token budget per research loop
-DEFAULT_TOKEN_LIMIT=100000
-
-# Default time limit per research loop (minutes)
-DEFAULT_TIME_LIMIT_MINUTES=10
-
-# Default iterations limit per research loop
-DEFAULT_ITERATIONS_LIMIT=10
-```
-
-### RAG Service Configuration
-
-```bash
-# ChromaDB collection name for RAG
-RAG_COLLECTION_NAME=deepcritical_evidence
-
-# Number of top results to retrieve from RAG
-RAG_SIMILARITY_TOP_K=5
-
-# Automatically ingest evidence into RAG
-RAG_AUTO_INGEST=true
-```
-
-### ChromaDB Configuration
-
-```bash
-# ChromaDB storage path
-CHROMA_DB_PATH=./chroma_db
-
-# Whether to persist ChromaDB to disk
-CHROMA_DB_PERSIST=true
-
-# ChromaDB server host (for remote ChromaDB, optional)
-# CHROMA_DB_HOST=localhost
-
-# ChromaDB server port (for remote ChromaDB, optional)
-# CHROMA_DB_PORT=8000
-```
-
-### External Services
-
-```bash
-# Modal Token ID (for Modal sandbox execution)
-MODAL_TOKEN_ID=your_modal_token_id_here
-
-# Modal Token Secret
-MODAL_TOKEN_SECRET=your_modal_token_secret_here
-```
-
-### Logging Configuration
-
-```bash
-# Log Level: "DEBUG", "INFO", "WARNING", or "ERROR"
-LOG_LEVEL=INFO
-```
-
-## Configuration Properties
-
-The `Settings` class provides helpful properties for checking configuration:
-
-```python
-from src.utils.config import settings
-
-# Check API key availability
-settings.has_openai_key          # bool
-settings.has_anthropic_key       # bool
-settings.has_huggingface_key     # bool
-settings.has_any_llm_key         # bool
-
-# Check service availability
-settings.modal_available         # bool
-settings.web_search_available    # bool
-```
-
-## Environment Variables Reference
-
-### Required (at least one LLM)
-- `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` - At least one LLM provider key
-
-### Optional LLM Providers
-- `DEEPSEEK_API_KEY` (Phase 2)
-- `OPENROUTER_API_KEY` (Phase 2)
-- `GEMINI_API_KEY` (Phase 2)
-- `PERPLEXITY_API_KEY` (Phase 2)
-- `HUGGINGFACE_API_KEY` or `HF_TOKEN`
-- `AZURE_OPENAI_ENDPOINT` (Phase 2)
-- `AZURE_OPENAI_DEPLOYMENT` (Phase 2)
-- `AZURE_OPENAI_API_KEY` (Phase 2)
-- `AZURE_OPENAI_API_VERSION` (Phase 2)
-- `LOCAL_MODEL_URL` (Phase 2)
-
-### Web Search
-- `WEB_SEARCH_PROVIDER` (default: "duckduckgo")
-- `SERPER_API_KEY`
-- `SEARCHXNG_HOST`
-- `BRAVE_API_KEY`
-- `TAVILY_API_KEY`
-
-### Embeddings
-- `EMBEDDING_PROVIDER` (default: "local")
-- `HUGGINGFACE_EMBEDDING_MODEL` (optional)
-
-### RAG
-- `RAG_COLLECTION_NAME` (default: "deepcritical_evidence")
-- `RAG_SIMILARITY_TOP_K` (default: 5)
-- `RAG_AUTO_INGEST` (default: true)
-
-### ChromaDB
-- `CHROMA_DB_PATH` (default: "./chroma_db")
-- `CHROMA_DB_PERSIST` (default: true)
-- `CHROMA_DB_HOST` (optional)
-- `CHROMA_DB_PORT` (optional)
-
-### Budget
-- `DEFAULT_TOKEN_LIMIT` (default: 100000)
-- `DEFAULT_TIME_LIMIT_MINUTES` (default: 10)
-- `DEFAULT_ITERATIONS_LIMIT` (default: 10)
-
-### Other
-- `LLM_PROVIDER` (default: "openai")
-- `NCBI_API_KEY` (optional)
-- `MODAL_TOKEN_ID` (optional)
-- `MODAL_TOKEN_SECRET` (optional)
-- `MAX_ITERATIONS` (default: 10)
-- `LOG_LEVEL` (default: "INFO")
-- `USE_GRAPH_EXECUTION` (default: false)
-
-## Validation
-
-Settings are validated on load using Pydantic validation:
-
-- **Type checking**: All fields are strongly typed
-- **Range validation**: Numeric fields have min/max constraints
-- **Literal validation**: Enum fields only accept specific values
-- **Required fields**: API keys are checked when accessed via `get_api_key()`
-
-## Error Handling
-
-Configuration errors raise `ConfigurationError`:
-
-```python
-from src.utils.config import settings
-from src.utils.exceptions import ConfigurationError
-
-try:
-    api_key = settings.get_api_key()
-except ConfigurationError as e:
-    print(f"Configuration error: {e}")
-```
-
-## Future Enhancements (Phase 2)
-
-The following configurations are planned for Phase 2:
-
-1. **Additional LLM Providers**: DeepSeek, OpenRouter, Gemini, Perplexity, Azure OpenAI, Local models
-2. **Model Selection**: Reasoning/main/fast model configuration
-3. **Service Integration**: Migrate `folder/llm_config.py` to centralized config
-
-See `CONFIGURATION_ANALYSIS.md` for the complete implementation plan.
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/docs/api/agents.md b/docs/api/agents.md
new file mode 100644
index 0000000000000000000000000000000000000000..f0606a16bb266ec744025828cad6fafcad177592
--- /dev/null
+++ b/docs/api/agents.md
@@ -0,0 +1,259 @@
+# Agents API Reference
+
+This page documents the API for DeepCritical agents.
+
+## KnowledgeGapAgent
+
+**Module**: `src.agents.knowledge_gap`
+
+**Purpose**: Evaluates research state and identifies knowledge gaps.
+
+### Methods
+
+#### `evaluate`
+
+```python
+async def evaluate(
+    self,
+    query: str,
+    background_context: str,
+    conversation_history: Conversation,
+    iteration: int,
+    time_elapsed_minutes: float,
+    max_time_minutes: float
+) -> KnowledgeGapOutput
+```
+
+Evaluates research completeness and identifies outstanding knowledge gaps.
+
+**Parameters**:
+- `query`: Research query string
+- `background_context`: Background context for the query
+- `conversation_history`: Conversation history with previous iterations
+- `iteration`: Current iteration number
+- `time_elapsed_minutes`: Elapsed time in minutes
+- `max_time_minutes`: Maximum time limit in minutes
+
+**Returns**: `KnowledgeGapOutput` with:
+- `research_complete`: Boolean indicating if research is complete
+- `outstanding_gaps`: List of remaining knowledge gaps
+
+## ToolSelectorAgent
+
+**Module**: `src.agents.tool_selector`
+
+**Purpose**: Selects appropriate tools for addressing knowledge gaps.
+
+### Methods
+
+#### `select_tools`
+
+```python
+async def select_tools(
+    self,
+    query: str,
+    knowledge_gaps: list[str],
+    available_tools: list[str]
+) -> AgentSelectionPlan
+```
+
+Selects tools for addressing knowledge gaps.
+
+**Parameters**:
+- `query`: Research query string
+- `knowledge_gaps`: List of knowledge gaps to address
+- `available_tools`: List of available tool names
+
+**Returns**: `AgentSelectionPlan` with list of `AgentTask` objects.
+
+## WriterAgent
+
+**Module**: `src.agents.writer`
+
+**Purpose**: Generates final reports from research findings.
+
+### Methods
+
+#### `write_report`
+
+```python
+async def write_report(
+    self,
+    query: str,
+    findings: str,
+    output_length: str = "medium",
+    output_instructions: str | None = None
+) -> str
+```
+
+Generates a markdown report from research findings.
+
+**Parameters**:
+- `query`: Research query string
+- `findings`: Research findings to include in report
+- `output_length`: Desired output length ("short", "medium", "long")
+- `output_instructions`: Additional instructions for report generation
+
+**Returns**: Markdown string with numbered citations.
+
+## LongWriterAgent
+
+**Module**: `src.agents.long_writer`
+
+**Purpose**: Long-form report generation with section-by-section writing.
+
+### Methods
+
+#### `write_next_section`
+
+```python
+async def write_next_section(
+    self,
+    query: str,
+    draft: ReportDraft,
+    section_title: str,
+    section_content: str
+) -> LongWriterOutput
+```
+
+Writes the next section of a long-form report.
+
+**Parameters**:
+- `query`: Research query string
+- `draft`: Current report draft
+- `section_title`: Title of the section to write
+- `section_content`: Content/guidance for the section
+
+**Returns**: `LongWriterOutput` with updated draft.
+
+#### `write_report`
+
+```python
+async def write_report(
+    self,
+    query: str,
+    report_title: str,
+    report_draft: ReportDraft
+) -> str
+```
+
+Generates final report from draft.
+
+**Parameters**:
+- `query`: Research query string
+- `report_title`: Title of the report
+- `report_draft`: Complete report draft
+
+**Returns**: Final markdown report string.
+
+## ProofreaderAgent
+
+**Module**: `src.agents.proofreader`
+
+**Purpose**: Proofreads and polishes report drafts.
+
+### Methods
+
+#### `proofread`
+
+```python
+async def proofread(
+    self,
+    query: str,
+    report_title: str,
+    report_draft: ReportDraft
+) -> str
+```
+
+Proofreads and polishes a report draft.
+
+**Parameters**:
+- `query`: Research query string
+- `report_title`: Title of the report
+- `report_draft`: Report draft to proofread
+
+**Returns**: Polished markdown string.
+
+## ThinkingAgent
+
+**Module**: `src.agents.thinking`
+
+**Purpose**: Generates observations from conversation history.
+
+### Methods
+
+#### `generate_observations`
+
+```python
+async def generate_observations(
+    self,
+    query: str,
+    background_context: str,
+    conversation_history: Conversation
+) -> str
+```
+
+Generates observations from conversation history.
+
+**Parameters**:
+- `query`: Research query string
+- `background_context`: Background context
+- `conversation_history`: Conversation history
+
+**Returns**: Observation string.
+
+## InputParserAgent
+
+**Module**: `src.agents.input_parser`
+
+**Purpose**: Parses and improves user queries, detects research mode.
+
+### Methods
+
+#### `parse_query`
+
+```python
+async def parse_query(
+    self,
+    query: str
+) -> ParsedQuery
+```
+
+Parses and improves a user query.
+
+**Parameters**:
+- `query`: Original query string
+
+**Returns**: `ParsedQuery` with:
+- `original_query`: Original query string
+- `improved_query`: Refined query string
+- `research_mode`: "iterative" or "deep"
+- `key_entities`: List of key entities
+- `research_questions`: List of research questions
+
+## Factory Functions
+
+All agents have factory functions in `src.agent_factory.agents`:
+
+```python
+def create_knowledge_gap_agent(model: Any | None = None) -> KnowledgeGapAgent
+def create_tool_selector_agent(model: Any | None = None) -> ToolSelectorAgent
+def create_writer_agent(model: Any | None = None) -> WriterAgent
+def create_long_writer_agent(model: Any | None = None) -> LongWriterAgent
+def create_proofreader_agent(model: Any | None = None) -> ProofreaderAgent
+def create_thinking_agent(model: Any | None = None) -> ThinkingAgent
+def create_input_parser_agent(model: Any | None = None) -> InputParserAgent
+```
+
+**Parameters**:
+- `model`: Optional Pydantic AI model. If None, uses `get_model()` from settings.
+
+**Returns**: Agent instance.
+
+## See Also
+
+- [Architecture - Agents](../architecture/agents.md) - Architecture overview
+- [Models API](models.md) - Data models used by agents
+
+
+
diff --git a/docs/api/models.md b/docs/api/models.md
new file mode 100644
index 0000000000000000000000000000000000000000..c945e337bfb6e6c1eb5c626cae9bff79451384e4
--- /dev/null
+++ b/docs/api/models.md
@@ -0,0 +1,237 @@
+# Models API Reference
+
+This page documents the Pydantic models used throughout DeepCritical.
+
+## Evidence
+
+**Module**: `src.utils.models`
+
+**Purpose**: Represents evidence from search results.
+
+```python
+class Evidence(BaseModel):
+    citation: Citation
+    content: str
+    relevance_score: float = Field(ge=0.0, le=1.0)
+    metadata: dict[str, Any] = Field(default_factory=dict)
+```
+
+**Fields**:
+- `citation`: Citation information (title, URL, date, authors)
+- `content`: Evidence text content
+- `relevance_score`: Relevance score (0.0-1.0)
+- `metadata`: Additional metadata dictionary
+
+## Citation
+
+**Module**: `src.utils.models`
+
+**Purpose**: Citation information for evidence.
+
+```python
+class Citation(BaseModel):
+    title: str
+    url: str
+    date: str | None = None
+    authors: list[str] = Field(default_factory=list)
+```
+
+**Fields**:
+- `title`: Article/trial title
+- `url`: Source URL
+- `date`: Publication date (optional)
+- `authors`: List of authors (optional)
+
+## KnowledgeGapOutput
+
+**Module**: `src.utils.models`
+
+**Purpose**: Output from knowledge gap evaluation.
+
+```python
+class KnowledgeGapOutput(BaseModel):
+    research_complete: bool
+    outstanding_gaps: list[str] = Field(default_factory=list)
+```
+
+**Fields**:
+- `research_complete`: Boolean indicating if research is complete
+- `outstanding_gaps`: List of remaining knowledge gaps
+
+## AgentSelectionPlan
+
+**Module**: `src.utils.models`
+
+**Purpose**: Plan for tool/agent selection.
+
+```python
+class AgentSelectionPlan(BaseModel):
+    tasks: list[AgentTask] = Field(default_factory=list)
+```
+
+**Fields**:
+- `tasks`: List of agent tasks to execute
+
+## AgentTask
+
+**Module**: `src.utils.models`
+
+**Purpose**: Individual agent task.
+
+```python
+class AgentTask(BaseModel):
+    agent_name: str
+    query: str
+    context: dict[str, Any] = Field(default_factory=dict)
+```
+
+**Fields**:
+- `agent_name`: Name of agent to use
+- `query`: Task query
+- `context`: Additional context dictionary
+
+## ReportDraft
+
+**Module**: `src.utils.models`
+
+**Purpose**: Draft structure for long-form reports.
+
+```python
+class ReportDraft(BaseModel):
+    title: str
+    sections: list[ReportSection] = Field(default_factory=list)
+    references: list[Citation] = Field(default_factory=list)
+```
+
+**Fields**:
+- `title`: Report title
+- `sections`: List of report sections
+- `references`: List of citations
+
+## ReportSection
+
+**Module**: `src.utils.models`
+
+**Purpose**: Individual section in a report draft.
+
+```python
+class ReportSection(BaseModel):
+    title: str
+    content: str
+    order: int
+```
+
+**Fields**:
+- `title`: Section title
+- `content`: Section content
+- `order`: Section order number
+
+## ParsedQuery
+
+**Module**: `src.utils.models`
+
+**Purpose**: Parsed and improved query.
+
+```python
+class ParsedQuery(BaseModel):
+    original_query: str
+    improved_query: str
+    research_mode: Literal["iterative", "deep"]
+    key_entities: list[str] = Field(default_factory=list)
+    research_questions: list[str] = Field(default_factory=list)
+```
+
+**Fields**:
+- `original_query`: Original query string
+- `improved_query`: Refined query string
+- `research_mode`: Research mode ("iterative" or "deep")
+- `key_entities`: List of key entities
+- `research_questions`: List of research questions
+
+## Conversation
+
+**Module**: `src.utils.models`
+
+**Purpose**: Conversation history with iterations.
+
+```python
+class Conversation(BaseModel):
+    iterations: list[IterationData] = Field(default_factory=list)
+```
+
+**Fields**:
+- `iterations`: List of iteration data
+
+## IterationData
+
+**Module**: `src.utils.models`
+
+**Purpose**: Data for a single iteration.
+
+```python
+class IterationData(BaseModel):
+    iteration: int
+    observations: str | None = None
+    knowledge_gaps: list[str] = Field(default_factory=list)
+    tool_calls: list[dict[str, Any]] = Field(default_factory=list)
+    findings: str | None = None
+    thoughts: str | None = None
+```
+
+**Fields**:
+- `iteration`: Iteration number
+- `observations`: Generated observations
+- `knowledge_gaps`: Identified knowledge gaps
+- `tool_calls`: Tool calls made
+- `findings`: Findings from tools
+- `thoughts`: Agent thoughts
+
+## AgentEvent
+
+**Module**: `src.utils.models`
+
+**Purpose**: Event emitted during research execution.
+
+```python
+class AgentEvent(BaseModel):
+    type: str
+    iteration: int | None = None
+    data: dict[str, Any] = Field(default_factory=dict)
+```
+
+**Fields**:
+- `type`: Event type (e.g., "started", "search_complete", "complete")
+- `iteration`: Iteration number (optional)
+- `data`: Event data dictionary
+
+## BudgetStatus
+
+**Module**: `src.utils.models`
+
+**Purpose**: Current budget status.
+
+```python
+class BudgetStatus(BaseModel):
+    tokens_used: int
+    tokens_limit: int
+    time_elapsed_seconds: float
+    time_limit_seconds: float
+    iterations: int
+    iterations_limit: int
+```
+
+**Fields**:
+- `tokens_used`: Tokens used so far
+- `tokens_limit`: Token limit
+- `time_elapsed_seconds`: Elapsed time in seconds
+- `time_limit_seconds`: Time limit in seconds
+- `iterations`: Current iteration count
+- `iterations_limit`: Iteration limit
+
+## See Also
+
+- [Architecture - Agents](../architecture/agents.md) - How models are used
+- [Configuration](../configuration/index.md) - Model configuration
+
+
+
diff --git a/docs/api/orchestrators.md b/docs/api/orchestrators.md
new file mode 100644
index 0000000000000000000000000000000000000000..4cb466562eb90a0d6eafc0327f6773c77f55ddc5
--- /dev/null
+++ b/docs/api/orchestrators.md
@@ -0,0 +1,184 @@
+# Orchestrators API Reference
+
+This page documents the API for DeepCritical orchestrators.
+
+## IterativeResearchFlow
+
+**Module**: `src.orchestrator.research_flow`
+
+**Purpose**: Single-loop research with search-judge-synthesize cycles.
+
+### Methods
+
+#### `run`
+
+```python
+async def run(
+    self,
+    query: str,
+    background_context: str = "",
+    max_iterations: int | None = None,
+    max_time_minutes: float | None = None,
+    token_budget: int | None = None
+) -> AsyncGenerator[AgentEvent, None]
+```
+
+Runs iterative research flow.
+
+**Parameters**:
+- `query`: Research query string
+- `background_context`: Background context (default: "")
+- `max_iterations`: Maximum iterations (default: from settings)
+- `max_time_minutes`: Maximum time in minutes (default: from settings)
+- `token_budget`: Token budget (default: from settings)
+
+**Yields**: `AgentEvent` objects for:
+- `started`: Research started
+- `search_complete`: Search completed
+- `judge_complete`: Evidence evaluation completed
+- `synthesizing`: Generating report
+- `complete`: Research completed
+- `error`: Error occurred
+
+## DeepResearchFlow
+
+**Module**: `src.orchestrator.research_flow`
+
+**Purpose**: Multi-section parallel research with planning and synthesis.
+
+### Methods
+
+#### `run`
+
+```python
+async def run(
+    self,
+    query: str,
+    background_context: str = "",
+    max_iterations_per_section: int | None = None,
+    max_time_minutes: float | None = None,
+    token_budget: int | None = None
+) -> AsyncGenerator[AgentEvent, None]
+```
+
+Runs deep research flow.
+
+**Parameters**:
+- `query`: Research query string
+- `background_context`: Background context (default: "")
+- `max_iterations_per_section`: Maximum iterations per section (default: from settings)
+- `max_time_minutes`: Maximum time in minutes (default: from settings)
+- `token_budget`: Token budget (default: from settings)
+
+**Yields**: `AgentEvent` objects for:
+- `started`: Research started
+- `planning`: Creating research plan
+- `looping`: Running parallel research loops
+- `synthesizing`: Synthesizing results
+- `complete`: Research completed
+- `error`: Error occurred
+
+## GraphOrchestrator
+
+**Module**: `src.orchestrator.graph_orchestrator`
+
+**Purpose**: Graph-based execution using Pydantic AI agents as nodes.
+
+### Methods
+
+#### `run`
+
+```python
+async def run(
+    self,
+    query: str,
+    research_mode: str = "auto",
+    use_graph: bool = True
+) -> AsyncGenerator[AgentEvent, None]
+```
+
+Runs graph-based research orchestration.
+
+**Parameters**:
+- `query`: Research query string
+- `research_mode`: Research mode ("iterative", "deep", or "auto")
+- `use_graph`: Whether to use graph execution (default: True)
+
+**Yields**: `AgentEvent` objects during graph execution.
+
+## Orchestrator Factory
+
+**Module**: `src.orchestrator_factory`
+
+**Purpose**: Factory for creating orchestrators.
+
+### Functions
+
+#### `create_orchestrator`
+
+```python
+def create_orchestrator(
+    search_handler: SearchHandlerProtocol,
+    judge_handler: JudgeHandlerProtocol,
+    config: dict[str, Any],
+    mode: str | None = None
+) -> Any
+```
+
+Creates an orchestrator instance.
+
+**Parameters**:
+- `search_handler`: Search handler protocol implementation
+- `judge_handler`: Judge handler protocol implementation
+- `config`: Configuration dictionary
+- `mode`: Orchestrator mode ("simple", "advanced", "magentic", or None for auto-detect)
+
+**Returns**: Orchestrator instance.
+
+**Raises**:
+- `ValueError`: If requirements not met
+
+**Modes**:
+- `"simple"`: Legacy orchestrator
+- `"advanced"` or `"magentic"`: Magentic orchestrator (requires OpenAI API key)
+- `None`: Auto-detect based on API key availability
+
+## MagenticOrchestrator
+
+**Module**: `src.orchestrator_magentic`
+
+**Purpose**: Multi-agent coordination using Microsoft Agent Framework.
+
+### Methods
+
+#### `run`
+
+```python
+async def run(
+    self,
+    query: str,
+    max_rounds: int = 15,
+    max_stalls: int = 3
+) -> AsyncGenerator[AgentEvent, None]
+```
+
+Runs Magentic orchestration.
+
+**Parameters**:
+- `query`: Research query string
+- `max_rounds`: Maximum rounds (default: 15)
+- `max_stalls`: Maximum stalls before reset (default: 3)
+
+**Yields**: `AgentEvent` objects converted from Magentic events.
+
+**Requirements**:
+- `agent-framework-core` package
+- OpenAI API key
+
+## See Also
+
+- [Architecture - Orchestrators](../architecture/orchestrators.md) - Architecture overview
+- [Graph Orchestration](../architecture/graph-orchestration.md) - Graph execution details
+
+
+
diff --git a/docs/api/services.md b/docs/api/services.md
new file mode 100644
index 0000000000000000000000000000000000000000..5104eb4af192e9b01d8c791d1c1148f24c2497ed
--- /dev/null
+++ b/docs/api/services.md
@@ -0,0 +1,190 @@
+# Services API Reference
+
+This page documents the API for DeepCritical services.
+
+## EmbeddingService
+
+**Module**: `src.services.embeddings`
+
+**Purpose**: Local sentence-transformers for semantic search and deduplication.
+
+### Methods
+
+#### `embed`
+
+```python
+async def embed(self, text: str) -> list[float]
+```
+
+Generates embedding for a text string.
+
+**Parameters**:
+- `text`: Text to embed
+
+**Returns**: Embedding vector as list of floats.
+
+#### `embed_batch`
+
+```python
+async def embed_batch(self, texts: list[str]) -> list[list[float]]
+```
+
+Generates embeddings for multiple texts.
+
+**Parameters**:
+- `texts`: List of texts to embed
+
+**Returns**: List of embedding vectors.
+
+#### `similarity`
+
+```python
+async def similarity(self, text1: str, text2: str) -> float
+```
+
+Calculates similarity between two texts.
+
+**Parameters**:
+- `text1`: First text
+- `text2`: Second text
+
+**Returns**: Similarity score (0.0-1.0).
+
+#### `find_duplicates`
+
+```python
+async def find_duplicates(
+    self,
+    texts: list[str],
+    threshold: float = 0.85
+) -> list[tuple[int, int]]
+```
+
+Finds duplicate texts based on similarity threshold.
+
+**Parameters**:
+- `texts`: List of texts to check
+- `threshold`: Similarity threshold (default: 0.85)
+
+**Returns**: List of (index1, index2) tuples for duplicate pairs.
+
+### Factory Function
+
+#### `get_embedding_service`
+
+```python
+@lru_cache(maxsize=1)
+def get_embedding_service() -> EmbeddingService
+```
+
+Returns singleton EmbeddingService instance.
+
+## LlamaIndexRAGService
+
+**Module**: `src.services.rag`
+
+**Purpose**: Retrieval-Augmented Generation using LlamaIndex.
+
+### Methods
+
+#### `ingest_evidence`
+
+```python
+async def ingest_evidence(self, evidence: list[Evidence]) -> None
+```
+
+Ingests evidence into RAG service.
+
+**Parameters**:
+- `evidence`: List of Evidence objects to ingest
+
+**Note**: Requires OpenAI API key for embeddings.
+
+#### `retrieve`
+
+```python
+async def retrieve(
+    self,
+    query: str,
+    top_k: int = 5
+) -> list[Document]
+```
+
+Retrieves relevant documents for a query.
+
+**Parameters**:
+- `query`: Search query string
+- `top_k`: Number of top results to return (default: 5)
+
+**Returns**: List of Document objects with metadata.
+
+#### `query`
+
+```python
+async def query(
+    self,
+    query: str,
+    top_k: int = 5
+) -> str
+```
+
+Queries RAG service and returns formatted results.
+
+**Parameters**:
+- `query`: Search query string
+- `top_k`: Number of top results to return (default: 5)
+
+**Returns**: Formatted query results as string.
+
+### Factory Function
+
+#### `get_rag_service`
+
+```python
+@lru_cache(maxsize=1)
+def get_rag_service() -> LlamaIndexRAGService | None
+```
+
+Returns singleton LlamaIndexRAGService instance, or None if OpenAI key not available.
+
+## StatisticalAnalyzer
+
+**Module**: `src.services.statistical_analyzer`
+
+**Purpose**: Secure execution of AI-generated statistical code.
+
+### Methods
+
+#### `analyze`
+
+```python
+async def analyze(
+    self,
+    hypothesis: str,
+    evidence: list[Evidence],
+    data_description: str | None = None
+) -> AnalysisResult
+```
+
+Analyzes a hypothesis using statistical methods.
+
+**Parameters**:
+- `hypothesis`: Hypothesis to analyze
+- `evidence`: List of Evidence objects
+- `data_description`: Optional data description
+
+**Returns**: `AnalysisResult` with:
+- `verdict`: SUPPORTED, REFUTED, or INCONCLUSIVE
+- `code`: Generated analysis code
+- `output`: Execution output
+- `error`: Error message if execution failed
+
+**Note**: Requires Modal credentials for sandbox execution.
+
+## See Also
+
+- [Architecture - Services](../architecture/services.md) - Architecture overview
+- [Configuration](../configuration/index.md) - Service configuration
+
+
+
diff --git a/docs/api/tools.md b/docs/api/tools.md
new file mode 100644
index 0000000000000000000000000000000000000000..92293eb2de0a7a54f94a486081b25bfaa814abf9
--- /dev/null
+++ b/docs/api/tools.md
@@ -0,0 +1,224 @@
+# Tools API Reference
+
+This page documents the API for DeepCritical search tools.
+
+## SearchTool Protocol
+
+All tools implement the `SearchTool` protocol:
+
+```python
+class SearchTool(Protocol):
+    @property
+    def name(self) -> str: ...
+    
+    async def search(
+        self, 
+        query: str, 
+        max_results: int = 10
+    ) -> list[Evidence]: ...
+```
+
+## PubMedTool
+
+**Module**: `src.tools.pubmed`
+
+**Purpose**: Search peer-reviewed biomedical literature from PubMed.
+
+### Properties
+
+#### `name`
+
+```python
+@property
+def name(self) -> str
+```
+
+Returns tool name: `"pubmed"`
+
+### Methods
+
+#### `search`
+
+```python
+async def search(
+    self,
+    query: str,
+    max_results: int = 10
+) -> list[Evidence]
+```
+
+Searches PubMed for articles.
+
+**Parameters**:
+- `query`: Search query string
+- `max_results`: Maximum number of results to return (default: 10)
+
+**Returns**: List of `Evidence` objects with PubMed articles.
+
+**Raises**:
+- `SearchError`: If search fails
+- `RateLimitError`: If rate limit is exceeded
+
+## ClinicalTrialsTool
+
+**Module**: `src.tools.clinicaltrials`
+
+**Purpose**: Search ClinicalTrials.gov for interventional studies.
+
+### Properties
+
+#### `name`
+
+```python
+@property
+def name(self) -> str
+```
+
+Returns tool name: `"clinicaltrials"`
+
+### Methods
+
+#### `search`
+
+```python
+async def search(
+    self,
+    query: str,
+    max_results: int = 10
+) -> list[Evidence]
+```
+
+Searches ClinicalTrials.gov for trials.
+
+**Parameters**:
+- `query`: Search query string
+- `max_results`: Maximum number of results to return (default: 10)
+
+**Returns**: List of `Evidence` objects with clinical trials.
+
+**Note**: Only returns interventional studies with status: COMPLETED, ACTIVE_NOT_RECRUITING, RECRUITING, ENROLLING_BY_INVITATION
+
+**Raises**:
+- `SearchError`: If search fails
+
+## EuropePMCTool
+
+**Module**: `src.tools.europepmc`
+
+**Purpose**: Search Europe PMC for preprints and peer-reviewed articles.
+
+### Properties
+
+#### `name`
+
+```python
+@property
+def name(self) -> str
+```
+
+Returns tool name: `"europepmc"`
+
+### Methods
+
+#### `search`
+
+```python
+async def search(
+    self,
+    query: str,
+    max_results: int = 10
+) -> list[Evidence]
+```
+
+Searches Europe PMC for articles and preprints.
+
+**Parameters**:
+- `query`: Search query string
+- `max_results`: Maximum number of results to return (default: 10)
+
+**Returns**: List of `Evidence` objects with articles/preprints.
+
+**Note**: Includes both preprints (marked with `[PREPRINT - Not peer-reviewed]`) and peer-reviewed articles.
+
+**Raises**:
+- `SearchError`: If search fails
+
+## RAGTool
+
+**Module**: `src.tools.rag_tool`
+
+**Purpose**: Semantic search within collected evidence.
+
+### Properties
+
+#### `name`
+
+```python
+@property
+def name(self) -> str
+```
+
+Returns tool name: `"rag"`
+
+### Methods
+
+#### `search`
+
+```python
+async def search(
+    self,
+    query: str,
+    max_results: int = 10
+) -> list[Evidence]
+```
+
+Searches collected evidence using semantic similarity.
+
+**Parameters**:
+- `query`: Search query string
+- `max_results`: Maximum number of results to return (default: 10)
+
+**Returns**: List of `Evidence` objects from collected evidence.
+
+**Note**: Requires evidence to be ingested into RAG service first.
+
+## SearchHandler
+
+**Module**: `src.tools.search_handler`
+
+**Purpose**: Orchestrates parallel searches across multiple tools.
+
+### Methods
+
+#### `search`
+
+```python
+async def search(
+    self,
+    query: str,
+    tools: list[SearchTool] | None = None,
+    max_results_per_tool: int = 10
+) -> SearchResult
+```
+
+Searches multiple tools in parallel.
+
+**Parameters**:
+- `query`: Search query string
+- `tools`: List of tools to use (default: all available tools)
+- `max_results_per_tool`: Maximum results per tool (default: 10)
+
+**Returns**: `SearchResult` with:
+- `evidence`: Aggregated list of evidence
+- `tool_results`: Results per tool
+- `total_count`: Total number of results
+
+**Note**: Uses `asyncio.gather()` for parallel execution. Handles tool failures gracefully.
+
+## See Also
+
+- [Architecture - Tools](../architecture/tools.md) - Architecture overview
+- [Models API](models.md) - Data models used by tools
+
+
+
diff --git a/docs/architecture/agents.md b/docs/architecture/agents.md
new file mode 100644
index 0000000000000000000000000000000000000000..f522ac9b753d5d4456e9fdf4f38c2f7733a40b4e
--- /dev/null
+++ b/docs/architecture/agents.md
@@ -0,0 +1,181 @@
+# Agents Architecture
+
+DeepCritical uses Pydantic AI agents for all AI-powered operations. All agents follow a consistent pattern and use structured output types.
+
+## Agent Pattern
+
+All agents use the Pydantic AI `Agent` class with the following structure:
+
+- **System Prompt**: Module-level constant with date injection
+- **Agent Class**: `__init__(model: Any | None = None)`
+- **Main Method**: Async method (e.g., `async def evaluate()`, `async def write_report()`)
+- **Factory Function**: `def create_agent_name(model: Any | None = None) -> AgentName`
+
+## Model Initialization
+
+Agents use `get_model()` from `src/agent_factory/judges.py` if no model is provided. This supports:
+
+- OpenAI models
+- Anthropic models
+- HuggingFace Inference API models
+
+The model selection is based on the configured `LLM_PROVIDER` in settings.
+
+## Error Handling
+
+Agents return fallback values on failure rather than raising exceptions:
+
+- `KnowledgeGapOutput(research_complete=False, outstanding_gaps=[...])`
+- Empty strings for text outputs
+- Default structured outputs
+
+All errors are logged with context using structlog.
+
+## Input Validation
+
+All agents validate inputs:
+
+- Check that queries/inputs are not empty
+- Truncate very long inputs with warnings
+- Handle None values gracefully
+
+## Output Types
+
+Agents use structured output types from `src/utils/models.py`:
+
+- `KnowledgeGapOutput`: Research completeness evaluation
+- `AgentSelectionPlan`: Tool selection plan
+- `ReportDraft`: Long-form report structure
+- `ParsedQuery`: Query parsing and mode detection
+
+For text output (writer agents), agents return `str` directly.
+
+## Agent Types
+
+### Knowledge Gap Agent
+
+**File**: `src/agents/knowledge_gap.py`
+
+**Purpose**: Evaluates research state and identifies knowledge gaps.
+
+**Output**: `KnowledgeGapOutput` with:
+- `research_complete`: Boolean indicating if research is complete
+- `outstanding_gaps`: List of remaining knowledge gaps
+
+**Methods**:
+- `async def evaluate(query, background_context, conversation_history, iteration, time_elapsed_minutes, max_time_minutes) -> KnowledgeGapOutput`
+
+### Tool Selector Agent
+
+**File**: `src/agents/tool_selector.py`
+
+**Purpose**: Selects appropriate tools for addressing knowledge gaps.
+
+**Output**: `AgentSelectionPlan` with list of `AgentTask` objects.
+
+**Available Agents**:
+- `WebSearchAgent`: General web search for fresh information
+- `SiteCrawlerAgent`: Research specific entities/companies
+- `RAGAgent`: Semantic search within collected evidence
+
+### Writer Agent
+
+**File**: `src/agents/writer.py`
+
+**Purpose**: Generates final reports from research findings.
+
+**Output**: Markdown string with numbered citations.
+
+**Methods**:
+- `async def write_report(query, findings, output_length, output_instructions) -> str`
+
+**Features**:
+- Validates inputs
+- Truncates very long findings (max 50000 chars) with warning
+- Retry logic for transient failures (3 retries)
+- Citation validation before returning
+
+### Long Writer Agent
+
+**File**: `src/agents/long_writer.py`
+
+**Purpose**: Long-form report generation with section-by-section writing.
+
+**Input/Output**: Uses `ReportDraft` models.
+
+**Methods**:
+- `async def write_next_section(query, draft, section_title, section_content) -> LongWriterOutput`
+- `async def write_report(query, report_title, report_draft) -> str`
+
+**Features**:
+- Writes sections iteratively
+- Aggregates references across sections
+- Reformats section headings and references
+- Deduplicates and renumbers references
+
+### Proofreader Agent
+
+**File**: `src/agents/proofreader.py`
+
+**Purpose**: Proofreads and polishes report drafts.
+
+**Input**: `ReportDraft`
+**Output**: Polished markdown string
+
+**Methods**:
+- `async def proofread(query, report_title, report_draft) -> str`
+
+**Features**:
+- Removes duplicate content across sections
+- Adds executive summary if multiple sections
+- Preserves all references and citations
+- Improves flow and readability
+
+### Thinking Agent
+
+**File**: `src/agents/thinking.py`
+
+**Purpose**: Generates observations from conversation history.
+
+**Output**: Observation string
+
+**Methods**:
+- `async def generate_observations(query, background_context, conversation_history) -> str`
+
+### Input Parser Agent
+
+**File**: `src/agents/input_parser.py`
+
+**Purpose**: Parses and improves user queries, detects research mode.
+
+**Output**: `ParsedQuery` with:
+- `original_query`: Original query string
+- `improved_query`: Refined query string
+- `research_mode`: "iterative" or "deep"
+- `key_entities`: List of key entities
+- `research_questions`: List of research questions
+
+## Factory Functions
+
+All agents have factory functions in `src/agent_factory/agents.py`:
+
+```python
+def create_knowledge_gap_agent(model: Any | None = None) -> KnowledgeGapAgent
+def create_tool_selector_agent(model: Any | None = None) -> ToolSelectorAgent
+def create_writer_agent(model: Any | None = None) -> WriterAgent
+# ... etc
+```
+
+Factory functions:
+- Use `get_model()` if no model provided
+- Raise `ConfigurationError` if creation fails
+- Log agent creation
+
+## See Also
+
+- [Orchestrators](orchestrators.md) - How agents are orchestrated
+- [API Reference - Agents](../api/agents.md) - API documentation
+- [Contributing - Code Style](../contributing/code-style.md) - Development guidelines
+
+
+
diff --git a/docs/architecture/design-patterns.md b/docs/architecture/design-patterns.md
deleted file mode 100644
index 3fff9a0ce1dc7be118c9b328ee06c43f3445c3a6..0000000000000000000000000000000000000000
--- a/docs/architecture/design-patterns.md
+++ /dev/null
@@ -1,1509 +0,0 @@
-# Design Patterns & Technical Decisions
-## Explicit Answers to Architecture Questions
-
----
-
-## Purpose of This Document
-
-This document explicitly answers all the "design pattern" questions raised in team discussions. It provides clear technical decisions with rationale.
-
----
-
-## 1. Primary Architecture Pattern
-
-### Decision: Orchestrator with Search-Judge Loop
-
-**Pattern Name**: Iterative Research Orchestrator
-
-**Structure**:
-```
-┌─────────────────────────────────────┐
-│    Research Orchestrator            │
-│  ┌───────────────────────────────┐  │
-│  │  Search Strategy Planner      │  │
-│  └───────────────────────────────┘  │
-│              ↓                      │
-│  ┌───────────────────────────────┐  │
-│  │  Tool Coordinator             │  │
-│  │  - PubMed Search              │  │
-│  │  - Web Search                 │  │
-│  │  - Clinical Trials            │  │
-│  └───────────────────────────────┘  │
-│              ↓                      │
-│  ┌───────────────────────────────┐  │
-│  │  Evidence Aggregator          │  │
-│  └───────────────────────────────┘  │
-│              ↓                      │
-│  ┌───────────────────────────────┐  │
-│  │  Quality Judge                │  │
-│  │  (LLM-based assessment)       │  │
-│  └───────────────────────────────┘  │
-│              ↓                      │
-│       Loop or Synthesize?           │
-│              ↓                      │
-│  ┌───────────────────────────────┐  │
-│  │  Report Generator             │  │
-│  └───────────────────────────────┘  │
-└─────────────────────────────────────┘
-```
-
-**Why NOT single-agent?**
-- Need coordinated multi-tool queries
-- Need iterative refinement
-- Need quality assessment between searches
-
-**Why NOT pure ReAct?**
-- Medical research requires structured workflow
-- Need explicit quality gates
-- Want deterministic tool selection
-
-**Why THIS pattern?**
-- Clear separation of concerns
-- Testable components
-- Easy to debug
-- Proven in similar systems
-
----
-
-## 2. Tool Selection & Orchestration Pattern
-
-### Decision: Static Tool Registry with Dynamic Selection
-
-**Pattern**:
-```python
-class ToolRegistry:
-    """Central registry of available research tools"""
-    tools = {
-        'pubmed': PubMedSearchTool(),
-        'web': WebSearchTool(),
-        'trials': ClinicalTrialsTool(),
-        'drugs': DrugInfoTool(),
-    }
-
-class Orchestrator:
-    def select_tools(self, question: str, iteration: int) -> List[Tool]:
-        """Dynamically choose tools based on context"""
-        if iteration == 0:
-            # First pass: broad search
-            return [tools['pubmed'], tools['web']]
-        else:
-            # Refinement: targeted search
-            return self.judge.recommend_tools(question, context)
-```
-
-**Why NOT on-the-fly agent factories?**
-- 6-day timeline (too complex)
-- Tools are known upfront
-- Simpler to test and debug
-
-**Why NOT single tool?**
-- Need multiple evidence sources
-- Different tools for different info types
-- Better coverage
-
-**Why THIS pattern?**
-- Balance flexibility vs simplicity
-- Tools can be added easily
-- Selection logic is transparent
-
----
-
-## 3. Judge Pattern
-
-### Decision: Dual-Judge System (Quality + Budget)
-
-**Pattern**:
-```python
-class QualityJudge:
-    """LLM-based evidence quality assessment"""
-
-    def is_sufficient(self, question: str, evidence: List[Evidence]) -> bool:
-        """Main decision: do we have enough?"""
-        return (
-            self.has_mechanism_explanation(evidence) and
-            self.has_drug_candidates(evidence) and
-            self.has_clinical_evidence(evidence) and
-            self.confidence_score(evidence) > threshold
-        )
-
-    def identify_gaps(self, question: str, evidence: List[Evidence]) -> List[str]:
-        """What's missing?"""
-        gaps = []
-        if not self.has_mechanism_explanation(evidence):
-            gaps.append("disease mechanism")
-        if not self.has_drug_candidates(evidence):
-            gaps.append("potential drug candidates")
-        if not self.has_clinical_evidence(evidence):
-            gaps.append("clinical trial data")
-        return gaps
-
-class BudgetJudge:
-    """Resource constraint enforcement"""
-
-    def should_stop(self, state: ResearchState) -> bool:
-        """Hard limits"""
-        return (
-            state.tokens_used >= max_tokens or
-            state.iterations >= max_iterations or
-            state.time_elapsed >= max_time
-        )
-```
-
-**Why NOT just LLM judge?**
-- Cost control (prevent runaway queries)
-- Time bounds (hackathon demo needs to be fast)
-- Safety (prevent infinite loops)
-
-**Why NOT just token budget?**
-- Want early exit when answer is good
-- Quality matters, not just quantity
-- Better user experience
-
-**Why THIS pattern?**
-- Best of both worlds
-- Clear separation (quality vs resources)
-- Each judge has single responsibility
-
----
-
-## 4. Break/Stopping Pattern
-
-### Decision: Three-Tier Break Conditions
-
-**Pattern**:
-```python
-def should_continue(state: ResearchState) -> bool:
-    """Multi-tier stopping logic"""
-
-    # Tier 1: Quality-based (ideal stop)
-    if quality_judge.is_sufficient(state.question, state.evidence):
-        state.stop_reason = "sufficient_evidence"
-        return False
-
-    # Tier 2: Budget-based (cost control)
-    if state.tokens_used >= config.max_tokens:
-        state.stop_reason = "token_budget_exceeded"
-        return False
-
-    # Tier 3: Iteration-based (safety)
-    if state.iterations >= config.max_iterations:
-        state.stop_reason = "max_iterations_reached"
-        return False
-
-    # Tier 4: Time-based (demo friendly)
-    if state.time_elapsed >= config.max_time:
-        state.stop_reason = "timeout"
-        return False
-
-    return True  # Continue researching
-```
-
-**Configuration**:
-```toml
-[research.limits]
-max_tokens = 50000      # ~$0.50 at Claude pricing
-max_iterations = 5      # Reasonable depth
-max_time_seconds = 120  # 2 minutes for demo
-judge_threshold = 0.8   # Quality confidence score
-```
-
-**Why multiple conditions?**
-- Defense in depth
-- Different failure modes
-- Graceful degradation
-
-**Why these specific limits?**
-- Tokens: Balances cost vs quality
-- Iterations: Enough for refinement, not too deep
-- Time: Fast enough for live demo
-- Judge: High bar for quality
-
----
-
-## 5. State Management Pattern
-
-### Decision: Pydantic State Machine with Checkpoints
-
-**Pattern**:
-```python
-class ResearchState(BaseModel):
-    """Immutable state snapshots"""
-    query_id: str
-    question: str
-    iteration: int = 0
-    evidence: List[Evidence] = []
-    tokens_used: int = 0
-    search_history: List[SearchQuery] = []
-    stop_reason: Optional[str] = None
-    created_at: datetime
-    updated_at: datetime
-
-class StateManager:
-    def save_checkpoint(self, state: ResearchState) -> None:
-        """Save state to disk"""
-        path = f".deepresearch/checkpoints/{state.query_id}_iter{state.iteration}.json"
-        path.write_text(state.model_dump_json(indent=2))
-
-    def load_checkpoint(self, query_id: str, iteration: int) -> ResearchState:
-        """Resume from checkpoint"""
-        path = f".deepresearch/checkpoints/{query_id}_iter{iteration}.json"
-        return ResearchState.model_validate_json(path.read_text())
-```
-
-**Directory Structure**:
-```
-.deepresearch/
-├── state/
-│   └── current_123.json          # Active research state
-├── checkpoints/
-│   ├── query_123_iter0.json      # Checkpoint after iteration 0
-│   ├── query_123_iter1.json      # Checkpoint after iteration 1
-│   └── query_123_iter2.json      # Checkpoint after iteration 2
-└── workspace/
-    └── query_123/
-        ├── papers/                # Downloaded PDFs
-        ├── search_results/        # Raw search results
-        └── analysis/              # Intermediate analysis
-```
-
-**Why Pydantic?**
-- Type safety
-- Validation
-- Easy serialization
-- Integration with Pydantic AI
-
-**Why checkpoints?**
-- Resume interrupted research
-- Debugging (inspect state at each iteration)
-- Cost savings (don't re-query)
-- Demo resilience
-
----
-
-## 6. Tool Interface Pattern
-
-### Decision: Async Unified Tool Protocol
-
-**Pattern**:
-```python
-from typing import Protocol, Optional, List, Dict
-import asyncio
-
-class ResearchTool(Protocol):
-    """Standard async interface all tools must implement"""
-
-    async def search(
-        self,
-        query: str,
-        max_results: int = 10,
-        filters: Optional[Dict] = None
-    ) -> List[Evidence]:
-        """Execute search and return structured evidence"""
-        ...
-
-    def get_metadata(self) -> ToolMetadata:
-        """Tool capabilities and requirements"""
-        ...
-
-class PubMedSearchTool:
-    """Concrete async implementation"""
-
-    def __init__(self):
-        self._rate_limiter = asyncio.Semaphore(3)  # 3 req/sec
-        self._cache: Dict[str, List[Evidence]] = {}
-
-    async def search(self, query: str, max_results: int = 10, **kwargs) -> List[Evidence]:
-        # Check cache first
-        cache_key = f"{query}:{max_results}"
-        if cache_key in self._cache:
-            return self._cache[cache_key]
-
-        async with self._rate_limiter:
-            # 1. Query PubMed E-utilities API (async httpx)
-            async with httpx.AsyncClient() as client:
-                response = await client.get(
-                    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
-                    params={"db": "pubmed", "term": query, "retmax": max_results}
-                )
-            # 2. Parse XML response
-            # 3. Extract: title, abstract, authors, citations
-            # 4. Convert to Evidence objects
-            evidence_list = self._parse_response(response.text)
-
-            # Cache results
-            self._cache[cache_key] = evidence_list
-            return evidence_list
-
-    def get_metadata(self) -> ToolMetadata:
-        return ToolMetadata(
-            name="PubMed",
-            description="Biomedical literature search",
-            rate_limit="3 requests/second",
-            requires_api_key=False
-        )
-```
-
-**Parallel Tool Execution**:
-```python
-async def search_all_tools(query: str, tools: List[ResearchTool]) -> List[Evidence]:
-    """Run all tool searches in parallel"""
-    tasks = [tool.search(query) for tool in tools]
-    results = await asyncio.gather(*tasks, return_exceptions=True)
-
-    # Flatten and filter errors
-    evidence = []
-    for result in results:
-        if isinstance(result, Exception):
-            logger.warning(f"Tool failed: {result}")
-        else:
-            evidence.extend(result)
-    return evidence
-```
-
-**Why Async?**
-- Tools are I/O bound (network calls)
-- Parallel execution = faster searches
-- Better UX (streaming progress)
-- Standard in 2025 Python
-
-**Why Protocol?**
-- Loose coupling
-- Easy to add new tools
-- Testable with mocks
-- Clear contract
-
-**Why NOT abstract base class?**
-- More Pythonic (PEP 544)
-- Duck typing friendly
-- Runtime checking with isinstance
-
----
-
-## 7. Report Generation Pattern
-
-### Decision: Structured Output with Citations
-
-**Pattern**:
-```python
-class DrugCandidate(BaseModel):
-    name: str
-    mechanism: str
-    evidence_quality: Literal["strong", "moderate", "weak"]
-    clinical_status: str  # "FDA approved", "Phase 2", etc.
-    citations: List[Citation]
-
-class ResearchReport(BaseModel):
-    query: str
-    disease_mechanism: str
-    candidates: List[DrugCandidate]
-    methodology: str  # How we searched
-    confidence: float
-    sources_used: List[str]
-    generated_at: datetime
-
-    def to_markdown(self) -> str:
-        """Human-readable format"""
-        ...
-
-    def to_json(self) -> str:
-        """Machine-readable format"""
-        ...
-```
-
-**Output Example**:
-```markdown
-# Research Report: Long COVID Fatigue
-
-## Disease Mechanism
-Long COVID fatigue is associated with mitochondrial dysfunction
-and persistent inflammation [1, 2].
-
-## Drug Candidates
-
-### 1. Coenzyme Q10 (CoQ10) - STRONG EVIDENCE
-- **Mechanism**: Mitochondrial support, ATP production
-- **Status**: FDA approved (supplement)
-- **Evidence**: 2 randomized controlled trials showing fatigue reduction
-- **Citations**:
-  - Smith et al. (2023) - PubMed: 12345678
-  - Johnson et al. (2023) - PubMed: 87654321
-
-### 2. Low-dose Naltrexone (LDN) - MODERATE EVIDENCE
-- **Mechanism**: Anti-inflammatory, immune modulation
-- **Status**: FDA approved (different indication)
-- **Evidence**: 3 case studies, 1 ongoing Phase 2 trial
-- **Citations**: ...
-
-## Methodology
-- Searched PubMed: 45 papers reviewed
-- Searched Web: 12 sources
-- Clinical trials: 8 trials identified
-- Total iterations: 3
-- Tokens used: 12,450
-
-## Confidence: 85%
-
-## Sources
-- PubMed E-utilities
-- ClinicalTrials.gov
-- OpenFDA Database
-```
-
-**Why structured?**
-- Parseable by other systems
-- Consistent format
-- Easy to validate
-- Good for datasets
-
-**Why markdown?**
-- Human-readable
-- Renders nicely in Gradio
-- Easy to convert to PDF
-- Standard format
-
----
-
-## 8. Error Handling Pattern
-
-### Decision: Graceful Degradation with Fallbacks
-
-**Pattern**:
-```python
-class ResearchAgent:
-    def research(self, question: str) -> ResearchReport:
-        try:
-            return self._research_with_retry(question)
-        except TokenBudgetExceeded:
-            # Return partial results
-            return self._synthesize_partial(state)
-        except ToolFailure as e:
-            # Try alternate tools
-            return self._research_with_fallback(question, failed_tool=e.tool)
-        except Exception as e:
-            # Log and return error report
-            logger.error(f"Research failed: {e}")
-            return self._error_report(question, error=e)
-```
-
-**Why NOT fail fast?**
-- Hackathon demo must be robust
-- Partial results better than nothing
-- Good user experience
-
-**Why NOT silent failures?**
-- Need visibility for debugging
-- User should know limitations
-- Honest about confidence
-
----
-
-## 9. Configuration Pattern
-
-### Decision: Hydra-inspired but Simpler
-
-**Pattern**:
-```toml
-# config.toml
-
-[research]
-max_iterations = 5
-max_tokens = 50000
-max_time_seconds = 120
-judge_threshold = 0.85
-
-[tools]
-enabled = ["pubmed", "web", "trials"]
-
-[tools.pubmed]
-max_results = 20
-rate_limit = 3  # per second
-
-[tools.web]
-engine = "serpapi"
-max_results = 10
-
-[llm]
-provider = "anthropic"
-model = "claude-3-5-sonnet-20241022"
-temperature = 0.1
-
-[output]
-format = "markdown"
-include_citations = true
-include_methodology = true
-```
-
-**Loading**:
-```python
-from pathlib import Path
-import tomllib
-
-def load_config() -> dict:
-    config_path = Path("config.toml")
-    with open(config_path, "rb") as f:
-        return tomllib.load(f)
-```
-
-**Why NOT full Hydra?**
-- Simpler for hackathon
-- Easier to understand
-- Faster to modify
-- Can upgrade later
-
-**Why TOML?**
-- Human-readable
-- Standard (PEP 680)
-- Better than YAML edge cases
-- Native in Python 3.11+
-
----
-
-## 10. Testing Pattern
-
-### Decision: Three-Level Testing Strategy
-
-**Pattern**:
-```python
-# Level 1: Unit tests (fast, isolated)
-def test_pubmed_tool():
-    tool = PubMedSearchTool()
-    results = tool.search("aspirin cardiovascular")
-    assert len(results) > 0
-    assert all(isinstance(r, Evidence) for r in results)
-
-# Level 2: Integration tests (tools + agent)
-def test_research_loop():
-    agent = ResearchAgent(config=test_config)
-    report = agent.research("aspirin repurposing")
-    assert report.candidates
-    assert report.confidence > 0
-
-# Level 3: End-to-end tests (full system)
-def test_full_workflow():
-    # Simulate user query through Gradio UI
-    response = gradio_app.predict("test query")
-    assert "Drug Candidates" in response
-```
-
-**Why three levels?**
-- Fast feedback (unit tests)
-- Confidence (integration tests)
-- Reality check (e2e tests)
-
-**Test Data**:
-```python
-# tests/fixtures/
-- mock_pubmed_response.xml
-- mock_web_results.json
-- sample_research_query.txt
-- expected_report.md
-```
-
----
-
-## 11. Judge Prompt Templates
-
-### Decision: Structured JSON Output with Domain-Specific Criteria
-
-**Quality Judge System Prompt**:
-```python
-QUALITY_JUDGE_SYSTEM = """You are a medical research quality assessor specializing in drug repurposing.
-Your task is to evaluate if collected evidence is sufficient to answer a drug repurposing question.
-
-You assess evidence against four criteria specific to drug repurposing research:
-1. MECHANISM: Understanding of the disease's molecular/cellular mechanisms
-2. CANDIDATES: Identification of potential drug candidates with known mechanisms
-3. EVIDENCE: Clinical or preclinical evidence supporting repurposing
-4. SOURCES: Quality and credibility of sources (peer-reviewed > preprints > web)
-
-You MUST respond with valid JSON only. No other text."""
-```
-
-**Quality Judge User Prompt**:
-```python
-QUALITY_JUDGE_USER = """
-## Research Question
-{question}
-
-## Evidence Collected (Iteration {iteration} of {max_iterations})
-{evidence_summary}
-
-## Token Budget
-Used: {tokens_used} / {max_tokens}
-
-## Your Assessment
-
-Evaluate the evidence and respond with this exact JSON structure:
-
-```json
-{{
-  "assessment": {{
-    "mechanism_score": <0-10>,
-    "mechanism_reasoning": "<Step-by-step analysis of mechanism understanding>",
-    "candidates_score": <0-10>,
-    "candidates_found": ["<drug1>", "<drug2>", ...],
-    "evidence_score": <0-10>,
-    "evidence_reasoning": "<Critical evaluation of clinical/preclinical support>",
-    "sources_score": <0-10>,
-    "sources_breakdown": {{
-      "peer_reviewed": <count>,
-      "clinical_trials": <count>,
-      "preprints": <count>,
-      "other": <count>
-    }}
-  }},
-  "overall_confidence": <0.0-1.0>,
-  "sufficient": <true/false>,
-  "gaps": ["<missing info 1>", "<missing info 2>"],
-  "recommended_searches": ["<search query 1>", "<search query 2>"],
-  "recommendation": "<continue|synthesize>"
-}}
-```
-
-Decision rules:
-- sufficient=true if overall_confidence >= 0.8 AND mechanism_score >= 6 AND candidates_score >= 6
-- sufficient=true if remaining budget < 10% (must synthesize with what we have)
-- Otherwise, provide recommended_searches to fill gaps
-"""
-```
-
-**Report Synthesis Prompt**:
-```python
-SYNTHESIS_PROMPT = """You are a medical research synthesizer creating a drug repurposing report.
-
-## Research Question
-{question}
-
-## Collected Evidence
-{all_evidence}
-
-## Judge Assessment
-{final_assessment}
-
-## Your Task
-Create a comprehensive research report with this structure:
-
-1. **Executive Summary** (2-3 sentences)
-2. **Disease Mechanism** - What we understand about the condition
-3. **Drug Candidates** - For each candidate:
-   - Drug name and current FDA status
-   - Proposed mechanism for this condition
-   - Evidence quality (strong/moderate/weak)
-   - Key citations
-4. **Methodology** - How we searched (tools used, queries, iterations)
-5. **Limitations** - What we couldn't find or verify
-6. **Confidence Score** - Overall confidence in findings
-
-Format as Markdown. Include PubMed IDs as citations [PMID: 12345678].
-Be scientifically accurate. Do not hallucinate drug names or mechanisms.
-If evidence is weak, say so clearly."""
-```
-
-**Why Structured JSON?**
-- Parseable by code (not just LLM output)
-- Consistent format for logging/debugging
-- Can trigger specific actions (continue vs synthesize)
-- Testable with expected outputs
-
-**Why Domain-Specific Criteria?**
-- Generic "is this good?" prompts fail
-- Drug repurposing has specific requirements
-- Physician on team validated criteria
-- Maps to real research workflow
-
----
-
-## 12. MCP Server Integration (Hackathon Track)
-
-### Decision: Tools as MCP Servers for Reusability
-
-**Why MCP?**
-- Hackathon has dedicated MCP track
-- Makes our tools reusable by others
-- Standard protocol (Model Context Protocol)
-- Future-proof (industry adoption growing)
-
-**Architecture**:
-```
-┌─────────────────────────────────────────────────┐
-│  DeepCritical Agent                             │
-│  (uses tools directly OR via MCP)               │
-└─────────────────────────────────────────────────┘
-                      │
-         ┌────────────┼────────────┐
-         ↓            ↓            ↓
-┌─────────────┐ ┌──────────┐ ┌───────────────┐
-│ PubMed MCP  │ │ Web MCP  │ │ Trials MCP    │
-│ Server      │ │ Server   │ │ Server        │
-└─────────────┘ └──────────┘ └───────────────┘
-         │            │            │
-         ↓            ↓            ↓
-    PubMed API   Brave/DDG   ClinicalTrials.gov
-```
-
-**PubMed MCP Server Implementation**:
-```python
-# src/mcp_servers/pubmed_server.py
-from fastmcp import FastMCP
-
-mcp = FastMCP("PubMed Research Tool")
-
-@mcp.tool()
-async def search_pubmed(
-    query: str,
-    max_results: int = 10,
-    date_range: str = "5y"
-) -> dict:
-    """
-    Search PubMed for biomedical literature.
-
-    Args:
-        query: Search terms (supports PubMed syntax like [MeSH])
-        max_results: Maximum papers to return (default 10, max 100)
-        date_range: Time filter - "1y", "5y", "10y", or "all"
-
-    Returns:
-        dict with papers list containing title, abstract, authors, pmid, date
-    """
-    tool = PubMedSearchTool()
-    results = await tool.search(query, max_results)
-    return {
-        "query": query,
-        "count": len(results),
-        "papers": [r.model_dump() for r in results]
-    }
-
-@mcp.tool()
-async def get_paper_details(pmid: str) -> dict:
-    """
-    Get full details for a specific PubMed paper.
-
-    Args:
-        pmid: PubMed ID (e.g., "12345678")
-
-    Returns:
-        Full paper metadata including abstract, MeSH terms, references
-    """
-    tool = PubMedSearchTool()
-    return await tool.get_details(pmid)
-
-if __name__ == "__main__":
-    mcp.run()
-```
-
-**Running the MCP Server**:
-```bash
-# Start the server
-python -m src.mcp_servers.pubmed_server
-
-# Or with uvx (recommended)
-uvx fastmcp run src/mcp_servers/pubmed_server.py
-
-# Note: fastmcp uses stdio transport by default, which is perfect
-# for local integration with Claude Desktop or the main agent.
-```
-
-**Claude Desktop Integration** (for demo):
-```json
-// ~/Library/Application Support/Claude/claude_desktop_config.json
-{
-  "mcpServers": {
-    "pubmed": {
-      "command": "python",
-      "args": ["-m", "src.mcp_servers.pubmed_server"],
-      "cwd": "/path/to/deepcritical"
-    }
-  }
-}
-```
-
-**Why FastMCP?**
-- Simple decorator syntax
-- Handles protocol complexity
-- Good docs and examples
-- Works with Claude Desktop and API
-
-**MCP Track Submission Requirements**:
-- [ ] At least one tool as MCP server
-- [ ] README with setup instructions
-- [ ] Demo showing MCP usage
-- [ ] Bonus: Multiple tools as MCP servers
-
----
-
-## 13. Gradio UI Pattern (Hackathon Track)
-
-### Decision: Streaming Progress with Modern UI
-
-**Pattern**:
-```python
-import gradio as gr
-from typing import Generator
-
-def research_with_streaming(question: str) -> Generator[str, None, None]:
-    """Stream research progress to UI"""
-    yield "🔍 Starting research...\n\n"
-
-    agent = ResearchAgent()
-
-    async for event in agent.research_stream(question):
-        match event.type:
-            case "search_start":
-                yield f"📚 Searching {event.tool}...\n"
-            case "search_complete":
-                yield f"✅ Found {event.count} results from {event.tool}\n"
-            case "judge_thinking":
-                yield f"🤔 Evaluating evidence quality...\n"
-            case "judge_decision":
-                yield f"📊 Confidence: {event.confidence:.0%}\n"
-            case "iteration_complete":
-                yield f"🔄 Iteration {event.iteration} complete\n\n"
-            case "synthesis_start":
-                yield f"📝 Generating report...\n"
-            case "complete":
-                yield f"\n---\n\n{event.report}"
-
-# Gradio 5 UI
-with gr.Blocks(theme=gr.themes.Soft()) as demo:
-    gr.Markdown("# 🔬 DeepCritical: Drug Repurposing Research Agent")
-    gr.Markdown("Ask a question about potential drug repurposing opportunities.")
-
-    with gr.Row():
-        with gr.Column(scale=2):
-            question = gr.Textbox(
-                label="Research Question",
-                placeholder="What existing drugs might help treat long COVID fatigue?",
-                lines=2
-            )
-            examples = gr.Examples(
-                examples=[
-                    "What existing drugs might help treat long COVID fatigue?",
-                    "Find existing drugs that might slow Alzheimer's progression",
-                    "Which diabetes drugs show promise for cancer treatment?"
-                ],
-                inputs=question
-            )
-            submit = gr.Button("🚀 Start Research", variant="primary")
-
-        with gr.Column(scale=3):
-            output = gr.Markdown(label="Research Progress & Report")
-
-    submit.click(
-        fn=research_with_streaming,
-        inputs=question,
-        outputs=output,
-    )
-
-demo.launch()
-```
-
-**Why Streaming?**
-- User sees progress, not loading spinner
-- Builds trust (system is working)
-- Better UX for long operations
-- Gradio 5 native support
-
-**Why gr.Markdown Output?**
-- Research reports are markdown
-- Renders citations nicely
-- Code blocks for methodology
-- Tables for drug comparisons
-
----
-
-## Summary: Design Decision Table
-
-| # | Question | Decision | Why |
-|---|----------|----------|-----|
-| 1 | **Architecture** | Orchestrator with search-judge loop | Clear, testable, proven |
-| 2 | **Tools** | Static registry, dynamic selection | Balance flexibility vs simplicity |
-| 3 | **Judge** | Dual (quality + budget) | Quality + cost control |
-| 4 | **Stopping** | Four-tier conditions | Defense in depth |
-| 5 | **State** | Pydantic + checkpoints | Type-safe, resumable |
-| 6 | **Tool Interface** | Async Protocol + parallel execution | Fast I/O, modern Python |
-| 7 | **Output** | Structured + Markdown | Human & machine readable |
-| 8 | **Errors** | Graceful degradation + fallbacks | Robust for demo |
-| 9 | **Config** | TOML (Hydra-inspired) | Simple, standard |
-| 10 | **Testing** | Three levels | Fast feedback + confidence |
-| 11 | **Judge Prompts** | Structured JSON + domain criteria | Parseable, medical-specific |
-| 12 | **MCP** | Tools as MCP servers | Hackathon track, reusability |
-| 13 | **UI** | Gradio 5 streaming | Progress visibility, modern UX |
-
----
-
-## Answers to Specific Questions
-
-### "What's the orchestrator pattern?"
-**Answer**: See Section 1 - Iterative Research Orchestrator with search-judge loop
-
-### "LLM-as-judge or token budget?"
-**Answer**: Both - See Section 3 (Dual-Judge System) and Section 4 (Three-Tier Break Conditions)
-
-### "What's the break pattern?"
-**Answer**: See Section 4 - Three stopping conditions: quality threshold, token budget, max iterations
-
-### "Should we use agent factories?"
-**Answer**: No - See Section 2. Static tool registry is simpler for 6-day timeline
-
-### "How do we handle state?"
-**Answer**: See Section 5 - Pydantic state machine with checkpoints
-
----
-
-## Appendix: Complete Data Models
-
-```python
-# src/deepresearch/models.py
-from pydantic import BaseModel, Field
-from typing import List, Optional, Literal
-from datetime import datetime
-
-class Citation(BaseModel):
-    """Reference to a source"""
-    source_type: Literal["pubmed", "web", "trial", "fda"]
-    identifier: str  # PMID, URL, NCT number, etc.
-    title: str
-    authors: Optional[List[str]] = None
-    date: Optional[str] = None
-    url: Optional[str] = None
-
-class Evidence(BaseModel):
-    """Single piece of evidence from search"""
-    content: str
-    source: Citation
-    relevance_score: float = Field(ge=0, le=1)
-    evidence_type: Literal["mechanism", "candidate", "clinical", "safety"]
-
-class DrugCandidate(BaseModel):
-    """Potential drug for repurposing"""
-    name: str
-    generic_name: Optional[str] = None
-    mechanism: str
-    current_indications: List[str]
-    proposed_mechanism: str
-    evidence_quality: Literal["strong", "moderate", "weak"]
-    fda_status: str
-    citations: List[Citation]
-
-class JudgeAssessment(BaseModel):
-    """Output from quality judge"""
-    mechanism_score: int = Field(ge=0, le=10)
-    candidates_score: int = Field(ge=0, le=10)
-    evidence_score: int = Field(ge=0, le=10)
-    sources_score: int = Field(ge=0, le=10)
-    overall_confidence: float = Field(ge=0, le=1)
-    sufficient: bool
-    gaps: List[str]
-    recommended_searches: List[str]
-    recommendation: Literal["continue", "synthesize"]
-
-class ResearchState(BaseModel):
-    """Complete state of a research session"""
-    query_id: str
-    question: str
-    iteration: int = 0
-    evidence: List[Evidence] = []
-    assessments: List[JudgeAssessment] = []
-    tokens_used: int = 0
-    search_history: List[str] = []
-    stop_reason: Optional[str] = None
-    created_at: datetime = Field(default_factory=datetime.utcnow)
-    updated_at: datetime = Field(default_factory=datetime.utcnow)
-
-class ResearchReport(BaseModel):
-    """Final output report"""
-    query: str
-    executive_summary: str
-    disease_mechanism: str
-    candidates: List[DrugCandidate]
-    methodology: str
-    limitations: str
-    confidence: float
-    sources_used: int
-    tokens_used: int
-    iterations: int
-    generated_at: datetime = Field(default_factory=datetime.utcnow)
-
-    def to_markdown(self) -> str:
-        """Render as markdown for Gradio"""
-        md = f"# Research Report: {self.query}\n\n"
-        md += f"## Executive Summary\n{self.executive_summary}\n\n"
-        md += f"## Disease Mechanism\n{self.disease_mechanism}\n\n"
-        md += "## Drug Candidates\n\n"
-        for i, drug in enumerate(self.candidates, 1):
-            md += f"### {i}. {drug.name} - {drug.evidence_quality.upper()} EVIDENCE\n"
-            md += f"- **Mechanism**: {drug.proposed_mechanism}\n"
-            md += f"- **FDA Status**: {drug.fda_status}\n"
-            md += f"- **Current Uses**: {', '.join(drug.current_indications)}\n"
-            md += f"- **Citations**: {len(drug.citations)} sources\n\n"
-        md += f"## Methodology\n{self.methodology}\n\n"
-        md += f"## Limitations\n{self.limitations}\n\n"
-        md += f"## Confidence: {self.confidence:.0%}\n"
-        return md
-```
-
----
-
-## 14. Alternative Frameworks Considered
-
-We researched major agent frameworks before settling on our stack. Here's why we chose what we chose, and what we'd steal if we're shipping like animals and have time for Gucci upgrades.
-
-### Frameworks Evaluated
-
-| Framework | Repo | What It Does |
-|-----------|------|--------------|
-| **Microsoft AutoGen** | [github.com/microsoft/autogen](https://github.com/microsoft/autogen) | Multi-agent orchestration, complex workflows |
-| **Claude Agent SDK** | [github.com/anthropics/claude-agent-sdk-python](https://github.com/anthropics/claude-agent-sdk-python) | Anthropic's official agent framework |
-| **Pydantic AI** | [github.com/pydantic/pydantic-ai](https://github.com/pydantic/pydantic-ai) | Type-safe agents, structured outputs |
-
-### Why NOT AutoGen (Microsoft)?
-
-**Pros:**
-- Battle-tested multi-agent orchestration
-- `reflect_on_tool_use` - model reviews its own tool results
-- `max_tool_iterations` - built-in iteration limits
-- Concurrent tool execution
-- Rich ecosystem (AutoGen Studio, benchmarks)
-
-**Cons for MVP:**
-- Heavy dependency tree (50+ packages)
-- Complex configuration (YAML + Python)
-- Overkill for single-agent search-judge loop
-- Learning curve eats into 6-day timeline
-
-**Verdict:** Great for multi-agent systems. Overkill for our MVP.
-
-### Why NOT Claude Agent SDK (Anthropic)?
-
-**Pros:**
-- Official Anthropic framework
-- Clean `@tool` decorator pattern
-- In-process MCP servers (no subprocess)
-- Hooks for pre/post tool execution
-- Direct Claude Code integration
-
-**Cons for MVP:**
-- Requires Claude Code CLI bundled
-- Node.js dependency for some features
-- Designed for Claude Code ecosystem, not standalone agents
-- Less flexible for custom LLM providers
-
-**Verdict:** Would be great if we were building ON Claude Code. We're building a standalone agent.
-
-### Why Pydantic AI + FastMCP (Our Choice)
-
-**Pros:**
-- ✅ Simple, Pythonic API
-- ✅ Native async/await
-- ✅ Type-safe with Pydantic
-- ✅ Works with any LLM provider
-- ✅ FastMCP for clean MCP servers
-- ✅ Minimal dependencies
-- ✅ Can ship MVP in 6 days
-
-**Cons:**
-- Newer framework (less battle-tested)
-- Smaller ecosystem
-- May need to build more from scratch
-
-**Verdict:** Right tool for the job. Ship fast, iterate later.
-
----
-
-## 15. Stretch Goals: Gucci Bangers (If We're Shipping Like Animals)
-
-If MVP ships early and we're crushing it, here's what we'd steal from other frameworks:
-
-### Tier 1: Quick Wins (2-4 hours each)
-
-#### From Claude Agent SDK: `@tool` Decorator Pattern
-Replace our Protocol-based tools with cleaner decorators:
-
-```python
-# CURRENT (Protocol-based)
-class PubMedSearchTool:
-    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
-        ...
-
-# UPGRADE (Decorator-based, stolen from Claude SDK)
-from claude_agent_sdk import tool
-
-@tool("search_pubmed", "Search PubMed for biomedical papers", {
-    "query": str,
-    "max_results": int
-})
-async def search_pubmed(args):
-    results = await _do_pubmed_search(args["query"], args["max_results"])
-    return {"content": [{"type": "text", "text": json.dumps(results)}]}
-```
-
-**Why it's Gucci:** Cleaner syntax, automatic schema generation, less boilerplate.
-
-#### From AutoGen: Reflect on Tool Use
-Add a reflection step where the model reviews its own tool results:
-
-```python
-# CURRENT: Judge evaluates evidence
-assessment = await judge.assess(question, evidence)
-
-# UPGRADE: Add reflection step (stolen from AutoGen)
-class ReflectiveJudge:
-    async def assess_with_reflection(self, question, evidence, tool_results):
-        # First pass: raw assessment
-        initial = await self._assess(question, evidence)
-
-        # Reflection: "Did I use the tools correctly?"
-        reflection = await self._reflect_on_tool_use(tool_results)
-
-        # Final: combine assessment + reflection
-        return self._combine(initial, reflection)
-```
-
-**Why it's Gucci:** Catches tool misuse, improves accuracy, more robust judge.
-
-### Tier 2: Medium Lifts (4-8 hours each)
-
-#### From AutoGen: Concurrent Tool Execution
-Run multiple tools in parallel with proper error handling:
-
-```python
-# CURRENT: Sequential with asyncio.gather
-results = await asyncio.gather(*[tool.search(query) for tool in tools])
-
-# UPGRADE: AutoGen-style with cancellation + timeout
-from autogen_core import CancellationToken
-
-async def execute_tools_concurrent(tools, query, timeout=30):
-    token = CancellationToken()
-
-    async def run_with_timeout(tool):
-        try:
-            return await asyncio.wait_for(
-                tool.search(query, cancellation_token=token),
-                timeout=timeout
-            )
-        except asyncio.TimeoutError:
-            token.cancel()  # Cancel other tools
-            return ToolError(f"{tool.name} timed out")
-
-    return await asyncio.gather(*[run_with_timeout(t) for t in tools])
-```
-
-**Why it's Gucci:** Proper timeout handling, cancellation propagation, production-ready.
-
-#### From Claude SDK: Hooks System
-Add pre/post hooks for logging, validation, cost tracking:
-
-```python
-# UPGRADE: Hook system (stolen from Claude SDK)
-class HookManager:
-    async def pre_tool_use(self, tool_name, args):
-        """Called before every tool execution"""
-        logger.info(f"Calling {tool_name} with {args}")
-        self.cost_tracker.start_timer()
-
-    async def post_tool_use(self, tool_name, result, duration):
-        """Called after every tool execution"""
-        self.cost_tracker.record(tool_name, duration)
-        if result.is_error:
-            self.error_tracker.record(tool_name, result.error)
-```
-
-**Why it's Gucci:** Observability, debugging, cost tracking, production-ready.
-
-### Tier 3: Big Lifts (Post-Hackathon)
-
-#### Full AutoGen Integration
-If we want multi-agent capabilities later:
-
-```python
-# POST-HACKATHON: Multi-agent drug repurposing
-from autogen_agentchat import AssistantAgent, GroupChat
-
-literature_agent = AssistantAgent(
-    name="LiteratureReviewer",
-    tools=[pubmed_search, web_search],
-    system_message="You search and summarize medical literature."
-)
-
-mechanism_agent = AssistantAgent(
-    name="MechanismAnalyzer",
-    tools=[pathway_db, protein_db],
-    system_message="You analyze disease mechanisms and drug targets."
-)
-
-synthesis_agent = AssistantAgent(
-    name="ReportSynthesizer",
-    system_message="You synthesize findings into actionable reports."
-)
-
-# Orchestrate multi-agent workflow
-group_chat = GroupChat(
-    agents=[literature_agent, mechanism_agent, synthesis_agent],
-    max_round=10
-)
-```
-
-**Why it's Gucci:** True multi-agent collaboration, specialized roles, scalable.
-
----
-
-## Priority Order for Stretch Goals
-
-| Priority | Feature | Source | Effort | Impact |
-|----------|---------|--------|--------|--------|
-| 1 | `@tool` decorator | Claude SDK | 2 hrs | High - cleaner code |
-| 2 | Reflect on tool use | AutoGen | 3 hrs | High - better accuracy |
-| 3 | Hooks system | Claude SDK | 4 hrs | Medium - observability |
-| 4 | Concurrent + cancellation | AutoGen | 4 hrs | Medium - robustness |
-| 5 | Multi-agent | AutoGen | 8+ hrs | Post-hackathon |
-
----
-
-## The Bottom Line
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│  MVP (Days 1-4): Pydantic AI + FastMCP                      │
-│  - Ship working drug repurposing agent                      │
-│  - Search-judge loop with PubMed + Web                      │
-│  - Gradio UI with streaming                                 │
-│  - MCP server for hackathon track                           │
-├─────────────────────────────────────────────────────────────┤
-│  If Crushing It (Days 5-6): Steal the Gucci                 │
-│  - @tool decorators from Claude SDK                         │
-│  - Reflect on tool use from AutoGen                         │
-│  - Hooks for observability                                  │
-├─────────────────────────────────────────────────────────────┤
-│  Post-Hackathon: Full AutoGen Integration                   │
-│  - Multi-agent workflows                                    │
-│  - Specialized agent roles                                  │
-│  - Production-grade orchestration                           │
-└─────────────────────────────────────────────────────────────┘
-```
-
-**Ship MVP first. Steal bangers if time. Scale later.**
-
----
-
-## 16. Reference Implementation Resources
-
-We've cloned production-ready repos into `reference_repos/` that we can vendor, copy from, or just USE directly. This section documents what's available and how to leverage it.
-
-### Cloned Repositories
-
-| Repository | Location | What It Provides |
-|------------|----------|------------------|
-| **pydanticai-research-agent** | `reference_repos/pydanticai-research-agent/` | Complete PydanticAI agent with Brave Search |
-| **pubmed-mcp-server** | `reference_repos/pubmed-mcp-server/` | Production-grade PubMed MCP server (TypeScript) |
-| **autogen-microsoft** | `reference_repos/autogen-microsoft/` | Microsoft's multi-agent framework |
-| **claude-agent-sdk** | `reference_repos/claude-agent-sdk/` | Anthropic's agent SDK with @tool decorator |
-
-### 🔥 CHEAT CODE: Production PubMed MCP Already Exists
-
-The `pubmed-mcp-server` is **production-grade** and has EVERYTHING we need:
-
-```bash
-# Already available tools in pubmed-mcp-server:
-pubmed_search_articles    # Search PubMed with filters, date ranges
-pubmed_fetch_contents     # Get full article details by PMID
-pubmed_article_connections # Find citations, related articles
-pubmed_research_agent     # Generate research plan outlines
-pubmed_generate_chart     # Create PNG charts from data
-```
-
-**Option 1: Use it directly via npx**
-```json
-{
-  "mcpServers": {
-    "pubmed": {
-      "command": "npx",
-      "args": ["@cyanheads/pubmed-mcp-server"],
-      "env": { "NCBI_API_KEY": "your_key" }
-    }
-  }
-}
-```
-
-**Option 2: Vendor the logic into Python**
-The TypeScript code in `reference_repos/pubmed-mcp-server/src/` shows exactly how to:
-- Construct PubMed E-utilities queries
-- Handle rate limiting (3/sec without key, 10/sec with key)
-- Parse XML responses
-- Extract article metadata
-
-### PydanticAI Research Agent Patterns
-
-The `pydanticai-research-agent` repo provides copy-paste patterns:
-
-**Agent Definition** (`agents/research_agent.py`):
-```python
-from pydantic_ai import Agent, RunContext
-from dataclasses import dataclass
-
-@dataclass
-class ResearchAgentDependencies:
-    brave_api_key: str
-    session_id: Optional[str] = None
-
-research_agent = Agent(
-    get_llm_model(),
-    deps_type=ResearchAgentDependencies,
-    system_prompt=SYSTEM_PROMPT
-)
-
-@research_agent.tool
-async def search_web(
-    ctx: RunContext[ResearchAgentDependencies],
-    query: str,
-    max_results: int = 10
-) -> List[Dict[str, Any]]:
-    """Search with context access via ctx.deps"""
-    results = await search_web_tool(ctx.deps.brave_api_key, query, max_results)
-    return results
-```
-
-**Brave Search Tool** (`tools/brave_search.py`):
-```python
-async def search_web_tool(api_key: str, query: str, count: int = 10) -> List[Dict]:
-    headers = {"X-Subscription-Token": api_key, "Accept": "application/json"}
-    async with httpx.AsyncClient() as client:
-        response = await client.get(
-            "https://api.search.brave.com/res/v1/web/search",
-            headers=headers,
-            params={"q": query, "count": count},
-            timeout=30.0
-        )
-    # Handle 429 rate limit, 401 auth errors
-    data = response.json()
-    return data.get("web", {}).get("results", [])
-```
-
-**Pydantic Models** (`models/research_models.py`):
-```python
-class BraveSearchResult(BaseModel):
-    title: str
-    url: str
-    description: str
-    score: float = Field(ge=0.0, le=1.0)
-```
-
-### Microsoft Agent Framework Orchestration Patterns
-
-From [deepwiki.com/microsoft/agent-framework](https://deepwiki.com/microsoft/agent-framework/3.4-workflows-and-orchestration):
-
-#### Sequential Orchestration
-```
-Agent A → Agent B → Agent C (each receives prior outputs)
-```
-**Use when:** Tasks have dependencies, results inform next steps.
-
-#### Concurrent (Fan-out/Fan-in)
-```
-           ┌→ Agent A ─┐
-Dispatcher ├→ Agent B ─┼→ Aggregator
-           └→ Agent C ─┘
-```
-**Use when:** Independent tasks can run in parallel, results need consolidation.
-**Our use:** Parallel PubMed + Web search.
-
-#### Handoff Orchestration
-```
-Coordinator → routes to → Specialist A, B, or C based on request
-```
-**Use when:** Router decides which search strategy based on query type.
-**Our use:** Route "mechanism" vs "clinical trial" vs "drug info" queries.
-
-#### HITL (Human-in-the-Loop)
-```
-Agent → RequestInfoEvent → Human validates → Agent continues
-```
-**Use when:** Critical judgment points need human validation.
-**Our use:** Optional "approve drug candidates before synthesis" step.
-
-### Recommended Hybrid Pattern for Our Agent
-
-Based on all the research, here's our recommended implementation:
-
-```
-┌─────────────────────────────────────────────────────────┐
-│  1. ROUTER (Handoff Pattern)                             │
-│     - Analyze query type                                 │
-│     - Choose search strategy                             │
-├─────────────────────────────────────────────────────────┤
-│  2. SEARCH (Concurrent Pattern)                          │
-│     - Fan-out to PubMed + Web in parallel                │
-│     - Timeout handling per AutoGen patterns              │
-│     - Aggregate results                                  │
-├─────────────────────────────────────────────────────────┤
-│  3. JUDGE (Sequential + Budget)                          │
-│     - Quality assessment                                 │
-│     - Token/iteration budget check                       │
-│     - Recommend: continue or synthesize                  │
-├─────────────────────────────────────────────────────────┤
-│  4. SYNTHESIZE (Final Agent)                             │
-│     - Generate research report                           │
-│     - Include citations                                  │
-│     - Stream to Gradio UI                                │
-└─────────────────────────────────────────────────────────┘
-```
-
-### Quick Start: Minimal Implementation Path
-
-**Day 1-2: Core Loop**
-1. Copy `search_web_tool` from `pydanticai-research-agent/tools/brave_search.py`
-2. Implement PubMed search (reference `pubmed-mcp-server/src/` for E-utilities patterns)
-3. Wire up basic search-judge loop
-
-**Day 3: Judge + State**
-1. Implement quality judge with JSON structured output
-2. Add budget judge
-3. Add Pydantic state management
-
-**Day 4: UI + MCP**
-1. Gradio streaming UI
-2. Wrap PubMed tool as FastMCP server
-
-**Day 5-6: Polish + Deploy**
-1. HuggingFace Spaces deployment
-2. Demo video
-3. Stretch goals if time
-
----
-
-## 17. External Resources & MCP Servers
-
-### Available PubMed MCP Servers (Community)
-
-| Server | Author | Features | Link |
-|--------|--------|----------|------|
-| **pubmed-mcp-server** | cyanheads | Full E-utilities, research agent, charts | [GitHub](https://github.com/cyanheads/pubmed-mcp-server) |
-| **BioMCP** | GenomOncology | PubMed + ClinicalTrials + MyVariant | [GitHub](https://github.com/genomoncology/biomcp) |
-| **PubMed-MCP-Server** | JackKuo666 | Basic search, metadata access | [GitHub](https://github.com/JackKuo666/PubMed-MCP-Server) |
-
-### Web Search Options
-
-| Tool | Free Tier | API Key | Async Support |
-|------|-----------|---------|---------------|
-| **Brave Search** | 2000/month | Required | Yes (httpx) |
-| **DuckDuckGo** | Unlimited | No | Yes (duckduckgo-search) |
-| **SerpAPI** | None | Required | Yes |
-
-**Recommended:** Start with DuckDuckGo (free, no key), upgrade to Brave for production.
-
-```python
-# DuckDuckGo async search (no API key needed!)
-from duckduckgo_search import DDGS
-
-async def search_ddg(query: str, max_results: int = 10) -> List[Dict]:
-    with DDGS() as ddgs:
-        results = list(ddgs.text(query, max_results=max_results))
-    return [{"title": r["title"], "url": r["href"], "description": r["body"]} for r in results]
-```
-
----
-
-**Document Status**: Official Architecture Spec
-**Review Score**: 100/100 (Ironclad Gucci Banger Edition)
-**Sections**: 17 design patterns + data models appendix + reference repos + stretch goals
-**Last Updated**: November 2025
diff --git a/docs/architecture/graph-orchestration.md b/docs/architecture/graph-orchestration.md
new file mode 100644
index 0000000000000000000000000000000000000000..249351123cd97c4945634767eda230e551e26da4
--- /dev/null
+++ b/docs/architecture/graph-orchestration.md
@@ -0,0 +1,152 @@
+# Graph Orchestration Architecture
+
+## Overview
+
+Phase 4 implements a graph-based orchestration system for research workflows using Pydantic AI agents as nodes. This enables better parallel execution, conditional routing, and state management compared to simple agent chains.
+
+## Graph Structure
+
+### Nodes
+
+Graph nodes represent different stages in the research workflow:
+
+1. **Agent Nodes**: Execute Pydantic AI agents
+   - Input: Prompt/query
+   - Output: Structured or unstructured response
+   - Examples: `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`
+
+2. **State Nodes**: Update or read workflow state
+   - Input: Current state
+   - Output: Updated state
+   - Examples: Update evidence, update conversation history
+
+3. **Decision Nodes**: Make routing decisions based on conditions
+   - Input: Current state/results
+   - Output: Next node ID
+   - Examples: Continue research vs. complete research
+
+4. **Parallel Nodes**: Execute multiple nodes concurrently
+   - Input: List of node IDs
+   - Output: Aggregated results
+   - Examples: Parallel iterative research loops
+
+### Edges
+
+Edges define transitions between nodes:
+
+1. **Sequential Edges**: Always traversed (no condition)
+   - From: Source node
+   - To: Target node
+   - Condition: None (always True)
+
+2. **Conditional Edges**: Traversed based on condition
+   - From: Source node
+   - To: Target node
+   - Condition: Callable that returns bool
+   - Example: If research complete → go to writer, else → continue loop
+
+3. **Parallel Edges**: Used for parallel execution branches
+   - From: Parallel node
+   - To: Multiple target nodes
+   - Execution: All targets run concurrently
+
+## Graph Patterns
+
+### Iterative Research Graph
+
+```
+[Input] → [Thinking] → [Knowledge Gap] → [Decision: Complete?]
+                                              ↓ No          ↓ Yes
+                                    [Tool Selector]    [Writer]
+                                              ↓
+                                    [Execute Tools] → [Loop Back]
+```
+
+### Deep Research Graph
+
+```
+[Input] → [Planner] → [Parallel Iterative Loops] → [Synthesizer]
+                           ↓         ↓         ↓
+                        [Loop1]  [Loop2]  [Loop3]
+```
+
+## State Management
+
+State is managed via `WorkflowState` using `ContextVar` for thread-safe isolation:
+
+- **Evidence**: Collected evidence from searches
+- **Conversation**: Iteration history (gaps, tool calls, findings, thoughts)
+- **Embedding Service**: For semantic search
+
+State transitions occur at state nodes, which update the global workflow state.
+
+## Execution Flow
+
+1. **Graph Construction**: Build graph from nodes and edges
+2. **Graph Validation**: Ensure graph is valid (no cycles, all nodes reachable)
+3. **Graph Execution**: Traverse graph from entry node
+4. **Node Execution**: Execute each node based on type
+5. **Edge Evaluation**: Determine next node(s) based on edges
+6. **Parallel Execution**: Use `asyncio.gather()` for parallel nodes
+7. **State Updates**: Update state at state nodes
+8. **Event Streaming**: Yield events during execution for UI
+
+## Conditional Routing
+
+Decision nodes evaluate conditions and return next node IDs:
+
+- **Knowledge Gap Decision**: If `research_complete` → writer, else → tool selector
+- **Budget Decision**: If budget exceeded → exit, else → continue
+- **Iteration Decision**: If max iterations → exit, else → continue
+
+## Parallel Execution
+
+Parallel nodes execute multiple nodes concurrently:
+
+- Each parallel branch runs independently
+- Results are aggregated after all branches complete
+- State is synchronized after parallel execution
+- Errors in one branch don't stop other branches
+
+## Budget Enforcement
+
+Budget constraints are enforced at decision nodes:
+
+- **Token Budget**: Track LLM token usage
+- **Time Budget**: Track elapsed time
+- **Iteration Budget**: Track iteration count
+
+If any budget is exceeded, execution routes to exit node.
+
+## Error Handling
+
+Errors are handled at multiple levels:
+
+1. **Node Level**: Catch errors in individual node execution
+2. **Graph Level**: Handle errors during graph traversal
+3. **State Level**: Rollback state changes on error
+
+Errors are logged and yield error events for UI.
+
+## Backward Compatibility
+
+Graph execution is optional via feature flag:
+
+- `USE_GRAPH_EXECUTION=true`: Use graph-based execution
+- `USE_GRAPH_EXECUTION=false`: Use agent chain execution (existing)
+
+This allows gradual migration and fallback if needed.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/architecture/graph_orchestration.md b/docs/architecture/graph_orchestration.md
index 7bd82206e8cf305709158d9fdf1e2580ccb1dab9..ec5601fdc9ad0108706bb6a4c855fc2407cb6064 100644
--- a/docs/architecture/graph_orchestration.md
+++ b/docs/architecture/graph_orchestration.md
@@ -137,6 +137,14 @@ Graph execution is optional via feature flag:
 
 This allows gradual migration and fallback if needed.
 
+## See Also
+
+- [Orchestrators](orchestrators.md) - Overview of all orchestrator patterns
+- [Workflows](workflows.md) - Workflow diagrams and patterns
+- [Workflow Diagrams](workflow-diagrams.md) - Detailed workflow diagrams
+- [API Reference - Orchestrators](../api/orchestrators.md) - API documentation
+
+
 
 
 
diff --git a/docs/architecture/middleware.md b/docs/architecture/middleware.md
new file mode 100644
index 0000000000000000000000000000000000000000..c2c28cc3850703e70303309c0ca1b048f8edc39d
--- /dev/null
+++ b/docs/architecture/middleware.md
@@ -0,0 +1,131 @@
+# Middleware Architecture
+
+DeepCritical uses middleware for state management, budget tracking, and workflow coordination.
+
+## State Management
+
+### WorkflowState
+
+**File**: `src/middleware/state_machine.py`
+
+**Purpose**: Thread-safe state management for research workflows
+
+**Implementation**: Uses `ContextVar` for thread-safe isolation
+
+**State Components**:
+- `evidence: list[Evidence]`: Collected evidence from searches
+- `conversation: Conversation`: Iteration history (gaps, tool calls, findings, thoughts)
+- `embedding_service: Any`: Embedding service for semantic search
+
+**Methods**:
+- `add_evidence(evidence: Evidence)`: Adds evidence with URL-based deduplication
+- `async search_related(query: str, top_k: int = 5) -> list[Evidence]`: Semantic search
+
+**Initialization**:
+```python
+from src.middleware.state_machine import init_workflow_state
+
+init_workflow_state(embedding_service)
+```
+
+**Access**:
+```python
+from src.middleware.state_machine import get_workflow_state
+
+state = get_workflow_state()  # Auto-initializes if missing
+```
+
+## Workflow Manager
+
+**File**: `src/middleware/workflow_manager.py`
+
+**Purpose**: Coordinates parallel research loops
+
+**Methods**:
+- `add_loop(loop: ResearchLoop)`: Add a research loop to manage
+- `async run_loops_parallel() -> list[ResearchLoop]`: Run all loops in parallel
+- `update_loop_status(loop_id: str, status: str)`: Update loop status
+- `sync_loop_evidence_to_state()`: Synchronize evidence from loops to global state
+
+**Features**:
+- Uses `asyncio.gather()` for parallel execution
+- Handles errors per loop (doesn't fail all if one fails)
+- Tracks loop status: `pending`, `running`, `completed`, `failed`, `cancelled`
+- Evidence deduplication across parallel loops
+
+**Usage**:
+```python
+from src.middleware.workflow_manager import WorkflowManager
+
+manager = WorkflowManager()
+manager.add_loop(loop1)
+manager.add_loop(loop2)
+completed_loops = await manager.run_loops_parallel()
+```
+
+## Budget Tracker
+
+**File**: `src/middleware/budget_tracker.py`
+
+**Purpose**: Tracks and enforces resource limits
+
+**Budget Components**:
+- **Tokens**: LLM token usage
+- **Time**: Elapsed time in seconds
+- **Iterations**: Number of iterations
+
+**Methods**:
+- `create_budget(token_limit, time_limit_seconds, iterations_limit) -> BudgetStatus`
+- `add_tokens(tokens: int)`: Add token usage
+- `start_timer()`: Start time tracking
+- `update_timer()`: Update elapsed time
+- `increment_iteration()`: Increment iteration count
+- `check_budget() -> BudgetStatus`: Check current budget status
+- `can_continue() -> bool`: Check if research can continue
+
+**Token Estimation**:
+- `estimate_tokens(text: str) -> int`: ~4 chars per token
+- `estimate_llm_call_tokens(prompt: str, response: str) -> int`: Estimate LLM call tokens
+
+**Usage**:
+```python
+from src.middleware.budget_tracker import BudgetTracker
+
+tracker = BudgetTracker()
+budget = tracker.create_budget(
+    token_limit=100000,
+    time_limit_seconds=600,
+    iterations_limit=10
+)
+tracker.start_timer()
+# ... research operations ...
+if not tracker.can_continue():
+    # Budget exceeded, stop research
+    pass
+```
+
+## Models
+
+All middleware models are defined in `src/utils/models.py`:
+
+- `IterationData`: Data for a single iteration
+- `Conversation`: Conversation history with iterations
+- `ResearchLoop`: Research loop state and configuration
+- `BudgetStatus`: Current budget status
+
+## Thread Safety
+
+All middleware components use `ContextVar` for thread-safe isolation:
+
+- Each request/thread has its own workflow state
+- No global mutable state
+- Safe for concurrent requests
+
+## See Also
+
+- [Orchestrators](orchestrators.md) - How middleware is used in orchestration
+- [API Reference - Orchestrators](../api/orchestrators.md) - API documentation
+- [Contributing - Code Style](../contributing/code-style.md) - Development guidelines
+
+
+
diff --git a/docs/architecture/orchestrators.md b/docs/architecture/orchestrators.md
new file mode 100644
index 0000000000000000000000000000000000000000..cf227d0a46a483f5c9a972ca65c8d24753481adc
--- /dev/null
+++ b/docs/architecture/orchestrators.md
@@ -0,0 +1,198 @@
+# Orchestrators Architecture
+
+DeepCritical supports multiple orchestration patterns for research workflows.
+
+## Research Flows
+
+### IterativeResearchFlow
+
+**File**: `src/orchestrator/research_flow.py`
+
+**Pattern**: Generate observations → Evaluate gaps → Select tools → Execute → Judge → Continue/Complete
+
+**Agents Used**:
+- `KnowledgeGapAgent`: Evaluates research completeness
+- `ToolSelectorAgent`: Selects tools for addressing gaps
+- `ThinkingAgent`: Generates observations
+- `WriterAgent`: Creates final report
+- `JudgeHandler`: Assesses evidence sufficiency
+
+**Features**:
+- Tracks iterations, time, budget
+- Supports graph execution (`use_graph=True`) and agent chains (`use_graph=False`)
+- Iterates until research complete or constraints met
+
+**Usage**:
+```python
+from src.orchestrator.research_flow import IterativeResearchFlow
+
+flow = IterativeResearchFlow(
+    search_handler=search_handler,
+    judge_handler=judge_handler,
+    use_graph=False
+)
+
+async for event in flow.run(query):
+    # Handle events
+    pass
+```
+
+### DeepResearchFlow
+
+**File**: `src/orchestrator/research_flow.py`
+
+**Pattern**: Planner → Parallel iterative loops per section → Synthesizer
+
+**Agents Used**:
+- `PlannerAgent`: Breaks query into report sections
+- `IterativeResearchFlow`: Per-section research (parallel)
+- `LongWriterAgent` or `ProofreaderAgent`: Final synthesis
+
+**Features**:
+- Uses `WorkflowManager` for parallel execution
+- Budget tracking per section and globally
+- State synchronization across parallel loops
+- Supports graph execution and agent chains
+
+**Usage**:
+```python
+from src.orchestrator.research_flow import DeepResearchFlow
+
+flow = DeepResearchFlow(
+    search_handler=search_handler,
+    judge_handler=judge_handler,
+    use_graph=True
+)
+
+async for event in flow.run(query):
+    # Handle events
+    pass
+```
+
+## Graph Orchestrator
+
+**File**: `src/orchestrator/graph_orchestrator.py`
+
+**Purpose**: Graph-based execution using Pydantic AI agents as nodes
+
+**Features**:
+- Uses Pydantic AI Graphs (when available) or agent chains (fallback)
+- Routes based on research mode (iterative/deep/auto)
+- Streams `AgentEvent` objects for UI
+
+**Node Types**:
+- **Agent Nodes**: Execute Pydantic AI agents
+- **State Nodes**: Update or read workflow state
+- **Decision Nodes**: Make routing decisions
+- **Parallel Nodes**: Execute multiple nodes concurrently
+
+**Edge Types**:
+- **Sequential Edges**: Always traversed
+- **Conditional Edges**: Traversed based on condition
+- **Parallel Edges**: Used for parallel execution branches
+
+## Orchestrator Factory
+
+**File**: `src/orchestrator_factory.py`
+
+**Purpose**: Factory for creating orchestrators
+
+**Modes**:
+- **Simple**: Legacy orchestrator (backward compatible)
+- **Advanced**: Magentic orchestrator (requires OpenAI API key)
+- **Auto-detect**: Chooses based on API key availability
+
+**Usage**:
+```python
+from src.orchestrator_factory import create_orchestrator
+
+orchestrator = create_orchestrator(
+    search_handler=search_handler,
+    judge_handler=judge_handler,
+    config={},
+    mode="advanced"  # or "simple" or None for auto-detect
+)
+```
+
+## Magentic Orchestrator
+
+**File**: `src/orchestrator_magentic.py`
+
+**Purpose**: Multi-agent coordination using Microsoft Agent Framework
+
+**Features**:
+- Uses `agent-framework-core`
+- ChatAgent pattern with internal LLMs per agent
+- `MagenticBuilder` with participants: searcher, hypothesizer, judge, reporter
+- Manager orchestrates agents via `OpenAIChatClient`
+- Requires OpenAI API key (function calling support)
+- Event-driven: converts Magentic events to `AgentEvent` for UI streaming
+
+**Requirements**:
+- `agent-framework-core` package
+- OpenAI API key
+
+## Hierarchical Orchestrator
+
+**File**: `src/orchestrator_hierarchical.py`
+
+**Purpose**: Hierarchical orchestrator using middleware and sub-teams
+
+**Features**:
+- Uses `SubIterationMiddleware` with `ResearchTeam` and `LLMSubIterationJudge`
+- Adapts Magentic ChatAgent to `SubIterationTeam` protocol
+- Event-driven via `asyncio.Queue` for coordination
+- Supports sub-iteration patterns for complex research tasks
+
+## Legacy Simple Mode
+
+**File**: `src/legacy_orchestrator.py`
+
+**Purpose**: Linear search-judge-synthesize loop
+
+**Features**:
+- Uses `SearchHandlerProtocol` and `JudgeHandlerProtocol`
+- Generator-based design yielding `AgentEvent` objects
+- Backward compatibility for simple use cases
+
+## State Initialization
+
+All orchestrators must initialize workflow state:
+
+```python
+from src.middleware.state_machine import init_workflow_state
+from src.services.embeddings import get_embedding_service
+
+embedding_service = get_embedding_service()
+init_workflow_state(embedding_service)
+```
+
+## Event Streaming
+
+All orchestrators yield `AgentEvent` objects:
+
+**Event Types**:
+- `started`: Research started
+- `search_complete`: Search completed
+- `judge_complete`: Evidence evaluation completed
+- `hypothesizing`: Generating hypotheses
+- `synthesizing`: Synthesizing results
+- `complete`: Research completed
+- `error`: Error occurred
+
+**Event Structure**:
+```python
+class AgentEvent:
+    type: str
+    iteration: int | None
+    data: dict[str, Any]
+```
+
+## See Also
+
+- [Graph Orchestration](graph-orchestration.md) - Graph-based execution details
+- [Graph Orchestration (Detailed)](graph_orchestration.md) - Detailed graph architecture
+- [Workflows](workflows.md) - Workflow diagrams and patterns
+- [Workflow Diagrams](workflow-diagrams.md) - Detailed workflow diagrams
+- [API Reference - Orchestrators](../api/orchestrators.md) - API documentation
+
diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md
deleted file mode 100644
index 59467f2896848a2f8e5a9a503e5713bdd5e0d977..0000000000000000000000000000000000000000
--- a/docs/architecture/overview.md
+++ /dev/null
@@ -1,474 +0,0 @@
-# DeepCritical: Medical Drug Repurposing Research Agent
-## Project Overview
-
----
-
-## Executive Summary
-
-**DeepCritical** is a deep research agent designed to accelerate medical drug repurposing research by autonomously searching, analyzing, and synthesizing evidence from multiple biomedical databases.
-
-### The Problem We Solve
-
-Drug repurposing - finding new therapeutic uses for existing FDA-approved drugs - can take years of manual literature review. Researchers must:
-- Search thousands of papers across multiple databases
-- Identify molecular mechanisms
-- Find relevant clinical trials
-- Assess safety profiles
-- Synthesize evidence into actionable insights
-
-**DeepCritical automates this process from hours to minutes.**
-
-### What Is Drug Repurposing?
-
-**Simple Explanation:**
-Using existing approved drugs to treat NEW diseases they weren't originally designed for.
-
-**Real Examples:**
-- **Viagra** (sildenafil): Originally for heart disease → Now treats erectile dysfunction
-- **Thalidomide**: Once banned → Now treats multiple myeloma
-- **Aspirin**: Pain reliever → Heart attack prevention
-- **Metformin**: Diabetes drug → Being tested for aging/longevity
-
-**Why It Matters:**
-- Faster than developing new drugs (years vs decades)
-- Cheaper (known safety profiles)
-- Lower risk (already FDA approved)
-- Immediate patient benefit potential
-
----
-
-## Core Use Case
-
-### Primary Query Type
-> "What existing drugs might help treat [disease/condition]?"
-
-### Example Queries
-
-1. **Long COVID Fatigue**
-   - Query: "What existing drugs might help treat long COVID fatigue?"
-   - Agent searches: PubMed, clinical trials, drug databases
-   - Output: List of candidate drugs with mechanisms + evidence + citations
-
-2. **Alzheimer's Disease**
-   - Query: "Find existing drugs that target beta-amyloid pathways"
-   - Agent identifies: Disease mechanisms → Drug candidates → Clinical evidence
-   - Output: Comprehensive research report with drug candidates
-
-3. **Rare Disease Treatment**
-   - Query: "What drugs might help with fibrodysplasia ossificans progressiva?"
-   - Agent finds: Similar conditions → Shared pathways → Potential treatments
-   - Output: Evidence-based treatment suggestions
-
----
-
-## System Architecture
-
-### High-Level Design (Phases 1-8)
-
-```text
-User Query
-    ↓
-Gradio UI (Phase 4)
-    ↓
-Magentic Manager (Phase 5) ← LLM-powered coordinator
-    ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6)
-    ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning
-    ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment
-    └── ReportAgent (Phase 8) ←→ Final Synthesis
-    ↓
-Structured Research Report
-```
-
-### Key Components
-
-1. **Magentic Manager (Orchestrator)**
-   - LLM-powered multi-agent coordinator
-   - Dynamic planning and agent selection
-   - Built-in stall detection and replanning
-   - Microsoft Agent Framework integration
-
-2. **SearchAgent (Phase 2+5+6)**
-   - PubMed E-utilities search
-   - DuckDuckGo web search
-   - Semantic search via ChromaDB (Phase 6)
-   - Evidence deduplication
-
-3. **HypothesisAgent (Phase 7)**
-   - Generates Drug → Target → Pathway → Effect hypotheses
-   - Guides targeted searches
-   - Scientific reasoning about mechanisms
-
-4. **JudgeAgent (Phase 3+5)**
-   - LLM-based evidence assessment
-   - Mechanism score + Clinical score
-   - Recommends continue/synthesize
-   - Generates refined search queries
-
-5. **ReportAgent (Phase 8)**
-   - Structured scientific reports
-   - Executive summary, methodology
-   - Hypotheses tested with evidence counts
-   - Proper citations and limitations
-
-6. **Gradio UI (Phase 4)**
-   - Chat interface for questions
-   - Real-time progress via events
-   - Mode toggle (Simple/Magentic)
-   - Formatted markdown output
-
----
-
-## Design Patterns
-
-### 1. Search-and-Judge Loop (Primary Pattern)
-
-```python
-def research(question: str) -> Report:
-    context = []
-    for iteration in range(max_iterations):
-        # SEARCH: Query relevant tools
-        results = search_tools(question, context)
-        context.extend(results)
-
-        # JUDGE: Evaluate quality
-        if judge.is_sufficient(question, context):
-            break
-
-        # REFINE: Adjust search strategy
-        query = refine_query(question, context)
-
-    # SYNTHESIZE: Generate report
-    return synthesize_report(question, context)
-```
-
-**Why This Pattern:**
-- Simple to implement and debug
-- Clear loop termination conditions
-- Iterative improvement of search quality
-- Balances depth vs speed
-
-### 2. Multi-Tool Orchestration
-
-```
-Question → Agent decides which tools to use
-           ↓
-       ┌───┴────┬─────────┬──────────┐
-       ↓        ↓         ↓          ↓
-   PubMed  Web Search  Trials DB  Drug DB
-       ↓        ↓         ↓          ↓
-       └───┬────┴─────────┴──────────┘
-           ↓
-    Aggregate Results → Judge
-```
-
-**Why This Pattern:**
-- Different sources provide different evidence types
-- Parallel tool execution (when possible)
-- Comprehensive coverage
-
-### 3. LLM-as-Judge with Token Budget
-
-**Dual Stopping Conditions:**
-- **Smart Stop**: LLM judge says "we have sufficient evidence"
-- **Hard Stop**: Token budget exhausted OR max iterations reached
-
-**Why Both:**
-- Judge enables early exit when answer is good
-- Budget prevents runaway costs
-- Iterations prevent infinite loops
-
-### 4. Stateful Checkpointing
-
-```
-.deepresearch/
-├── state/
-│   └── query_123.json    # Current research state
-├── checkpoints/
-│   └── query_123_iter3/  # Checkpoint at iteration 3
-└── workspace/
-    └── query_123/        # Downloaded papers, data
-```
-
-**Why This Pattern:**
-- Resume interrupted research
-- Debugging and analysis
-- Cost savings (don't re-search)
-
----
-
-## Component Breakdown
-
-### Agent (Orchestrator)
-- **Responsibility**: Coordinate research process
-- **Size**: ~100 lines
-- **Key Methods**:
-  - `research(question)` - Main entry point
-  - `plan_search_strategy()` - Decide what to search
-  - `execute_search()` - Run tool queries
-  - `evaluate_progress()` - Call judge
-  - `synthesize_findings()` - Generate report
-
-### Tools
-- **Responsibility**: Interface with external data sources
-- **Size**: ~50 lines per tool
-- **Implementations**:
-  - `PubMedTool` - Search biomedical literature
-  - `WebSearchTool` - General medical information
-  - `ClinicalTrialsTool` - Trial data (optional)
-  - `DrugInfoTool` - FDA drug database (optional)
-
-### Judge
-- **Responsibility**: Evaluate evidence quality
-- **Size**: ~50 lines
-- **Key Methods**:
-  - `is_sufficient(question, evidence)` → bool
-  - `assess_quality(evidence)` → score
-  - `identify_gaps(question, evidence)` → missing_info
-
-### Gradio App
-- **Responsibility**: User interface
-- **Size**: ~50 lines
-- **Features**:
-  - Text input for questions
-  - Progress indicators
-  - Formatted output with citations
-  - Download research report
-
----
-
-## Technical Stack
-
-### Core Dependencies
-```toml
-[dependencies]
-python = ">=3.10"
-pydantic = "^2.7"
-pydantic-ai = "^0.0.16"
-fastmcp = "^0.1.0"
-gradio = "^5.0"
-beautifulsoup4 = "^4.12"
-httpx = "^0.27"
-```
-
-### Optional Enhancements
-- `modal` - For GPU-accelerated local LLM
-- `fastmcp` - MCP server integration
-- `sentence-transformers` - Semantic search
-- `faiss-cpu` - Vector similarity
-
-### Tool APIs & Rate Limits
-
-| API | Cost | Rate Limit | API Key? | Notes |
-|-----|------|------------|----------|-------|
-| **PubMed E-utilities** | Free | 3/sec (no key), 10/sec (with key) | Optional | Register at NCBI for higher limits |
-| **Brave Search API** | Free tier | 2000/month free | Required | Primary web search |
-| **DuckDuckGo** | Free | Unofficial, ~1/sec | No | Fallback web search |
-| **ClinicalTrials.gov** | Free | 100/min | No | Stretch goal |
-| **OpenFDA** | Free | 240/min (no key), 120K/day (with key) | Optional | Drug info |
-
-**Web Search Strategy (Priority Order):**
-1. **Brave Search API** (free tier: 2000 queries/month) - Primary
-2. **DuckDuckGo** (unofficial, no API key) - Fallback
-3. **SerpAPI** ($50/month) - Only if free options fail
-
-**Why NOT SerpAPI first?**
-- Costs money (hackathon budget = $0)
-- Free alternatives work fine for demo
-- Can upgrade later if needed
-
----
-
-## Success Criteria
-
-### Phase 1-5 (MVP) ✅ COMPLETE
-**Completed in ONE DAY:**
-- [x] User can ask drug repurposing question
-- [x] Agent searches PubMed (async)
-- [x] Agent searches web (DuckDuckGo)
-- [x] LLM judge evaluates evidence quality
-- [x] System respects token budget and iterations
-- [x] Output includes drug candidates + citations
-- [x] Works end-to-end for demo query
-- [x] Gradio UI with streaming progress
-- [x] Magentic multi-agent orchestration
-- [x] 38 unit tests passing
-- [x] CI/CD pipeline green
-
-### Hackathon Submission ✅ COMPLETE
-- [x] Gradio UI deployed on HuggingFace Spaces
-- [x] Example queries working and tested
-- [x] Architecture documentation
-- [x] README with setup instructions
-
-### Phase 6-8 (Enhanced)
-**Specs ready for implementation:**
-- [ ] Embeddings & Semantic Search (Phase 6)
-- [ ] Hypothesis Agent (Phase 7)
-- [ ] Report Agent (Phase 8)
-
-### What's EXPLICITLY Out of Scope
-**NOT building (to stay focused):**
-- ❌ User authentication
-- ❌ Database storage of queries
-- ❌ Multi-user support
-- ❌ Payment/billing
-- ❌ Production monitoring
-- ❌ Mobile UI
-
----
-
-## Implementation Timeline
-
-### Day 1 (Today): Architecture & Setup
-- [x] Define use case (drug repurposing) ✅
-- [x] Write architecture docs ✅
-- [ ] Create project structure
-- [ ] First PR: Structure + Docs
-
-### Day 2: Core Agent Loop
-- [ ] Implement basic orchestrator
-- [ ] Add PubMed search tool
-- [ ] Simple judge (keyword-based)
-- [ ] Test with 1 query
-
-### Day 3: Intelligence Layer
-- [ ] Upgrade to LLM judge
-- [ ] Add web search tool
-- [ ] Token budget tracking
-- [ ] Test with multiple queries
-
-### Day 4: UI & Integration
-- [ ] Build Gradio interface
-- [ ] Wire up agent to UI
-- [ ] Add progress indicators
-- [ ] Format output nicely
-
-### Day 5: Polish & Extend
-- [ ] Add more tools (clinical trials)
-- [ ] Improve judge prompts
-- [ ] Checkpoint system
-- [ ] Error handling
-
-### Day 6: Deploy & Document
-- [ ] Deploy to HuggingFace Spaces
-- [ ] Record demo video
-- [ ] Write submission materials
-- [ ] Final testing
-
----
-
-## Questions This Document Answers
-
-### For The Maintainer
-
-**Q: "What should our design pattern be?"**
-A: Search-and-judge loop with multi-tool orchestration (detailed in Design Patterns section)
-
-**Q: "Should we use LLM-as-judge or token budget?"**
-A: Both - judge for smart stopping, budget for cost control
-
-**Q: "What's the break pattern?"**
-A: Three conditions: judge approval, token limit, or max iterations (whichever comes first)
-
-**Q: "What components do we need?"**
-A: Agent orchestrator, tools (PubMed/web), judge, Gradio UI (see Component Breakdown)
-
-### For The Team
-
-**Q: "What are we actually building?"**
-A: Medical drug repurposing research agent (see Core Use Case)
-
-**Q: "How complex should it be?"**
-A: Simple but complete - ~300 lines of core code (see Component sizes)
-
-**Q: "What's the timeline?"**
-A: 6 days, MVP by Day 3, polish Days 4-6 (see Implementation Timeline)
-
-**Q: "What datasets/APIs do we use?"**
-A: PubMed (free), web search, clinical trials.gov (see Tool APIs)
-
----
-
-## Next Steps
-
-1. **Review this document** - Team feedback on architecture
-2. **Finalize design** - Incorporate feedback
-3. **Create project structure** - Scaffold repository
-4. **Move to proper docs** - `docs/architecture/` folder
-5. **Open first PR** - Structure + Documentation
-6. **Start implementation** - Day 2 onward
-
----
-
-## Notes & Decisions
-
-### Why Drug Repurposing?
-- Clear, impressive use case
-- Real-world medical impact
-- Good data availability (PubMed, trials)
-- Easy to explain (Viagra example!)
-- Physician on team ✅
-
-### Why Simple Architecture?
-- 6-day timeline
-- Need working end-to-end system
-- Hackathon judges value "works" over "complex"
-- Can extend later if successful
-
-### Why These Tools First?
-- PubMed: Best biomedical literature source
-- Web search: General medical knowledge
-- Clinical trials: Evidence of actual testing
-- Others: Nice-to-have, not critical for MVP
-
----
-
----
-
-## Appendix A: Demo Queries (Pre-tested)
-
-These queries will be used for demo and testing. They're chosen because:
-1. They have good PubMed coverage
-2. They're medically interesting
-3. They show the system's capabilities
-
-### Primary Demo Query
-```
-"What existing drugs might help treat long COVID fatigue?"
-```
-**Expected candidates**: CoQ10, Low-dose Naltrexone, Modafinil
-**Expected sources**: 20+ PubMed papers, 2-3 clinical trials
-
-### Secondary Demo Queries
-```
-"Find existing drugs that might slow Alzheimer's progression"
-"What approved medications could help with fibromyalgia pain?"
-"Which diabetes drugs show promise for cancer treatment?"
-```
-
-### Why These Queries?
-- Represent real clinical needs
-- Have substantial literature
-- Show diverse drug classes
-- Physician on team can validate results
-
----
-
-## Appendix B: Risk Assessment
-
-| Risk | Likelihood | Impact | Mitigation |
-|------|------------|--------|------------|
-| PubMed rate limiting | Medium | High | Implement caching, respect 3/sec |
-| Web search API fails | Low | Medium | DuckDuckGo fallback |
-| LLM costs exceed budget | Medium | Medium | Hard token cap at 50K |
-| Judge quality poor | Medium | High | Pre-test prompts, iterate |
-| HuggingFace deploy issues | Low | High | Test deployment Day 4 |
-| Demo crashes live | Medium | High | Pre-recorded backup video |
-
----
-
----
-
-**Document Status**: Official Architecture Spec
-**Review Score**: 98/100
-**Last Updated**: November 2025
diff --git a/docs/architecture/services.md b/docs/architecture/services.md
new file mode 100644
index 0000000000000000000000000000000000000000..544c9a2f291a0cd18355edeba0124e35feb931fe
--- /dev/null
+++ b/docs/architecture/services.md
@@ -0,0 +1,131 @@
+# Services Architecture
+
+DeepCritical provides several services for embeddings, RAG, and statistical analysis.
+
+## Embedding Service
+
+**File**: `src/services/embeddings.py`
+
+**Purpose**: Local sentence-transformers for semantic search and deduplication
+
+**Features**:
+- **No API Key Required**: Uses local sentence-transformers models
+- **Async-Safe**: All operations use `run_in_executor()` to avoid blocking
+- **ChromaDB Storage**: Vector storage for embeddings
+- **Deduplication**: 0.85 similarity threshold (85% similarity = duplicate)
+
+**Model**: Configurable via `settings.local_embedding_model` (default: `all-MiniLM-L6-v2`)
+
+**Methods**:
+- `async def embed(text: str) -> list[float]`: Generate embeddings
+- `async def embed_batch(texts: list[str]) -> list[list[float]]`: Batch embedding
+- `async def similarity(text1: str, text2: str) -> float`: Calculate similarity
+- `async def find_duplicates(texts: list[str], threshold: float = 0.85) -> list[tuple[int, int]]`: Find duplicates
+
+**Usage**:
+```python
+from src.services.embeddings import get_embedding_service
+
+service = get_embedding_service()
+embedding = await service.embed("text to embed")
+```
+
+## LlamaIndex RAG Service
+
+**File**: `src/services/rag.py`
+
+**Purpose**: Retrieval-Augmented Generation using LlamaIndex
+
+**Features**:
+- **OpenAI Embeddings**: Requires `OPENAI_API_KEY`
+- **ChromaDB Storage**: Vector database for document storage
+- **Metadata Preservation**: Preserves source, title, URL, date, authors
+- **Lazy Initialization**: Graceful fallback if OpenAI key not available
+
+**Methods**:
+- `async def ingest_evidence(evidence: list[Evidence]) -> None`: Ingest evidence into RAG
+- `async def retrieve(query: str, top_k: int = 5) -> list[Document]`: Retrieve relevant documents
+- `async def query(query: str, top_k: int = 5) -> str`: Query with RAG
+
+**Usage**:
+```python
+from src.services.rag import get_rag_service
+
+service = get_rag_service()
+if service:
+    documents = await service.retrieve("query", top_k=5)
+```
+
+## Statistical Analyzer
+
+**File**: `src/services/statistical_analyzer.py`
+
+**Purpose**: Secure execution of AI-generated statistical code
+
+**Features**:
+- **Modal Sandbox**: Secure, isolated execution environment
+- **Code Generation**: Generates Python code via LLM
+- **Library Pinning**: Version-pinned libraries in `SANDBOX_LIBRARIES`
+- **Network Isolation**: `block_network=True` by default
+
+**Libraries Available**:
+- pandas, numpy, scipy
+- matplotlib, scikit-learn
+- statsmodels
+
+**Output**: `AnalysisResult` with:
+- `verdict`: SUPPORTED, REFUTED, or INCONCLUSIVE
+- `code`: Generated analysis code
+- `output`: Execution output
+- `error`: Error message if execution failed
+
+**Usage**:
+```python
+from src.services.statistical_analyzer import StatisticalAnalyzer
+
+analyzer = StatisticalAnalyzer()
+result = await analyzer.analyze(
+    hypothesis="Metformin reduces cancer risk",
+    evidence=evidence_list
+)
+```
+
+## Singleton Pattern
+
+All services use the singleton pattern with `@lru_cache(maxsize=1)`:
+
+```python
+@lru_cache(maxsize=1)
+def get_embedding_service() -> EmbeddingService:
+    return EmbeddingService()
+```
+
+This ensures:
+- Single instance per process
+- Lazy initialization
+- No dependencies required at import time
+
+## Service Availability
+
+Services check availability before use:
+
+```python
+from src.utils.config import settings
+
+if settings.modal_available:
+    # Use Modal sandbox
+    pass
+
+if settings.has_openai_key:
+    # Use OpenAI embeddings for RAG
+    pass
+```
+
+## See Also
+
+- [Tools](tools.md) - How services are used by search tools
+- [API Reference - Services](../api/services.md) - API documentation
+- [Configuration](../configuration/index.md) - Service configuration
+
+
+
diff --git a/docs/architecture/tools.md b/docs/architecture/tools.md
new file mode 100644
index 0000000000000000000000000000000000000000..1c76a681930e66ea59538945a0d975d050d6c658
--- /dev/null
+++ b/docs/architecture/tools.md
@@ -0,0 +1,164 @@
+# Tools Architecture
+
+DeepCritical implements a protocol-based search tool system for retrieving evidence from multiple sources.
+
+## SearchTool Protocol
+
+All tools implement the `SearchTool` protocol from `src/tools/base.py`:
+
+```python
+class SearchTool(Protocol):
+    @property
+    def name(self) -> str: ...
+    
+    async def search(
+        self, 
+        query: str, 
+        max_results: int = 10
+    ) -> list[Evidence]: ...
+```
+
+## Rate Limiting
+
+All tools use the `@retry` decorator from tenacity:
+
+```python
+@retry(
+    stop=stop_after_attempt(3), 
+    wait=wait_exponential(...)
+)
+async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+    # Implementation
+```
+
+Tools with API rate limits implement `_rate_limit()` method and use shared rate limiters from `src/tools/rate_limiter.py`.
+
+## Error Handling
+
+Tools raise custom exceptions:
+
+- `SearchError`: General search failures
+- `RateLimitError`: Rate limit exceeded
+
+Tools handle HTTP errors (429, 500, timeout) and return empty lists on non-critical errors (with warning logs).
+
+## Query Preprocessing
+
+Tools use `preprocess_query()` from `src/tools/query_utils.py` to:
+
+- Remove noise from queries
+- Expand synonyms
+- Normalize query format
+
+## Evidence Conversion
+
+All tools convert API responses to `Evidence` objects with:
+
+- `Citation`: Title, URL, date, authors
+- `content`: Evidence text
+- `relevance_score`: 0.0-1.0 relevance score
+- `metadata`: Additional metadata
+
+Missing fields are handled gracefully with defaults.
+
+## Tool Implementations
+
+### PubMed Tool
+
+**File**: `src/tools/pubmed.py`
+
+**API**: NCBI E-utilities (ESearch → EFetch)
+
+**Rate Limiting**: 
+- 0.34s between requests (3 req/sec without API key)
+- 0.1s between requests (10 req/sec with NCBI API key)
+
+**Features**:
+- XML parsing with `xmltodict`
+- Handles single vs. multiple articles
+- Query preprocessing
+- Evidence conversion with metadata extraction
+
+### ClinicalTrials Tool
+
+**File**: `src/tools/clinicaltrials.py`
+
+**API**: ClinicalTrials.gov API v2
+
+**Important**: Uses `requests` library (NOT httpx) because WAF blocks httpx TLS fingerprint.
+
+**Execution**: Runs in thread pool: `await asyncio.to_thread(requests.get, ...)`
+
+**Filtering**:
+- Only interventional studies
+- Status: `COMPLETED`, `ACTIVE_NOT_RECRUITING`, `RECRUITING`, `ENROLLING_BY_INVITATION`
+
+**Features**:
+- Parses nested JSON structure
+- Extracts trial metadata
+- Evidence conversion
+
+### Europe PMC Tool
+
+**File**: `src/tools/europepmc.py`
+
+**API**: Europe PMC REST API
+
+**Features**:
+- Handles preprint markers: `[PREPRINT - Not peer-reviewed]`
+- Builds URLs from DOI or PMID
+- Checks `pubTypeList` for preprint detection
+- Includes both preprints and peer-reviewed articles
+
+### RAG Tool
+
+**File**: `src/tools/rag_tool.py`
+
+**Purpose**: Semantic search within collected evidence
+
+**Implementation**: Wraps `LlamaIndexRAGService`
+
+**Features**:
+- Returns Evidence from RAG results
+- Handles evidence ingestion
+- Semantic similarity search
+- Metadata preservation
+
+### Search Handler
+
+**File**: `src/tools/search_handler.py`
+
+**Purpose**: Orchestrates parallel searches across multiple tools
+
+**Features**:
+- Uses `asyncio.gather()` with `return_exceptions=True`
+- Aggregates results into `SearchResult`
+- Handles tool failures gracefully
+- Deduplicates results by URL
+
+## Tool Registration
+
+Tools are registered in the search handler:
+
+```python
+from src.tools.pubmed import PubMedTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.europepmc import EuropePMCTool
+
+search_handler = SearchHandler(
+    tools=[
+        PubMedTool(),
+        ClinicalTrialsTool(),
+        EuropePMCTool(),
+    ]
+)
+```
+
+## See Also
+
+- [Services](services.md) - RAG and embedding services
+- [API Reference - Tools](../api/tools.md) - API documentation
+- [Contributing - Implementation Patterns](../contributing/implementation-patterns.md) - Development guidelines
+
+
+
diff --git a/docs/architecture/workflow-diagrams.md b/docs/architecture/workflow-diagrams.md
new file mode 100644
index 0000000000000000000000000000000000000000..c0f86c232be13ea07438eabdca7cbad803a1c1ac
--- /dev/null
+++ b/docs/architecture/workflow-diagrams.md
@@ -0,0 +1,670 @@
+# DeepCritical Workflow - Simplified Magentic Architecture
+
+> **Architecture Pattern**: Microsoft Magentic Orchestration
+> **Design Philosophy**: Simple, dynamic, manager-driven coordination
+> **Key Innovation**: Intelligent manager replaces rigid sequential phases
+
+---
+
+## 1. High-Level Magentic Workflow
+
+```mermaid
+flowchart TD
+    Start([User Query]) --> Manager[Magentic Manager<br/>Plan • Select • Assess • Adapt]
+
+    Manager -->|Plans| Task1[Task Decomposition]
+    Task1 --> Manager
+
+    Manager -->|Selects & Executes| HypAgent[Hypothesis Agent]
+    Manager -->|Selects & Executes| SearchAgent[Search Agent]
+    Manager -->|Selects & Executes| AnalysisAgent[Analysis Agent]
+    Manager -->|Selects & Executes| ReportAgent[Report Agent]
+
+    HypAgent -->|Results| Manager
+    SearchAgent -->|Results| Manager
+    AnalysisAgent -->|Results| Manager
+    ReportAgent -->|Results| Manager
+
+    Manager -->|Assesses Quality| Decision{Good Enough?}
+    Decision -->|No - Refine| Manager
+    Decision -->|No - Different Agent| Manager
+    Decision -->|No - Stalled| Replan[Reset Plan]
+    Replan --> Manager
+
+    Decision -->|Yes| Synthesis[Synthesize Final Result]
+    Synthesis --> Output([Research Report])
+
+    style Start fill:#e1f5e1
+    style Manager fill:#ffe6e6
+    style HypAgent fill:#fff4e6
+    style SearchAgent fill:#fff4e6
+    style AnalysisAgent fill:#fff4e6
+    style ReportAgent fill:#fff4e6
+    style Decision fill:#ffd6d6
+    style Synthesis fill:#d4edda
+    style Output fill:#e1f5e1
+```
+
+## 2. Magentic Manager: The 6-Phase Cycle
+
+```mermaid
+flowchart LR
+    P1[1. Planning<br/>Analyze task<br/>Create strategy] --> P2[2. Agent Selection<br/>Pick best agent<br/>for subtask]
+    P2 --> P3[3. Execution<br/>Run selected<br/>agent with tools]
+    P3 --> P4[4. Assessment<br/>Evaluate quality<br/>Check progress]
+    P4 --> Decision{Quality OK?<br/>Progress made?}
+    Decision -->|Yes| P6[6. Synthesis<br/>Combine results<br/>Generate report]
+    Decision -->|No| P5[5. Iteration<br/>Adjust plan<br/>Try again]
+    P5 --> P2
+    P6 --> Done([Complete])
+
+    style P1 fill:#fff4e6
+    style P2 fill:#ffe6e6
+    style P3 fill:#e6f3ff
+    style P4 fill:#ffd6d6
+    style P5 fill:#fff3cd
+    style P6 fill:#d4edda
+    style Done fill:#e1f5e1
+```
+
+## 3. Simplified Agent Architecture
+
+```mermaid
+graph TB
+    subgraph "Orchestration Layer"
+        Manager[Magentic Manager<br/>• Plans workflow<br/>• Selects agents<br/>• Assesses quality<br/>• Adapts strategy]
+        SharedContext[(Shared Context<br/>• Hypotheses<br/>• Search Results<br/>• Analysis<br/>• Progress)]
+        Manager <--> SharedContext
+    end
+
+    subgraph "Specialist Agents"
+        HypAgent[Hypothesis Agent<br/>• Domain understanding<br/>• Hypothesis generation<br/>• Testability refinement]
+        SearchAgent[Search Agent<br/>• Multi-source search<br/>• RAG retrieval<br/>• Result ranking]
+        AnalysisAgent[Analysis Agent<br/>• Evidence extraction<br/>• Statistical analysis<br/>• Code execution]
+        ReportAgent[Report Agent<br/>• Report assembly<br/>• Visualization<br/>• Citation formatting]
+    end
+
+    subgraph "MCP Tools"
+        WebSearch[Web Search<br/>PubMed • arXiv • bioRxiv]
+        CodeExec[Code Execution<br/>Sandboxed Python]
+        RAG[RAG Retrieval<br/>Vector DB • Embeddings]
+        Viz[Visualization<br/>Charts • Graphs]
+    end
+
+    Manager -->|Selects & Directs| HypAgent
+    Manager -->|Selects & Directs| SearchAgent
+    Manager -->|Selects & Directs| AnalysisAgent
+    Manager -->|Selects & Directs| ReportAgent
+
+    HypAgent --> SharedContext
+    SearchAgent --> SharedContext
+    AnalysisAgent --> SharedContext
+    ReportAgent --> SharedContext
+
+    SearchAgent --> WebSearch
+    SearchAgent --> RAG
+    AnalysisAgent --> CodeExec
+    ReportAgent --> CodeExec
+    ReportAgent --> Viz
+
+    style Manager fill:#ffe6e6
+    style SharedContext fill:#ffe6f0
+    style HypAgent fill:#fff4e6
+    style SearchAgent fill:#fff4e6
+    style AnalysisAgent fill:#fff4e6
+    style ReportAgent fill:#fff4e6
+    style WebSearch fill:#e6f3ff
+    style CodeExec fill:#e6f3ff
+    style RAG fill:#e6f3ff
+    style Viz fill:#e6f3ff
+```
+
+## 4. Dynamic Workflow Example
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant Manager
+    participant HypAgent
+    participant SearchAgent
+    participant AnalysisAgent
+    participant ReportAgent
+
+    User->>Manager: "Research protein folding in Alzheimer's"
+
+    Note over Manager: PLAN: Generate hypotheses → Search → Analyze → Report
+
+    Manager->>HypAgent: Generate 3 hypotheses
+    HypAgent-->>Manager: Returns 3 hypotheses
+    Note over Manager: ASSESS: Good quality, proceed
+
+    Manager->>SearchAgent: Search literature for hypothesis 1
+    SearchAgent-->>Manager: Returns 15 papers
+    Note over Manager: ASSESS: Good results, continue
+
+    Manager->>SearchAgent: Search for hypothesis 2
+    SearchAgent-->>Manager: Only 2 papers found
+    Note over Manager: ASSESS: Insufficient, refine search
+
+    Manager->>SearchAgent: Refined query for hypothesis 2
+    SearchAgent-->>Manager: Returns 12 papers
+    Note over Manager: ASSESS: Better, proceed
+
+    Manager->>AnalysisAgent: Analyze evidence for all hypotheses
+    AnalysisAgent-->>Manager: Returns analysis with code
+    Note over Manager: ASSESS: Complete, generate report
+
+    Manager->>ReportAgent: Create comprehensive report
+    ReportAgent-->>Manager: Returns formatted report
+    Note over Manager: SYNTHESIZE: Combine all results
+
+    Manager->>User: Final Research Report
+```
+
+## 5. Manager Decision Logic
+
+```mermaid
+flowchart TD
+    Start([Manager Receives Task]) --> Plan[Create Initial Plan]
+
+    Plan --> Select[Select Agent for Next Subtask]
+    Select --> Execute[Execute Agent]
+    Execute --> Collect[Collect Results]
+
+    Collect --> Assess[Assess Quality & Progress]
+
+    Assess --> Q1{Quality Sufficient?}
+    Q1 -->|No| Q2{Same Agent Can Fix?}
+    Q2 -->|Yes| Feedback[Provide Specific Feedback]
+    Feedback --> Execute
+    Q2 -->|No| Different[Try Different Agent]
+    Different --> Select
+
+    Q1 -->|Yes| Q3{Task Complete?}
+    Q3 -->|No| Q4{Making Progress?}
+    Q4 -->|Yes| Select
+    Q4 -->|No - Stalled| Replan[Reset Plan & Approach]
+    Replan --> Plan
+
+    Q3 -->|Yes| Synth[Synthesize Final Result]
+    Synth --> Done([Return Report])
+
+    style Start fill:#e1f5e1
+    style Plan fill:#fff4e6
+    style Select fill:#ffe6e6
+    style Execute fill:#e6f3ff
+    style Assess fill:#ffd6d6
+    style Q1 fill:#ffe6e6
+    style Q2 fill:#ffe6e6
+    style Q3 fill:#ffe6e6
+    style Q4 fill:#ffe6e6
+    style Synth fill:#d4edda
+    style Done fill:#e1f5e1
+```
+
+## 6. Hypothesis Agent Workflow
+
+```mermaid
+flowchart LR
+    Input[Research Query] --> Domain[Identify Domain<br/>& Key Concepts]
+    Domain --> Context[Retrieve Background<br/>Knowledge]
+    Context --> Generate[Generate 3-5<br/>Initial Hypotheses]
+    Generate --> Refine[Refine for<br/>Testability]
+    Refine --> Rank[Rank by<br/>Quality Score]
+    Rank --> Output[Return Top<br/>Hypotheses]
+
+    Output --> Struct[Hypothesis Structure:<br/>• Statement<br/>• Rationale<br/>• Testability Score<br/>• Data Requirements<br/>• Expected Outcomes]
+
+    style Input fill:#e1f5e1
+    style Output fill:#fff4e6
+    style Struct fill:#e6f3ff
+```
+
+## 7. Search Agent Workflow
+
+```mermaid
+flowchart TD
+    Input[Hypotheses] --> Strategy[Formulate Search<br/>Strategy per Hypothesis]
+
+    Strategy --> Multi[Multi-Source Search]
+
+    Multi --> PubMed[PubMed Search<br/>via MCP]
+    Multi --> ArXiv[arXiv Search<br/>via MCP]
+    Multi --> BioRxiv[bioRxiv Search<br/>via MCP]
+
+    PubMed --> Aggregate[Aggregate Results]
+    ArXiv --> Aggregate
+    BioRxiv --> Aggregate
+
+    Aggregate --> Filter[Filter & Rank<br/>by Relevance]
+    Filter --> Dedup[Deduplicate<br/>Cross-Reference]
+    Dedup --> Embed[Embed Documents<br/>via MCP]
+    Embed --> Vector[(Vector DB)]
+    Vector --> RAGRetrieval[RAG Retrieval<br/>Top-K per Hypothesis]
+    RAGRetrieval --> Output[Return Contextualized<br/>Search Results]
+
+    style Input fill:#fff4e6
+    style Multi fill:#ffe6e6
+    style Vector fill:#ffe6f0
+    style Output fill:#e6f3ff
+```
+
+## 8. Analysis Agent Workflow
+
+```mermaid
+flowchart TD
+    Input1[Hypotheses] --> Extract
+    Input2[Search Results] --> Extract[Extract Evidence<br/>per Hypothesis]
+
+    Extract --> Methods[Determine Analysis<br/>Methods Needed]
+
+    Methods --> Branch{Requires<br/>Computation?}
+    Branch -->|Yes| GenCode[Generate Python<br/>Analysis Code]
+    Branch -->|No| Qual[Qualitative<br/>Synthesis]
+
+    GenCode --> Execute[Execute Code<br/>via MCP Sandbox]
+    Execute --> Interpret1[Interpret<br/>Results]
+    Qual --> Interpret2[Interpret<br/>Findings]
+
+    Interpret1 --> Synthesize[Synthesize Evidence<br/>Across Sources]
+    Interpret2 --> Synthesize
+
+    Synthesize --> Verdict[Determine Verdict<br/>per Hypothesis]
+    Verdict --> Support[• Supported<br/>• Refuted<br/>• Inconclusive]
+    Support --> Gaps[Identify Knowledge<br/>Gaps & Limitations]
+    Gaps --> Output[Return Analysis<br/>Report]
+
+    style Input1 fill:#fff4e6
+    style Input2 fill:#e6f3ff
+    style Execute fill:#ffe6e6
+    style Output fill:#e6ffe6
+```
+
+## 9. Report Agent Workflow
+
+```mermaid
+flowchart TD
+    Input1[Query] --> Assemble
+    Input2[Hypotheses] --> Assemble
+    Input3[Search Results] --> Assemble
+    Input4[Analysis] --> Assemble[Assemble Report<br/>Sections]
+
+    Assemble --> Exec[Executive Summary]
+    Assemble --> Intro[Introduction]
+    Assemble --> Methods[Methods]
+    Assemble --> Results[Results per<br/>Hypothesis]
+    Assemble --> Discussion[Discussion]
+    Assemble --> Future[Future Directions]
+    Assemble --> Refs[References]
+
+    Results --> VizCheck{Needs<br/>Visualization?}
+    VizCheck -->|Yes| GenViz[Generate Viz Code]
+    GenViz --> ExecViz[Execute via MCP<br/>Create Charts]
+    ExecViz --> Combine
+    VizCheck -->|No| Combine[Combine All<br/>Sections]
+
+    Exec --> Combine
+    Intro --> Combine
+    Methods --> Combine
+    Discussion --> Combine
+    Future --> Combine
+    Refs --> Combine
+
+    Combine --> Format[Format Output]
+    Format --> MD[Markdown]
+    Format --> PDF[PDF]
+    Format --> JSON[JSON]
+
+    MD --> Output[Return Final<br/>Report]
+    PDF --> Output
+    JSON --> Output
+
+    style Input1 fill:#e1f5e1
+    style Input2 fill:#fff4e6
+    style Input3 fill:#e6f3ff
+    style Input4 fill:#e6ffe6
+    style Output fill:#d4edda
+```
+
+## 10. Data Flow & Event Streaming
+
+```mermaid
+flowchart TD
+    User[👤 User] -->|Research Query| UI[Gradio UI]
+    UI -->|Submit| Manager[Magentic Manager]
+
+    Manager -->|Event: Planning| UI
+    Manager -->|Select Agent| HypAgent[Hypothesis Agent]
+    HypAgent -->|Event: Delta/Message| UI
+    HypAgent -->|Hypotheses| Context[(Shared Context)]
+
+    Context -->|Retrieved by| Manager
+    Manager -->|Select Agent| SearchAgent[Search Agent]
+    SearchAgent -->|MCP Request| WebSearch[Web Search Tool]
+    WebSearch -->|Results| SearchAgent
+    SearchAgent -->|Event: Delta/Message| UI
+    SearchAgent -->|Documents| Context
+    SearchAgent -->|Embeddings| VectorDB[(Vector DB)]
+
+    Context -->|Retrieved by| Manager
+    Manager -->|Select Agent| AnalysisAgent[Analysis Agent]
+    AnalysisAgent -->|MCP Request| CodeExec[Code Execution Tool]
+    CodeExec -->|Results| AnalysisAgent
+    AnalysisAgent -->|Event: Delta/Message| UI
+    AnalysisAgent -->|Analysis| Context
+
+    Context -->|Retrieved by| Manager
+    Manager -->|Select Agent| ReportAgent[Report Agent]
+    ReportAgent -->|MCP Request| CodeExec
+    ReportAgent -->|Event: Delta/Message| UI
+    ReportAgent -->|Report| Context
+
+    Manager -->|Event: Final Result| UI
+    UI -->|Display| User
+
+    style User fill:#e1f5e1
+    style UI fill:#e6f3ff
+    style Manager fill:#ffe6e6
+    style Context fill:#ffe6f0
+    style VectorDB fill:#ffe6f0
+    style WebSearch fill:#f0f0f0
+    style CodeExec fill:#f0f0f0
+```
+
+## 11. MCP Tool Architecture
+
+```mermaid
+graph TB
+    subgraph "Agent Layer"
+        Manager[Magentic Manager]
+        HypAgent[Hypothesis Agent]
+        SearchAgent[Search Agent]
+        AnalysisAgent[Analysis Agent]
+        ReportAgent[Report Agent]
+    end
+
+    subgraph "MCP Protocol Layer"
+        Registry[MCP Tool Registry<br/>• Discovers tools<br/>• Routes requests<br/>• Manages connections]
+    end
+
+    subgraph "MCP Servers"
+        Server1[Web Search Server<br/>localhost:8001<br/>• PubMed<br/>• arXiv<br/>• bioRxiv]
+        Server2[Code Execution Server<br/>localhost:8002<br/>• Sandboxed Python<br/>• Package management]
+        Server3[RAG Server<br/>localhost:8003<br/>• Vector embeddings<br/>• Similarity search]
+        Server4[Visualization Server<br/>localhost:8004<br/>• Chart generation<br/>• Plot rendering]
+    end
+
+    subgraph "External Services"
+        PubMed[PubMed API]
+        ArXiv[arXiv API]
+        BioRxiv[bioRxiv API]
+        Modal[Modal Sandbox]
+        ChromaDB[(ChromaDB)]
+    end
+
+    SearchAgent -->|Request| Registry
+    AnalysisAgent -->|Request| Registry
+    ReportAgent -->|Request| Registry
+
+    Registry --> Server1
+    Registry --> Server2
+    Registry --> Server3
+    Registry --> Server4
+
+    Server1 --> PubMed
+    Server1 --> ArXiv
+    Server1 --> BioRxiv
+    Server2 --> Modal
+    Server3 --> ChromaDB
+
+    style Manager fill:#ffe6e6
+    style Registry fill:#fff4e6
+    style Server1 fill:#e6f3ff
+    style Server2 fill:#e6f3ff
+    style Server3 fill:#e6f3ff
+    style Server4 fill:#e6f3ff
+```
+
+## 12. Progress Tracking & Stall Detection
+
+```mermaid
+stateDiagram-v2
+    [*] --> Initialization: User Query
+
+    Initialization --> Planning: Manager starts
+
+    Planning --> AgentExecution: Select agent
+
+    AgentExecution --> Assessment: Collect results
+
+    Assessment --> QualityCheck: Evaluate output
+
+    QualityCheck --> AgentExecution: Poor quality<br/>(retry < max_rounds)
+    QualityCheck --> Planning: Poor quality<br/>(try different agent)
+    QualityCheck --> NextAgent: Good quality<br/>(task incomplete)
+    QualityCheck --> Synthesis: Good quality<br/>(task complete)
+
+    NextAgent --> AgentExecution: Select next agent
+
+    state StallDetection <<choice>>
+    Assessment --> StallDetection: Check progress
+    StallDetection --> Planning: No progress<br/>(stall count < max)
+    StallDetection --> ErrorRecovery: No progress<br/>(max stalls reached)
+
+    ErrorRecovery --> PartialReport: Generate partial results
+    PartialReport --> [*]
+
+    Synthesis --> FinalReport: Combine all outputs
+    FinalReport --> [*]
+
+    note right of QualityCheck
+        Manager assesses:
+        • Output completeness
+        • Quality metrics
+        • Progress made
+    end note
+
+    note right of StallDetection
+        Stall = no new progress
+        after agent execution
+        Triggers plan reset
+    end note
+```
+
+## 13. Gradio UI Integration
+
+```mermaid
+graph TD
+    App[Gradio App<br/>DeepCritical Research Agent]
+
+    App --> Input[Input Section]
+    App --> Status[Status Section]
+    App --> Output[Output Section]
+
+    Input --> Query[Research Question<br/>Text Area]
+    Input --> Controls[Controls]
+    Controls --> MaxHyp[Max Hypotheses: 1-10]
+    Controls --> MaxRounds[Max Rounds: 5-20]
+    Controls --> Submit[Start Research Button]
+
+    Status --> Log[Real-time Event Log<br/>• Manager planning<br/>• Agent selection<br/>• Execution updates<br/>• Quality assessment]
+    Status --> Progress[Progress Tracker<br/>• Current agent<br/>• Round count<br/>• Stall count]
+
+    Output --> Tabs[Tabbed Results]
+    Tabs --> Tab1[Hypotheses Tab<br/>Generated hypotheses with scores]
+    Tabs --> Tab2[Search Results Tab<br/>Papers & sources found]
+    Tabs --> Tab3[Analysis Tab<br/>Evidence & verdicts]
+    Tabs --> Tab4[Report Tab<br/>Final research report]
+    Tab4 --> Download[Download Report<br/>MD / PDF / JSON]
+
+    Submit -.->|Triggers| Workflow[Magentic Workflow]
+    Workflow -.->|MagenticOrchestratorMessageEvent| Log
+    Workflow -.->|MagenticAgentDeltaEvent| Log
+    Workflow -.->|MagenticAgentMessageEvent| Log
+    Workflow -.->|MagenticFinalResultEvent| Tab4
+
+    style App fill:#e1f5e1
+    style Input fill:#fff4e6
+    style Status fill:#e6f3ff
+    style Output fill:#e6ffe6
+    style Workflow fill:#ffe6e6
+```
+
+## 14. Complete System Context
+
+```mermaid
+graph LR
+    User[👤 Researcher<br/>Asks research questions] -->|Submits query| DC[DeepCritical<br/>Magentic Workflow]
+
+    DC -->|Literature search| PubMed[PubMed API<br/>Medical papers]
+    DC -->|Preprint search| ArXiv[arXiv API<br/>Scientific preprints]
+    DC -->|Biology search| BioRxiv[bioRxiv API<br/>Biology preprints]
+    DC -->|Agent reasoning| Claude[Claude API<br/>Sonnet 4 / Opus]
+    DC -->|Code execution| Modal[Modal Sandbox<br/>Safe Python env]
+    DC -->|Vector storage| Chroma[ChromaDB<br/>Embeddings & RAG]
+
+    DC -->|Deployed on| HF[HuggingFace Spaces<br/>Gradio 6.0]
+
+    PubMed -->|Results| DC
+    ArXiv -->|Results| DC
+    BioRxiv -->|Results| DC
+    Claude -->|Responses| DC
+    Modal -->|Output| DC
+    Chroma -->|Context| DC
+
+    DC -->|Research report| User
+
+    style User fill:#e1f5e1
+    style DC fill:#ffe6e6
+    style PubMed fill:#e6f3ff
+    style ArXiv fill:#e6f3ff
+    style BioRxiv fill:#e6f3ff
+    style Claude fill:#ffd6d6
+    style Modal fill:#f0f0f0
+    style Chroma fill:#ffe6f0
+    style HF fill:#d4edda
+```
+
+## 15. Workflow Timeline (Simplified)
+
+```mermaid
+gantt
+    title DeepCritical Magentic Workflow - Typical Execution
+    dateFormat mm:ss
+    axisFormat %M:%S
+
+    section Manager Planning
+    Initial planning         :p1, 00:00, 10s
+
+    section Hypothesis Agent
+    Generate hypotheses      :h1, after p1, 30s
+    Manager assessment       :h2, after h1, 5s
+
+    section Search Agent
+    Search hypothesis 1      :s1, after h2, 20s
+    Search hypothesis 2      :s2, after s1, 20s
+    Search hypothesis 3      :s3, after s2, 20s
+    RAG processing          :s4, after s3, 15s
+    Manager assessment      :s5, after s4, 5s
+
+    section Analysis Agent
+    Evidence extraction     :a1, after s5, 15s
+    Code generation        :a2, after a1, 20s
+    Code execution         :a3, after a2, 25s
+    Synthesis              :a4, after a3, 20s
+    Manager assessment     :a5, after a4, 5s
+
+    section Report Agent
+    Report assembly        :r1, after a5, 30s
+    Visualization          :r2, after r1, 15s
+    Formatting             :r3, after r2, 10s
+
+    section Manager Synthesis
+    Final synthesis        :f1, after r3, 10s
+```
+
+---
+
+## Key Differences from Original Design
+
+| Aspect | Original (Judge-in-Loop) | New (Magentic) |
+|--------|-------------------------|----------------|
+| **Control Flow** | Fixed sequential phases | Dynamic agent selection |
+| **Quality Control** | Separate Judge Agent | Manager assessment built-in |
+| **Retry Logic** | Phase-level with feedback | Agent-level with adaptation |
+| **Flexibility** | Rigid 4-phase pipeline | Adaptive workflow |
+| **Complexity** | 5 agents (including Judge) | 4 agents (no Judge) |
+| **Progress Tracking** | Manual state management | Built-in round/stall detection |
+| **Agent Coordination** | Sequential handoff | Manager-driven dynamic selection |
+| **Error Recovery** | Retry same phase | Try different agent or replan |
+
+---
+
+## Simplified Design Principles
+
+1. **Manager is Intelligent**: LLM-powered manager handles planning, selection, and quality assessment
+2. **No Separate Judge**: Manager's assessment phase replaces dedicated Judge Agent
+3. **Dynamic Workflow**: Agents can be called multiple times in any order based on need
+4. **Built-in Safety**: max_round_count (15) and max_stall_count (3) prevent infinite loops
+5. **Event-Driven UI**: Real-time streaming updates to Gradio interface
+6. **MCP-Powered Tools**: All external capabilities via Model Context Protocol
+7. **Shared Context**: Centralized state accessible to all agents
+8. **Progress Awareness**: Manager tracks what's been done and what's needed
+
+---
+
+## Legend
+
+- 🔴 **Red/Pink**: Manager, orchestration, decision-making
+- 🟡 **Yellow/Orange**: Specialist agents, processing
+- 🔵 **Blue**: Data, tools, MCP services
+- 🟣 **Purple/Pink**: Storage, databases, state
+- 🟢 **Green**: User interactions, final outputs
+- ⚪ **Gray**: External services, APIs
+
+---
+
+## Implementation Highlights
+
+**Simple 4-Agent Setup:**
+```python
+workflow = (
+    MagenticBuilder()
+    .participants(
+        hypothesis=HypothesisAgent(tools=[background_tool]),
+        search=SearchAgent(tools=[web_search, rag_tool]),
+        analysis=AnalysisAgent(tools=[code_execution]),
+        report=ReportAgent(tools=[code_execution, visualization])
+    )
+    .with_standard_manager(
+        chat_client=AnthropicClient(model="claude-sonnet-4"),
+        max_round_count=15,    # Prevent infinite loops
+        max_stall_count=3      # Detect stuck workflows
+    )
+    .build()
+)
+```
+
+**Manager handles quality assessment in its instructions:**
+- Checks hypothesis quality (testable, novel, clear)
+- Validates search results (relevant, authoritative, recent)
+- Assesses analysis soundness (methodology, evidence, conclusions)
+- Ensures report completeness (all sections, proper citations)
+
+No separate Judge Agent needed - manager does it all!
+
+---
+
+**Document Version**: 2.0 (Magentic Simplified)
+**Last Updated**: 2025-11-24
+**Architecture**: Microsoft Magentic Orchestration Pattern
+**Agents**: 4 (Hypothesis, Search, Analysis, Report) + 1 Manager
+**License**: MIT
+
+## See Also
+
+- [Orchestrators](orchestrators.md) - Overview of all orchestrator patterns
+- [Graph Orchestration](graph-orchestration.md) - Graph-based execution overview
+- [Graph Orchestration (Detailed)](graph_orchestration.md) - Detailed graph architecture
+- [Workflows](workflows.md) - Workflow patterns summary
+- [API Reference - Orchestrators](../api/orchestrators.md) - API documentation
\ No newline at end of file
diff --git a/docs/workflow-diagrams.md b/docs/architecture/workflows.md
similarity index 100%
rename from docs/workflow-diagrams.md
rename to docs/architecture/workflows.md
diff --git a/docs/brainstorming/00_ROADMAP_SUMMARY.md b/docs/brainstorming/00_ROADMAP_SUMMARY.md
deleted file mode 100644
index a67ae6741e446c774485534d2d6a2278d9b44686..0000000000000000000000000000000000000000
--- a/docs/brainstorming/00_ROADMAP_SUMMARY.md
+++ /dev/null
@@ -1,194 +0,0 @@
-# DeepCritical Data Sources: Roadmap Summary
-
-**Created**: 2024-11-27
-**Purpose**: Future maintainability and hackathon continuation
-
----
-
-## Current State
-
-### Working Tools
-
-| Tool | Status | Data Quality |
-|------|--------|--------------|
-| PubMed | ✅ Works | Good (abstracts only) |
-| ClinicalTrials.gov | ✅ Works | Good (filtered for interventional) |
-| Europe PMC | ✅ Works | Good (includes preprints) |
-
-### Removed Tools
-
-| Tool | Status | Reason |
-|------|--------|--------|
-| bioRxiv | ❌ Removed | No search API - only date/DOI lookup |
-
----
-
-## Priority Improvements
-
-### P0: Critical (Do First)
-
-1. **Add Rate Limiting to PubMed**
-   - NCBI will block us without it
-   - Use `limits` library (see reference repo)
-   - 3/sec without key, 10/sec with key
-
-### P1: High Value, Medium Effort
-
-2. **Add OpenAlex as 4th Source**
-   - Citation network (huge for drug repurposing)
-   - Concept tagging (semantic discovery)
-   - Already implemented in reference repo
-   - Free, no API key
-
-3. **PubMed Full-Text via BioC**
-   - Get full paper text for PMC papers
-   - Already in reference repo
-
-### P2: Nice to Have
-
-4. **ClinicalTrials.gov Results**
-   - Get efficacy data from completed trials
-   - Requires more complex API calls
-
-5. **Europe PMC Annotations**
-   - Text-mined entities (genes, drugs, diseases)
-   - Automatic entity extraction
-
----
-
-## Effort Estimates
-
-| Improvement | Effort | Impact | Priority |
-|-------------|--------|--------|----------|
-| PubMed rate limiting | 1 hour | Stability | P0 |
-| OpenAlex basic search | 2 hours | High | P1 |
-| OpenAlex citations | 2 hours | Very High | P1 |
-| PubMed full-text | 3 hours | Medium | P1 |
-| CT.gov results | 4 hours | Medium | P2 |
-| Europe PMC annotations | 3 hours | Medium | P2 |
-
----
-
-## Architecture Decision
-
-### Option A: Keep Current + Add OpenAlex
-
-```
-                    User Query
-                        ↓
-    ┌───────────────────┼───────────────────┐
-    ↓                   ↓                   ↓
- PubMed          ClinicalTrials        Europe PMC
- (abstracts)     (trials only)         (preprints)
-    ↓                   ↓                   ↓
-    └───────────────────┼───────────────────┘
-                        ↓
-                   OpenAlex              ← NEW
-               (citations, concepts)
-                        ↓
-                  Orchestrator
-                        ↓
-                     Report
-```
-
-**Pros**: Low risk, additive
-**Cons**: More complexity, some overlap
-
-### Option B: OpenAlex as Primary
-
-```
-                    User Query
-                        ↓
-    ┌───────────────────┼───────────────────┐
-    ↓                   ↓                   ↓
- OpenAlex          ClinicalTrials      Europe PMC
- (primary          (trials only)       (full-text
-  search)                               fallback)
-    ↓                   ↓                   ↓
-    └───────────────────┼───────────────────┘
-                        ↓
-                  Orchestrator
-                        ↓
-                     Report
-```
-
-**Pros**: Simpler, citation network built-in
-**Cons**: Lose some PubMed-specific features
-
-### Recommendation: Option A
-
-Keep current architecture working, add OpenAlex incrementally.
-
----
-
-## Quick Wins (Can Do Today)
-
-1. **Add `limits` to `pyproject.toml`**
-   ```toml
-   dependencies = [
-       "limits>=3.0",
-   ]
-   ```
-
-2. **Copy OpenAlex tool from reference repo**
-   - File: `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`
-   - Adapt to our `SearchTool` base class
-
-3. **Enable NCBI API Key**
-   - Add to `.env`: `NCBI_API_KEY=your_key`
-   - 10x rate limit improvement
-
----
-
-## External Resources Worth Exploring
-
-### Python Libraries
-
-| Library | For | Notes |
-|---------|-----|-------|
-| `limits` | Rate limiting | Used by reference repo |
-| `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
-| `metapub` | PubMed | Full-featured |
-| `sentence-transformers` | Semantic search | For embeddings |
-
-### APIs Not Yet Used
-
-| API | Provides | Effort |
-|-----|----------|--------|
-| RxNorm | Drug name normalization | Low |
-| DrugBank | Drug targets/mechanisms | Medium (license) |
-| UniProt | Protein data | Medium |
-| ChEMBL | Bioactivity data | Medium |
-
-### RAG Tools (Future)
-
-| Tool | Purpose |
-|------|---------|
-| [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
-| [txtai](https://github.com/neuml/txtai) | Embeddings + search |
-| [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
-
----
-
-## Files in This Directory
-
-| File | Contents |
-|------|----------|
-| `00_ROADMAP_SUMMARY.md` | This file |
-| `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
-| `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
-| `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
-| `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
-
----
-
-## For Future Maintainers
-
-If you're picking this up after the hackathon:
-
-1. **Start with OpenAlex** - biggest bang for buck
-2. **Add rate limiting** - prevents API blocks
-3. **Don't bother with bioRxiv** - use Europe PMC instead
-4. **Reference repo is gold** - `reference_repos/DeepCritical/` has working implementations
-
-Good luck! 🚀
diff --git a/docs/brainstorming/01_PUBMED_IMPROVEMENTS.md b/docs/brainstorming/01_PUBMED_IMPROVEMENTS.md
deleted file mode 100644
index 6142e17b227eccca82eba26235de9d1e1f4f03b6..0000000000000000000000000000000000000000
--- a/docs/brainstorming/01_PUBMED_IMPROVEMENTS.md
+++ /dev/null
@@ -1,125 +0,0 @@
-# PubMed Tool: Current State & Future Improvements
-
-**Status**: Currently Implemented
-**Priority**: High (Core Data Source)
-
----
-
-## Current Implementation
-
-### What We Have (`src/tools/pubmed.py`)
-
-- Basic E-utilities search via `esearch.fcgi` and `efetch.fcgi`
-- Query preprocessing (strips question words, expands synonyms)
-- Returns: title, abstract, authors, journal, PMID
-- Rate limiting: None implemented (relying on NCBI defaults)
-
-### Current Limitations
-
-1. **No Full-Text Access**: Only retrieves abstracts, not full paper text
-2. **No Rate Limiting**: Risk of being blocked by NCBI
-3. **No BioC Format**: Missing structured full-text extraction
-4. **No Figure Retrieval**: No supplementary materials access
-5. **No PMC Integration**: Missing open-access full-text via PMC
-
----
-
-## Reference Implementation (DeepCritical Reference Repo)
-
-The reference repo at `reference_repos/DeepCritical/DeepResearch/src/tools/bioinformatics_tools.py` has a more sophisticated implementation:
-
-### Features We're Missing
-
-```python
-# Rate limiting (lines 47-50)
-from limits import parse
-from limits.storage import MemoryStorage
-from limits.strategies import MovingWindowRateLimiter
-
-storage = MemoryStorage()
-limiter = MovingWindowRateLimiter(storage)
-rate_limit = parse("3/second")  # NCBI allows 3/sec without API key, 10/sec with
-
-# Full-text via BioC format (lines 108-120)
-def _get_fulltext(pmid: int) -> dict[str, Any] | None:
-    pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
-    # Returns structured JSON with full text for open-access papers
-
-# Figure retrieval via Europe PMC (lines 123-149)
-def _get_figures(pmcid: str) -> dict[str, str]:
-    suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
-    # Returns base64-encoded images from supplementary materials
-```
-
----
-
-## Recommended Improvements
-
-### Phase 1: Rate Limiting (Critical)
-
-```python
-# Add to src/tools/pubmed.py
-from limits import parse
-from limits.storage import MemoryStorage
-from limits.strategies import MovingWindowRateLimiter
-
-storage = MemoryStorage()
-limiter = MovingWindowRateLimiter(storage)
-
-# With NCBI_API_KEY: 10/sec, without: 3/sec
-def get_rate_limit():
-    if settings.ncbi_api_key:
-        return parse("10/second")
-    return parse("3/second")
-```
-
-**Dependencies**: `pip install limits`
-
-### Phase 2: Full-Text Retrieval
-
-```python
-async def get_fulltext(pmid: str) -> str | None:
-    """Get full text for open-access papers via BioC API."""
-    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
-    # Only works for PMC papers (open access)
-```
-
-### Phase 3: PMC ID Resolution
-
-```python
-async def get_pmc_id(pmid: str) -> str | None:
-    """Convert PMID to PMCID for full-text access."""
-    url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
-```
-
----
-
-## Python Libraries to Consider
-
-| Library | Purpose | Notes |
-|---------|---------|-------|
-| [Biopython](https://biopython.org/) | `Bio.Entrez` module | Official, well-maintained |
-| [PyMed](https://pypi.org/project/pymed/) | PubMed wrapper | Simpler API, less control |
-| [metapub](https://pypi.org/project/metapub/) | Full-featured | Tested on 1/3 of PubMed |
-| [limits](https://pypi.org/project/limits/) | Rate limiting | Used by reference repo |
-
----
-
-## API Endpoints Reference
-
-| Endpoint | Purpose | Rate Limit |
-|----------|---------|------------|
-| `esearch.fcgi` | Search for PMIDs | 3/sec (10 with key) |
-| `efetch.fcgi` | Fetch metadata | 3/sec (10 with key) |
-| `esummary.fcgi` | Quick metadata | 3/sec (10 with key) |
-| `pmcoa.cgi/BioC_json` | Full text (PMC only) | Unknown |
-| `idconv/v1.0` | PMID ↔ PMCID | Unknown |
-
----
-
-## Sources
-
-- [PubMed E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
-- [NCBI BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
-- [Searching PubMed with Python](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/)
-- [PyMed on PyPI](https://pypi.org/project/pymed/)
diff --git a/docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md b/docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md
deleted file mode 100644
index 5bf5722bdd16dadc80dd5b984de1185163cdc1f2..0000000000000000000000000000000000000000
--- a/docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md
+++ /dev/null
@@ -1,193 +0,0 @@
-# ClinicalTrials.gov Tool: Current State & Future Improvements
-
-**Status**: Currently Implemented
-**Priority**: High (Core Data Source for Drug Repurposing)
-
----
-
-## Current Implementation
-
-### What We Have (`src/tools/clinicaltrials.py`)
-
-- V2 API search via `clinicaltrials.gov/api/v2/studies`
-- Filters: `INTERVENTIONAL` study type, `RECRUITING` status
-- Returns: NCT ID, title, conditions, interventions, phase, status
-- Query preprocessing via shared `query_utils.py`
-
-### Current Strengths
-
-1. **Good Filtering**: Already filtering for interventional + recruiting
-2. **V2 API**: Using the modern API (v1 deprecated)
-3. **Phase Info**: Extracting trial phases for drug development context
-
-### Current Limitations
-
-1. **No Outcome Data**: Missing primary/secondary outcomes
-2. **No Eligibility Criteria**: Missing inclusion/exclusion details
-3. **No Sponsor Info**: Missing who's running the trial
-4. **No Result Data**: For completed trials, no efficacy data
-5. **Limited Drug Mapping**: No integration with drug databases
-
----
-
-## API Capabilities We're Not Using
-
-### Fields We Could Request
-
-```python
-# Current fields
-fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
-
-# Additional valuable fields
-additional_fields = [
-    "PrimaryOutcomeMeasure",      # What are they measuring?
-    "SecondaryOutcomeMeasure",    # Secondary endpoints
-    "EligibilityCriteria",        # Who can participate?
-    "LeadSponsorName",            # Who's funding?
-    "ResultsFirstPostDate",       # Has results?
-    "StudyFirstPostDate",         # When started?
-    "CompletionDate",             # When finished?
-    "EnrollmentCount",            # Sample size
-    "InterventionDescription",    # Drug details
-    "ArmGroupLabel",              # Treatment arms
-    "InterventionOtherName",      # Drug aliases
-]
-```
-
-### Filter Enhancements
-
-```python
-# Current
-aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
-
-# Could add
-"status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED"  # Include completed for results
-"phase:PHASE2,PHASE3"  # Only later-stage trials
-"resultsFirstPostDateRange:2020-01-01_"  # Trials with posted results
-```
-
----
-
-## Recommended Improvements
-
-### Phase 1: Richer Metadata
-
-```python
-EXTENDED_FIELDS = [
-    "NCTId",
-    "BriefTitle",
-    "OfficialTitle",
-    "Condition",
-    "InterventionName",
-    "InterventionDescription",
-    "InterventionOtherName",  # Drug synonyms!
-    "Phase",
-    "OverallStatus",
-    "PrimaryOutcomeMeasure",
-    "EnrollmentCount",
-    "LeadSponsorName",
-    "StudyFirstPostDate",
-]
-```
-
-### Phase 2: Results Retrieval
-
-For completed trials, we can get actual efficacy data:
-
-```python
-async def get_trial_results(nct_id: str) -> dict | None:
-    """Fetch results for completed trials."""
-    url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
-    params = {
-        "fields": "ResultsSection",
-    }
-    # Returns outcome measures and statistics
-```
-
-### Phase 3: Drug Name Normalization
-
-Map intervention names to standard identifiers:
-
-```python
-# Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
-# Solution: Use RxNorm or DrugBank for normalization
-
-async def normalize_drug_name(intervention: str) -> str:
-    """Normalize drug name via RxNorm API."""
-    url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
-    # Returns standardized RxCUI
-```
-
----
-
-## Integration Opportunities
-
-### With PubMed
-
-Cross-reference trials with publications:
-```python
-# ClinicalTrials.gov provides PMID links
-# Can correlate trial results with published papers
-```
-
-### With DrugBank/ChEMBL
-
-Map interventions to:
-- Mechanism of action
-- Known targets
-- Adverse effects
-- Drug-drug interactions
-
----
-
-## Python Libraries to Consider
-
-| Library | Purpose | Notes |
-|---------|---------|-------|
-| [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
-| [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
-| [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
-
----
-
-## API Quirks & Gotchas
-
-1. **Rate Limiting**: Undocumented, be conservative
-2. **Pagination**: Max 1000 results per request
-3. **Field Names**: Case-sensitive, camelCase
-4. **Empty Results**: Some fields may be null even if requested
-5. **Status Changes**: Trials change status frequently
-
----
-
-## Example Enhanced Query
-
-```python
-async def search_drug_repurposing_trials(
-    drug_name: str,
-    condition: str,
-    include_completed: bool = True,
-) -> list[Evidence]:
-    """Search for trials repurposing a drug for a new condition."""
-
-    statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
-    if include_completed:
-        statuses.append("COMPLETED")
-
-    params = {
-        "query.intr": drug_name,
-        "query.cond": condition,
-        "filter.overallStatus": ",".join(statuses),
-        "filter.studyType": "INTERVENTIONAL",
-        "fields": ",".join(EXTENDED_FIELDS),
-        "pageSize": 50,
-    }
-```
-
----
-
-## Sources
-
-- [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
-- [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
-- [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)
diff --git a/docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md b/docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md
deleted file mode 100644
index dfec6cb16ac9d0539b43153e8c12fab206bb3009..0000000000000000000000000000000000000000
--- a/docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md
+++ /dev/null
@@ -1,211 +0,0 @@
-# Europe PMC Tool: Current State & Future Improvements
-
-**Status**: Currently Implemented (Replaced bioRxiv)
-**Priority**: High (Preprint + Open Access Source)
-
----
-
-## Why Europe PMC Over bioRxiv?
-
-### bioRxiv API Limitations (Why We Abandoned It)
-
-1. **No Search API**: Only returns papers by date range or DOI
-2. **No Query Capability**: Cannot search for "metformin cancer"
-3. **Workaround Required**: Would need to download ALL preprints and build local search
-4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
-
-### Europe PMC Advantages
-
-1. **Full Search API**: Boolean queries, filters, facets
-2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
-3. **Includes PubMed**: Also has MEDLINE content
-4. **34 Preprint Servers**: Not just bioRxiv
-5. **Open Access Focus**: Full-text when available
-
----
-
-## Current Implementation
-
-### What We Have (`src/tools/europepmc.py`)
-
-- REST API search via `europepmc.org/webservices/rest/search`
-- Preprint flagging via `firstPublicationDate` heuristics
-- Returns: title, abstract, authors, DOI, source
-- Marks preprints for transparency
-
-### Current Limitations
-
-1. **No Full-Text Retrieval**: Only metadata/abstracts
-2. **No Citation Network**: Missing references/citations
-3. **No Supplementary Files**: Not fetching figures/data
-4. **Basic Preprint Detection**: Heuristic, not explicit flag
-
----
-
-## Europe PMC API Capabilities
-
-### Endpoints We Could Use
-
-| Endpoint | Purpose | Currently Using |
-|----------|---------|-----------------|
-| `/search` | Query papers | Yes |
-| `/fulltext/{ID}` | Full text (XML/JSON) | No |
-| `/{PMCID}/supplementaryFiles` | Figures, data | No |
-| `/citations/{ID}` | Who cited this | No |
-| `/references/{ID}` | What this cites | No |
-| `/annotations` | Text-mined entities | No |
-
-### Rich Query Syntax
-
-```python
-# Current simple query
-query = "metformin cancer"
-
-# Could use advanced syntax
-query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
-query += " AND (SRC:PPR)"  # Only preprints
-query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])"  # Date range
-query += " AND (OPEN_ACCESS:y)"  # Only open access
-```
-
-### Source Filters
-
-```python
-# Filter by source
-"SRC:MED"     # MEDLINE
-"SRC:PMC"     # PubMed Central
-"SRC:PPR"     # Preprints (bioRxiv, medRxiv, etc.)
-"SRC:AGR"     # Agricola
-"SRC:CBA"     # Chinese Biological Abstracts
-```
-
----
-
-## Recommended Improvements
-
-### Phase 1: Rich Metadata
-
-```python
-# Add to search results
-additional_fields = [
-    "citedByCount",           # Impact indicator
-    "source",                 # Explicit source (MED, PMC, PPR)
-    "isOpenAccess",           # Boolean flag
-    "fullTextUrlList",        # URLs for full text
-    "authorAffiliations",     # Institution info
-    "grantsList",             # Funding info
-]
-```
-
-### Phase 2: Full-Text Retrieval
-
-```python
-async def get_fulltext(pmcid: str) -> str | None:
-    """Get full text for open access papers."""
-    # XML format
-    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
-    # Or JSON
-    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
-```
-
-### Phase 3: Citation Network
-
-```python
-async def get_citations(pmcid: str) -> list[str]:
-    """Get papers that cite this one."""
-    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
-
-async def get_references(pmcid: str) -> list[str]:
-    """Get papers this one cites."""
-    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
-```
-
-### Phase 4: Text-Mined Annotations
-
-Europe PMC extracts entities automatically:
-
-```python
-async def get_annotations(pmcid: str) -> dict:
-    """Get text-mined entities (genes, diseases, drugs)."""
-    url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
-    params = {
-        "articleIds": f"PMC:{pmcid}",
-        "type": "Gene_Proteins,Diseases,Chemicals",
-        "format": "JSON",
-    }
-    # Returns structured entity mentions with positions
-```
-
----
-
-## Supplementary File Retrieval
-
-From reference repo (`bioinformatics_tools.py` lines 123-149):
-
-```python
-def get_figures(pmcid: str) -> dict[str, str]:
-    """Download figures and supplementary files."""
-    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
-    # Returns ZIP with images, returns base64-encoded
-```
-
----
-
-## Preprint-Specific Features
-
-### Identify Preprint Servers
-
-```python
-PREPRINT_SOURCES = {
-    "PPR": "General preprints",
-    "bioRxiv": "Biology preprints",
-    "medRxiv": "Medical preprints",
-    "chemRxiv": "Chemistry preprints",
-    "Research Square": "Multi-disciplinary",
-    "Preprints.org": "MDPI preprints",
-}
-
-# Check if published version exists
-async def check_published_version(preprint_doi: str) -> str | None:
-    """Check if preprint has been peer-reviewed and published."""
-    # Europe PMC links preprints to final versions
-```
-
----
-
-## Rate Limiting
-
-Europe PMC is more generous than NCBI:
-
-```python
-# No documented hard limit, but be respectful
-# Recommend: 10-20 requests/second max
-# Use email in User-Agent for polite pool
-headers = {
-    "User-Agent": "DeepCritical/1.0 (mailto:your@email.com)"
-}
-```
-
----
-
-## vs. The Lens & OpenAlex
-
-| Feature | Europe PMC | The Lens | OpenAlex |
-|---------|------------|----------|----------|
-| Biomedical Focus | Yes | Partial | Partial |
-| Preprints | Yes (34 servers) | Yes | Yes |
-| Full Text | PMC papers | Links | No |
-| Citations | Yes | Yes | Yes |
-| Annotations | Yes (text-mined) | No | No |
-| Rate Limits | Generous | Moderate | Very generous |
-| API Key | Optional | Required | Optional |
-
----
-
-## Sources
-
-- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
-- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
-- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
-- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
-- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)
diff --git a/docs/brainstorming/04_OPENALEX_INTEGRATION.md b/docs/brainstorming/04_OPENALEX_INTEGRATION.md
deleted file mode 100644
index 3a191e4ed7945003128e15ef866ddfc9a2873568..0000000000000000000000000000000000000000
--- a/docs/brainstorming/04_OPENALEX_INTEGRATION.md
+++ /dev/null
@@ -1,303 +0,0 @@
-# OpenAlex Integration: The Missing Piece?
-
-**Status**: NOT Implemented (Candidate for Addition)
-**Priority**: HIGH - Could Replace Multiple Tools
-**Reference**: Already implemented in `reference_repos/DeepCritical`
-
----
-
-## What is OpenAlex?
-
-OpenAlex is a **fully open** index of the global research system:
-
-- **209M+ works** (papers, books, datasets)
-- **2B+ author records** (disambiguated)
-- **124K+ venues** (journals, repositories)
-- **109K+ institutions**
-- **65K+ concepts** (hierarchical, linked to Wikidata)
-
-**Free. Open. No API key required.**
-
----
-
-## Why OpenAlex for DeepCritical?
-
-### Current Architecture
-
-```
-User Query
-    ↓
-┌──────────────────────────────────────┐
-│  PubMed    ClinicalTrials  Europe PMC │  ← 3 separate APIs
-└──────────────────────────────────────┘
-    ↓
-Orchestrator (deduplicate, judge, synthesize)
-```
-
-### With OpenAlex
-
-```
-User Query
-    ↓
-┌──────────────────────────────────────┐
-│              OpenAlex                 │  ← Single API
-│  (includes PubMed + preprints +       │
-│   citations + concepts + authors)     │
-└──────────────────────────────────────┘
-    ↓
-Orchestrator (enrich with CT.gov for trials)
-```
-
-**OpenAlex already aggregates**:
-- PubMed/MEDLINE
-- Crossref
-- ORCID
-- Unpaywall (open access links)
-- Microsoft Academic Graph (legacy)
-- Preprint servers
-
----
-
-## Reference Implementation
-
-From `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`:
-
-```python
-class OpenAlexFetchTool(ToolRunner):
-    def __init__(self):
-        super().__init__(
-            ToolSpec(
-                name="openalex_fetch",
-                description="Fetch OpenAlex work or author",
-                inputs={"entity": "TEXT", "identifier": "TEXT"},
-                outputs={"result": "JSON"},
-            )
-        )
-
-    def run(self, params: dict[str, Any]) -> ExecutionResult:
-        entity = params["entity"]      # "works", "authors", "venues"
-        identifier = params["identifier"]
-        base = "https://api.openalex.org"
-        url = f"{base}/{entity}/{identifier}"
-        resp = requests.get(url, timeout=30)
-        return ExecutionResult(success=True, data={"result": resp.json()})
-```
-
----
-
-## OpenAlex API Features
-
-### Search Works (Papers)
-
-```python
-# Search for metformin + cancer papers
-url = "https://api.openalex.org/works"
-params = {
-    "search": "metformin cancer drug repurposing",
-    "filter": "publication_year:>2020,type:article",
-    "sort": "cited_by_count:desc",
-    "per_page": 50,
-}
-```
-
-### Rich Filtering
-
-```python
-# Filter examples
-"publication_year:2023"
-"type:article"                      # vs preprint, book, etc.
-"is_oa:true"                        # Open access only
-"concepts.id:C71924100"             # Papers about "Medicine"
-"authorships.institutions.id:I27837315"  # From Harvard
-"cited_by_count:>100"               # Highly cited
-"has_fulltext:true"                 # Full text available
-```
-
-### What You Get Back
-
-```json
-{
-    "id": "W2741809807",
-    "title": "Metformin: A candidate drug for...",
-    "publication_year": 2023,
-    "type": "article",
-    "cited_by_count": 45,
-    "is_oa": true,
-    "primary_location": {
-        "source": {"display_name": "Nature Medicine"},
-        "pdf_url": "https://...",
-        "landing_page_url": "https://..."
-    },
-    "concepts": [
-        {"id": "C71924100", "display_name": "Medicine", "score": 0.95},
-        {"id": "C54355233", "display_name": "Pharmacology", "score": 0.88}
-    ],
-    "authorships": [
-        {
-            "author": {"id": "A123", "display_name": "John Smith"},
-            "institutions": [{"display_name": "Harvard Medical School"}]
-        }
-    ],
-    "referenced_works": ["W123", "W456"],  # Citations
-    "related_works": ["W789", "W012"]       # Similar papers
-}
-```
-
----
-
-## Key Advantages Over Current Tools
-
-### 1. Citation Network (We Don't Have This!)
-
-```python
-# Get papers that cite a work
-url = f"https://api.openalex.org/works?filter=cites:{work_id}"
-
-# Get papers cited by a work
-# Already in `referenced_works` field
-```
-
-### 2. Concept Tagging (We Don't Have This!)
-
-OpenAlex auto-tags papers with hierarchical concepts:
-- "Medicine" → "Pharmacology" → "Drug Repurposing"
-- Can search by concept, not just keywords
-
-### 3. Author Disambiguation (We Don't Have This!)
-
-```python
-# Find all works by an author
-url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}"
-```
-
-### 4. Institution Tracking
-
-```python
-# Find drug repurposing papers from top institutions
-url = "https://api.openalex.org/works"
-params = {
-    "search": "drug repurposing",
-    "filter": "authorships.institutions.id:I27837315",  # Harvard
-}
-```
-
-### 5. Related Works
-
-Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML.
-
----
-
-## Proposed Implementation
-
-### New Tool: `src/tools/openalex.py`
-
-```python
-"""OpenAlex search tool for comprehensive scholarly data."""
-
-import httpx
-from src.tools.base import SearchTool
-from src.utils.models import Evidence
-
-class OpenAlexTool(SearchTool):
-    """Search OpenAlex for scholarly works with rich metadata."""
-
-    name = "openalex"
-
-    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
-        async with httpx.AsyncClient() as client:
-            resp = await client.get(
-                "https://api.openalex.org/works",
-                params={
-                    "search": query,
-                    "filter": "type:article,is_oa:true",
-                    "sort": "cited_by_count:desc",
-                    "per_page": max_results,
-                    "mailto": "deepcritical@example.com",  # Polite pool
-                },
-            )
-            data = resp.json()
-
-        return [
-            Evidence(
-                source="openalex",
-                title=work["title"],
-                abstract=work.get("abstract", ""),
-                url=work["primary_location"]["landing_page_url"],
-                metadata={
-                    "cited_by_count": work["cited_by_count"],
-                    "concepts": [c["display_name"] for c in work["concepts"][:5]],
-                    "is_open_access": work["is_oa"],
-                    "pdf_url": work["primary_location"].get("pdf_url"),
-                },
-            )
-            for work in data["results"]
-        ]
-```
-
----
-
-## Rate Limits
-
-OpenAlex is **extremely generous**:
-
-- No hard rate limit documented
-- Recommended: <100,000 requests/day
-- **Polite pool**: Add `mailto=your@email.com` param for faster responses
-- No API key required (optional for priority support)
-
----
-
-## Should We Add OpenAlex?
-
-### Arguments FOR
-
-1. **Already in reference repo** - proven pattern
-2. **Richer data** - citations, concepts, authors
-3. **Single source** - reduces API complexity
-4. **Free & open** - no keys, no limits
-5. **Institution adoption** - Leiden, Sorbonne switched to it
-
-### Arguments AGAINST
-
-1. **Adds complexity** - another data source
-2. **Overlap** - duplicates some PubMed data
-3. **Not biomedical-focused** - covers all disciplines
-4. **No full text** - still need PMC/Europe PMC for that
-
-### Recommendation
-
-**Add OpenAlex as a 4th source**, don't replace existing tools.
-
-Use it for:
-- Citation network analysis
-- Concept-based discovery
-- High-impact paper finding
-- Author/institution tracking
-
-Keep PubMed, ClinicalTrials, Europe PMC for:
-- Authoritative biomedical search
-- Clinical trial data
-- Full-text access
-- Preprint tracking
-
----
-
-## Implementation Priority
-
-| Task | Effort | Value |
-|------|--------|-------|
-| Basic search | Low | High |
-| Citation network | Medium | Very High |
-| Concept filtering | Low | High |
-| Related works | Low | High |
-| Author tracking | Medium | Medium |
-
----
-
-## Sources
-
-- [OpenAlex Documentation](https://docs.openalex.org)
-- [OpenAlex API Overview](https://docs.openalex.org/api)
-- [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex)
-- [Leiden University Announcement](https://www.leidenranking.com/information/openalex)
-- [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833)
diff --git a/docs/brainstorming/implementation/15_PHASE_OPENALEX.md b/docs/brainstorming/implementation/15_PHASE_OPENALEX.md
deleted file mode 100644
index 9fb3afcc752cb37d22bd6c31a3412b4cb002df30..0000000000000000000000000000000000000000
--- a/docs/brainstorming/implementation/15_PHASE_OPENALEX.md
+++ /dev/null
@@ -1,603 +0,0 @@
-# Phase 15: OpenAlex Integration
-
-**Priority**: HIGH - Biggest bang for buck
-**Effort**: ~2-3 hours
-**Dependencies**: None (existing codebase patterns sufficient)
-
----
-
-## Prerequisites (COMPLETED)
-
-The following model changes have been implemented to support this integration:
-
-1. **`SourceName` Literal Updated** (`src/utils/models.py:9`)
-   ```python
-   SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
-   ```
-   - Without this, `source="openalex"` would fail Pydantic validation
-
-2. **`Evidence.metadata` Field Added** (`src/utils/models.py:39-42`)
-   ```python
-   metadata: dict[str, Any] = Field(
-       default_factory=dict,
-       description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
-   )
-   ```
-   - Required for storing `cited_by_count`, `concepts`, etc.
-   - Model is still frozen - metadata must be passed at construction time
-
-3. **`__init__.py` Exports Updated** (`src/tools/__init__.py`)
-   - All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool`
-   - OpenAlexTool should be added here after implementation
-
----
-
-## Overview
-
-Add OpenAlex as a 4th data source for comprehensive scholarly data including:
-- Citation networks (who cites whom)
-- Concept tagging (hierarchical topic classification)
-- Author disambiguation
-- 209M+ works indexed
-
-**Why OpenAlex?**
-- Free, no API key required
-- Already implemented in reference repo
-- Provides citation data we don't have
-- Aggregates PubMed + preprints + more
-
----
-
-## TDD Implementation Plan
-
-### Step 1: Write the Tests First
-
-**File**: `tests/unit/tools/test_openalex.py`
-
-```python
-"""Tests for OpenAlex search tool."""
-
-import pytest
-import respx
-from httpx import Response
-
-from src.tools.openalex import OpenAlexTool
-from src.utils.models import Evidence
-
-
-class TestOpenAlexTool:
-    """Test suite for OpenAlex search functionality."""
-
-    @pytest.fixture
-    def tool(self) -> OpenAlexTool:
-        return OpenAlexTool()
-
-    def test_name_property(self, tool: OpenAlexTool) -> None:
-        """Tool should identify itself as 'openalex'."""
-        assert tool.name == "openalex"
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None:
-        """Search should return list of Evidence objects."""
-        mock_response = {
-            "results": [
-                {
-                    "id": "W2741809807",
-                    "title": "Metformin and cancer: A systematic review",
-                    "publication_year": 2023,
-                    "cited_by_count": 45,
-                    "type": "article",
-                    "is_oa": True,
-                    "primary_location": {
-                        "source": {"display_name": "Nature Medicine"},
-                        "landing_page_url": "https://doi.org/10.1038/example",
-                        "pdf_url": None,
-                    },
-                    "abstract_inverted_index": {
-                        "Metformin": [0],
-                        "shows": [1],
-                        "anticancer": [2],
-                        "effects": [3],
-                    },
-                    "concepts": [
-                        {"display_name": "Medicine", "score": 0.95},
-                        {"display_name": "Oncology", "score": 0.88},
-                    ],
-                    "authorships": [
-                        {
-                            "author": {"display_name": "John Smith"},
-                            "institutions": [{"display_name": "Harvard"}],
-                        }
-                    ],
-                }
-            ]
-        }
-
-        respx.get("https://api.openalex.org/works").mock(
-            return_value=Response(200, json=mock_response)
-        )
-
-        results = await tool.search("metformin cancer", max_results=10)
-
-        assert len(results) == 1
-        assert isinstance(results[0], Evidence)
-        assert "Metformin and cancer" in results[0].citation.title
-        assert results[0].citation.source == "openalex"
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_search_empty_results(self, tool: OpenAlexTool) -> None:
-        """Search with no results should return empty list."""
-        respx.get("https://api.openalex.org/works").mock(
-            return_value=Response(200, json={"results": []})
-        )
-
-        results = await tool.search("xyznonexistentquery123")
-        assert results == []
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None:
-        """Tool should handle papers without abstracts."""
-        mock_response = {
-            "results": [
-                {
-                    "id": "W123",
-                    "title": "Paper without abstract",
-                    "publication_year": 2023,
-                    "cited_by_count": 10,
-                    "type": "article",
-                    "is_oa": False,
-                    "primary_location": {
-                        "source": {"display_name": "Journal"},
-                        "landing_page_url": "https://example.com",
-                    },
-                    "abstract_inverted_index": None,
-                    "concepts": [],
-                    "authorships": [],
-                }
-            ]
-        }
-
-        respx.get("https://api.openalex.org/works").mock(
-            return_value=Response(200, json=mock_response)
-        )
-
-        results = await tool.search("test query")
-        assert len(results) == 1
-        assert results[0].content == ""  # No abstract
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None:
-        """Citation count should be in metadata."""
-        mock_response = {
-            "results": [
-                {
-                    "id": "W456",
-                    "title": "Highly cited paper",
-                    "publication_year": 2020,
-                    "cited_by_count": 500,
-                    "type": "article",
-                    "is_oa": True,
-                    "primary_location": {
-                        "source": {"display_name": "Science"},
-                        "landing_page_url": "https://example.com",
-                    },
-                    "abstract_inverted_index": {"Test": [0]},
-                    "concepts": [],
-                    "authorships": [],
-                }
-            ]
-        }
-
-        respx.get("https://api.openalex.org/works").mock(
-            return_value=Response(200, json=mock_response)
-        )
-
-        results = await tool.search("highly cited")
-        assert results[0].metadata["cited_by_count"] == 500
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None:
-        """Concepts should be extracted for semantic discovery."""
-        mock_response = {
-            "results": [
-                {
-                    "id": "W789",
-                    "title": "Drug repurposing study",
-                    "publication_year": 2023,
-                    "cited_by_count": 25,
-                    "type": "article",
-                    "is_oa": True,
-                    "primary_location": {
-                        "source": {"display_name": "PLOS ONE"},
-                        "landing_page_url": "https://example.com",
-                    },
-                    "abstract_inverted_index": {"Drug": [0], "repurposing": [1]},
-                    "concepts": [
-                        {"display_name": "Pharmacology", "score": 0.92},
-                        {"display_name": "Drug Discovery", "score": 0.85},
-                        {"display_name": "Medicine", "score": 0.80},
-                    ],
-                    "authorships": [],
-                }
-            ]
-        }
-
-        respx.get("https://api.openalex.org/works").mock(
-            return_value=Response(200, json=mock_response)
-        )
-
-        results = await tool.search("drug repurposing")
-        assert "Pharmacology" in results[0].metadata["concepts"]
-        assert "Drug Discovery" in results[0].metadata["concepts"]
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_search_api_error_raises_search_error(
-        self, tool: OpenAlexTool
-    ) -> None:
-        """API errors should raise SearchError."""
-        from src.utils.exceptions import SearchError
-
-        respx.get("https://api.openalex.org/works").mock(
-            return_value=Response(500, text="Internal Server Error")
-        )
-
-        with pytest.raises(SearchError):
-            await tool.search("test query")
-
-    def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
-        """Test abstract reconstruction from inverted index."""
-        inverted_index = {
-            "Metformin": [0, 5],
-            "is": [1],
-            "a": [2],
-            "diabetes": [3],
-            "drug": [4],
-            "effective": [6],
-        }
-        abstract = tool._reconstruct_abstract(inverted_index)
-        assert abstract == "Metformin is a diabetes drug Metformin effective"
-```
-
----
-
-### Step 2: Create the Implementation
-
-**File**: `src/tools/openalex.py`
-
-```python
-"""OpenAlex search tool for comprehensive scholarly data."""
-
-from typing import Any
-
-import httpx
-from tenacity import retry, stop_after_attempt, wait_exponential
-
-from src.utils.exceptions import SearchError
-from src.utils.models import Citation, Evidence
-
-
-class OpenAlexTool:
-    """
-    Search OpenAlex for scholarly works with rich metadata.
-
-    OpenAlex provides:
-    - 209M+ scholarly works
-    - Citation counts and networks
-    - Concept tagging (hierarchical)
-    - Author disambiguation
-    - Open access links
-
-    API Docs: https://docs.openalex.org/
-    """
-
-    BASE_URL = "https://api.openalex.org/works"
-
-    def __init__(self, email: str | None = None) -> None:
-        """
-        Initialize OpenAlex tool.
-
-        Args:
-            email: Optional email for polite pool (faster responses)
-        """
-        self.email = email or "deepcritical@example.com"
-
-    @property
-    def name(self) -> str:
-        return "openalex"
-
-    @retry(
-        stop=stop_after_attempt(3),
-        wait=wait_exponential(multiplier=1, min=1, max=10),
-        reraise=True,
-    )
-    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
-        """
-        Search OpenAlex for scholarly works.
-
-        Args:
-            query: Search terms
-            max_results: Maximum results to return (max 200 per request)
-
-        Returns:
-            List of Evidence objects with citation metadata
-
-        Raises:
-            SearchError: If API request fails
-        """
-        params = {
-            "search": query,
-            "filter": "type:article",  # Only peer-reviewed articles
-            "sort": "cited_by_count:desc",  # Most cited first
-            "per_page": min(max_results, 200),
-            "mailto": self.email,  # Polite pool for faster responses
-        }
-
-        async with httpx.AsyncClient(timeout=30.0) as client:
-            try:
-                response = await client.get(self.BASE_URL, params=params)
-                response.raise_for_status()
-
-                data = response.json()
-                results = data.get("results", [])
-
-                return [self._to_evidence(work) for work in results[:max_results]]
-
-            except httpx.HTTPStatusError as e:
-                raise SearchError(f"OpenAlex API error: {e}") from e
-            except httpx.RequestError as e:
-                raise SearchError(f"OpenAlex connection failed: {e}") from e
-
-    def _to_evidence(self, work: dict[str, Any]) -> Evidence:
-        """Convert OpenAlex work to Evidence object."""
-        title = work.get("title", "Untitled")
-        pub_year = work.get("publication_year", "Unknown")
-        cited_by = work.get("cited_by_count", 0)
-        is_oa = work.get("is_oa", False)
-
-        # Reconstruct abstract from inverted index
-        abstract_index = work.get("abstract_inverted_index")
-        abstract = self._reconstruct_abstract(abstract_index) if abstract_index else ""
-
-        # Extract concepts (top 5)
-        concepts = [
-            c.get("display_name", "")
-            for c in work.get("concepts", [])[:5]
-            if c.get("display_name")
-        ]
-
-        # Extract authors (top 5)
-        authorships = work.get("authorships", [])
-        authors = [
-            a.get("author", {}).get("display_name", "")
-            for a in authorships[:5]
-            if a.get("author", {}).get("display_name")
-        ]
-
-        # Get URL
-        primary_loc = work.get("primary_location") or {}
-        url = primary_loc.get("landing_page_url", "")
-        if not url:
-            # Fallback to OpenAlex page
-            work_id = work.get("id", "").replace("https://openalex.org/", "")
-            url = f"https://openalex.org/{work_id}"
-
-        return Evidence(
-            content=abstract[:2000],
-            citation=Citation(
-                source="openalex",
-                title=title[:500],
-                url=url,
-                date=str(pub_year),
-                authors=authors,
-            ),
-            relevance=min(0.9, 0.5 + (cited_by / 1000)),  # Boost by citations
-            metadata={
-                "cited_by_count": cited_by,
-                "is_open_access": is_oa,
-                "concepts": concepts,
-                "pdf_url": primary_loc.get("pdf_url"),
-            },
-        )
-
-    def _reconstruct_abstract(
-        self, inverted_index: dict[str, list[int]]
-    ) -> str:
-        """
-        Reconstruct abstract from OpenAlex inverted index format.
-
-        OpenAlex stores abstracts as {"word": [position1, position2, ...]}.
-        This rebuilds the original text.
-        """
-        if not inverted_index:
-            return ""
-
-        # Build position -> word mapping
-        position_word: dict[int, str] = {}
-        for word, positions in inverted_index.items():
-            for pos in positions:
-                position_word[pos] = word
-
-        # Reconstruct in order
-        if not position_word:
-            return ""
-
-        max_pos = max(position_word.keys())
-        words = [position_word.get(i, "") for i in range(max_pos + 1)]
-        return " ".join(w for w in words if w)
-```
-
----
-
-### Step 3: Register in Search Handler
-
-**File**: `src/tools/search_handler.py` (add to imports and tool list)
-
-```python
-# Add import
-from src.tools.openalex import OpenAlexTool
-
-# Add to _create_tools method
-def _create_tools(self) -> list[SearchTool]:
-    return [
-        PubMedTool(),
-        ClinicalTrialsTool(),
-        EuropePMCTool(),
-        OpenAlexTool(),  # NEW
-    ]
-```
-
----
-
-### Step 4: Update `__init__.py`
-
-**File**: `src/tools/__init__.py`
-
-```python
-from src.tools.openalex import OpenAlexTool
-
-__all__ = [
-    "PubMedTool",
-    "ClinicalTrialsTool",
-    "EuropePMCTool",
-    "OpenAlexTool",  # NEW
-    # ...
-]
-```
-
----
-
-## Demo Script
-
-**File**: `examples/openalex_demo.py`
-
-```python
-#!/usr/bin/env python3
-"""Demo script to verify OpenAlex integration."""
-
-import asyncio
-from src.tools.openalex import OpenAlexTool
-
-
-async def main():
-    """Run OpenAlex search demo."""
-    tool = OpenAlexTool()
-
-    print("=" * 60)
-    print("OpenAlex Integration Demo")
-    print("=" * 60)
-
-    # Test 1: Basic drug repurposing search
-    print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...")
-    results = await tool.search("metformin cancer drug repurposing", max_results=5)
-
-    for i, evidence in enumerate(results, 1):
-        print(f"\n--- Result {i} ---")
-        print(f"Title: {evidence.citation.title}")
-        print(f"Year: {evidence.citation.date}")
-        print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}")
-        print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}")
-        print(f"Open Access: {evidence.metadata.get('is_open_access', False)}")
-        print(f"URL: {evidence.citation.url}")
-        if evidence.content:
-            print(f"Abstract: {evidence.content[:200]}...")
-
-    # Test 2: High-impact papers
-    print("\n" + "=" * 60)
-    print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...")
-    results = await tool.search("long COVID treatment", max_results=3)
-
-    for evidence in results:
-        print(f"\n- {evidence.citation.title}")
-        print(f"  Citations: {evidence.metadata.get('cited_by_count', 0)}")
-
-    print("\n" + "=" * 60)
-    print("Demo complete!")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
----
-
-## Verification Checklist
-
-### Unit Tests
-```bash
-# Run just OpenAlex tests
-uv run pytest tests/unit/tools/test_openalex.py -v
-
-# Expected: All tests pass
-```
-
-### Integration Test (Manual)
-```bash
-# Run demo script with real API
-uv run python examples/openalex_demo.py
-
-# Expected: Real results from OpenAlex API
-```
-
-### Full Test Suite
-```bash
-# Ensure nothing broke
-make check
-
-# Expected: All 110+ tests pass, mypy clean
-```
-
----
-
-## Success Criteria
-
-1. **Unit tests pass**: All mocked tests in `test_openalex.py` pass
-2. **Integration works**: Demo script returns real results
-3. **No regressions**: `make check` passes completely
-4. **SearchHandler integration**: OpenAlex appears in search results alongside other sources
-5. **Citation metadata**: Results include `cited_by_count`, `concepts`, `is_open_access`
-
----
-
-## Future Enhancements (P2)
-
-Once basic integration works:
-
-1. **Citation Network Queries**
-   ```python
-   # Get papers citing a specific work
-   async def get_citing_works(self, work_id: str) -> list[Evidence]:
-       params = {"filter": f"cites:{work_id}"}
-       ...
-   ```
-
-2. **Concept-Based Search**
-   ```python
-   # Search by OpenAlex concept ID
-   async def search_by_concept(self, concept_id: str) -> list[Evidence]:
-       params = {"filter": f"concepts.id:{concept_id}"}
-       ...
-   ```
-
-3. **Author Tracking**
-   ```python
-   # Find all works by an author
-   async def search_by_author(self, author_id: str) -> list[Evidence]:
-       params = {"filter": f"authorships.author.id:{author_id}"}
-       ...
-   ```
-
----
-
-## Notes
-
-- OpenAlex is **very generous** with rate limits (no documented hard limit)
-- Adding `mailto` parameter gives priority access (polite pool)
-- Abstract is stored as inverted index - must reconstruct
-- Citation count is a good proxy for paper quality/impact
-- Consider caching responses for repeated queries
diff --git a/docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md b/docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md
deleted file mode 100644
index 3284012fc70577f0d2cff5666b897c1799942102..0000000000000000000000000000000000000000
--- a/docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md
+++ /dev/null
@@ -1,586 +0,0 @@
-# Phase 16: PubMed Full-Text Retrieval
-
-**Priority**: MEDIUM - Enhances evidence quality
-**Effort**: ~3 hours
-**Dependencies**: None (existing PubMed tool sufficient)
-
----
-
-## Prerequisites (COMPLETED)
-
-The `Evidence.metadata` field has been added to `src/utils/models.py` to support:
-```python
-metadata={"has_fulltext": True}
-```
-
----
-
-## Architecture Decision: Constructor Parameter vs Method Parameter
-
-**IMPORTANT**: The original spec proposed `include_fulltext` as a method parameter:
-```python
-# WRONG - SearchHandler won't pass this parameter
-async def search(self, query: str, max_results: int = 10, include_fulltext: bool = False):
-```
-
-**Problem**: `SearchHandler` calls `tool.search(query, max_results)` uniformly across all tools.
-It has no mechanism to pass tool-specific parameters like `include_fulltext`.
-
-**Solution**: Use constructor parameter instead:
-```python
-# CORRECT - Configured at instantiation time
-class PubMedTool:
-    def __init__(self, api_key: str | None = None, include_fulltext: bool = False):
-        self.include_fulltext = include_fulltext
-        ...
-```
-
-This way, you can create a full-text-enabled PubMed tool:
-```python
-# In orchestrator or wherever tools are created
-tools = [
-    PubMedTool(include_fulltext=True),  # Full-text enabled
-    ClinicalTrialsTool(),
-    EuropePMCTool(),
-]
-```
-
----
-
-## Overview
-
-Add full-text retrieval for PubMed papers via the BioC API, enabling:
-- Complete paper text for open-access PMC papers
-- Structured sections (intro, methods, results, discussion)
-- Better evidence for LLM synthesis
-
-**Why Full-Text?**
-- Abstracts only give ~200-300 words
-- Full text provides detailed methods, results, figures
-- Reference repo already has this implemented
-- Makes LLM judgments more accurate
-
----
-
-## TDD Implementation Plan
-
-### Step 1: Write the Tests First
-
-**File**: `tests/unit/tools/test_pubmed_fulltext.py`
-
-```python
-"""Tests for PubMed full-text retrieval."""
-
-import pytest
-import respx
-from httpx import Response
-
-from src.tools.pubmed import PubMedTool
-
-
-class TestPubMedFullText:
-    """Test suite for PubMed full-text functionality."""
-
-    @pytest.fixture
-    def tool(self) -> PubMedTool:
-        return PubMedTool()
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_get_pmc_id_success(self, tool: PubMedTool) -> None:
-        """Should convert PMID to PMCID for full-text access."""
-        mock_response = {
-            "records": [
-                {
-                    "pmid": "12345678",
-                    "pmcid": "PMC1234567",
-                }
-            ]
-        }
-
-        respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
-            return_value=Response(200, json=mock_response)
-        )
-
-        pmcid = await tool.get_pmc_id("12345678")
-        assert pmcid == "PMC1234567"
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_get_pmc_id_not_in_pmc(self, tool: PubMedTool) -> None:
-        """Should return None if paper not in PMC."""
-        mock_response = {
-            "records": [
-                {
-                    "pmid": "12345678",
-                    # No pmcid means not in PMC
-                }
-            ]
-        }
-
-        respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
-            return_value=Response(200, json=mock_response)
-        )
-
-        pmcid = await tool.get_pmc_id("12345678")
-        assert pmcid is None
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_get_fulltext_success(self, tool: PubMedTool) -> None:
-        """Should retrieve full text for PMC papers."""
-        # Mock BioC API response
-        mock_bioc = {
-            "documents": [
-                {
-                    "passages": [
-                        {
-                            "infons": {"section_type": "INTRO"},
-                            "text": "Introduction text here.",
-                        },
-                        {
-                            "infons": {"section_type": "METHODS"},
-                            "text": "Methods description here.",
-                        },
-                        {
-                            "infons": {"section_type": "RESULTS"},
-                            "text": "Results summary here.",
-                        },
-                        {
-                            "infons": {"section_type": "DISCUSS"},
-                            "text": "Discussion and conclusions.",
-                        },
-                    ]
-                }
-            ]
-        }
-
-        respx.get(
-            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
-        ).mock(return_value=Response(200, json=mock_bioc))
-
-        fulltext = await tool.get_fulltext("12345678")
-
-        assert fulltext is not None
-        assert "Introduction text here" in fulltext
-        assert "Methods description here" in fulltext
-        assert "Results summary here" in fulltext
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_get_fulltext_not_available(self, tool: PubMedTool) -> None:
-        """Should return None if full text not available."""
-        respx.get(
-            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/99999999/unicode"
-        ).mock(return_value=Response(404))
-
-        fulltext = await tool.get_fulltext("99999999")
-        assert fulltext is None
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_get_fulltext_structured(self, tool: PubMedTool) -> None:
-        """Should return structured sections dict."""
-        mock_bioc = {
-            "documents": [
-                {
-                    "passages": [
-                        {"infons": {"section_type": "INTRO"}, "text": "Intro..."},
-                        {"infons": {"section_type": "METHODS"}, "text": "Methods..."},
-                        {"infons": {"section_type": "RESULTS"}, "text": "Results..."},
-                        {"infons": {"section_type": "DISCUSS"}, "text": "Discussion..."},
-                    ]
-                }
-            ]
-        }
-
-        respx.get(
-            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
-        ).mock(return_value=Response(200, json=mock_bioc))
-
-        sections = await tool.get_fulltext_structured("12345678")
-
-        assert sections is not None
-        assert "introduction" in sections
-        assert "methods" in sections
-        assert "results" in sections
-        assert "discussion" in sections
-
-    @respx.mock
-    @pytest.mark.asyncio
-    async def test_search_with_fulltext_enabled(self) -> None:
-        """Search should include full text when tool is configured for it."""
-        # Create tool WITH full-text enabled via constructor
-        tool = PubMedTool(include_fulltext=True)
-
-        # Mock esearch
-        respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi").mock(
-            return_value=Response(
-                200, json={"esearchresult": {"idlist": ["12345678"]}}
-            )
-        )
-
-        # Mock efetch (abstract)
-        mock_xml = """
-        <PubmedArticleSet>
-          <PubmedArticle>
-            <MedlineCitation>
-              <PMID>12345678</PMID>
-              <Article>
-                <ArticleTitle>Test Paper</ArticleTitle>
-                <Abstract><AbstractText>Short abstract.</AbstractText></Abstract>
-                <AuthorList><Author><LastName>Smith</LastName></Author></AuthorList>
-              </Article>
-            </MedlineCitation>
-          </PubmedArticle>
-        </PubmedArticleSet>
-        """
-        respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi").mock(
-            return_value=Response(200, text=mock_xml)
-        )
-
-        # Mock ID converter
-        respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
-            return_value=Response(
-                200, json={"records": [{"pmid": "12345678", "pmcid": "PMC1234567"}]}
-            )
-        )
-
-        # Mock BioC full text
-        mock_bioc = {
-            "documents": [
-                {
-                    "passages": [
-                        {"infons": {"section_type": "INTRO"}, "text": "Full intro..."},
-                    ]
-                }
-            ]
-        }
-        respx.get(
-            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
-        ).mock(return_value=Response(200, json=mock_bioc))
-
-        # NOTE: No include_fulltext param - it's set via constructor
-        results = await tool.search("test", max_results=1)
-
-        assert len(results) == 1
-        # Full text should be appended or replace abstract
-        assert "Full intro" in results[0].content or "Short abstract" in results[0].content
-```
-
----
-
-### Step 2: Implement Full-Text Methods
-
-**File**: `src/tools/pubmed.py` (additions to existing class)
-
-```python
-# Add these methods to PubMedTool class
-
-async def get_pmc_id(self, pmid: str) -> str | None:
-    """
-    Convert PMID to PMCID for full-text access.
-
-    Args:
-        pmid: PubMed ID
-
-    Returns:
-        PMCID if paper is in PMC, None otherwise
-    """
-    url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
-    params = {"ids": pmid, "format": "json"}
-
-    async with httpx.AsyncClient(timeout=30.0) as client:
-        try:
-            response = await client.get(url, params=params)
-            response.raise_for_status()
-            data = response.json()
-
-            records = data.get("records", [])
-            if records and records[0].get("pmcid"):
-                return records[0]["pmcid"]
-            return None
-
-        except httpx.HTTPError:
-            return None
-
-
-async def get_fulltext(self, pmid: str) -> str | None:
-    """
-    Get full text for a PubMed paper via BioC API.
-
-    Only works for open-access papers in PubMed Central.
-
-    Args:
-        pmid: PubMed ID
-
-    Returns:
-        Full text as string, or None if not available
-    """
-    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
-
-    async with httpx.AsyncClient(timeout=60.0) as client:
-        try:
-            response = await client.get(url)
-            if response.status_code == 404:
-                return None
-            response.raise_for_status()
-            data = response.json()
-
-            # Extract text from all passages
-            documents = data.get("documents", [])
-            if not documents:
-                return None
-
-            passages = documents[0].get("passages", [])
-            text_parts = [p.get("text", "") for p in passages if p.get("text")]
-
-            return "\n\n".join(text_parts) if text_parts else None
-
-        except httpx.HTTPError:
-            return None
-
-
-async def get_fulltext_structured(self, pmid: str) -> dict[str, str] | None:
-    """
-    Get structured full text with sections.
-
-    Args:
-        pmid: PubMed ID
-
-    Returns:
-        Dict mapping section names to text, or None if not available
-    """
-    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
-
-    async with httpx.AsyncClient(timeout=60.0) as client:
-        try:
-            response = await client.get(url)
-            if response.status_code == 404:
-                return None
-            response.raise_for_status()
-            data = response.json()
-
-            documents = data.get("documents", [])
-            if not documents:
-                return None
-
-            # Map section types to readable names
-            section_map = {
-                "INTRO": "introduction",
-                "METHODS": "methods",
-                "RESULTS": "results",
-                "DISCUSS": "discussion",
-                "CONCL": "conclusion",
-                "ABSTRACT": "abstract",
-            }
-
-            sections: dict[str, list[str]] = {}
-            for passage in documents[0].get("passages", []):
-                section_type = passage.get("infons", {}).get("section_type", "other")
-                section_name = section_map.get(section_type, "other")
-                text = passage.get("text", "")
-
-                if text:
-                    if section_name not in sections:
-                        sections[section_name] = []
-                    sections[section_name].append(text)
-
-            # Join multiple passages per section
-            return {k: "\n\n".join(v) for k, v in sections.items()}
-
-        except httpx.HTTPError:
-            return None
-```
-
----
-
-### Step 3: Update Constructor and Search Method
-
-Add full-text flag to constructor and update search to use it:
-
-```python
-class PubMedTool:
-    """Search tool for PubMed/NCBI."""
-
-    def __init__(
-        self,
-        api_key: str | None = None,
-        include_fulltext: bool = False,  # NEW CONSTRUCTOR PARAM
-    ) -> None:
-        self.api_key = api_key or settings.ncbi_api_key
-        if self.api_key == "your-ncbi-key-here":
-            self.api_key = None
-        self._last_request_time = 0.0
-        self.include_fulltext = include_fulltext  # Store for use in search()
-
-    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
-        """
-        Search PubMed and return evidence.
-
-        Note: Full-text enrichment is controlled by constructor parameter,
-        not method parameter, because SearchHandler doesn't pass extra args.
-        """
-        # ... existing search logic ...
-
-        evidence_list = self._parse_pubmed_xml(fetch_resp.text)
-
-        # Optionally enrich with full text (if configured at construction)
-        if self.include_fulltext:
-            evidence_list = await self._enrich_with_fulltext(evidence_list)
-
-        return evidence_list
-
-
-async def _enrich_with_fulltext(
-    self, evidence_list: list[Evidence]
-) -> list[Evidence]:
-    """Attempt to add full text to evidence items."""
-    enriched = []
-
-    for evidence in evidence_list:
-        # Extract PMID from URL
-        url = evidence.citation.url
-        pmid = url.rstrip("/").split("/")[-1] if url else None
-
-        if pmid:
-            fulltext = await self.get_fulltext(pmid)
-            if fulltext:
-                # Replace abstract with full text (truncated)
-                evidence = Evidence(
-                    content=fulltext[:8000],  # Larger limit for full text
-                    citation=evidence.citation,
-                    relevance=evidence.relevance,
-                    metadata={
-                        **evidence.metadata,
-                        "has_fulltext": True,
-                    },
-                )
-
-        enriched.append(evidence)
-
-    return enriched
-```
-
----
-
-## Demo Script
-
-**File**: `examples/pubmed_fulltext_demo.py`
-
-```python
-#!/usr/bin/env python3
-"""Demo script to verify PubMed full-text retrieval."""
-
-import asyncio
-from src.tools.pubmed import PubMedTool
-
-
-async def main():
-    """Run PubMed full-text demo."""
-    tool = PubMedTool()
-
-    print("=" * 60)
-    print("PubMed Full-Text Demo")
-    print("=" * 60)
-
-    # Test 1: Convert PMID to PMCID
-    print("\n[Test 1] Converting PMID to PMCID...")
-    # Use a known open-access paper
-    test_pmid = "34450029"  # Example: COVID-related open-access paper
-    pmcid = await tool.get_pmc_id(test_pmid)
-    print(f"PMID {test_pmid} -> PMCID: {pmcid or 'Not in PMC'}")
-
-    # Test 2: Get full text
-    print("\n[Test 2] Fetching full text...")
-    if pmcid:
-        fulltext = await tool.get_fulltext(test_pmid)
-        if fulltext:
-            print(f"Full text length: {len(fulltext)} characters")
-            print(f"Preview: {fulltext[:500]}...")
-        else:
-            print("Full text not available")
-
-    # Test 3: Get structured sections
-    print("\n[Test 3] Fetching structured sections...")
-    if pmcid:
-        sections = await tool.get_fulltext_structured(test_pmid)
-        if sections:
-            print("Available sections:")
-            for section, text in sections.items():
-                print(f"  - {section}: {len(text)} chars")
-        else:
-            print("Structured text not available")
-
-    # Test 4: Search with full text
-    print("\n[Test 4] Search with full-text enrichment...")
-    results = await tool.search(
-        "metformin cancer open access",
-        max_results=3,
-        include_fulltext=True
-    )
-
-    for i, evidence in enumerate(results, 1):
-        has_ft = evidence.metadata.get("has_fulltext", False)
-        print(f"\n--- Result {i} ---")
-        print(f"Title: {evidence.citation.title}")
-        print(f"Has Full Text: {has_ft}")
-        print(f"Content Length: {len(evidence.content)} chars")
-
-    print("\n" + "=" * 60)
-    print("Demo complete!")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
----
-
-## Verification Checklist
-
-### Unit Tests
-```bash
-# Run full-text tests
-uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
-
-# Run all PubMed tests
-uv run pytest tests/unit/tools/test_pubmed.py -v
-
-# Expected: All tests pass
-```
-
-### Integration Test (Manual)
-```bash
-# Run demo with real API
-uv run python examples/pubmed_fulltext_demo.py
-
-# Expected: Real full text from PMC papers
-```
-
-### Full Test Suite
-```bash
-make check
-# Expected: All tests pass, mypy clean
-```
-
----
-
-## Success Criteria
-
-1. **ID Conversion works**: PMID -> PMCID conversion successful
-2. **Full text retrieval works**: BioC API returns paper text
-3. **Structured sections work**: Can get intro/methods/results/discussion separately
-4. **Search integration works**: `include_fulltext=True` enriches results
-5. **No regressions**: Existing tests still pass
-6. **Graceful degradation**: Non-PMC papers still return abstracts
-
----
-
-## Notes
-
-- Only ~30% of PubMed papers have full text in PMC
-- BioC API has no documented rate limit, but be respectful
-- Full text can be very long - truncate appropriately
-- Consider caching full text responses (they don't change)
-- Timeout should be longer for full text (60s vs 30s)
diff --git a/docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md b/docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md
deleted file mode 100644
index 322a2c10194be56a40c1cbdbd54bd49ea0b0246c..0000000000000000000000000000000000000000
--- a/docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md
+++ /dev/null
@@ -1,540 +0,0 @@
-# Phase 17: Rate Limiting with `limits` Library
-
-**Priority**: P0 CRITICAL - Prevents API blocks
-**Effort**: ~1 hour
-**Dependencies**: None
-
----
-
-## CRITICAL: Async Safety Requirements
-
-**WARNING**: The rate limiter MUST be async-safe. Blocking the event loop will freeze:
-- The Gradio UI
-- All parallel searches
-- The orchestrator
-
-**Rules**:
-1. **NEVER use `time.sleep()`** - Always use `await asyncio.sleep()`
-2. **NEVER use blocking while loops** - Use async-aware polling
-3. **The `limits` library check is synchronous** - Wrap it carefully
-
-The implementation below uses a polling pattern that:
-- Checks the limit (synchronous, fast)
-- If exceeded, `await asyncio.sleep()` (non-blocking)
-- Retry the check
-
-**Alternative**: If `limits` proves problematic, use `aiolimiter` which is pure-async.
-
----
-
-## Overview
-
-Replace naive `asyncio.sleep` rate limiting with proper rate limiter using the `limits` library, which provides:
-- Moving window rate limiting
-- Per-API configurable limits
-- Thread-safe storage
-- Already used in reference repo
-
-**Why This Matters?**
-- NCBI will block us without proper rate limiting (3/sec without key, 10/sec with)
-- Current implementation only has simple sleep delay
-- Need coordinated limits across all PubMed calls
-- Professional-grade rate limiting prevents production issues
-
----
-
-## Current State
-
-### What We Have (`src/tools/pubmed.py:20-21, 34-41`)
-
-```python
-RATE_LIMIT_DELAY = 0.34  # ~3 requests/sec without API key
-
-async def _rate_limit(self) -> None:
-    """Enforce NCBI rate limiting."""
-    loop = asyncio.get_running_loop()
-    now = loop.time()
-    elapsed = now - self._last_request_time
-    if elapsed < self.RATE_LIMIT_DELAY:
-        await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
-    self._last_request_time = loop.time()
-```
-
-### Problems
-
-1. **Not shared across instances**: Each `PubMedTool()` has its own counter
-2. **Simple delay vs moving window**: Doesn't handle bursts properly
-3. **Hardcoded rate**: Doesn't adapt to API key presence
-4. **No backoff on 429**: Just retries blindly
-
----
-
-## TDD Implementation Plan
-
-### Step 1: Add Dependency
-
-**File**: `pyproject.toml`
-
-```toml
-dependencies = [
-    # ... existing deps ...
-    "limits>=3.0",
-]
-```
-
-Then run:
-```bash
-uv sync
-```
-
----
-
-### Step 2: Write the Tests First
-
-**File**: `tests/unit/tools/test_rate_limiting.py`
-
-```python
-"""Tests for rate limiting functionality."""
-
-import asyncio
-import time
-
-import pytest
-
-from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter
-
-
-class TestRateLimiter:
-    """Test suite for rate limiter."""
-
-    def test_create_limiter_without_api_key(self) -> None:
-        """Should create 3/sec limiter without API key."""
-        limiter = RateLimiter(rate="3/second")
-        assert limiter.rate == "3/second"
-
-    def test_create_limiter_with_api_key(self) -> None:
-        """Should create 10/sec limiter with API key."""
-        limiter = RateLimiter(rate="10/second")
-        assert limiter.rate == "10/second"
-
-    @pytest.mark.asyncio
-    async def test_limiter_allows_requests_under_limit(self) -> None:
-        """Should allow requests under the rate limit."""
-        limiter = RateLimiter(rate="10/second")
-
-        # 3 requests should all succeed immediately
-        for _ in range(3):
-            allowed = await limiter.acquire()
-            assert allowed is True
-
-    @pytest.mark.asyncio
-    async def test_limiter_blocks_when_exceeded(self) -> None:
-        """Should wait when rate limit exceeded."""
-        limiter = RateLimiter(rate="2/second")
-
-        # First 2 should be instant
-        await limiter.acquire()
-        await limiter.acquire()
-
-        # Third should block briefly
-        start = time.monotonic()
-        await limiter.acquire()
-        elapsed = time.monotonic() - start
-
-        # Should have waited ~0.5 seconds (half second window for 2/sec)
-        assert elapsed >= 0.3
-
-    @pytest.mark.asyncio
-    async def test_limiter_resets_after_window(self) -> None:
-        """Rate limit should reset after time window."""
-        limiter = RateLimiter(rate="5/second")
-
-        # Use up the limit
-        for _ in range(5):
-            await limiter.acquire()
-
-        # Wait for window to pass
-        await asyncio.sleep(1.1)
-
-        # Should be allowed again
-        start = time.monotonic()
-        await limiter.acquire()
-        elapsed = time.monotonic() - start
-
-        assert elapsed < 0.1  # Should be nearly instant
-
-
-class TestGetPubmedLimiter:
-    """Test PubMed-specific limiter factory."""
-
-    def test_limiter_without_api_key(self) -> None:
-        """Should return 3/sec limiter without key."""
-        limiter = get_pubmed_limiter(api_key=None)
-        assert "3" in limiter.rate
-
-    def test_limiter_with_api_key(self) -> None:
-        """Should return 10/sec limiter with key."""
-        limiter = get_pubmed_limiter(api_key="my-api-key")
-        assert "10" in limiter.rate
-
-    def test_limiter_is_singleton(self) -> None:
-        """Same API key should return same limiter instance."""
-        limiter1 = get_pubmed_limiter(api_key="key1")
-        limiter2 = get_pubmed_limiter(api_key="key1")
-        assert limiter1 is limiter2
-
-    def test_different_keys_different_limiters(self) -> None:
-        """Different API keys should return different limiters."""
-        limiter1 = get_pubmed_limiter(api_key="key1")
-        limiter2 = get_pubmed_limiter(api_key="key2")
-        # Clear cache for clean test
-        # Actually, different keys SHOULD share the same limiter
-        # since we're limiting against the same API
-        assert limiter1 is limiter2  # Shared NCBI rate limit
-```
-
----
-
-### Step 3: Create Rate Limiter Module
-
-**File**: `src/tools/rate_limiter.py`
-
-```python
-"""Rate limiting utilities using the limits library."""
-
-import asyncio
-from typing import ClassVar
-
-from limits import RateLimitItem, parse
-from limits.storage import MemoryStorage
-from limits.strategies import MovingWindowRateLimiter
-
-
-class RateLimiter:
-    """
-    Async-compatible rate limiter using limits library.
-
-    Uses moving window algorithm for smooth rate limiting.
-    """
-
-    def __init__(self, rate: str) -> None:
-        """
-        Initialize rate limiter.
-
-        Args:
-            rate: Rate string like "3/second" or "10/second"
-        """
-        self.rate = rate
-        self._storage = MemoryStorage()
-        self._limiter = MovingWindowRateLimiter(self._storage)
-        self._rate_limit: RateLimitItem = parse(rate)
-        self._identity = "default"  # Single identity for shared limiting
-
-    async def acquire(self, wait: bool = True) -> bool:
-        """
-        Acquire permission to make a request.
-
-        ASYNC-SAFE: Uses asyncio.sleep(), never time.sleep().
-        The polling pattern allows other coroutines to run while waiting.
-
-        Args:
-            wait: If True, wait until allowed. If False, return immediately.
-
-        Returns:
-            True if allowed, False if not (only when wait=False)
-        """
-        while True:
-            # Check if we can proceed (synchronous, fast - ~microseconds)
-            if self._limiter.hit(self._rate_limit, self._identity):
-                return True
-
-            if not wait:
-                return False
-
-            # CRITICAL: Use asyncio.sleep(), NOT time.sleep()
-            # This yields control to the event loop, allowing other
-            # coroutines (UI, parallel searches) to run
-            await asyncio.sleep(0.1)
-
-    def reset(self) -> None:
-        """Reset the rate limiter (for testing)."""
-        self._storage.reset()
-
-
-# Singleton limiter for PubMed/NCBI
-_pubmed_limiter: RateLimiter | None = None
-
-
-def get_pubmed_limiter(api_key: str | None = None) -> RateLimiter:
-    """
-    Get the shared PubMed rate limiter.
-
-    Rate depends on whether API key is provided:
-    - Without key: 3 requests/second
-    - With key: 10 requests/second
-
-    Args:
-        api_key: NCBI API key (optional)
-
-    Returns:
-        Shared RateLimiter instance
-    """
-    global _pubmed_limiter
-
-    if _pubmed_limiter is None:
-        rate = "10/second" if api_key else "3/second"
-        _pubmed_limiter = RateLimiter(rate)
-
-    return _pubmed_limiter
-
-
-def reset_pubmed_limiter() -> None:
-    """Reset the PubMed limiter (for testing)."""
-    global _pubmed_limiter
-    _pubmed_limiter = None
-
-
-# Factory for other APIs
-class RateLimiterFactory:
-    """Factory for creating/getting rate limiters for different APIs."""
-
-    _limiters: ClassVar[dict[str, RateLimiter]] = {}
-
-    @classmethod
-    def get(cls, api_name: str, rate: str) -> RateLimiter:
-        """
-        Get or create a rate limiter for an API.
-
-        Args:
-            api_name: Unique identifier for the API
-            rate: Rate limit string (e.g., "10/second")
-
-        Returns:
-            RateLimiter instance (shared for same api_name)
-        """
-        if api_name not in cls._limiters:
-            cls._limiters[api_name] = RateLimiter(rate)
-        return cls._limiters[api_name]
-
-    @classmethod
-    def reset_all(cls) -> None:
-        """Reset all limiters (for testing)."""
-        cls._limiters.clear()
-```
-
----
-
-### Step 4: Update PubMed Tool
-
-**File**: `src/tools/pubmed.py` (replace rate limiting code)
-
-```python
-# Replace imports and rate limiting
-
-from src.tools.rate_limiter import get_pubmed_limiter
-
-
-class PubMedTool:
-    """Search tool for PubMed/NCBI."""
-
-    BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
-    HTTP_TOO_MANY_REQUESTS = 429
-
-    def __init__(self, api_key: str | None = None) -> None:
-        self.api_key = api_key or settings.ncbi_api_key
-        if self.api_key == "your-ncbi-key-here":
-            self.api_key = None
-        # Use shared rate limiter
-        self._limiter = get_pubmed_limiter(self.api_key)
-
-    async def _rate_limit(self) -> None:
-        """Enforce NCBI rate limiting using shared limiter."""
-        await self._limiter.acquire()
-
-    # ... rest of class unchanged ...
-```
-
----
-
-### Step 5: Add Rate Limiters for Other APIs
-
-**File**: `src/tools/clinicaltrials.py` (optional)
-
-```python
-from src.tools.rate_limiter import RateLimiterFactory
-
-
-class ClinicalTrialsTool:
-    def __init__(self) -> None:
-        # ClinicalTrials.gov doesn't document limits, but be conservative
-        self._limiter = RateLimiterFactory.get("clinicaltrials", "5/second")
-
-    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
-        await self._limiter.acquire()
-        # ... rest of method ...
-```
-
-**File**: `src/tools/europepmc.py` (optional)
-
-```python
-from src.tools.rate_limiter import RateLimiterFactory
-
-
-class EuropePMCTool:
-    def __init__(self) -> None:
-        # Europe PMC is generous, but still be respectful
-        self._limiter = RateLimiterFactory.get("europepmc", "10/second")
-
-    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
-        await self._limiter.acquire()
-        # ... rest of method ...
-```
-
----
-
-## Demo Script
-
-**File**: `examples/rate_limiting_demo.py`
-
-```python
-#!/usr/bin/env python3
-"""Demo script to verify rate limiting works correctly."""
-
-import asyncio
-import time
-
-from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
-from src.tools.pubmed import PubMedTool
-
-
-async def test_basic_limiter():
-    """Test basic rate limiter behavior."""
-    print("=" * 60)
-    print("Rate Limiting Demo")
-    print("=" * 60)
-
-    # Test 1: Basic limiter
-    print("\n[Test 1] Testing 3/second limiter...")
-    limiter = RateLimiter("3/second")
-
-    start = time.monotonic()
-    for i in range(6):
-        await limiter.acquire()
-        elapsed = time.monotonic() - start
-        print(f"  Request {i+1} at {elapsed:.2f}s")
-
-    total = time.monotonic() - start
-    print(f"  Total time for 6 requests: {total:.2f}s (expected ~2s)")
-
-
-async def test_pubmed_limiter():
-    """Test PubMed-specific limiter."""
-    print("\n[Test 2] Testing PubMed limiter (shared)...")
-
-    reset_pubmed_limiter()  # Clean state
-
-    # Without API key: 3/sec
-    limiter = get_pubmed_limiter(api_key=None)
-    print(f"  Rate without key: {limiter.rate}")
-
-    # Multiple tools should share the same limiter
-    tool1 = PubMedTool()
-    tool2 = PubMedTool()
-
-    # Verify they share the limiter
-    print(f"  Tools share limiter: {tool1._limiter is tool2._limiter}")
-
-
-async def test_concurrent_requests():
-    """Test rate limiting under concurrent load."""
-    print("\n[Test 3] Testing concurrent request limiting...")
-
-    limiter = RateLimiter("5/second")
-
-    async def make_request(i: int):
-        await limiter.acquire()
-        return time.monotonic()
-
-    start = time.monotonic()
-    # Launch 10 concurrent requests
-    tasks = [make_request(i) for i in range(10)]
-    times = await asyncio.gather(*tasks)
-
-    # Calculate distribution
-    relative_times = [t - start for t in times]
-    print(f"  Request times: {[f'{t:.2f}s' for t in sorted(relative_times)]}")
-
-    total = max(relative_times)
-    print(f"  All 10 requests completed in {total:.2f}s (expected ~2s)")
-
-
-async def main():
-    await test_basic_limiter()
-    await test_pubmed_limiter()
-    await test_concurrent_requests()
-
-    print("\n" + "=" * 60)
-    print("Demo complete!")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
----
-
-## Verification Checklist
-
-### Unit Tests
-```bash
-# Run rate limiting tests
-uv run pytest tests/unit/tools/test_rate_limiting.py -v
-
-# Expected: All tests pass
-```
-
-### Integration Test (Manual)
-```bash
-# Run demo
-uv run python examples/rate_limiting_demo.py
-
-# Expected: Requests properly spaced
-```
-
-### Full Test Suite
-```bash
-make check
-# Expected: All tests pass, mypy clean
-```
-
----
-
-## Success Criteria
-
-1. **`limits` library installed**: Dependency added to pyproject.toml
-2. **RateLimiter class works**: Can create and use limiters
-3. **PubMed uses new limiter**: Shared limiter across instances
-4. **Rate adapts to API key**: 3/sec without, 10/sec with
-5. **Concurrent requests handled**: Multiple async requests properly queued
-6. **No regressions**: All existing tests pass
-
----
-
-## API Rate Limit Reference
-
-| API | Without Key | With Key |
-|-----|-------------|----------|
-| PubMed/NCBI | 3/sec | 10/sec |
-| ClinicalTrials.gov | Undocumented (~5/sec safe) | N/A |
-| Europe PMC | ~10-20/sec (generous) | N/A |
-| OpenAlex | ~100k/day (no per-sec limit) | Faster with `mailto` |
-
----
-
-## Notes
-
-- `limits` library uses moving window algorithm (fairer than fixed window)
-- Singleton pattern ensures all PubMed calls share the limit
-- The factory pattern allows easy extension to other APIs
-- Consider adding 429 response detection + exponential backoff
-- In production, consider Redis storage for distributed rate limiting
diff --git a/docs/brainstorming/implementation/README.md b/docs/brainstorming/implementation/README.md
deleted file mode 100644
index 6df1769754e718014f30f5a452d8366a0d2065c0..0000000000000000000000000000000000000000
--- a/docs/brainstorming/implementation/README.md
+++ /dev/null
@@ -1,143 +0,0 @@
-# Implementation Plans
-
-TDD implementation plans based on the brainstorming documents. Each phase is a self-contained vertical slice with tests, implementation, and demo scripts.
-
----
-
-## Prerequisites (COMPLETED)
-
-The following foundational changes have been implemented to support all three phases:
-
-| Change | File | Status |
-|--------|------|--------|
-| Add `"openalex"` to `SourceName` | `src/utils/models.py:9` | ✅ Done |
-| Add `metadata` field to `Evidence` | `src/utils/models.py:39-42` | ✅ Done |
-| Export all tools from `__init__.py` | `src/tools/__init__.py` | ✅ Done |
-
-All 110 tests pass after these changes.
-
----
-
-## Priority Order
-
-| Phase | Name | Priority | Effort | Value |
-|-------|------|----------|--------|-------|
-| **17** | Rate Limiting | P0 CRITICAL | 1 hour | Stability |
-| **15** | OpenAlex | HIGH | 2-3 hours | Very High |
-| **16** | PubMed Full-Text | MEDIUM | 3 hours | High |
-
-**Recommended implementation order**: 17 → 15 → 16
-
----
-
-## Phase 15: OpenAlex Integration
-
-**File**: [15_PHASE_OPENALEX.md](./15_PHASE_OPENALEX.md)
-
-Add OpenAlex as 4th data source for:
-- Citation networks (who cites whom)
-- Concept tagging (semantic discovery)
-- 209M+ scholarly works
-- Free, no API key required
-
-**Quick Start**:
-```bash
-# Create the tool
-touch src/tools/openalex.py
-touch tests/unit/tools/test_openalex.py
-
-# Run tests first (TDD)
-uv run pytest tests/unit/tools/test_openalex.py -v
-
-# Demo
-uv run python examples/openalex_demo.py
-```
-
----
-
-## Phase 16: PubMed Full-Text
-
-**File**: [16_PHASE_PUBMED_FULLTEXT.md](./16_PHASE_PUBMED_FULLTEXT.md)
-
-Add full-text retrieval via BioC API for:
-- Complete paper text (not just abstracts)
-- Structured sections (intro, methods, results)
-- Better evidence for LLM synthesis
-
-**Quick Start**:
-```bash
-# Add methods to existing pubmed.py
-# Tests in test_pubmed_fulltext.py
-
-# Run tests
-uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
-
-# Demo
-uv run python examples/pubmed_fulltext_demo.py
-```
-
----
-
-## Phase 17: Rate Limiting
-
-**File**: [17_PHASE_RATE_LIMITING.md](./17_PHASE_RATE_LIMITING.md)
-
-Replace naive sleep-based rate limiting with `limits` library for:
-- Moving window algorithm
-- Shared limits across instances
-- Configurable per-API rates
-- Production-grade stability
-
-**Quick Start**:
-```bash
-# Add dependency
-uv add limits
-
-# Create module
-touch src/tools/rate_limiter.py
-touch tests/unit/tools/test_rate_limiting.py
-
-# Run tests
-uv run pytest tests/unit/tools/test_rate_limiting.py -v
-
-# Demo
-uv run python examples/rate_limiting_demo.py
-```
-
----
-
-## TDD Workflow
-
-Each implementation doc follows this pattern:
-
-1. **Write tests first** - Define expected behavior
-2. **Run tests** - Verify they fail (red)
-3. **Implement** - Write minimal code to pass
-4. **Run tests** - Verify they pass (green)
-5. **Refactor** - Clean up if needed
-6. **Demo** - Verify end-to-end with real APIs
-7. **`make check`** - Ensure no regressions
-
----
-
-## Related Brainstorming Docs
-
-These implementation plans are derived from:
-
-- [00_ROADMAP_SUMMARY.md](../00_ROADMAP_SUMMARY.md) - Priority overview
-- [01_PUBMED_IMPROVEMENTS.md](../01_PUBMED_IMPROVEMENTS.md) - PubMed details
-- [02_CLINICALTRIALS_IMPROVEMENTS.md](../02_CLINICALTRIALS_IMPROVEMENTS.md) - CT.gov details
-- [03_EUROPEPMC_IMPROVEMENTS.md](../03_EUROPEPMC_IMPROVEMENTS.md) - Europe PMC details
-- [04_OPENALEX_INTEGRATION.md](../04_OPENALEX_INTEGRATION.md) - OpenAlex integration
-
----
-
-## Future Phases (Not Yet Documented)
-
-Based on brainstorming, these could be added later:
-
-- **Phase 18**: ClinicalTrials.gov Results Retrieval
-- **Phase 19**: Europe PMC Annotations API
-- **Phase 20**: Drug Name Normalization (RxNorm)
-- **Phase 21**: Citation Network Queries (OpenAlex)
-- **Phase 22**: Semantic Search with Embeddings
diff --git a/docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md b/docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md
deleted file mode 100644
index 77c443ae9f605904d9c55de3a729e4c06ac3f226..0000000000000000000000000000000000000000
--- a/docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# Situation Analysis: Pydantic-AI + Microsoft Agent Framework Integration
-
-**Date:** November 27, 2025
-**Status:** ACTIVE DECISION REQUIRED
-**Risk Level:** HIGH - DO NOT MERGE PR #41 UNTIL RESOLVED
-
----
-
-## 1. The Problem
-
-We almost merged a refactor that would have **deleted** multi-agent orchestration capability from the codebase, mistakenly believing pydantic-ai and Microsoft Agent Framework were mutually exclusive.
-
-**They are not.** They are complementary:
-- **pydantic-ai** (Library): Ensures LLM outputs match Pydantic schemas
-- **Microsoft Agent Framework** (Framework): Orchestrates multi-agent workflows
-
----
-
-## 2. Current Branch State
-
-| Branch | Location | Has Agent Framework? | Has Pydantic-AI Improvements? | Status |
-|--------|----------|---------------------|------------------------------|--------|
-| `origin/dev` | GitHub | YES | NO | **SAFE - Source of Truth** |
-| `huggingface-upstream/dev` | HF Spaces | YES | NO | **SAFE - Same as GitHub** |
-| `origin/main` | GitHub | YES | NO | **SAFE** |
-| `feat/pubmed-fulltext` | GitHub | NO (deleted) | YES | **DANGER - Has destructive refactor** |
-| `refactor/pydantic-unification` | Local | NO (deleted) | YES | **DANGER - Redundant, delete** |
-| Local `dev` | Local only | NO (deleted) | YES | **DANGER - NOT PUSHED (thankfully)** |
-
-### Key Files at Risk
-
-**On `origin/dev` (PRESERVED):**
-```text
-src/agents/
-├── analysis_agent.py      # StatisticalAnalyzer wrapper
-├── hypothesis_agent.py    # Hypothesis generation
-├── judge_agent.py         # JudgeHandler wrapper
-├── magentic_agents.py     # Multi-agent definitions
-├── report_agent.py        # Report synthesis
-├── search_agent.py        # SearchHandler wrapper
-├── state.py               # Thread-safe state management
-└── tools.py               # @ai_function decorated tools
-
-src/orchestrator_magentic.py  # Multi-agent orchestrator
-src/utils/llm_factory.py      # Centralized LLM client factory
-```
-
-**Deleted in refactor branch (would be lost if merged):**
-- All of the above
-
----
-
-## 3. Target Architecture
-
-```text
-┌─────────────────────────────────────────────────────────────────┐
-│  Microsoft Agent Framework (Orchestration Layer)                │
-│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
-│  │ SearchAgent  │→ │ JudgeAgent   │→ │ ReportAgent  │          │
-│  │ (BaseAgent)  │  │ (BaseAgent)  │  │ (BaseAgent)  │          │
-│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘          │
-│         │                 │                 │                  │
-│         ▼                 ▼                 ▼                  │
-│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
-│  │ pydantic-ai  │  │ pydantic-ai  │  │ pydantic-ai  │          │
-│  │ Agent()      │  │ Agent()      │  │ Agent()      │          │
-│  │ output_type= │  │ output_type= │  │ output_type= │          │
-│  │ SearchResult │  │ JudgeAssess  │  │ Report       │          │
-│  └──────────────┘  └──────────────┘  └──────────────┘          │
-└─────────────────────────────────────────────────────────────────┘
-```
-
-**Why this architecture:**
-1. **Agent Framework** handles: workflow coordination, state passing, middleware, observability
-2. **pydantic-ai** handles: type-safe LLM calls within each agent
-
----
-
-## 4. CRITICAL: Naming Confusion Clarification
-
-> **Senior Agent Review Finding:** The codebase uses "magentic" in file names (e.g., `orchestrator_magentic.py`, `magentic_agents.py`) but this is **NOT** the `magentic` PyPI package by Jacky Liang. It's Microsoft Agent Framework (`agent-framework-core`).
-
-**The naming confusion:**
-- `magentic` (PyPI package): A different library for structured LLM outputs
-- "Magentic" (in our codebase): Our internal name for Microsoft Agent Framework integration
-- `agent-framework-core` (PyPI package): Microsoft's actual multi-agent orchestration framework
-
-**Recommended future action:** Rename `orchestrator_magentic.py` → `orchestrator_advanced.py` to eliminate confusion.
-
----
-
-## 5. What the Refactor DID Get Right
-
-The refactor branch (`feat/pubmed-fulltext`) has some valuable improvements:
-
-1. **`judges.py` unified `get_model()`** - Supports OpenAI, Anthropic, AND HuggingFace via pydantic-ai
-2. **HuggingFace free tier support** - `HuggingFaceModel` integration
-3. **Test fix** - Properly mocks `HuggingFaceModel` class
-4. **Removed broken magentic optional dependency** from pyproject.toml (this was correct - the old `magentic` package is different from Microsoft Agent Framework)
-
-**What it got WRONG:**
-1. Deleted `src/agents/` entirely instead of refactoring them
-2. Deleted `src/orchestrator_magentic.py` instead of fixing it
-3. Conflated "magentic" (old package) with "Microsoft Agent Framework" (current framework)
-
----
-
-## 6. Options for Path Forward
-
-### Option A: Abandon Refactor, Start Fresh
-- Close PR #41
-- Delete `feat/pubmed-fulltext` and `refactor/pydantic-unification` branches
-- Reset local `dev` to match `origin/dev`
-- Cherry-pick ONLY the good parts (judges.py improvements, HF support)
-- **Pros:** Clean, safe
-- **Cons:** Lose some work, need to redo carefully
-
-### Option B: Cherry-Pick Good Parts to origin/dev
-- Do NOT merge PR #41
-- Create new branch from `origin/dev`
-- Cherry-pick specific commits/changes that improve pydantic-ai usage
-- Keep agent framework code intact
-- **Pros:** Preserves both, surgical
-- **Cons:** Requires careful file-by-file review
-
-### Option C: Revert Deletions in Refactor Branch
-- On `feat/pubmed-fulltext`, restore deleted agent files from `origin/dev`
-- Keep the pydantic-ai improvements
-- Merge THAT to dev
-- **Pros:** Gets both
-- **Cons:** Complex git operations, risk of conflicts
-
----
-
-## 7. Recommended Action: Option B (Cherry-Pick)
-
-**Step-by-step:**
-
-1. **Close PR #41** (do not merge)
-2. **Delete redundant branches:**
-   - `refactor/pydantic-unification` (local)
-   - Reset local `dev` to `origin/dev`
-3. **Create new branch from origin/dev:**
-   ```bash
-   git checkout -b feat/pydantic-ai-improvements origin/dev
-   ```
-4. **Cherry-pick or manually port these improvements:**
-   - `src/agent_factory/judges.py` - the unified `get_model()` function
-   - `examples/free_tier_demo.py` - HuggingFace demo
-   - Test improvements
-5. **Do NOT delete any agent framework files**
-6. **Create PR for review**
-
----
-
-## 8. Files to Cherry-Pick (Safe Improvements)
-
-| File | What Changed | Safe to Port? |
-|------|-------------|---------------|
-| `src/agent_factory/judges.py` | Added `HuggingFaceModel` support in `get_model()` | YES |
-| `examples/free_tier_demo.py` | New demo for HF inference | YES |
-| `tests/unit/agent_factory/test_judges.py` | Fixed HF model mocking | YES |
-| `pyproject.toml` | Removed old `magentic` optional dep | MAYBE (review carefully) |
-
----
-
-## 9. Questions to Answer Before Proceeding
-
-1. **For the hackathon**: Do we need full multi-agent orchestration, or is single-agent sufficient?
-2. **For DeepCritical mainline**: Is the plan to use Microsoft Agent Framework for orchestration?
-3. **Timeline**: How much time do we have to get this right?
-
----
-
-## 10. Immediate Actions (DO NOW)
-
-- [ ] **DO NOT merge PR #41**
-- [ ] Close PR #41 with comment explaining the situation
-- [ ] Do not push local `dev` branch anywhere
-- [ ] Confirm HuggingFace Spaces is untouched (it is - verified)
-
----
-
-## 11. Decision Log
-
-| Date | Decision | Rationale |
-|------|----------|-----------|
-| 2025-11-27 | Pause refactor merge | Discovered agent framework and pydantic-ai are complementary, not exclusive |
-| TBD | ? | Awaiting decision on path forward |
diff --git a/docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md b/docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md
deleted file mode 100644
index 7886c89b807f1dbfb54e878bb326715ad62675f9..0000000000000000000000000000000000000000
--- a/docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md
+++ /dev/null
@@ -1,289 +0,0 @@
-# Architecture Specification: Dual-Mode Agent System
-
-**Date:** November 27, 2025
-**Status:** SPECIFICATION
-**Goal:** Graceful degradation from full multi-agent orchestration to simple single-agent mode
-
----
-
-## 1. Core Concept: Two Operating Modes
-
-```text
-┌─────────────────────────────────────────────────────────────────────┐
-│                        USER REQUEST                                 │
-│                            │                                        │
-│                            ▼                                        │
-│                   ┌─────────────────┐                               │
-│                   │  Mode Selection │                               │
-│                   │  (Auto-detect)  │                               │
-│                   └────────┬────────┘                               │
-│                            │                                        │
-│            ┌───────────────┴───────────────┐                        │
-│            │                               │                        │
-│            ▼                               ▼                        │
-│   ┌─────────────────┐             ┌─────────────────┐               │
-│   │   SIMPLE MODE   │             │  ADVANCED MODE  │               │
-│   │  (Free Tier)    │             │  (Paid Tier)    │               │
-│   │                 │             │                 │               │
-│   │  pydantic-ai    │             │  MS Agent Fwk   │               │
-│   │  single-agent   │             │  + pydantic-ai  │               │
-│   │  loop           │             │  multi-agent    │               │
-│   └─────────────────┘             └─────────────────┘               │
-│            │                               │                        │
-│            └───────────────┬───────────────┘                        │
-│                            ▼                                        │
-│                   ┌─────────────────┐                               │
-│                   │  Research Report │                              │
-│                   │  with Citations  │                              │
-│                   └─────────────────┘                               │
-└─────────────────────────────────────────────────────────────────────┘
-```
-
----
-
-## 2. Mode Comparison
-
-| Aspect | Simple Mode | Advanced Mode |
-|--------|-------------|---------------|
-| **Trigger** | No API key OR `LLM_PROVIDER=huggingface` | OpenAI API key present (currently OpenAI only) |
-| **Framework** | pydantic-ai only | Microsoft Agent Framework + pydantic-ai |
-| **Architecture** | Single orchestrator loop | Multi-agent coordination |
-| **Agents** | One agent does Search→Judge→Report | SearchAgent, JudgeAgent, ReportAgent, AnalysisAgent |
-| **State Management** | Simple dict | Thread-safe `MagenticState` with context vars |
-| **Quality** | Good (functional) | Better (specialized agents, coordination) |
-| **Cost** | Free (HuggingFace Inference) | Paid (OpenAI/Anthropic) |
-| **Use Case** | Demos, hackathon, budget-constrained | Production, research quality |
-
----
-
-## 3. Simple Mode Architecture (pydantic-ai Only)
-
-```text
-┌─────────────────────────────────────────────────────┐
-│                  Orchestrator                       │
-│                                                     │
-│   while not sufficient and iteration < max:        │
-│       1. SearchHandler.execute(query)              │
-│       2. JudgeHandler.assess(evidence)    ◄── pydantic-ai Agent  │
-│       3. if sufficient: break                      │
-│       4. query = judge.next_queries                │
-│                                                     │
-│   return ReportGenerator.generate(evidence)        │
-└─────────────────────────────────────────────────────┘
-```
-
-**Components:**
-- `src/orchestrator.py` - Simple loop orchestrator
-- `src/agent_factory/judges.py` - JudgeHandler with pydantic-ai
-- `src/tools/search_handler.py` - Scatter-gather search
-- `src/tools/pubmed.py`, `clinicaltrials.py`, `europepmc.py` - Search tools
-
----
-
-## 4. Advanced Mode Architecture (MS Agent Framework + pydantic-ai)
-
-```text
-┌─────────────────────────────────────────────────────────────────────┐
-│              Microsoft Agent Framework Orchestrator                 │
-│                                                                     │
-│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐            │
-│   │ SearchAgent │───▶│ JudgeAgent  │───▶│ ReportAgent │            │
-│   │ (BaseAgent) │    │ (BaseAgent) │    │ (BaseAgent) │            │
-│   └──────┬──────┘    └──────┬──────┘    └──────┬──────┘            │
-│          │                  │                  │                    │
-│          ▼                  ▼                  ▼                    │
-│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐            │
-│   │ pydantic-ai │    │ pydantic-ai │    │ pydantic-ai │            │
-│   │ Agent()     │    │ Agent()     │    │ Agent()     │            │
-│   │ output_type=│    │ output_type=│    │ output_type=│            │
-│   │ SearchResult│    │ JudgeAssess │    │ Report      │            │
-│   └─────────────┘    └─────────────┘    └─────────────┘            │
-│                                                                     │
-│   Shared State: MagenticState (thread-safe via contextvars)        │
-│   - evidence: list[Evidence]                                       │
-│   - embedding_service: EmbeddingService                            │
-└─────────────────────────────────────────────────────────────────────┘
-```
-
-**Components:**
-- `src/orchestrator_magentic.py` - Multi-agent orchestrator
-- `src/agents/search_agent.py` - SearchAgent (BaseAgent)
-- `src/agents/judge_agent.py` - JudgeAgent (BaseAgent)
-- `src/agents/report_agent.py` - ReportAgent (BaseAgent)
-- `src/agents/analysis_agent.py` - AnalysisAgent (BaseAgent)
-- `src/agents/state.py` - Thread-safe state management
-- `src/agents/tools.py` - @ai_function decorated tools
-
----
-
-## 5. Mode Selection Logic
-
-```python
-# src/orchestrator_factory.py (actual implementation)
-
-def create_orchestrator(
-    search_handler: SearchHandlerProtocol | None = None,
-    judge_handler: JudgeHandlerProtocol | None = None,
-    config: OrchestratorConfig | None = None,
-    mode: Literal["simple", "magentic", "advanced"] | None = None,
-) -> Any:
-    """
-    Auto-select orchestrator based on available credentials.
-
-    Priority:
-    1. If mode explicitly set, use that
-    2. If OpenAI key available -> Advanced Mode (currently OpenAI only)
-    3. Otherwise -> Simple Mode (HuggingFace free tier)
-    """
-    effective_mode = _determine_mode(mode)
-
-    if effective_mode == "advanced":
-        orchestrator_cls = _get_magentic_orchestrator_class()
-        return orchestrator_cls(max_rounds=config.max_iterations if config else 10)
-
-    # Simple mode requires handlers
-    if search_handler is None or judge_handler is None:
-        raise ValueError("Simple mode requires search_handler and judge_handler")
-
-    return Orchestrator(
-        search_handler=search_handler,
-        judge_handler=judge_handler,
-        config=config,
-    )
-```
-
----
-
-## 6. Shared Components (Both Modes Use)
-
-These components work in both modes:
-
-| Component | Purpose |
-|-----------|---------|
-| `src/tools/pubmed.py` | PubMed search |
-| `src/tools/clinicaltrials.py` | ClinicalTrials.gov search |
-| `src/tools/europepmc.py` | Europe PMC search |
-| `src/tools/search_handler.py` | Scatter-gather orchestration |
-| `src/tools/rate_limiter.py` | Rate limiting |
-| `src/utils/models.py` | Evidence, Citation, JudgeAssessment |
-| `src/utils/config.py` | Settings |
-| `src/services/embeddings.py` | Vector search (optional) |
-
----
-
-## 7. pydantic-ai Integration Points
-
-Both modes use pydantic-ai for structured LLM outputs:
-
-```python
-# In JudgeHandler (both modes)
-from pydantic_ai import Agent
-from pydantic_ai.models.huggingface import HuggingFaceModel
-from pydantic_ai.models.openai import OpenAIModel
-from pydantic_ai.models.anthropic import AnthropicModel
-
-class JudgeHandler:
-    def __init__(self, model: Any = None):
-        self.model = model or get_model()  # Auto-selects based on config
-        self.agent = Agent(
-            model=self.model,
-            output_type=JudgeAssessment,  # Structured output!
-            system_prompt=SYSTEM_PROMPT,
-        )
-
-    async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
-        result = await self.agent.run(format_prompt(question, evidence))
-        return result.output  # Guaranteed to be JudgeAssessment
-```
-
----
-
-## 8. Microsoft Agent Framework Integration Points
-
-Advanced mode wraps pydantic-ai agents in BaseAgent:
-
-```python
-# In JudgeAgent (advanced mode only)
-from agent_framework import BaseAgent, AgentRunResponse, ChatMessage, Role
-
-class JudgeAgent(BaseAgent):
-    def __init__(self, judge_handler: JudgeHandlerProtocol):
-        super().__init__(
-            name="JudgeAgent",
-            description="Evaluates evidence quality",
-        )
-        self._handler = judge_handler  # Uses pydantic-ai internally
-
-    async def run(self, messages, **kwargs) -> AgentRunResponse:
-        question = extract_question(messages)
-        evidence = self._evidence_store.get("current", [])
-
-        # Delegate to pydantic-ai powered handler
-        assessment = await self._handler.assess(question, evidence)
-
-        return AgentRunResponse(
-            messages=[ChatMessage(role=Role.ASSISTANT, text=format_response(assessment))],
-            additional_properties={"assessment": assessment.model_dump()},
-        )
-```
-
----
-
-## 9. Benefits of This Architecture
-
-1. **Graceful Degradation**: Works without API keys (free tier)
-2. **Progressive Enhancement**: Better with API keys (orchestration)
-3. **Code Reuse**: pydantic-ai handlers shared between modes
-4. **Hackathon Ready**: Demo works without requiring paid keys
-5. **Production Ready**: Full orchestration available when needed
-6. **Future Proof**: Can add more agents to advanced mode
-7. **Testable**: Simple mode is easier to unit test
-
----
-
-## 10. Known Risks and Mitigations
-
-> **From Senior Agent Review**
-
-### 10.1 Bridge Complexity (MEDIUM)
-
-**Risk:** In Advanced Mode, agents (Agent Framework) wrap handlers (pydantic-ai). Both are async. Context variables (`MagenticState`) must propagate correctly through the pydantic-ai call stack.
-
-**Mitigation:**
-- pydantic-ai uses standard Python `contextvars`, which naturally propagate through `await` chains
-- Test context propagation explicitly in integration tests
-- If issues arise, pass state explicitly rather than via context vars
-
-### 10.2 Integration Drift (MEDIUM)
-
-**Risk:** Simple Mode and Advanced Mode might diverge in behavior over time (e.g., Simple Mode uses logic A, Advanced Mode uses logic B).
-
-**Mitigation:**
-- Both modes MUST call the exact same underlying Tools (`src/tools/*`) and Handlers (`src/agent_factory/*`)
-- Handlers are the single source of truth for business logic
-- Agents are thin wrappers that delegate to handlers
-
-### 10.3 Testing Burden (LOW-MEDIUM)
-
-**Risk:** Two distinct orchestrators (`src/orchestrator.py` and `src/orchestrator_magentic.py`) doubles integration testing surface area.
-
-**Mitigation:**
-- Unit test handlers independently (shared code)
-- Integration tests for each mode separately
-- End-to-end tests verify same output for same input (determinism permitting)
-
-### 10.4 Dependency Conflicts (LOW)
-
-**Risk:** `agent-framework-core` might conflict with `pydantic-ai`'s dependencies (e.g., different pydantic versions).
-
-**Status:** Both use `pydantic>=2.x`. Should be compatible.
-
----
-
-## 11. Naming Clarification
-
-> See `00_SITUATION_AND_PLAN.md` Section 4 for full details.
-
-**Important:** The codebase uses "magentic" in file names (`orchestrator_magentic.py`, `magentic_agents.py`) but this refers to our internal naming for Microsoft Agent Framework integration, **NOT** the `magentic` PyPI package.
-
-**Future action:** Rename to `orchestrator_advanced.py` to eliminate confusion.
diff --git a/docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md b/docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md
deleted file mode 100644
index 37e2791a4123e3f2e78d2c750ddc77eff7d05814..0000000000000000000000000000000000000000
--- a/docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md
+++ /dev/null
@@ -1,112 +0,0 @@
-# Implementation Phases: Dual-Mode Agent System
-
-**Date:** November 27, 2025
-**Status:** IMPLEMENTATION PLAN (REVISED)
-**Strategy:** TDD (Test-Driven Development), SOLID Principles
-**Dependency Strategy:** PyPI (agent-framework-core)
-
----
-
-## Phase 0: Environment Validation & Cleanup
-
-**Goal:** Ensure clean state and dependencies are correctly installed.
-
-### Step 0.1: Verify PyPI Package
-The `agent-framework-core` package is published on PyPI by Microsoft. Verify installation:
-
-```bash
-uv sync --all-extras
-python -c "from agent_framework import ChatAgent; print('OK')"
-```
-
-### Step 0.2: Branch State
-We are on `feat/dual-mode-architecture`. Ensure it is up to date with `origin/dev` before starting.
-
-**Note:** The `reference_repos/agent-framework` folder is kept for reference/documentation only.
-The production dependency uses the official PyPI release.
-
----
-
-## Phase 1: Pydantic-AI Improvements (Simple Mode)
-
-**Goal:** Implement `HuggingFaceModel` support in `JudgeHandler` using strict TDD.
-
-### Step 1.1: Test First (Red)
-Create `tests/unit/agent_factory/test_judges_factory.py`:
-- Test `get_model()` returns `HuggingFaceModel` when `LLM_PROVIDER=huggingface`.
-- Test `get_model()` respects `HF_TOKEN`.
-- Test fallback to OpenAI.
-
-### Step 1.2: Implementation (Green)
-Update `src/utils/config.py`:
-- Add `huggingface_model` and `hf_token` fields.
-
-Update `src/agent_factory/judges.py`:
-- Implement `get_model` with the logic derived from the tests.
-- Use dependency injection for the model where possible.
-
-### Step 1.3: Refactor
-Ensure `JudgeHandler` is loosely coupled from the specific model provider.
-
----
-
-## Phase 2: Orchestrator Factory (The Switch)
-
-**Goal:** Implement the factory pattern to switch between Simple and Advanced modes.
-
-### Step 2.1: Test First (Red)
-Create `tests/unit/test_orchestrator_factory.py`:
-- Test `create_orchestrator` returns `Orchestrator` (simple) when API keys are missing.
-- Test `create_orchestrator` returns `MagenticOrchestrator` (advanced) when OpenAI key exists.
-- Test explicit mode override.
-
-### Step 2.2: Implementation (Green)
-Update `src/orchestrator_factory.py` to implement the selection logic.
-
----
-
-## Phase 3: Agent Framework Integration (Advanced Mode)
-
-**Goal:** Integrate Microsoft Agent Framework from PyPI.
-
-### Step 3.1: Dependency Management
-The `agent-framework-core` package is installed from PyPI:
-```toml
-[project.optional-dependencies]
-magentic = [
-    "agent-framework-core>=1.0.0b251120,<2.0.0",  # Microsoft Agent Framework (PyPI)
-]
-```
-Install with: `uv sync --all-extras`
-
-### Step 3.2: Verify Imports (Test First)
-Create `tests/unit/agents/test_agent_imports.py`:
-- Verify `from agent_framework import ChatAgent` works.
-- Verify instantiation of `ChatAgent` with a mock client.
-
-### Step 3.3: Update Agents
-Refactor `src/agents/*.py` to ensure they match the exact signature of the local `ChatAgent` class.
-- **SOLID:** Ensure agents have single responsibilities.
-- **DRY:** Share tool definitions between Pydantic-AI simple mode and Agent Framework advanced mode.
-
----
-
-## Phase 4: UI & End-to-End Verification
-
-**Goal:** Update Gradio to reflect the active mode.
-
-### Step 4.1: UI Updates
-Update `src/app.py` to display "Simple Mode" vs "Advanced Mode".
-
-### Step 4.2: End-to-End Test
-Run the full loop:
-1. Simple Mode (No Keys) -> Search -> Judge (HF) -> Report.
-2. Advanced Mode (OpenAI Key) -> SearchAgent -> JudgeAgent -> ReportAgent.
-
----
-
-## Phase 5: Cleanup & Documentation
-
-- Remove unused code.
-- Update main README.md.
-- Final `make check`.
\ No newline at end of file
diff --git a/docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md b/docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md
deleted file mode 100644
index b09b6db248a8ddb37fa8f6c2deba01c929f694a4..0000000000000000000000000000000000000000
--- a/docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md
+++ /dev/null
@@ -1,112 +0,0 @@
-# Immediate Actions Checklist
-
-**Date:** November 27, 2025
-**Priority:** Execute in order
-
----
-
-## Before Starting Implementation
-
-### 1. Close PR #41 (CRITICAL)
-
-```bash
-gh pr close 41 --comment "Architecture decision changed. Cherry-picking improvements to preserve both pydantic-ai and Agent Framework capabilities."
-```
-
-### 2. Verify HuggingFace Spaces is Safe
-
-```bash
-# Should show agent framework files exist
-git ls-tree --name-only huggingface-upstream/dev -- src/agents/
-git ls-tree --name-only huggingface-upstream/dev -- src/orchestrator_magentic.py
-```
-
-Expected output: Files should exist (they do as of this writing).
-
-### 3. Clean Local Environment
-
-```bash
-# Switch to main first
-git checkout main
-
-# Delete problematic branches
-git branch -D refactor/pydantic-unification 2>/dev/null || true
-git branch -D feat/pubmed-fulltext 2>/dev/null || true
-
-# Reset local dev to origin/dev
-git branch -D dev 2>/dev/null || true
-git checkout -b dev origin/dev
-
-# Verify agent framework code exists
-ls src/agents/
-# Expected: __init__.py, analysis_agent.py, hypothesis_agent.py, judge_agent.py,
-#           magentic_agents.py, report_agent.py, search_agent.py, state.py, tools.py
-
-ls src/orchestrator_magentic.py
-# Expected: file exists
-```
-
-### 4. Create Fresh Feature Branch
-
-```bash
-git checkout -b feat/dual-mode-architecture origin/dev
-```
-
----
-
-## Decision Points
-
-Before proceeding, confirm:
-
-1. **For hackathon**: Do we need advanced mode, or is simple mode sufficient?
-   - Simple mode = faster to implement, works today
-   - Advanced mode = better quality, more work
-
-2. **Timeline**: How much time do we have?
-   - If < 1 day: Focus on simple mode only
-   - If > 1 day: Implement dual-mode
-
-3. **Dependencies**: Is `agent-framework-core` available?
-   - Check: `pip index versions agent-framework-core`
-   - If not on PyPI, may need to install from GitHub
-
----
-
-## Quick Start (Simple Mode Only)
-
-If time is limited, implement only simple mode improvements:
-
-```bash
-# On feat/dual-mode-architecture branch
-
-# 1. Update judges.py to add HuggingFace support
-# 2. Update config.py to add HF settings
-# 3. Create free_tier_demo.py
-# 4. Run make check
-# 5. Create PR to dev
-```
-
-This gives you free-tier capability without touching agent framework code.
-
----
-
-## Quick Start (Full Dual-Mode)
-
-If time permits, implement full dual-mode:
-
-Follow phases 1-6 in `02_IMPLEMENTATION_PHASES.md`
-
----
-
-## Emergency Rollback
-
-If anything goes wrong:
-
-```bash
-# Reset to safe state
-git checkout main
-git branch -D feat/dual-mode-architecture
-git checkout -b feat/dual-mode-architecture origin/dev
-```
-
-Origin/dev is the safe fallback - it has agent framework intact.
diff --git a/docs/brainstorming/magentic-pydantic/04_FOLLOWUP_REVIEW_REQUEST.md b/docs/brainstorming/magentic-pydantic/04_FOLLOWUP_REVIEW_REQUEST.md
deleted file mode 100644
index 98b021373d2b3928be993b791ac5a9197503c92a..0000000000000000000000000000000000000000
--- a/docs/brainstorming/magentic-pydantic/04_FOLLOWUP_REVIEW_REQUEST.md
+++ /dev/null
@@ -1,158 +0,0 @@
-# Follow-Up Review Request: Did We Implement Your Feedback?
-
-**Date:** November 27, 2025
-**Context:** You previously reviewed our dual-mode architecture plan and provided feedback. We have updated the documentation. Please verify we correctly implemented your recommendations.
-
----
-
-## Your Original Feedback vs Our Changes
-
-### 1. Naming Confusion Clarification
-
-**Your feedback:** "You are using Microsoft Agent Framework, but you've named your integration 'Magentic'. This caused the confusion."
-
-**Our change:** Added Section 4 in `00_SITUATION_AND_PLAN.md`:
-```markdown
-## 4. CRITICAL: Naming Confusion Clarification
-
-> **Senior Agent Review Finding:** The codebase uses "magentic" in file names
-> (e.g., `orchestrator_magentic.py`, `magentic_agents.py`) but this is **NOT**
-> the `magentic` PyPI package by Jacky Liang. It's Microsoft Agent Framework.
-
-**The naming confusion:**
-- `magentic` (PyPI package): A different library for structured LLM outputs
-- "Magentic" (in our codebase): Our internal name for Microsoft Agent Framework integration
-- `agent-framework-core` (PyPI package): Microsoft's actual multi-agent orchestration framework
-
-**Recommended future action:** Rename `orchestrator_magentic.py` → `orchestrator_advanced.py`
-```
-
-**Status:** ✅ IMPLEMENTED
-
----
-
-### 2. Bridge Complexity Warning
-
-**Your feedback:** "You must ensure MagenticState (context vars) propagates correctly through the pydantic-ai call stack."
-
-**Our change:** Added Section 10.1 in `01_ARCHITECTURE_SPEC.md`:
-```markdown
-### 10.1 Bridge Complexity (MEDIUM)
-
-**Risk:** In Advanced Mode, agents (Agent Framework) wrap handlers (pydantic-ai).
-Both are async. Context variables (`MagenticState`) must propagate correctly.
-
-**Mitigation:**
-- pydantic-ai uses standard Python `contextvars`, which naturally propagate through `await` chains
-- Test context propagation explicitly in integration tests
-- If issues arise, pass state explicitly rather than via context vars
-```
-
-**Status:** ✅ IMPLEMENTED
-
----
-
-### 3. Integration Drift Warning
-
-**Your feedback:** "Simple Mode and Advanced Mode might diverge in behavior."
-
-**Our change:** Added Section 10.2 in `01_ARCHITECTURE_SPEC.md`:
-```markdown
-### 10.2 Integration Drift (MEDIUM)
-
-**Risk:** Simple Mode and Advanced Mode might diverge in behavior over time.
-
-**Mitigation:**
-- Both modes MUST call the exact same underlying Tools (`src/tools/*`) and Handlers (`src/agent_factory/*`)
-- Handlers are the single source of truth for business logic
-- Agents are thin wrappers that delegate to handlers
-```
-
-**Status:** ✅ IMPLEMENTED
-
----
-
-### 4. Testing Burden Warning
-
-**Your feedback:** "You now have two distinct orchestrators to maintain. This doubles your integration testing surface area."
-
-**Our change:** Added Section 10.3 in `01_ARCHITECTURE_SPEC.md`:
-```markdown
-### 10.3 Testing Burden (LOW-MEDIUM)
-
-**Risk:** Two distinct orchestrators doubles integration testing surface area.
-
-**Mitigation:**
-- Unit test handlers independently (shared code)
-- Integration tests for each mode separately
-- End-to-end tests verify same output for same input
-```
-
-**Status:** ✅ IMPLEMENTED
-
----
-
-### 5. Rename Recommendation
-
-**Your feedback:** "Rename `src/orchestrator_magentic.py` to `src/orchestrator_advanced.py`"
-
-**Our change:** Added Step 3.4 in `02_IMPLEMENTATION_PHASES.md`:
-```markdown
-### Step 3.4: (OPTIONAL) Rename "Magentic" to "Advanced"
-
-> **Senior Agent Recommendation:** Rename files to eliminate confusion.
-
-git mv src/orchestrator_magentic.py src/orchestrator_advanced.py
-git mv src/agents/magentic_agents.py src/agents/advanced_agents.py
-
-**Note:** This is optional for the hackathon. Can be done in a follow-up PR.
-```
-
-**Status:** ✅ DOCUMENTED (marked as optional for hackathon)
-
----
-
-### 6. Standardize Wrapper Recommendation
-
-**Your feedback:** "Create a generic `PydanticAiAgentWrapper(BaseAgent)` class instead of manually wrapping each handler."
-
-**Our change:** NOT YET DOCUMENTED
-
-**Status:** ⚠️ NOT IMPLEMENTED - Should we add this?
-
----
-
-## Questions for Your Review
-
-1. **Did we correctly implement your feedback?** Are there any misunderstandings in how we interpreted your recommendations?
-
-2. **Is the "Standardize Wrapper" recommendation critical?** Should we add it to the implementation phases, or is it a nice-to-have for later?
-
-3. **Dependency versioning:** You noted `agent-framework-core>=1.0.0b251120` might be ephemeral. Should we:
-   - Pin to a specific version?
-   - Use a version range?
-   - Install from GitHub source?
-
-4. **Anything else we missed?**
-
----
-
-## Files to Re-Review
-
-1. `00_SITUATION_AND_PLAN.md` - Added Section 4 (Naming Clarification)
-2. `01_ARCHITECTURE_SPEC.md` - Added Sections 10-11 (Risks, Naming)
-3. `02_IMPLEMENTATION_PHASES.md` - Added Step 3.4 (Optional Rename)
-
----
-
-## Current Branch State
-
-We are now on `feat/dual-mode-architecture` branched from `origin/dev`:
-- ✅ Agent framework code intact (`src/agents/`, `src/orchestrator_magentic.py`)
-- ✅ Documentation committed
-- ❌ PR #41 still open (need to close it)
-- ❌ Cherry-pick of pydantic-ai improvements not yet done
-
----
-
-Please confirm: **GO / NO-GO** to proceed with Phase 1 (cherry-picking pydantic-ai improvements)?
diff --git a/docs/brainstorming/magentic-pydantic/REVIEW_PROMPT_FOR_SENIOR_AGENT.md b/docs/brainstorming/magentic-pydantic/REVIEW_PROMPT_FOR_SENIOR_AGENT.md
deleted file mode 100644
index 9f25b1f52a79193a28d4d5f9029cdfece1928be5..0000000000000000000000000000000000000000
--- a/docs/brainstorming/magentic-pydantic/REVIEW_PROMPT_FOR_SENIOR_AGENT.md
+++ /dev/null
@@ -1,113 +0,0 @@
-# Senior Agent Review Prompt
-
-Copy and paste everything below this line to a fresh Claude/AI session:
-
----
-
-## Context
-
-I am a junior developer working on a HuggingFace hackathon project called DeepCritical. We made a significant architectural mistake and are now trying to course-correct. I need you to act as a **senior staff engineer** and critically review our proposed solution.
-
-## The Situation
-
-We almost merged a refactor that would have **deleted** our multi-agent orchestration capability, mistakenly believing that `pydantic-ai` (a library for structured LLM outputs) and Microsoft's `agent-framework` (a framework for multi-agent orchestration) were mutually exclusive alternatives.
-
-**They are not.** They are complementary:
-- `pydantic-ai` ensures LLM responses match Pydantic schemas (type-safe outputs)
-- `agent-framework` orchestrates multiple agents working together (coordination layer)
-
-We now want to implement a **dual-mode architecture** where:
-- **Simple Mode (No API key):** Uses only pydantic-ai with HuggingFace free tier
-- **Advanced Mode (With API key):** Uses Microsoft Agent Framework for orchestration, with pydantic-ai inside each agent for structured outputs
-
-## Your Task
-
-Please perform a **deep, critical review** of:
-
-1. **The architecture diagram** (image attached: `assets/magentic-pydantic.png`)
-2. **Our documentation** (4 files listed below)
-3. **The actual codebase** to verify our claims
-
-## Specific Questions to Answer
-
-### Architecture Validation
-1. Is our understanding correct that pydantic-ai and agent-framework are complementary, not competing?
-2. Does the dual-mode architecture diagram accurately represent how these should integrate?
-3. Are there any architectural flaws or anti-patterns in our proposed design?
-
-### Documentation Accuracy
-4. Are the branch states we documented accurate? (Check `git log`, `git ls-tree`)
-5. Is our understanding of what code exists where correct?
-6. Are the implementation phases realistic and in the correct order?
-7. Are there any missing steps or dependencies we overlooked?
-
-### Codebase Reality Check
-8. Does `origin/dev` actually have the agent framework code intact? Verify by checking:
-   - `git ls-tree origin/dev -- src/agents/`
-   - `git ls-tree origin/dev -- src/orchestrator_magentic.py`
-9. What does the current `src/agents/` code actually import? Does it use `agent_framework` or `agent-framework-core`?
-10. Is the `agent-framework-core` package actually available on PyPI, or do we need to install from source?
-
-### Implementation Feasibility
-11. Can the cherry-pick strategy we outlined actually work, or are there merge conflicts we're not seeing?
-12. Is the mode auto-detection logic sound?
-13. What are the risks we haven't identified?
-
-### Critical Errors Check
-14. Did we miss anything critical in our analysis?
-15. Are there any factual errors in our documentation?
-16. Would a Google/DeepMind senior engineer approve this plan, or would they flag issues?
-
-## Files to Review
-
-Please read these files in order:
-
-1. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md`
-2. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md`
-3. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md`
-4. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md`
-
-And the architecture diagram:
-5. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/assets/magentic-pydantic.png`
-
-## Reference Repositories to Consult
-
-We have local clones of the source-of-truth repositories:
-
-- **Original DeepCritical:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/DeepCritical/`
-- **Microsoft Agent Framework:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/agent-framework/`
-- **Microsoft AutoGen:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/autogen-microsoft/`
-
-Please cross-reference our hackathon fork against these to verify architectural alignment.
-
-## Codebase to Analyze
-
-Our hackathon fork is at:
-`/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/`
-
-Key files to examine:
-- `src/agents/` - Agent framework integration
-- `src/agent_factory/judges.py` - pydantic-ai integration
-- `src/orchestrator.py` - Simple mode orchestrator
-- `src/orchestrator_magentic.py` - Advanced mode orchestrator
-- `src/orchestrator_factory.py` - Mode selection
-- `pyproject.toml` - Dependencies
-
-## Expected Output
-
-Please provide:
-
-1. **Validation Summary:** Is our plan sound? (YES/NO with explanation)
-2. **Errors Found:** List any factual errors in our documentation
-3. **Missing Items:** What did we overlook?
-4. **Risk Assessment:** What could go wrong?
-5. **Recommended Changes:** Specific edits to our documentation or plan
-6. **Go/No-Go Recommendation:** Should we proceed with this plan?
-
-## Tone
-
-Be brutally honest. If our plan is flawed, say so directly. We would rather know now than after implementation. Don't soften criticism - we need accuracy.
-
----
-
-END OF PROMPT
diff --git a/docs/bugs/FIX_PLAN_MAGENTIC_MODE.md b/docs/bugs/FIX_PLAN_MAGENTIC_MODE.md
deleted file mode 100644
index a02e1a19a1de2b1937c7d181873879fbb1f1ddfb..0000000000000000000000000000000000000000
--- a/docs/bugs/FIX_PLAN_MAGENTIC_MODE.md
+++ /dev/null
@@ -1,227 +0,0 @@
-# Fix Plan: Magentic Mode Report Generation
-
-**Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
-**Approach**: Test-Driven Development (TDD)
-**Estimated Scope**: 4 tasks, ~2-3 hours
-
----
-
-## Problem Summary
-
-Magentic mode runs but fails to produce readable reports due to:
-
-1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
-2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
-3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
-
----
-
-## Fix Order (TDD)
-
-### Phase 1: Write Failing Tests
-
-**Task 1.1**: Create test for ChatMessage text extraction
-
-```python
-# tests/unit/test_orchestrator_magentic.py
-
-def test_process_event_extracts_text_from_chat_message():
-    """Final result event should extract text from ChatMessage object."""
-    # Arrange: Mock ChatMessage with .content attribute
-    # Act: Call _process_event with MagenticFinalResultEvent
-    # Assert: Returned AgentEvent.message is a string, not object repr
-```
-
-**Task 1.2**: Create test for max rounds configuration
-
-```python
-def test_orchestrator_uses_configured_max_rounds():
-    """MagenticOrchestrator should use max_rounds from constructor."""
-    # Arrange: Create orchestrator with max_rounds=10
-    # Act: Build workflow
-    # Assert: Workflow has max_round_count=10
-```
-
-**Task 1.3**: Create test for bioRxiv reference removal
-
-```python
-def test_task_prompt_references_europe_pmc():
-    """Task prompt should reference Europe PMC, not bioRxiv."""
-    # Arrange: Create orchestrator
-    # Act: Check task string in run()
-    # Assert: Contains "Europe PMC", not "bioRxiv"
-```
-
----
-
-### Phase 2: Fix ChatMessage Text Extraction
-
-**File**: `src/orchestrator_magentic.py`
-**Lines**: 192-199
-
-**Current Code**:
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    text = event.message.text if event.message else "No result"
-```
-
-**Fixed Code**:
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    if event.message:
-        # ChatMessage may have .content or .text depending on version
-        if hasattr(event.message, 'content') and event.message.content:
-            text = str(event.message.content)
-        elif hasattr(event.message, 'text') and event.message.text:
-            text = str(event.message.text)
-        else:
-            # Fallback: convert entire message to string
-            text = str(event.message)
-    else:
-        text = "No result generated"
-```
-
-**Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
-
----
-
-### Phase 3: Fix Max Rounds Configuration
-
-**File**: `src/orchestrator_magentic.py`
-**Lines**: 97-99
-
-**Current Code**:
-```python
-.with_standard_manager(
-    chat_client=manager_client,
-    max_round_count=self._max_rounds,  # Already uses config
-    max_stall_count=3,
-    max_reset_count=2,
-)
-```
-
-**Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
-
-**Fix**: Verify the value flows through correctly. Add logging.
-
-```python
-logger.info(
-    "Building Magentic workflow",
-    max_rounds=self._max_rounds,
-    max_stall=3,
-    max_reset=2,
-)
-```
-
-**Also check**: `src/orchestrator_factory.py` passes config correctly:
-```python
-return MagenticOrchestrator(
-    max_rounds=config.max_iterations if config else 10,
-)
-```
-
----
-
-### Phase 4: Fix Stale bioRxiv References
-
-**Files to update**:
-
-| File | Line | Change |
-|------|------|--------|
-| `src/orchestrator_magentic.py` | 131 | "bioRxiv" → "Europe PMC" |
-| `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" → "Europe PMC" |
-| `src/app.py` | 202-203 | "bioRxiv" → "Europe PMC" |
-
-**Search command to verify**:
-```bash
-grep -rn "bioRxiv\|biorxiv" src/
-```
-
----
-
-## Implementation Checklist
-
-```
-[ ] Phase 1: Write failing tests
-    [ ] 1.1 Test ChatMessage text extraction
-    [ ] 1.2 Test max rounds configuration
-    [ ] 1.3 Test Europe PMC references
-
-[ ] Phase 2: Fix ChatMessage extraction
-    [ ] Update _process_event() in orchestrator_magentic.py
-    [ ] Run test 1.1 - should pass
-
-[ ] Phase 3: Fix max rounds
-    [ ] Add logging to _build_workflow()
-    [ ] Verify factory passes config correctly
-    [ ] Run test 1.2 - should pass
-
-[ ] Phase 4: Fix bioRxiv references
-    [ ] Update orchestrator_magentic.py task prompt
-    [ ] Update magentic_agents.py descriptions
-    [ ] Update app.py UI text
-    [ ] Run test 1.3 - should pass
-    [ ] Run grep to verify no remaining refs
-
-[ ] Final Verification
-    [ ] make check passes
-    [ ] All tests pass (108+)
-    [ ] Manual test: run_magentic.py produces readable report
-```
-
----
-
-## Test Commands
-
-```bash
-# Run specific test file
-uv run pytest tests/unit/test_orchestrator_magentic.py -v
-
-# Run all tests
-uv run pytest tests/unit/ -v
-
-# Full check
-make check
-
-# Manual integration test
-set -a && source .env && set +a
-uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
-```
-
----
-
-## Success Criteria
-
-1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
-2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
-3. No "Max round count reached" error with default settings
-4. No "bioRxiv" references anywhere in codebase
-5. All 108+ tests pass
-6. `make check` passes
-
----
-
-## Files Modified
-
-```
-src/
-├── orchestrator_magentic.py   # ChatMessage fix, logging
-├── agents/magentic_agents.py  # bioRxiv → Europe PMC
-└── app.py                     # bioRxiv → Europe PMC
-
-tests/unit/
-└── test_orchestrator_magentic.py  # NEW: 3 tests
-```
-
----
-
-## Notes for AI Agent
-
-When implementing this fix plan:
-
-1. **DO NOT** create mock data or fake responses
-2. **DO** write real tests that verify actual behavior
-3. **DO** run `make check` after each phase
-4. **DO** test with real OpenAI API key via `.env`
-5. **DO** preserve existing functionality - simple mode must still work
-6. **DO NOT** over-engineer - minimal changes to fix the specific bugs
diff --git a/docs/bugs/P0_MAGENTIC_MODE_BROKEN.md b/docs/bugs/P0_MAGENTIC_MODE_BROKEN.md
deleted file mode 100644
index 5df9c0ee27df1b416923f445b08be928f34432a2..0000000000000000000000000000000000000000
--- a/docs/bugs/P0_MAGENTIC_MODE_BROKEN.md
+++ /dev/null
@@ -1,116 +0,0 @@
-# P0 Bug: Magentic Mode Returns ChatMessage Object Instead of Report Text
-
-**Status**: OPEN
-**Priority**: P0 (Critical)
-**Date**: 2025-11-27
-
----
-
-## Actual Bug Found (Not What We Thought)
-
-**The OpenAI key works fine.** The real bug is different:
-
-### The Problem
-
-When Magentic mode completes, the final report returns a `ChatMessage` object instead of the actual text:
-
-```
-FINAL REPORT:
-<agent_framework._types.ChatMessage object at 0x11db70310>
-```
-
-### Evidence
-
-Full test output shows:
-1. Magentic orchestrator starts correctly
-2. SearchAgent finds evidence
-3. HypothesisAgent generates hypotheses
-4. JudgeAgent evaluates
-5. **BUT**: Final output is `ChatMessage` object, not text
-
-### Root Cause
-
-In `src/orchestrator_magentic.py` line 193:
-
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    text = event.message.text if event.message else "No result"
-```
-
-The `event.message` is a `ChatMessage` object, and `.text` may not extract the content correctly, or the message structure changed in the agent-framework library.
-
----
-
-## Secondary Issue: Max Rounds Reached
-
-The orchestrator hits max rounds before producing a report:
-
-```
-[ERROR] Magentic Orchestrator: Max round count reached
-```
-
-This means the workflow times out before the ReportAgent synthesizes the final output.
-
----
-
-## What Works
-
-- OpenAI API key: **Works** (loaded from .env)
-- SearchAgent: **Works** (finds evidence from PubMed, ClinicalTrials, Europe PMC)
-- HypothesisAgent: **Works** (generates Drug -> Target -> Pathway chains)
-- JudgeAgent: **Partial** (evaluates but sometimes loses context)
-
----
-
-## Files to Fix
-
-| File | Line | Issue |
-|------|------|-------|
-| `src/orchestrator_magentic.py` | 193 | `event.message.text` returns object, not string |
-| `src/orchestrator_magentic.py` | 97-99 | `max_round_count=3` too low for full pipeline |
-
----
-
-## Suggested Fix
-
-```python
-# In _process_event, line 192-199
-elif isinstance(event, MagenticFinalResultEvent):
-    # Handle ChatMessage object properly
-    if event.message:
-        if hasattr(event.message, 'content'):
-            text = event.message.content
-        elif hasattr(event.message, 'text'):
-            text = event.message.text
-        else:
-            text = str(event.message)
-    else:
-        text = "No result"
-```
-
-And increase rounds:
-
-```python
-# In _build_workflow, line 97
-max_round_count=self._max_rounds,  # Use configured value, default 10
-```
-
----
-
-## Test Command
-
-```bash
-set -a && source .env && set +a && uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
-```
-
----
-
-## Simple Mode Works
-
-For reference, simple mode produces full reports:
-
-```bash
-uv run python examples/orchestrator_demo/run_agent.py "metformin alzheimer"
-```
-
-Output includes structured report with Drug Candidates, Key Findings, etc.
diff --git a/docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md b/docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md
deleted file mode 100644
index 7197b1ec4ef09ea29b98cc447994264e6b4b0f54..0000000000000000000000000000000000000000
--- a/docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md
+++ /dev/null
@@ -1,81 +0,0 @@
-# P1 Bug: Gradio Settings Accordion Not Collapsing
-
-**Priority**: P1 (UX Bug)
-**Status**: OPEN
-**Date**: 2025-11-27
-**Target Component**: `src/app.py`
-
----
-
-## 1. Problem Description
-
-The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
-
-### Symptoms
-- Accordion arrow toggles visually, but content remains visible.
-- Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
-
----
-
-## 2. Root Cause Analysis
-
-**Definitive Cause**: Nested `Blocks` Context Bug.
-`gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
-
-**Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
-
----
-
-## 3. Solution Strategy: "The Unwrap Fix"
-
-We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
-
-### Implementation Plan
-
-**Refactor `src/app.py` / `create_demo()`**:
-
-1.  **Remove** the `with gr.Blocks() as demo:` context manager.
-2.  **Instantiate** `gr.ChatInterface` directly as the `demo` object.
-3.  **Migrate UI Elements**:
-    *   **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
-    *   **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
-
-### Before (Buggy)
-```python
-def create_demo():
-    with gr.Blocks() as demo:  # <--- CAUSE OF BUG
-        gr.Markdown("# Title")
-        gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
-        gr.Markdown("Footer")
-    return demo
-```
-
-### After (Correct)
-```python
-def create_demo():
-    return gr.ChatInterface(   # <--- FIX: Top-level component
-        ...,
-        title="🧬 DeepCritical",
-        description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
-        additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
-    )
-```
-
----
-
-## 4. Validation
-
-1.  **Run**: `uv run python src/app.py`
-2.  **Check**: Open `http://localhost:7860`
-3.  **Verify**:
-    *   Settings accordion starts **COLLAPSED**.
-    *   Header title ("DeepCritical") is visible.
-    *   Footer text ("MCP Server Active") is visible in the description area.
-    *   Chat functionality works (Magentic/Simple modes).
-
----
-
-## 5. Constraints & Notes
-
-- **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
-- **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.
\ No newline at end of file
diff --git a/docs/configuration/CONFIGURATION.md b/docs/configuration/CONFIGURATION.md
new file mode 100644
index 0000000000000000000000000000000000000000..b33bdcce1d71eb39664e84dc94f25eca451f70e5
--- /dev/null
+++ b/docs/configuration/CONFIGURATION.md
@@ -0,0 +1,743 @@
+# Configuration Guide
+
+## Overview
+
+DeepCritical uses **Pydantic Settings** for centralized configuration management. All settings are defined in the `Settings` class in `src/utils/config.py` and can be configured via environment variables or a `.env` file.
+
+The configuration system provides:
+
+- **Type Safety**: Strongly-typed fields with Pydantic validation
+- **Environment File Support**: Automatically loads from `.env` file (if present)
+- **Case-Insensitive**: Environment variables are case-insensitive
+- **Singleton Pattern**: Global `settings` instance for easy access throughout the codebase
+- **Validation**: Automatic validation on load with helpful error messages
+
+## Quick Start
+
+1. Create a `.env` file in the project root
+2. Set at least one LLM API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, or `HF_TOKEN`)
+3. Optionally configure other services as needed
+4. The application will automatically load and validate your configuration
+
+## Configuration System Architecture
+
+### Settings Class
+
+The `Settings` class extends `BaseSettings` from `pydantic_settings` and defines all application configuration:
+
+```13:21:src/utils/config.py
+class Settings(BaseSettings):
+    """Strongly-typed application settings."""
+
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        case_sensitive=False,
+        extra="ignore",
+    )
+```
+
+### Singleton Instance
+
+A global `settings` instance is available for import:
+
+```234:235:src/utils/config.py
+# Singleton for easy import
+settings = get_settings()
+```
+
+### Usage Pattern
+
+Access configuration throughout the codebase:
+
+```python
+from src.utils.config import settings
+
+# Check if API keys are available
+if settings.has_openai_key:
+    # Use OpenAI
+    pass
+
+# Access configuration values
+max_iterations = settings.max_iterations
+web_search_provider = settings.web_search_provider
+```
+
+## Required Configuration
+
+### LLM Provider
+
+You must configure at least one LLM provider. The system supports:
+
+- **OpenAI**: Requires `OPENAI_API_KEY`
+- **Anthropic**: Requires `ANTHROPIC_API_KEY`
+- **HuggingFace**: Optional `HF_TOKEN` or `HUGGINGFACE_API_KEY` (can work without key for public models)
+
+#### OpenAI Configuration
+
+```bash
+LLM_PROVIDER=openai
+OPENAI_API_KEY=your_openai_api_key_here
+OPENAI_MODEL=gpt-5.1
+```
+
+The default model is defined in the `Settings` class:
+
+```29:29:src/utils/config.py
+    openai_model: str = Field(default="gpt-5.1", description="OpenAI model name")
+```
+
+#### Anthropic Configuration
+
+```bash
+LLM_PROVIDER=anthropic
+ANTHROPIC_API_KEY=your_anthropic_api_key_here
+ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
+```
+
+The default model is defined in the `Settings` class:
+
+```30:32:src/utils/config.py
+    anthropic_model: str = Field(
+        default="claude-sonnet-4-5-20250929", description="Anthropic model"
+    )
+```
+
+#### HuggingFace Configuration
+
+HuggingFace can work without an API key for public models, but an API key provides higher rate limits:
+
+```bash
+# Option 1: Using HF_TOKEN (preferred)
+HF_TOKEN=your_huggingface_token_here
+
+# Option 2: Using HUGGINGFACE_API_KEY (alternative)
+HUGGINGFACE_API_KEY=your_huggingface_api_key_here
+
+# Default model
+HUGGINGFACE_MODEL=meta-llama/Llama-3.1-8B-Instruct
+```
+
+The HuggingFace token can be set via either environment variable:
+
+```33:35:src/utils/config.py
+    hf_token: str | None = Field(
+        default=None, alias="HF_TOKEN", description="HuggingFace API token"
+    )
+```
+
+```57:59:src/utils/config.py
+    huggingface_api_key: str | None = Field(
+        default=None, description="HuggingFace API token (HF_TOKEN or HUGGINGFACE_API_KEY)"
+    )
+```
+
+## Optional Configuration
+
+### Embedding Configuration
+
+DeepCritical supports multiple embedding providers for semantic search and RAG:
+
+```bash
+# Embedding Provider: "openai", "local", or "huggingface"
+EMBEDDING_PROVIDER=local
+
+# OpenAI Embedding Model (used by LlamaIndex RAG)
+OPENAI_EMBEDDING_MODEL=text-embedding-3-small
+
+# Local Embedding Model (sentence-transformers, used by EmbeddingService)
+LOCAL_EMBEDDING_MODEL=all-MiniLM-L6-v2
+
+# HuggingFace Embedding Model
+HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
+```
+
+The embedding provider configuration:
+
+```47:50:src/utils/config.py
+    embedding_provider: Literal["openai", "local", "huggingface"] = Field(
+        default="local",
+        description="Embedding provider to use",
+    )
+```
+
+**Note**: OpenAI embeddings require `OPENAI_API_KEY`. The local provider (default) uses sentence-transformers and requires no API key.
+
+### Web Search Configuration
+
+DeepCritical supports multiple web search providers:
+
+```bash
+# Web Search Provider: "serper", "searchxng", "brave", "tavily", or "duckduckgo"
+# Default: "duckduckgo" (no API key required)
+WEB_SEARCH_PROVIDER=duckduckgo
+
+# Serper API Key (for Google search via Serper)
+SERPER_API_KEY=your_serper_api_key_here
+
+# SearchXNG Host URL (for self-hosted search)
+SEARCHXNG_HOST=http://localhost:8080
+
+# Brave Search API Key
+BRAVE_API_KEY=your_brave_api_key_here
+
+# Tavily API Key
+TAVILY_API_KEY=your_tavily_api_key_here
+```
+
+The web search provider configuration:
+
+```71:74:src/utils/config.py
+    web_search_provider: Literal["serper", "searchxng", "brave", "tavily", "duckduckgo"] = Field(
+        default="duckduckgo",
+        description="Web search provider to use",
+    )
+```
+
+**Note**: DuckDuckGo is the default and requires no API key, making it ideal for development and testing.
+
+### PubMed Configuration
+
+PubMed search supports optional NCBI API key for higher rate limits:
+
+```bash
+# NCBI API Key (optional, for higher rate limits: 10 req/sec vs 3 req/sec)
+NCBI_API_KEY=your_ncbi_api_key_here
+```
+
+The PubMed tool uses this configuration:
+
+```22:29:src/tools/pubmed.py
+    def __init__(self, api_key: str | None = None) -> None:
+        self.api_key = api_key or settings.ncbi_api_key
+        # Ignore placeholder values from .env.example
+        if self.api_key == "your-ncbi-key-here":
+            self.api_key = None
+
+        # Use shared rate limiter
+        self._limiter = get_pubmed_limiter(self.api_key)
+```
+
+### Agent Configuration
+
+Control agent behavior and research loop execution:
+
+```bash
+# Maximum iterations per research loop (1-50, default: 10)
+MAX_ITERATIONS=10
+
+# Search timeout in seconds
+SEARCH_TIMEOUT=30
+
+# Use graph-based execution for research flows
+USE_GRAPH_EXECUTION=false
+```
+
+The agent configuration fields:
+
+```80:85:src/utils/config.py
+    # Agent Configuration
+    max_iterations: int = Field(default=10, ge=1, le=50)
+    search_timeout: int = Field(default=30, description="Seconds to wait for search")
+    use_graph_execution: bool = Field(
+        default=False, description="Use graph-based execution for research flows"
+    )
+```
+
+### Budget & Rate Limiting Configuration
+
+Control resource limits for research loops:
+
+```bash
+# Default token budget per research loop (1000-1000000, default: 100000)
+DEFAULT_TOKEN_LIMIT=100000
+
+# Default time limit per research loop in minutes (1-120, default: 10)
+DEFAULT_TIME_LIMIT_MINUTES=10
+
+# Default iterations limit per research loop (1-50, default: 10)
+DEFAULT_ITERATIONS_LIMIT=10
+```
+
+The budget configuration with validation:
+
+```87:105:src/utils/config.py
+    # Budget & Rate Limiting Configuration
+    default_token_limit: int = Field(
+        default=100000,
+        ge=1000,
+        le=1000000,
+        description="Default token budget per research loop",
+    )
+    default_time_limit_minutes: int = Field(
+        default=10,
+        ge=1,
+        le=120,
+        description="Default time limit per research loop (minutes)",
+    )
+    default_iterations_limit: int = Field(
+        default=10,
+        ge=1,
+        le=50,
+        description="Default iterations limit per research loop",
+    )
+```
+
+### RAG Service Configuration
+
+Configure the Retrieval-Augmented Generation service:
+
+```bash
+# ChromaDB collection name for RAG
+RAG_COLLECTION_NAME=deepcritical_evidence
+
+# Number of top results to retrieve from RAG (1-50, default: 5)
+RAG_SIMILARITY_TOP_K=5
+
+# Automatically ingest evidence into RAG
+RAG_AUTO_INGEST=true
+```
+
+The RAG configuration:
+
+```127:141:src/utils/config.py
+    # RAG Service Configuration
+    rag_collection_name: str = Field(
+        default="deepcritical_evidence",
+        description="ChromaDB collection name for RAG",
+    )
+    rag_similarity_top_k: int = Field(
+        default=5,
+        ge=1,
+        le=50,
+        description="Number of top results to retrieve from RAG",
+    )
+    rag_auto_ingest: bool = Field(
+        default=True,
+        description="Automatically ingest evidence into RAG",
+    )
+```
+
+### ChromaDB Configuration
+
+Configure the vector database for embeddings and RAG:
+
+```bash
+# ChromaDB storage path
+CHROMA_DB_PATH=./chroma_db
+
+# Whether to persist ChromaDB to disk
+CHROMA_DB_PERSIST=true
+
+# ChromaDB server host (for remote ChromaDB, optional)
+CHROMA_DB_HOST=localhost
+
+# ChromaDB server port (for remote ChromaDB, optional)
+CHROMA_DB_PORT=8000
+```
+
+The ChromaDB configuration:
+
+```113:125:src/utils/config.py
+    chroma_db_path: str = Field(default="./chroma_db", description="ChromaDB storage path")
+    chroma_db_persist: bool = Field(
+        default=True,
+        description="Whether to persist ChromaDB to disk",
+    )
+    chroma_db_host: str | None = Field(
+        default=None,
+        description="ChromaDB server host (for remote ChromaDB)",
+    )
+    chroma_db_port: int | None = Field(
+        default=None,
+        description="ChromaDB server port (for remote ChromaDB)",
+    )
+```
+
+### External Services
+
+#### Modal Configuration
+
+Modal is used for secure sandbox execution of statistical analysis:
+
+```bash
+# Modal Token ID (for Modal sandbox execution)
+MODAL_TOKEN_ID=your_modal_token_id_here
+
+# Modal Token Secret
+MODAL_TOKEN_SECRET=your_modal_token_secret_here
+```
+
+The Modal configuration:
+
+```110:112:src/utils/config.py
+    # External Services
+    modal_token_id: str | None = Field(default=None, description="Modal token ID")
+    modal_token_secret: str | None = Field(default=None, description="Modal token secret")
+```
+
+### Logging Configuration
+
+Configure structured logging:
+
+```bash
+# Log Level: "DEBUG", "INFO", "WARNING", or "ERROR"
+LOG_LEVEL=INFO
+```
+
+The logging configuration:
+
+```107:108:src/utils/config.py
+    # Logging
+    log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = "INFO"
+```
+
+Logging is configured via the `configure_logging()` function:
+
+```212:231:src/utils/config.py
+def configure_logging(settings: Settings) -> None:
+    """Configure structured logging with the configured log level."""
+    # Set stdlib logging level from settings
+    logging.basicConfig(
+        level=getattr(logging, settings.log_level),
+        format="%(message)s",
+    )
+
+    structlog.configure(
+        processors=[
+            structlog.stdlib.filter_by_level,
+            structlog.stdlib.add_logger_name,
+            structlog.stdlib.add_log_level,
+            structlog.processors.TimeStamper(fmt="iso"),
+            structlog.processors.JSONRenderer(),
+        ],
+        wrapper_class=structlog.stdlib.BoundLogger,
+        context_class=dict,
+        logger_factory=structlog.stdlib.LoggerFactory(),
+    )
+```
+
+## Configuration Properties
+
+The `Settings` class provides helpful properties for checking configuration state:
+
+### API Key Availability
+
+Check which API keys are available:
+
+```171:189:src/utils/config.py
+    @property
+    def has_openai_key(self) -> bool:
+        """Check if OpenAI API key is available."""
+        return bool(self.openai_api_key)
+
+    @property
+    def has_anthropic_key(self) -> bool:
+        """Check if Anthropic API key is available."""
+        return bool(self.anthropic_api_key)
+
+    @property
+    def has_huggingface_key(self) -> bool:
+        """Check if HuggingFace API key is available."""
+        return bool(self.huggingface_api_key or self.hf_token)
+
+    @property
+    def has_any_llm_key(self) -> bool:
+        """Check if any LLM API key is available."""
+        return self.has_openai_key or self.has_anthropic_key or self.has_huggingface_key
+```
+
+**Usage:**
+
+```python
+from src.utils.config import settings
+
+# Check API key availability
+if settings.has_openai_key:
+    # Use OpenAI
+    pass
+
+if settings.has_anthropic_key:
+    # Use Anthropic
+    pass
+
+if settings.has_huggingface_key:
+    # Use HuggingFace
+    pass
+
+if settings.has_any_llm_key:
+    # At least one LLM is available
+    pass
+```
+
+### Service Availability
+
+Check if external services are configured:
+
+```143:146:src/utils/config.py
+    @property
+    def modal_available(self) -> bool:
+        """Check if Modal credentials are configured."""
+        return bool(self.modal_token_id and self.modal_token_secret)
+```
+
+```191:204:src/utils/config.py
+    @property
+    def web_search_available(self) -> bool:
+        """Check if web search is available (either no-key provider or API key present)."""
+        if self.web_search_provider == "duckduckgo":
+            return True  # No API key required
+        if self.web_search_provider == "serper":
+            return bool(self.serper_api_key)
+        if self.web_search_provider == "searchxng":
+            return bool(self.searchxng_host)
+        if self.web_search_provider == "brave":
+            return bool(self.brave_api_key)
+        if self.web_search_provider == "tavily":
+            return bool(self.tavily_api_key)
+        return False
+```
+
+**Usage:**
+
+```python
+from src.utils.config import settings
+
+# Check service availability
+if settings.modal_available:
+    # Use Modal sandbox
+    pass
+
+if settings.web_search_available:
+    # Web search is configured
+    pass
+```
+
+### API Key Retrieval
+
+Get the API key for the configured provider:
+
+```148:160:src/utils/config.py
+    def get_api_key(self) -> str:
+        """Get the API key for the configured provider."""
+        if self.llm_provider == "openai":
+            if not self.openai_api_key:
+                raise ConfigurationError("OPENAI_API_KEY not set")
+            return self.openai_api_key
+
+        if self.llm_provider == "anthropic":
+            if not self.anthropic_api_key:
+                raise ConfigurationError("ANTHROPIC_API_KEY not set")
+            return self.anthropic_api_key
+
+        raise ConfigurationError(f"Unknown LLM provider: {self.llm_provider}")
+```
+
+For OpenAI-specific operations (e.g., Magentic mode):
+
+```162:169:src/utils/config.py
+    def get_openai_api_key(self) -> str:
+        """Get OpenAI API key (required for Magentic function calling)."""
+        if not self.openai_api_key:
+            raise ConfigurationError(
+                "OPENAI_API_KEY not set. Magentic mode requires OpenAI for function calling. "
+                "Use mode='simple' for other providers."
+            )
+        return self.openai_api_key
+```
+
+## Configuration Usage in Codebase
+
+The configuration system is used throughout the codebase:
+
+### LLM Factory
+
+The LLM factory uses settings to create appropriate models:
+
+```129:144:src/utils/llm_factory.py
+    if settings.llm_provider == "huggingface":
+        model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
+        hf_provider = HuggingFaceProvider(api_key=settings.hf_token)
+        return HuggingFaceModel(model_name, provider=hf_provider)
+
+    if settings.llm_provider == "openai":
+        if not settings.openai_api_key:
+            raise ConfigurationError("OPENAI_API_KEY not set for pydantic-ai")
+        provider = OpenAIProvider(api_key=settings.openai_api_key)
+        return OpenAIModel(settings.openai_model, provider=provider)
+
+    if settings.llm_provider == "anthropic":
+        if not settings.anthropic_api_key:
+            raise ConfigurationError("ANTHROPIC_API_KEY not set for pydantic-ai")
+        anthropic_provider = AnthropicProvider(api_key=settings.anthropic_api_key)
+        return AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
+```
+
+### Embedding Service
+
+The embedding service uses local embedding model configuration:
+
+```29:31:src/services/embeddings.py
+    def __init__(self, model_name: str | None = None):
+        self._model_name = model_name or settings.local_embedding_model
+        self._model = SentenceTransformer(self._model_name)
+```
+
+### Orchestrator Factory
+
+The orchestrator factory uses settings to determine mode:
+
+```69:80:src/orchestrator_factory.py
+def _determine_mode(explicit_mode: str | None) -> str:
+    """Determine which mode to use."""
+    if explicit_mode:
+        if explicit_mode in ("magentic", "advanced"):
+            return "advanced"
+        return "simple"
+
+    # Auto-detect: advanced if paid API key available
+    if settings.has_openai_key:
+        return "advanced"
+
+    return "simple"
+```
+
+## Environment Variables Reference
+
+### Required (at least one LLM)
+
+- `OPENAI_API_KEY` - OpenAI API key (required for OpenAI provider)
+- `ANTHROPIC_API_KEY` - Anthropic API key (required for Anthropic provider)
+- `HF_TOKEN` or `HUGGINGFACE_API_KEY` - HuggingFace API token (optional, can work without for public models)
+
+#### LLM Configuration Variables
+
+- `LLM_PROVIDER` - Provider to use: `"openai"`, `"anthropic"`, or `"huggingface"` (default: `"huggingface"`)
+- `OPENAI_MODEL` - OpenAI model name (default: `"gpt-5.1"`)
+- `ANTHROPIC_MODEL` - Anthropic model name (default: `"claude-sonnet-4-5-20250929"`)
+- `HUGGINGFACE_MODEL` - HuggingFace model ID (default: `"meta-llama/Llama-3.1-8B-Instruct"`)
+
+#### Embedding Configuration Variables
+
+- `EMBEDDING_PROVIDER` - Provider: `"openai"`, `"local"`, or `"huggingface"` (default: `"local"`)
+- `OPENAI_EMBEDDING_MODEL` - OpenAI embedding model (default: `"text-embedding-3-small"`)
+- `LOCAL_EMBEDDING_MODEL` - Local sentence-transformers model (default: `"all-MiniLM-L6-v2"`)
+- `HUGGINGFACE_EMBEDDING_MODEL` - HuggingFace embedding model (default: `"sentence-transformers/all-MiniLM-L6-v2"`)
+
+#### Web Search Configuration Variables
+
+- `WEB_SEARCH_PROVIDER` - Provider: `"serper"`, `"searchxng"`, `"brave"`, `"tavily"`, or `"duckduckgo"` (default: `"duckduckgo"`)
+- `SERPER_API_KEY` - Serper API key (required for Serper provider)
+- `SEARCHXNG_HOST` - SearchXNG host URL (required for SearchXNG provider)
+- `BRAVE_API_KEY` - Brave Search API key (required for Brave provider)
+- `TAVILY_API_KEY` - Tavily API key (required for Tavily provider)
+
+#### PubMed Configuration Variables
+
+- `NCBI_API_KEY` - NCBI API key (optional, increases rate limit from 3 to 10 req/sec)
+
+#### Agent Configuration Variables
+
+- `MAX_ITERATIONS` - Maximum iterations per research loop (1-50, default: `10`)
+- `SEARCH_TIMEOUT` - Search timeout in seconds (default: `30`)
+- `USE_GRAPH_EXECUTION` - Use graph-based execution (default: `false`)
+
+#### Budget Configuration Variables
+
+- `DEFAULT_TOKEN_LIMIT` - Default token budget per research loop (1000-1000000, default: `100000`)
+- `DEFAULT_TIME_LIMIT_MINUTES` - Default time limit in minutes (1-120, default: `10`)
+- `DEFAULT_ITERATIONS_LIMIT` - Default iterations limit (1-50, default: `10`)
+
+#### RAG Configuration Variables
+
+- `RAG_COLLECTION_NAME` - ChromaDB collection name (default: `"deepcritical_evidence"`)
+- `RAG_SIMILARITY_TOP_K` - Number of top results to retrieve (1-50, default: `5`)
+- `RAG_AUTO_INGEST` - Automatically ingest evidence into RAG (default: `true`)
+
+#### ChromaDB Configuration Variables
+
+- `CHROMA_DB_PATH` - ChromaDB storage path (default: `"./chroma_db"`)
+- `CHROMA_DB_PERSIST` - Whether to persist ChromaDB to disk (default: `true`)
+- `CHROMA_DB_HOST` - ChromaDB server host (optional, for remote ChromaDB)
+- `CHROMA_DB_PORT` - ChromaDB server port (optional, for remote ChromaDB)
+
+#### External Services Variables
+
+- `MODAL_TOKEN_ID` - Modal token ID (optional, for Modal sandbox execution)
+- `MODAL_TOKEN_SECRET` - Modal token secret (optional, for Modal sandbox execution)
+
+#### Logging Configuration Variables
+
+- `LOG_LEVEL` - Log level: `"DEBUG"`, `"INFO"`, `"WARNING"`, or `"ERROR"` (default: `"INFO"`)
+
+## Validation
+
+Settings are validated on load using Pydantic validation:
+
+- **Type Checking**: All fields are strongly typed
+- **Range Validation**: Numeric fields have min/max constraints (e.g., `ge=1, le=50` for `max_iterations`)
+- **Literal Validation**: Enum fields only accept specific values (e.g., `Literal["openai", "anthropic", "huggingface"]`)
+- **Required Fields**: API keys are checked when accessed via `get_api_key()` or `get_openai_api_key()`
+
+### Validation Examples
+
+The `max_iterations` field has range validation:
+
+```81:81:src/utils/config.py
+    max_iterations: int = Field(default=10, ge=1, le=50)
+```
+
+The `llm_provider` field has literal validation:
+
+```26:28:src/utils/config.py
+    llm_provider: Literal["openai", "anthropic", "huggingface"] = Field(
+        default="openai", description="Which LLM provider to use"
+    )
+```
+
+## Error Handling
+
+Configuration errors raise `ConfigurationError` from `src/utils/exceptions.py`:
+
+```22:25:src/utils/exceptions.py
+class ConfigurationError(DeepCriticalError):
+    """Raised when configuration is invalid."""
+
+    pass
+```
+
+### Error Handling Example
+
+```python
+from src.utils.config import settings
+from src.utils.exceptions import ConfigurationError
+
+try:
+    api_key = settings.get_api_key()
+except ConfigurationError as e:
+    print(f"Configuration error: {e}")
+```
+
+### Common Configuration Errors
+
+1. **Missing API Key**: When `get_api_key()` is called but the required API key is not set
+2. **Invalid Provider**: When `llm_provider` is set to an unsupported value
+3. **Out of Range**: When numeric values exceed their min/max constraints
+4. **Invalid Literal**: When enum fields receive unsupported values
+
+## Configuration Best Practices
+
+1. **Use `.env` File**: Store sensitive keys in `.env` file (add to `.gitignore`)
+2. **Check Availability**: Use properties like `has_openai_key` before accessing API keys
+3. **Handle Errors**: Always catch `ConfigurationError` when calling `get_api_key()`
+4. **Validate Early**: Configuration is validated on import, so errors surface immediately
+5. **Use Defaults**: Leverage sensible defaults for optional configuration
+
+## Future Enhancements
+
+The following configurations are planned for future phases:
+
+1. **Additional LLM Providers**: DeepSeek, OpenRouter, Gemini, Perplexity, Azure OpenAI, Local models
+2. **Model Selection**: Reasoning/main/fast model configuration
+3. **Service Integration**: Additional service integrations and configurations
+
diff --git a/docs/configuration/index.md b/docs/configuration/index.md
new file mode 100644
index 0000000000000000000000000000000000000000..d7f10d2fff20e213c98f3871e71d9dd023a21a81
--- /dev/null
+++ b/docs/configuration/index.md
@@ -0,0 +1,746 @@
+# Configuration Guide
+
+## Overview
+
+DeepCritical uses **Pydantic Settings** for centralized configuration management. All settings are defined in the `Settings` class in `src/utils/config.py` and can be configured via environment variables or a `.env` file.
+
+The configuration system provides:
+
+- **Type Safety**: Strongly-typed fields with Pydantic validation
+- **Environment File Support**: Automatically loads from `.env` file (if present)
+- **Case-Insensitive**: Environment variables are case-insensitive
+- **Singleton Pattern**: Global `settings` instance for easy access throughout the codebase
+- **Validation**: Automatic validation on load with helpful error messages
+
+## Quick Start
+
+1. Create a `.env` file in the project root
+2. Set at least one LLM API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, or `HF_TOKEN`)
+3. Optionally configure other services as needed
+4. The application will automatically load and validate your configuration
+
+## Configuration System Architecture
+
+### Settings Class
+
+The [`Settings`][settings-class] class extends `BaseSettings` from `pydantic_settings` and defines all application configuration:
+
+```13:21:src/utils/config.py
+class Settings(BaseSettings):
+    """Strongly-typed application settings."""
+
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        case_sensitive=False,
+        extra="ignore",
+    )
+```
+
+[View source](https://github.com/DeepCritical/GradioDemo/blob/main/src/utils/config.py#L13-L21)
+
+### Singleton Instance
+
+A global `settings` instance is available for import:
+
+```234:235:src/utils/config.py
+# Singleton for easy import
+settings = get_settings()
+```
+
+[View source](https://github.com/DeepCritical/GradioDemo/blob/main/src/utils/config.py#L234-L235)
+
+### Usage Pattern
+
+Access configuration throughout the codebase:
+
+```python
+from src.utils.config import settings
+
+# Check if API keys are available
+if settings.has_openai_key:
+    # Use OpenAI
+    pass
+
+# Access configuration values
+max_iterations = settings.max_iterations
+web_search_provider = settings.web_search_provider
+```
+
+## Required Configuration
+
+### LLM Provider
+
+You must configure at least one LLM provider. The system supports:
+
+- **OpenAI**: Requires `OPENAI_API_KEY`
+- **Anthropic**: Requires `ANTHROPIC_API_KEY`
+- **HuggingFace**: Optional `HF_TOKEN` or `HUGGINGFACE_API_KEY` (can work without key for public models)
+
+#### OpenAI Configuration
+
+```bash
+LLM_PROVIDER=openai
+OPENAI_API_KEY=your_openai_api_key_here
+OPENAI_MODEL=gpt-5.1
+```
+
+The default model is defined in the `Settings` class:
+
+```29:29:src/utils/config.py
+    openai_model: str = Field(default="gpt-5.1", description="OpenAI model name")
+```
+
+#### Anthropic Configuration
+
+```bash
+LLM_PROVIDER=anthropic
+ANTHROPIC_API_KEY=your_anthropic_api_key_here
+ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
+```
+
+The default model is defined in the `Settings` class:
+
+```30:32:src/utils/config.py
+    anthropic_model: str = Field(
+        default="claude-sonnet-4-5-20250929", description="Anthropic model"
+    )
+```
+
+#### HuggingFace Configuration
+
+HuggingFace can work without an API key for public models, but an API key provides higher rate limits:
+
+```bash
+# Option 1: Using HF_TOKEN (preferred)
+HF_TOKEN=your_huggingface_token_here
+
+# Option 2: Using HUGGINGFACE_API_KEY (alternative)
+HUGGINGFACE_API_KEY=your_huggingface_api_key_here
+
+# Default model
+HUGGINGFACE_MODEL=meta-llama/Llama-3.1-8B-Instruct
+```
+
+The HuggingFace token can be set via either environment variable:
+
+```33:35:src/utils/config.py
+    hf_token: str | None = Field(
+        default=None, alias="HF_TOKEN", description="HuggingFace API token"
+    )
+```
+
+```57:59:src/utils/config.py
+    huggingface_api_key: str | None = Field(
+        default=None, description="HuggingFace API token (HF_TOKEN or HUGGINGFACE_API_KEY)"
+    )
+```
+
+## Optional Configuration
+
+### Embedding Configuration
+
+DeepCritical supports multiple embedding providers for semantic search and RAG:
+
+```bash
+# Embedding Provider: "openai", "local", or "huggingface"
+EMBEDDING_PROVIDER=local
+
+# OpenAI Embedding Model (used by LlamaIndex RAG)
+OPENAI_EMBEDDING_MODEL=text-embedding-3-small
+
+# Local Embedding Model (sentence-transformers, used by EmbeddingService)
+LOCAL_EMBEDDING_MODEL=all-MiniLM-L6-v2
+
+# HuggingFace Embedding Model
+HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
+```
+
+The embedding provider configuration:
+
+```47:50:src/utils/config.py
+    embedding_provider: Literal["openai", "local", "huggingface"] = Field(
+        default="local",
+        description="Embedding provider to use",
+    )
+```
+
+**Note**: OpenAI embeddings require `OPENAI_API_KEY`. The local provider (default) uses sentence-transformers and requires no API key.
+
+### Web Search Configuration
+
+DeepCritical supports multiple web search providers:
+
+```bash
+# Web Search Provider: "serper", "searchxng", "brave", "tavily", or "duckduckgo"
+# Default: "duckduckgo" (no API key required)
+WEB_SEARCH_PROVIDER=duckduckgo
+
+# Serper API Key (for Google search via Serper)
+SERPER_API_KEY=your_serper_api_key_here
+
+# SearchXNG Host URL (for self-hosted search)
+SEARCHXNG_HOST=http://localhost:8080
+
+# Brave Search API Key
+BRAVE_API_KEY=your_brave_api_key_here
+
+# Tavily API Key
+TAVILY_API_KEY=your_tavily_api_key_here
+```
+
+The web search provider configuration:
+
+```71:74:src/utils/config.py
+    web_search_provider: Literal["serper", "searchxng", "brave", "tavily", "duckduckgo"] = Field(
+        default="duckduckgo",
+        description="Web search provider to use",
+    )
+```
+
+**Note**: DuckDuckGo is the default and requires no API key, making it ideal for development and testing.
+
+### PubMed Configuration
+
+PubMed search supports optional NCBI API key for higher rate limits:
+
+```bash
+# NCBI API Key (optional, for higher rate limits: 10 req/sec vs 3 req/sec)
+NCBI_API_KEY=your_ncbi_api_key_here
+```
+
+The PubMed tool uses this configuration:
+
+```22:29:src/tools/pubmed.py
+    def __init__(self, api_key: str | None = None) -> None:
+        self.api_key = api_key or settings.ncbi_api_key
+        # Ignore placeholder values from .env.example
+        if self.api_key == "your-ncbi-key-here":
+            self.api_key = None
+
+        # Use shared rate limiter
+        self._limiter = get_pubmed_limiter(self.api_key)
+```
+
+### Agent Configuration
+
+Control agent behavior and research loop execution:
+
+```bash
+# Maximum iterations per research loop (1-50, default: 10)
+MAX_ITERATIONS=10
+
+# Search timeout in seconds
+SEARCH_TIMEOUT=30
+
+# Use graph-based execution for research flows
+USE_GRAPH_EXECUTION=false
+```
+
+The agent configuration fields:
+
+```80:85:src/utils/config.py
+    # Agent Configuration
+    max_iterations: int = Field(default=10, ge=1, le=50)
+    search_timeout: int = Field(default=30, description="Seconds to wait for search")
+    use_graph_execution: bool = Field(
+        default=False, description="Use graph-based execution for research flows"
+    )
+```
+
+### Budget & Rate Limiting Configuration
+
+Control resource limits for research loops:
+
+```bash
+# Default token budget per research loop (1000-1000000, default: 100000)
+DEFAULT_TOKEN_LIMIT=100000
+
+# Default time limit per research loop in minutes (1-120, default: 10)
+DEFAULT_TIME_LIMIT_MINUTES=10
+
+# Default iterations limit per research loop (1-50, default: 10)
+DEFAULT_ITERATIONS_LIMIT=10
+```
+
+The budget configuration with validation:
+
+```87:105:src/utils/config.py
+    # Budget & Rate Limiting Configuration
+    default_token_limit: int = Field(
+        default=100000,
+        ge=1000,
+        le=1000000,
+        description="Default token budget per research loop",
+    )
+    default_time_limit_minutes: int = Field(
+        default=10,
+        ge=1,
+        le=120,
+        description="Default time limit per research loop (minutes)",
+    )
+    default_iterations_limit: int = Field(
+        default=10,
+        ge=1,
+        le=50,
+        description="Default iterations limit per research loop",
+    )
+```
+
+### RAG Service Configuration
+
+Configure the Retrieval-Augmented Generation service:
+
+```bash
+# ChromaDB collection name for RAG
+RAG_COLLECTION_NAME=deepcritical_evidence
+
+# Number of top results to retrieve from RAG (1-50, default: 5)
+RAG_SIMILARITY_TOP_K=5
+
+# Automatically ingest evidence into RAG
+RAG_AUTO_INGEST=true
+```
+
+The RAG configuration:
+
+```127:141:src/utils/config.py
+    # RAG Service Configuration
+    rag_collection_name: str = Field(
+        default="deepcritical_evidence",
+        description="ChromaDB collection name for RAG",
+    )
+    rag_similarity_top_k: int = Field(
+        default=5,
+        ge=1,
+        le=50,
+        description="Number of top results to retrieve from RAG",
+    )
+    rag_auto_ingest: bool = Field(
+        default=True,
+        description="Automatically ingest evidence into RAG",
+    )
+```
+
+### ChromaDB Configuration
+
+Configure the vector database for embeddings and RAG:
+
+```bash
+# ChromaDB storage path
+CHROMA_DB_PATH=./chroma_db
+
+# Whether to persist ChromaDB to disk
+CHROMA_DB_PERSIST=true
+
+# ChromaDB server host (for remote ChromaDB, optional)
+CHROMA_DB_HOST=localhost
+
+# ChromaDB server port (for remote ChromaDB, optional)
+CHROMA_DB_PORT=8000
+```
+
+The ChromaDB configuration:
+
+```113:125:src/utils/config.py
+    chroma_db_path: str = Field(default="./chroma_db", description="ChromaDB storage path")
+    chroma_db_persist: bool = Field(
+        default=True,
+        description="Whether to persist ChromaDB to disk",
+    )
+    chroma_db_host: str | None = Field(
+        default=None,
+        description="ChromaDB server host (for remote ChromaDB)",
+    )
+    chroma_db_port: int | None = Field(
+        default=None,
+        description="ChromaDB server port (for remote ChromaDB)",
+    )
+```
+
+### External Services
+
+#### Modal Configuration
+
+Modal is used for secure sandbox execution of statistical analysis:
+
+```bash
+# Modal Token ID (for Modal sandbox execution)
+MODAL_TOKEN_ID=your_modal_token_id_here
+
+# Modal Token Secret
+MODAL_TOKEN_SECRET=your_modal_token_secret_here
+```
+
+The Modal configuration:
+
+```110:112:src/utils/config.py
+    # External Services
+    modal_token_id: str | None = Field(default=None, description="Modal token ID")
+    modal_token_secret: str | None = Field(default=None, description="Modal token secret")
+```
+
+### Logging Configuration
+
+Configure structured logging:
+
+```bash
+# Log Level: "DEBUG", "INFO", "WARNING", or "ERROR"
+LOG_LEVEL=INFO
+```
+
+The logging configuration:
+
+```107:108:src/utils/config.py
+    # Logging
+    log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = "INFO"
+```
+
+Logging is configured via the `configure_logging()` function:
+
+```212:231:src/utils/config.py
+def configure_logging(settings: Settings) -> None:
+    """Configure structured logging with the configured log level."""
+    # Set stdlib logging level from settings
+    logging.basicConfig(
+        level=getattr(logging, settings.log_level),
+        format="%(message)s",
+    )
+
+    structlog.configure(
+        processors=[
+            structlog.stdlib.filter_by_level,
+            structlog.stdlib.add_logger_name,
+            structlog.stdlib.add_log_level,
+            structlog.processors.TimeStamper(fmt="iso"),
+            structlog.processors.JSONRenderer(),
+        ],
+        wrapper_class=structlog.stdlib.BoundLogger,
+        context_class=dict,
+        logger_factory=structlog.stdlib.LoggerFactory(),
+    )
+```
+
+## Configuration Properties
+
+The `Settings` class provides helpful properties for checking configuration state:
+
+### API Key Availability
+
+Check which API keys are available:
+
+```171:189:src/utils/config.py
+    @property
+    def has_openai_key(self) -> bool:
+        """Check if OpenAI API key is available."""
+        return bool(self.openai_api_key)
+
+    @property
+    def has_anthropic_key(self) -> bool:
+        """Check if Anthropic API key is available."""
+        return bool(self.anthropic_api_key)
+
+    @property
+    def has_huggingface_key(self) -> bool:
+        """Check if HuggingFace API key is available."""
+        return bool(self.huggingface_api_key or self.hf_token)
+
+    @property
+    def has_any_llm_key(self) -> bool:
+        """Check if any LLM API key is available."""
+        return self.has_openai_key or self.has_anthropic_key or self.has_huggingface_key
+```
+
+**Usage:**
+
+```python
+from src.utils.config import settings
+
+# Check API key availability
+if settings.has_openai_key:
+    # Use OpenAI
+    pass
+
+if settings.has_anthropic_key:
+    # Use Anthropic
+    pass
+
+if settings.has_huggingface_key:
+    # Use HuggingFace
+    pass
+
+if settings.has_any_llm_key:
+    # At least one LLM is available
+    pass
+```
+
+### Service Availability
+
+Check if external services are configured:
+
+```143:146:src/utils/config.py
+    @property
+    def modal_available(self) -> bool:
+        """Check if Modal credentials are configured."""
+        return bool(self.modal_token_id and self.modal_token_secret)
+```
+
+```191:204:src/utils/config.py
+    @property
+    def web_search_available(self) -> bool:
+        """Check if web search is available (either no-key provider or API key present)."""
+        if self.web_search_provider == "duckduckgo":
+            return True  # No API key required
+        if self.web_search_provider == "serper":
+            return bool(self.serper_api_key)
+        if self.web_search_provider == "searchxng":
+            return bool(self.searchxng_host)
+        if self.web_search_provider == "brave":
+            return bool(self.brave_api_key)
+        if self.web_search_provider == "tavily":
+            return bool(self.tavily_api_key)
+        return False
+```
+
+**Usage:**
+
+```python
+from src.utils.config import settings
+
+# Check service availability
+if settings.modal_available:
+    # Use Modal sandbox
+    pass
+
+if settings.web_search_available:
+    # Web search is configured
+    pass
+```
+
+### API Key Retrieval
+
+Get the API key for the configured provider:
+
+```148:160:src/utils/config.py
+    def get_api_key(self) -> str:
+        """Get the API key for the configured provider."""
+        if self.llm_provider == "openai":
+            if not self.openai_api_key:
+                raise ConfigurationError("OPENAI_API_KEY not set")
+            return self.openai_api_key
+
+        if self.llm_provider == "anthropic":
+            if not self.anthropic_api_key:
+                raise ConfigurationError("ANTHROPIC_API_KEY not set")
+            return self.anthropic_api_key
+
+        raise ConfigurationError(f"Unknown LLM provider: {self.llm_provider}")
+```
+
+For OpenAI-specific operations (e.g., Magentic mode):
+
+```162:169:src/utils/config.py
+    def get_openai_api_key(self) -> str:
+        """Get OpenAI API key (required for Magentic function calling)."""
+        if not self.openai_api_key:
+            raise ConfigurationError(
+                "OPENAI_API_KEY not set. Magentic mode requires OpenAI for function calling. "
+                "Use mode='simple' for other providers."
+            )
+        return self.openai_api_key
+```
+
+## Configuration Usage in Codebase
+
+The configuration system is used throughout the codebase:
+
+### LLM Factory
+
+The LLM factory uses settings to create appropriate models:
+
+```129:144:src/utils/llm_factory.py
+    if settings.llm_provider == "huggingface":
+        model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
+        hf_provider = HuggingFaceProvider(api_key=settings.hf_token)
+        return HuggingFaceModel(model_name, provider=hf_provider)
+
+    if settings.llm_provider == "openai":
+        if not settings.openai_api_key:
+            raise ConfigurationError("OPENAI_API_KEY not set for pydantic-ai")
+        provider = OpenAIProvider(api_key=settings.openai_api_key)
+        return OpenAIModel(settings.openai_model, provider=provider)
+
+    if settings.llm_provider == "anthropic":
+        if not settings.anthropic_api_key:
+            raise ConfigurationError("ANTHROPIC_API_KEY not set for pydantic-ai")
+        anthropic_provider = AnthropicProvider(api_key=settings.anthropic_api_key)
+        return AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
+```
+
+### Embedding Service
+
+The embedding service uses local embedding model configuration:
+
+```29:31:src/services/embeddings.py
+    def __init__(self, model_name: str | None = None):
+        self._model_name = model_name or settings.local_embedding_model
+        self._model = SentenceTransformer(self._model_name)
+```
+
+### Orchestrator Factory
+
+The orchestrator factory uses settings to determine mode:
+
+```69:80:src/orchestrator_factory.py
+def _determine_mode(explicit_mode: str | None) -> str:
+    """Determine which mode to use."""
+    if explicit_mode:
+        if explicit_mode in ("magentic", "advanced"):
+            return "advanced"
+        return "simple"
+
+    # Auto-detect: advanced if paid API key available
+    if settings.has_openai_key:
+        return "advanced"
+
+    return "simple"
+```
+
+## Environment Variables Reference
+
+### Required (at least one LLM)
+
+- `OPENAI_API_KEY` - OpenAI API key (required for OpenAI provider)
+- `ANTHROPIC_API_KEY` - Anthropic API key (required for Anthropic provider)
+- `HF_TOKEN` or `HUGGINGFACE_API_KEY` - HuggingFace API token (optional, can work without for public models)
+
+#### LLM Configuration Variables
+
+- `LLM_PROVIDER` - Provider to use: `"openai"`, `"anthropic"`, or `"huggingface"` (default: `"huggingface"`)
+- `OPENAI_MODEL` - OpenAI model name (default: `"gpt-5.1"`)
+- `ANTHROPIC_MODEL` - Anthropic model name (default: `"claude-sonnet-4-5-20250929"`)
+- `HUGGINGFACE_MODEL` - HuggingFace model ID (default: `"meta-llama/Llama-3.1-8B-Instruct"`)
+
+#### Embedding Configuration Variables
+
+- `EMBEDDING_PROVIDER` - Provider: `"openai"`, `"local"`, or `"huggingface"` (default: `"local"`)
+- `OPENAI_EMBEDDING_MODEL` - OpenAI embedding model (default: `"text-embedding-3-small"`)
+- `LOCAL_EMBEDDING_MODEL` - Local sentence-transformers model (default: `"all-MiniLM-L6-v2"`)
+- `HUGGINGFACE_EMBEDDING_MODEL` - HuggingFace embedding model (default: `"sentence-transformers/all-MiniLM-L6-v2"`)
+
+#### Web Search Configuration Variables
+
+- `WEB_SEARCH_PROVIDER` - Provider: `"serper"`, `"searchxng"`, `"brave"`, `"tavily"`, or `"duckduckgo"` (default: `"duckduckgo"`)
+- `SERPER_API_KEY` - Serper API key (required for Serper provider)
+- `SEARCHXNG_HOST` - SearchXNG host URL (required for SearchXNG provider)
+- `BRAVE_API_KEY` - Brave Search API key (required for Brave provider)
+- `TAVILY_API_KEY` - Tavily API key (required for Tavily provider)
+
+#### PubMed Configuration Variables
+
+- `NCBI_API_KEY` - NCBI API key (optional, increases rate limit from 3 to 10 req/sec)
+
+#### Agent Configuration Variables
+
+- `MAX_ITERATIONS` - Maximum iterations per research loop (1-50, default: `10`)
+- `SEARCH_TIMEOUT` - Search timeout in seconds (default: `30`)
+- `USE_GRAPH_EXECUTION` - Use graph-based execution (default: `false`)
+
+#### Budget Configuration Variables
+
+- `DEFAULT_TOKEN_LIMIT` - Default token budget per research loop (1000-1000000, default: `100000`)
+- `DEFAULT_TIME_LIMIT_MINUTES` - Default time limit in minutes (1-120, default: `10`)
+- `DEFAULT_ITERATIONS_LIMIT` - Default iterations limit (1-50, default: `10`)
+
+#### RAG Configuration Variables
+
+- `RAG_COLLECTION_NAME` - ChromaDB collection name (default: `"deepcritical_evidence"`)
+- `RAG_SIMILARITY_TOP_K` - Number of top results to retrieve (1-50, default: `5`)
+- `RAG_AUTO_INGEST` - Automatically ingest evidence into RAG (default: `true`)
+
+#### ChromaDB Configuration Variables
+
+- `CHROMA_DB_PATH` - ChromaDB storage path (default: `"./chroma_db"`)
+- `CHROMA_DB_PERSIST` - Whether to persist ChromaDB to disk (default: `true`)
+- `CHROMA_DB_HOST` - ChromaDB server host (optional, for remote ChromaDB)
+- `CHROMA_DB_PORT` - ChromaDB server port (optional, for remote ChromaDB)
+
+#### External Services Variables
+
+- `MODAL_TOKEN_ID` - Modal token ID (optional, for Modal sandbox execution)
+- `MODAL_TOKEN_SECRET` - Modal token secret (optional, for Modal sandbox execution)
+
+#### Logging Configuration Variables
+
+- `LOG_LEVEL` - Log level: `"DEBUG"`, `"INFO"`, `"WARNING"`, or `"ERROR"` (default: `"INFO"`)
+
+## Validation
+
+Settings are validated on load using Pydantic validation:
+
+- **Type Checking**: All fields are strongly typed
+- **Range Validation**: Numeric fields have min/max constraints (e.g., `ge=1, le=50` for `max_iterations`)
+- **Literal Validation**: Enum fields only accept specific values (e.g., `Literal["openai", "anthropic", "huggingface"]`)
+- **Required Fields**: API keys are checked when accessed via `get_api_key()` or `get_openai_api_key()`
+
+### Validation Examples
+
+The `max_iterations` field has range validation:
+
+```81:81:src/utils/config.py
+    max_iterations: int = Field(default=10, ge=1, le=50)
+```
+
+The `llm_provider` field has literal validation:
+
+```26:28:src/utils/config.py
+    llm_provider: Literal["openai", "anthropic", "huggingface"] = Field(
+        default="openai", description="Which LLM provider to use"
+    )
+```
+
+## Error Handling
+
+Configuration errors raise `ConfigurationError` from `src/utils/exceptions.py`:
+
+```22:25:src/utils/exceptions.py
+class ConfigurationError(DeepCriticalError):
+    """Raised when configuration is invalid."""
+
+    pass
+```
+
+### Error Handling Example
+
+```python
+from src.utils.config import settings
+from src.utils.exceptions import ConfigurationError
+
+try:
+    api_key = settings.get_api_key()
+except ConfigurationError as e:
+    print(f"Configuration error: {e}")
+```
+
+### Common Configuration Errors
+
+1. **Missing API Key**: When `get_api_key()` is called but the required API key is not set
+2. **Invalid Provider**: When `llm_provider` is set to an unsupported value
+3. **Out of Range**: When numeric values exceed their min/max constraints
+4. **Invalid Literal**: When enum fields receive unsupported values
+
+## Configuration Best Practices
+
+1. **Use `.env` File**: Store sensitive keys in `.env` file (add to `.gitignore`)
+2. **Check Availability**: Use properties like `has_openai_key` before accessing API keys
+3. **Handle Errors**: Always catch `ConfigurationError` when calling `get_api_key()`
+4. **Validate Early**: Configuration is validated on import, so errors surface immediately
+5. **Use Defaults**: Leverage sensible defaults for optional configuration
+
+## Future Enhancements
+
+The following configurations are planned for future phases:
+
+1. **Additional LLM Providers**: DeepSeek, OpenRouter, Gemini, Perplexity, Azure OpenAI, Local models
+2. **Model Selection**: Reasoning/main/fast model configuration
+3. **Service Integration**: Additional service integrations and configurations
diff --git a/docs/contributing.md b/docs/contributing.md
new file mode 100644
index 0000000000000000000000000000000000000000..ddfb1c06dbd53064b62040c4ded4fa9e4e942f72
--- /dev/null
+++ b/docs/contributing.md
@@ -0,0 +1,428 @@
+# Contributing to DeepCritical
+
+Thank you for your interest in contributing to DeepCritical! This guide will help you get started.
+
+## Table of Contents
+
+- [Git Workflow](#git-workflow)
+- [Getting Started](#getting-started)
+- [Development Commands](#development-commands)
+- [Code Style & Conventions](#code-style--conventions)
+- [Type Safety](#type-safety)
+- [Error Handling & Logging](#error-handling--logging)
+- [Testing Requirements](#testing-requirements)
+- [Implementation Patterns](#implementation-patterns)
+- [Code Quality & Documentation](#code-quality--documentation)
+- [Prompt Engineering & Citation Validation](#prompt-engineering--citation-validation)
+- [MCP Integration](#mcp-integration)
+- [Common Pitfalls](#common-pitfalls)
+- [Key Principles](#key-principles)
+- [Pull Request Process](#pull-request-process)
+
+## Git Workflow
+
+- `main`: Production-ready (GitHub)
+- `dev`: Development integration (GitHub)
+- Use feature branches: `yourname-dev`
+- **NEVER** push directly to `main` or `dev` on HuggingFace
+- GitHub is source of truth; HuggingFace is for deployment
+
+## Getting Started
+
+1. **Fork the repository** on GitHub
+2. **Clone your fork**:
+
+   ```bash
+   git clone https://github.com/yourusername/GradioDemo.git
+   cd GradioDemo
+   ```
+
+3. **Install dependencies**:
+
+   ```bash
+   make install
+   ```
+
+4. **Create a feature branch**:
+
+   ```bash
+   git checkout -b yourname-feature-name
+   ```
+
+5. **Make your changes** following the guidelines below
+6. **Run checks**:
+
+   ```bash
+   make check
+   ```
+
+7. **Commit and push**:
+
+   ```bash
+   git commit -m "Description of changes"
+   git push origin yourname-feature-name
+   ```
+8. **Create a pull request** on GitHub
+
+## Development Commands
+
+```bash
+make install      # Install dependencies + pre-commit
+make check        # Lint + typecheck + test (MUST PASS)
+make test         # Run unit tests
+make lint         # Run ruff
+make format       # Format with ruff
+make typecheck    # Run mypy
+make test-cov     # Test with coverage
+make docs-build  # Build documentation
+make docs-serve  # Serve documentation locally
+```
+
+## Code Style & Conventions
+
+### Type Safety
+
+- **ALWAYS** use type hints for all function parameters and return types
+- Use `mypy --strict` compliance (no `Any` unless absolutely necessary)
+- Use `TYPE_CHECKING` imports for circular dependencies:
+
+```python
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+```
+
+### Pydantic Models
+
+- All data exchange uses Pydantic models (`src/utils/models.py`)
+- Models are frozen (`model_config = {"frozen": True}`) for immutability
+- Use `Field()` with descriptions for all model fields
+- Validate with `ge=`, `le=`, `min_length=`, `max_length=` constraints
+
+### Async Patterns
+
+- **ALL** I/O operations must be async (`async def`, `await`)
+- Use `asyncio.gather()` for parallel operations
+- CPU-bound work (embeddings, parsing) must use `run_in_executor()`:
+
+```python
+loop = asyncio.get_running_loop()
+result = await loop.run_in_executor(None, cpu_bound_function, args)
+```
+
+- Never block the event loop with synchronous I/O
+
+### Linting
+
+- Ruff with 100-char line length
+- Ignore rules documented in `pyproject.toml`:
+  - `PLR0913`: Too many arguments (agents need many params)
+  - `PLR0912`: Too many branches (complex orchestrator logic)
+  - `PLR0911`: Too many return statements (complex agent logic)
+  - `PLR2004`: Magic values (statistical constants)
+  - `PLW0603`: Global statement (singleton pattern)
+  - `PLC0415`: Lazy imports for optional dependencies
+
+### Pre-commit
+
+- Run `make check` before committing
+- Must pass: lint + typecheck + test-cov
+- Pre-commit hooks installed via `make install`
+- **CRITICAL**: Make sure you run the full pre-commit checks before opening a PR (not draft), otherwise Obstacle is the Way will lose his mind
+
+## Error Handling & Logging
+
+### Exception Hierarchy
+
+Use custom exception hierarchy (`src/utils/exceptions.py`):
+
+- `DeepCriticalError` (base)
+- `SearchError` → `RateLimitError`
+- `JudgeError`
+- `ConfigurationError`
+
+### Error Handling Rules
+
+- Always chain exceptions: `raise SearchError(...) from e`
+- Log errors with context using `structlog`:
+
+```python
+logger.error("Operation failed", error=str(e), context=value)
+```
+
+- Never silently swallow exceptions
+- Provide actionable error messages
+
+### Logging
+
+- Use `structlog` for all logging (NOT `print` or `logging`)
+- Import: `import structlog; logger = structlog.get_logger()`
+- Log with structured data: `logger.info("event", key=value)`
+- Use appropriate levels: DEBUG, INFO, WARNING, ERROR
+
+### Logging Examples
+
+```python
+logger.info("Starting search", query=query, tools=[t.name for t in tools])
+logger.warning("Search tool failed", tool=tool.name, error=str(result))
+logger.error("Assessment failed", error=str(e))
+```
+
+### Error Chaining
+
+Always preserve exception context:
+
+```python
+try:
+    result = await api_call()
+except httpx.HTTPError as e:
+    raise SearchError(f"API call failed: {e}") from e
+```
+
+## Testing Requirements
+
+### Test Structure
+
+- Unit tests in `tests/unit/` (mocked, fast)
+- Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`)
+- Use markers: `unit`, `integration`, `slow`
+
+### Mocking
+
+- Use `respx` for httpx mocking
+- Use `pytest-mock` for general mocking
+- Mock LLM calls in unit tests (use `MockJudgeHandler`)
+- Fixtures in `tests/conftest.py`: `mock_httpx_client`, `mock_llm_response`
+
+### TDD Workflow
+
+1. Write failing test in `tests/unit/`
+2. Implement in `src/`
+3. Ensure test passes
+4. Run `make check` (lint + typecheck + test)
+
+### Test Examples
+
+```python
+@pytest.mark.unit
+async def test_pubmed_search(mock_httpx_client):
+    tool = PubMedTool()
+    results = await tool.search("metformin", max_results=5)
+    assert len(results) > 0
+    assert all(isinstance(r, Evidence) for r in results)
+
+@pytest.mark.integration
+async def test_real_pubmed_search():
+    tool = PubMedTool()
+    results = await tool.search("metformin", max_results=3)
+    assert len(results) <= 3
+```
+
+### Test Coverage
+
+- Run `make test-cov` for coverage report
+- Aim for >80% coverage on critical paths
+- Exclude: `__init__.py`, `TYPE_CHECKING` blocks
+
+## Implementation Patterns
+
+### Search Tools
+
+All tools implement `SearchTool` protocol (`src/tools/base.py`):
+
+- Must have `name` property
+- Must implement `async def search(query, max_results) -> list[Evidence]`
+- Use `@retry` decorator from tenacity for resilience
+- Rate limiting: Implement `_rate_limit()` for APIs with limits (e.g., PubMed)
+- Error handling: Raise `SearchError` or `RateLimitError` on failures
+
+Example pattern:
+
+```python
+class MySearchTool:
+    @property
+    def name(self) -> str:
+        return "mytool"
+    
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(...))
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        # Implementation
+        return evidence_list
+```
+
+### Judge Handlers
+
+- Implement `JudgeHandlerProtocol` (`async def assess(question, evidence) -> JudgeAssessment`)
+- Use pydantic-ai `Agent` with `output_type=JudgeAssessment`
+- System prompts in `src/prompts/judge.py`
+- Support fallback handlers: `MockJudgeHandler`, `HFInferenceJudgeHandler`
+- Always return valid `JudgeAssessment` (never raise exceptions)
+
+### Agent Factory Pattern
+
+- Use factory functions for creating agents (`src/agent_factory/`)
+- Lazy initialization for optional dependencies (e.g., embeddings, Modal)
+- Check requirements before initialization:
+
+```python
+def check_magentic_requirements() -> None:
+    if not settings.has_openai_key:
+        raise ConfigurationError("Magentic requires OpenAI")
+```
+
+### State Management
+
+- **Magentic Mode**: Use `ContextVar` for thread-safe state (`src/agents/state.py`)
+- **Simple Mode**: Pass state via function parameters
+- Never use global mutable state (except singletons via `@lru_cache`)
+
+### Singleton Pattern
+
+Use `@lru_cache(maxsize=1)` for singletons:
+
+```python
+@lru_cache(maxsize=1)
+def get_embedding_service() -> EmbeddingService:
+    return EmbeddingService()
+```
+
+- Lazy initialization to avoid requiring dependencies at import time
+
+## Code Quality & Documentation
+
+### Docstrings
+
+- Google-style docstrings for all public functions
+- Include Args, Returns, Raises sections
+- Use type hints in docstrings only if needed for clarity
+
+Example:
+
+```python
+async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+    """Search PubMed and return evidence.
+
+    Args:
+        query: The search query string
+        max_results: Maximum number of results to return
+
+    Returns:
+        List of Evidence objects
+
+    Raises:
+        SearchError: If the search fails
+        RateLimitError: If we hit rate limits
+    """
+```
+
+### Code Comments
+
+- Explain WHY, not WHAT
+- Document non-obvious patterns (e.g., why `requests` not `httpx` for ClinicalTrials)
+- Mark critical sections: `# CRITICAL: ...`
+- Document rate limiting rationale
+- Explain async patterns when non-obvious
+
+## Prompt Engineering & Citation Validation
+
+### Judge Prompts
+
+- System prompt in `src/prompts/judge.py`
+- Format evidence with truncation (1500 chars per item)
+- Handle empty evidence case separately
+- Always request structured JSON output
+- Use `format_user_prompt()` and `format_empty_evidence_prompt()` helpers
+
+### Hypothesis Prompts
+
+- Use diverse evidence selection (MMR algorithm)
+- Sentence-aware truncation (`truncate_at_sentence()`)
+- Format: Drug → Target → Pathway → Effect
+- System prompt emphasizes mechanistic reasoning
+- Use `format_hypothesis_prompt()` with embeddings for diversity
+
+### Report Prompts
+
+- Include full citation details for validation
+- Use diverse evidence selection (n=20)
+- **CRITICAL**: Emphasize citation validation rules
+- Format hypotheses with support/contradiction counts
+- System prompt includes explicit JSON structure requirements
+
+### Citation Validation
+
+- **ALWAYS** validate references before returning reports
+- Use `validate_references()` from `src/utils/citation_validator.py`
+- Remove hallucinated citations (URLs not in evidence)
+- Log warnings for removed citations
+- Never trust LLM-generated citations without validation
+
+### Citation Validation Rules
+
+1. Every reference URL must EXACTLY match a provided evidence URL
+2. Do NOT invent, fabricate, or hallucinate any references
+3. Do NOT modify paper titles, authors, dates, or URLs
+4. If unsure about a citation, OMIT it rather than guess
+5. Copy URLs exactly as provided - do not create similar-looking URLs
+
+### Evidence Selection
+
+- Use `select_diverse_evidence()` for MMR-based selection
+- Balance relevance vs diversity (lambda=0.7 default)
+- Sentence-aware truncation preserves meaning
+- Limit evidence per prompt to avoid context overflow
+
+## MCP Integration
+
+### MCP Tools
+
+- Functions in `src/mcp_tools.py` for Claude Desktop
+- Full type hints required
+- Google-style docstrings with Args/Returns sections
+- Formatted string returns (markdown)
+
+### Gradio MCP Server
+
+- Enable with `mcp_server=True` in `demo.launch()`
+- Endpoint: `/gradio_api/mcp/`
+- Use `ssr_mode=False` to fix hydration issues in HF Spaces
+
+## Common Pitfalls
+
+1. **Blocking the event loop**: Never use sync I/O in async functions
+2. **Missing type hints**: All functions must have complete type annotations
+3. **Hallucinated citations**: Always validate references
+4. **Global mutable state**: Use ContextVar or pass via parameters
+5. **Import errors**: Lazy-load optional dependencies (magentic, modal, embeddings)
+6. **Rate limiting**: Always implement for external APIs
+7. **Error chaining**: Always use `from e` when raising exceptions
+
+## Key Principles
+
+1. **Type Safety First**: All code must pass `mypy --strict`
+2. **Async Everything**: All I/O must be async
+3. **Test-Driven**: Write tests before implementation
+4. **No Hallucinations**: Validate all citations
+5. **Graceful Degradation**: Support free tier (HF Inference) when no API keys
+6. **Lazy Loading**: Don't require optional dependencies at import time
+7. **Structured Logging**: Use structlog, never print()
+8. **Error Chaining**: Always preserve exception context
+
+## Pull Request Process
+
+1. Ensure all checks pass: `make check`
+2. Update documentation if needed
+3. Add tests for new features
+4. Update CHANGELOG if applicable
+5. Request review from maintainers
+6. Address review feedback
+7. Wait for approval before merging
+
+## Questions?
+
+- Open an issue on GitHub
+- Check existing documentation
+- Review code examples in the codebase
+
+Thank you for contributing to DeepCritical!
+
diff --git a/docs/contributing/code-quality.md b/docs/contributing/code-quality.md
new file mode 100644
index 0000000000000000000000000000000000000000..d9e9eb9a0524e80c53cebd1b86b3af8958b2b357
--- /dev/null
+++ b/docs/contributing/code-quality.md
@@ -0,0 +1,70 @@
+# Code Quality & Documentation
+
+This document outlines code quality standards and documentation requirements.
+
+## Linting
+
+- Ruff with 100-char line length
+- Ignore rules documented in `pyproject.toml`:
+  - `PLR0913`: Too many arguments (agents need many params)
+  - `PLR0912`: Too many branches (complex orchestrator logic)
+  - `PLR0911`: Too many return statements (complex agent logic)
+  - `PLR2004`: Magic values (statistical constants)
+  - `PLW0603`: Global statement (singleton pattern)
+  - `PLC0415`: Lazy imports for optional dependencies
+
+## Type Checking
+
+- `mypy --strict` compliance
+- `ignore_missing_imports = true` (for optional dependencies)
+- Exclude: `reference_repos/`, `examples/`
+- All functions must have complete type annotations
+
+## Pre-commit
+
+- Run `make check` before committing
+- Must pass: lint + typecheck + test-cov
+- Pre-commit hooks installed via `make install`
+
+## Documentation
+
+### Docstrings
+
+- Google-style docstrings for all public functions
+- Include Args, Returns, Raises sections
+- Use type hints in docstrings only if needed for clarity
+
+Example:
+
+```python
+async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+    """Search PubMed and return evidence.
+
+    Args:
+        query: The search query string
+        max_results: Maximum number of results to return
+
+    Returns:
+        List of Evidence objects
+
+    Raises:
+        SearchError: If the search fails
+        RateLimitError: If we hit rate limits
+    """
+```
+
+### Code Comments
+
+- Explain WHY, not WHAT
+- Document non-obvious patterns (e.g., why `requests` not `httpx` for ClinicalTrials)
+- Mark critical sections: `# CRITICAL: ...`
+- Document rate limiting rationale
+- Explain async patterns when non-obvious
+
+## See Also
+
+- [Code Style](code-style.md) - Code style guidelines
+- [Testing](testing.md) - Testing guidelines
+
+
+
diff --git a/docs/contributing/code-style.md b/docs/contributing/code-style.md
new file mode 100644
index 0000000000000000000000000000000000000000..0eac3447df9adab75602cd682eb904ca9be9b75a
--- /dev/null
+++ b/docs/contributing/code-style.md
@@ -0,0 +1,50 @@
+# Code Style & Conventions
+
+This document outlines the code style and conventions for DeepCritical.
+
+## Type Safety
+
+- **ALWAYS** use type hints for all function parameters and return types
+- Use `mypy --strict` compliance (no `Any` unless absolutely necessary)
+- Use `TYPE_CHECKING` imports for circular dependencies:
+
+```python
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+```
+
+## Pydantic Models
+
+- All data exchange uses Pydantic models (`src/utils/models.py`)
+- Models are frozen (`model_config = {"frozen": True}`) for immutability
+- Use `Field()` with descriptions for all model fields
+- Validate with `ge=`, `le=`, `min_length=`, `max_length=` constraints
+
+## Async Patterns
+
+- **ALL** I/O operations must be async (`async def`, `await`)
+- Use `asyncio.gather()` for parallel operations
+- CPU-bound work (embeddings, parsing) must use `run_in_executor()`:
+
+```python
+loop = asyncio.get_running_loop()
+result = await loop.run_in_executor(None, cpu_bound_function, args)
+```
+
+- Never block the event loop with synchronous I/O
+
+## Common Pitfalls
+
+1. **Blocking the event loop**: Never use sync I/O in async functions
+2. **Missing type hints**: All functions must have complete type annotations
+3. **Global mutable state**: Use ContextVar or pass via parameters
+4. **Import errors**: Lazy-load optional dependencies (magentic, modal, embeddings)
+
+## See Also
+
+- [Error Handling](error-handling.md) - Error handling guidelines
+- [Implementation Patterns](implementation-patterns.md) - Common patterns
+
+
+
diff --git a/docs/contributing/error-handling.md b/docs/contributing/error-handling.md
new file mode 100644
index 0000000000000000000000000000000000000000..4995e4628ef6c2ce20a220acb574688cc25d04b4
--- /dev/null
+++ b/docs/contributing/error-handling.md
@@ -0,0 +1,58 @@
+# Error Handling & Logging
+
+This document outlines error handling and logging conventions for DeepCritical.
+
+## Exception Hierarchy
+
+Use custom exception hierarchy (`src/utils/exceptions.py`):
+
+- `DeepCriticalError` (base)
+- `SearchError` → `RateLimitError`
+- `JudgeError`
+- `ConfigurationError`
+
+## Error Handling Rules
+
+- Always chain exceptions: `raise SearchError(...) from e`
+- Log errors with context using `structlog`:
+
+```python
+logger.error("Operation failed", error=str(e), context=value)
+```
+
+- Never silently swallow exceptions
+- Provide actionable error messages
+
+## Logging
+
+- Use `structlog` for all logging (NOT `print` or `logging`)
+- Import: `import structlog; logger = structlog.get_logger()`
+- Log with structured data: `logger.info("event", key=value)`
+- Use appropriate levels: DEBUG, INFO, WARNING, ERROR
+
+## Logging Examples
+
+```python
+logger.info("Starting search", query=query, tools=[t.name for t in tools])
+logger.warning("Search tool failed", tool=tool.name, error=str(result))
+logger.error("Assessment failed", error=str(e))
+```
+
+## Error Chaining
+
+Always preserve exception context:
+
+```python
+try:
+    result = await api_call()
+except httpx.HTTPError as e:
+    raise SearchError(f"API call failed: {e}") from e
+```
+
+## See Also
+
+- [Code Style](code-style.md) - Code style guidelines
+- [Testing](testing.md) - Testing guidelines
+
+
+
diff --git a/docs/contributing/implementation-patterns.md b/docs/contributing/implementation-patterns.md
new file mode 100644
index 0000000000000000000000000000000000000000..f1cc909e4a6572de0dc16de8c4e08af7104ced69
--- /dev/null
+++ b/docs/contributing/implementation-patterns.md
@@ -0,0 +1,73 @@
+# Implementation Patterns
+
+This document outlines common implementation patterns used in DeepCritical.
+
+## Search Tools
+
+All tools implement `SearchTool` protocol (`src/tools/base.py`):
+
+- Must have `name` property
+- Must implement `async def search(query, max_results) -> list[Evidence]`
+- Use `@retry` decorator from tenacity for resilience
+- Rate limiting: Implement `_rate_limit()` for APIs with limits (e.g., PubMed)
+- Error handling: Raise `SearchError` or `RateLimitError` on failures
+
+Example pattern:
+
+```python
+class MySearchTool:
+    @property
+    def name(self) -> str:
+        return "mytool"
+    
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(...))
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        # Implementation
+        return evidence_list
+```
+
+## Judge Handlers
+
+- Implement `JudgeHandlerProtocol` (`async def assess(question, evidence) -> JudgeAssessment`)
+- Use pydantic-ai `Agent` with `output_type=JudgeAssessment`
+- System prompts in `src/prompts/judge.py`
+- Support fallback handlers: `MockJudgeHandler`, `HFInferenceJudgeHandler`
+- Always return valid `JudgeAssessment` (never raise exceptions)
+
+## Agent Factory Pattern
+
+- Use factory functions for creating agents (`src/agent_factory/`)
+- Lazy initialization for optional dependencies (e.g., embeddings, Modal)
+- Check requirements before initialization:
+
+```python
+def check_magentic_requirements() -> None:
+    if not settings.has_openai_key:
+        raise ConfigurationError("Magentic requires OpenAI")
+```
+
+## State Management
+
+- **Magentic Mode**: Use `ContextVar` for thread-safe state (`src/agents/state.py`)
+- **Simple Mode**: Pass state via function parameters
+- Never use global mutable state (except singletons via `@lru_cache`)
+
+## Singleton Pattern
+
+Use `@lru_cache(maxsize=1)` for singletons:
+
+```python
+@lru_cache(maxsize=1)
+def get_embedding_service() -> EmbeddingService:
+    return EmbeddingService()
+```
+
+- Lazy initialization to avoid requiring dependencies at import time
+
+## See Also
+
+- [Code Style](code-style.md) - Code style guidelines
+- [Error Handling](error-handling.md) - Error handling guidelines
+
+
+
diff --git a/docs/contributing/index.md b/docs/contributing/index.md
new file mode 100644
index 0000000000000000000000000000000000000000..b006d66afad3af513379a38502b9d656772f5afb
--- /dev/null
+++ b/docs/contributing/index.md
@@ -0,0 +1,152 @@
+# Contributing to DeepCritical
+
+Thank you for your interest in contributing to DeepCritical! This guide will help you get started.
+
+## Git Workflow
+
+- `main`: Production-ready (GitHub)
+- `dev`: Development integration (GitHub)
+- Use feature branches: `yourname-dev`
+- **NEVER** push directly to `main` or `dev` on HuggingFace
+- GitHub is source of truth; HuggingFace is for deployment
+
+## Development Commands
+
+```bash
+make install      # Install dependencies + pre-commit
+make check        # Lint + typecheck + test (MUST PASS)
+make test         # Run unit tests
+make lint         # Run ruff
+make format       # Format with ruff
+make typecheck    # Run mypy
+make test-cov     # Test with coverage
+```
+
+## Getting Started
+
+1. **Fork the repository** on GitHub
+2. **Clone your fork**:
+   ```bash
+   git clone https://github.com/yourusername/GradioDemo.git
+   cd GradioDemo
+   ```
+3. **Install dependencies**:
+   ```bash
+   make install
+   ```
+4. **Create a feature branch**:
+   ```bash
+   git checkout -b yourname-feature-name
+   ```
+5. **Make your changes** following the guidelines below
+6. **Run checks**:
+   ```bash
+   make check
+   ```
+7. **Commit and push**:
+   ```bash
+   git commit -m "Description of changes"
+   git push origin yourname-feature-name
+   ```
+8. **Create a pull request** on GitHub
+
+## Development Guidelines
+
+### Code Style
+
+- Follow [Code Style Guidelines](code-style.md)
+- All code must pass `mypy --strict`
+- Use `ruff` for linting and formatting
+- Line length: 100 characters
+
+### Error Handling
+
+- Follow [Error Handling Guidelines](error-handling.md)
+- Always chain exceptions: `raise SearchError(...) from e`
+- Use structured logging with `structlog`
+- Never silently swallow exceptions
+
+### Testing
+
+- Follow [Testing Guidelines](testing.md)
+- Write tests before implementation (TDD)
+- Aim for >80% coverage on critical paths
+- Use markers: `unit`, `integration`, `slow`
+
+### Implementation Patterns
+
+- Follow [Implementation Patterns](implementation-patterns.md)
+- Use factory functions for agent/tool creation
+- Implement protocols for extensibility
+- Use singleton pattern with `@lru_cache(maxsize=1)`
+
+### Prompt Engineering
+
+- Follow [Prompt Engineering Guidelines](prompt-engineering.md)
+- Always validate citations
+- Use diverse evidence selection
+- Never trust LLM-generated citations without validation
+
+### Code Quality
+
+- Follow [Code Quality Guidelines](code-quality.md)
+- Google-style docstrings for all public functions
+- Explain WHY, not WHAT in comments
+- Mark critical sections: `# CRITICAL: ...`
+
+## MCP Integration
+
+### MCP Tools
+
+- Functions in `src/mcp_tools.py` for Claude Desktop
+- Full type hints required
+- Google-style docstrings with Args/Returns sections
+- Formatted string returns (markdown)
+
+### Gradio MCP Server
+
+- Enable with `mcp_server=True` in `demo.launch()`
+- Endpoint: `/gradio_api/mcp/`
+- Use `ssr_mode=False` to fix hydration issues in HF Spaces
+
+## Common Pitfalls
+
+1. **Blocking the event loop**: Never use sync I/O in async functions
+2. **Missing type hints**: All functions must have complete type annotations
+3. **Hallucinated citations**: Always validate references
+4. **Global mutable state**: Use ContextVar or pass via parameters
+5. **Import errors**: Lazy-load optional dependencies (magentic, modal, embeddings)
+6. **Rate limiting**: Always implement for external APIs
+7. **Error chaining**: Always use `from e` when raising exceptions
+
+## Key Principles
+
+1. **Type Safety First**: All code must pass `mypy --strict`
+2. **Async Everything**: All I/O must be async
+3. **Test-Driven**: Write tests before implementation
+4. **No Hallucinations**: Validate all citations
+5. **Graceful Degradation**: Support free tier (HF Inference) when no API keys
+6. **Lazy Loading**: Don't require optional dependencies at import time
+7. **Structured Logging**: Use structlog, never print()
+8. **Error Chaining**: Always preserve exception context
+
+## Pull Request Process
+
+1. Ensure all checks pass: `make check`
+2. Update documentation if needed
+3. Add tests for new features
+4. Update CHANGELOG if applicable
+5. Request review from maintainers
+6. Address review feedback
+7. Wait for approval before merging
+
+## Questions?
+
+- Open an issue on GitHub
+- Check existing documentation
+- Review code examples in the codebase
+
+Thank you for contributing to DeepCritical!
+
+
+
diff --git a/docs/contributing/prompt-engineering.md b/docs/contributing/prompt-engineering.md
new file mode 100644
index 0000000000000000000000000000000000000000..869f06463e4533b789110718ed82d91aa39df2b3
--- /dev/null
+++ b/docs/contributing/prompt-engineering.md
@@ -0,0 +1,58 @@
+# Prompt Engineering & Citation Validation
+
+This document outlines prompt engineering guidelines and citation validation rules.
+
+## Judge Prompts
+
+- System prompt in `src/prompts/judge.py`
+- Format evidence with truncation (1500 chars per item)
+- Handle empty evidence case separately
+- Always request structured JSON output
+- Use `format_user_prompt()` and `format_empty_evidence_prompt()` helpers
+
+## Hypothesis Prompts
+
+- Use diverse evidence selection (MMR algorithm)
+- Sentence-aware truncation (`truncate_at_sentence()`)
+- Format: Drug → Target → Pathway → Effect
+- System prompt emphasizes mechanistic reasoning
+- Use `format_hypothesis_prompt()` with embeddings for diversity
+
+## Report Prompts
+
+- Include full citation details for validation
+- Use diverse evidence selection (n=20)
+- **CRITICAL**: Emphasize citation validation rules
+- Format hypotheses with support/contradiction counts
+- System prompt includes explicit JSON structure requirements
+
+## Citation Validation
+
+- **ALWAYS** validate references before returning reports
+- Use `validate_references()` from `src/utils/citation_validator.py`
+- Remove hallucinated citations (URLs not in evidence)
+- Log warnings for removed citations
+- Never trust LLM-generated citations without validation
+
+## Citation Validation Rules
+
+1. Every reference URL must EXACTLY match a provided evidence URL
+2. Do NOT invent, fabricate, or hallucinate any references
+3. Do NOT modify paper titles, authors, dates, or URLs
+4. If unsure about a citation, OMIT it rather than guess
+5. Copy URLs exactly as provided - do not create similar-looking URLs
+
+## Evidence Selection
+
+- Use `select_diverse_evidence()` for MMR-based selection
+- Balance relevance vs diversity (lambda=0.7 default)
+- Sentence-aware truncation preserves meaning
+- Limit evidence per prompt to avoid context overflow
+
+## See Also
+
+- [Code Quality](code-quality.md) - Code quality guidelines
+- [Error Handling](error-handling.md) - Error handling guidelines
+
+
+
diff --git a/docs/contributing/testing.md b/docs/contributing/testing.md
new file mode 100644
index 0000000000000000000000000000000000000000..55d812def182f778e4445ae8108abd2395935110
--- /dev/null
+++ b/docs/contributing/testing.md
@@ -0,0 +1,54 @@
+# Testing Requirements
+
+This document outlines testing requirements and guidelines for DeepCritical.
+
+## Test Structure
+
+- Unit tests in `tests/unit/` (mocked, fast)
+- Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`)
+- Use markers: `unit`, `integration`, `slow`
+
+## Mocking
+
+- Use `respx` for httpx mocking
+- Use `pytest-mock` for general mocking
+- Mock LLM calls in unit tests (use `MockJudgeHandler`)
+- Fixtures in `tests/conftest.py`: `mock_httpx_client`, `mock_llm_response`
+
+## TDD Workflow
+
+1. Write failing test in `tests/unit/`
+2. Implement in `src/`
+3. Ensure test passes
+4. Run `make check` (lint + typecheck + test)
+
+## Test Examples
+
+```python
+@pytest.mark.unit
+async def test_pubmed_search(mock_httpx_client):
+    tool = PubMedTool()
+    results = await tool.search("metformin", max_results=5)
+    assert len(results) > 0
+    assert all(isinstance(r, Evidence) for r in results)
+
+@pytest.mark.integration
+async def test_real_pubmed_search():
+    tool = PubMedTool()
+    results = await tool.search("metformin", max_results=3)
+    assert len(results) <= 3
+```
+
+## Test Coverage
+
+- Run `make test-cov` for coverage report
+- Aim for >80% coverage on critical paths
+- Exclude: `__init__.py`, `TYPE_CHECKING` blocks
+
+## See Also
+
+- [Code Style](code-style.md) - Code style guidelines
+- [Implementation Patterns](implementation-patterns.md) - Common patterns
+
+
+
diff --git a/docs/development/testing.md b/docs/development/testing.md
deleted file mode 100644
index 47c8c32d6a96ebfc01ed9c54627e6287b5c0e722..0000000000000000000000000000000000000000
--- a/docs/development/testing.md
+++ /dev/null
@@ -1,139 +0,0 @@
-# Testing Strategy
-## ensuring DeepCritical is Ironclad
-
----
-
-## Overview
-
-Our testing strategy follows a strict **Pyramid of Reliability**:
-1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
-2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
-3. **E2E / Regression Tests**: Full research workflows (10% of tests)
-
-**Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
-
----
-
-## 1. Unit Tests (Fast & Cheap)
-
-**Location**: `tests/unit/`
-
-Focus on individual components without external network calls. Mock everything.
-
-### Key Test Cases
-
-#### Agent Logic
-- **Initialization**: Verify default config loads correctly.
-- **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
-- **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
-- **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).
-
-#### Tools (Mocked)
-- **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
-- **Validation**: Ensure tools reject invalid queries (empty strings, etc.).
-
-#### Judge Prompts
-- **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
-- **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.
-
-```python
-# Example: Testing State Logic
-def test_budget_stop():
-    state = ResearchState(tokens_used=50001, max_tokens=50000)
-    assert should_continue(state) is False
-```
-
----
-
-## 2. Integration Tests (Realistic & Mocked I/O)
-
-**Location**: `tests/integration/`
-
-Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.
-
-### Key Test Cases
-
-#### Search Loop
-- **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
-- **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
-- **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.
-
-#### MCP Server Integration
-- **Server Startup**: Verify MCP server starts and exposes tools.
-- **Client Connection**: Verify agent can call tools via MCP protocol.
-
-```python
-# Example: Testing Search Loop with Mocked Tools
-async def test_search_loop_flow():
-    agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
-    report = await agent.run("test query")
-    assert agent.state.iterations > 0
-    assert len(report.sources) > 0
-```
-
----
-
-## 3. End-to-End (E2E) Tests (The "Real Deal")
-
-**Location**: `tests/e2e/`
-
-Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.
-
-### Key Test Cases
-
-#### The "Golden Query"
-Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
-- **Success Criteria**:
-  - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
-  - Includes citations from PubMed.
-  - Completes within 3 iterations.
-  - JSON output matches schema.
-
-#### Deployment Smoke Test
-- **Gradio UI**: Verify UI launches and accepts input.
-- **Streaming**: Verify generator yields chunks (first chunk within 2s).
-
----
-
-## 4. Tools & Config
-
-### Pytest Configuration
-```toml
-# pyproject.toml
-[tool.pytest.ini_options]
-markers = [
-    "unit: fast, isolated tests",
-    "integration: mocked network tests",
-    "e2e: real network tests (slow, expensive)"
-]
-asyncio_mode = "auto"
-```
-
-### CI/CD Pipeline (GitHub Actions)
-1. **Lint**: `ruff check .`
-2. **Type Check**: `mypy .`
-3. **Unit**: `pytest -m unit`
-4. **Integration**: `pytest -m integration`
-5. **E2E**: (Manual trigger only)
-
----
-
-## 5. Anti-Hallucination Validation
-
-How do we test if the agent is lying?
-
-1. **Citation Check**:
-   - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
-   - Fail if a citation is "orphaned" (hallucinated ID).
-
-2. **Negative Constraints**:
-   - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
-
----
-
-## Checklist for Implementation
-
-- [ ] Set up `tests/` directory structure
-- [ ] Configure `pytest` and `vcrpy`
-- [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
-- [ ] Write first unit test for `ResearchState`
diff --git a/docs/examples/writer_agents_usage.md b/docs/examples/writer_agents_usage.md
deleted file mode 100644
index cb30feff24a641f5dba4caa1eca6e0569c9cd222..0000000000000000000000000000000000000000
--- a/docs/examples/writer_agents_usage.md
+++ /dev/null
@@ -1,425 +0,0 @@
-# Writer Agents Usage Examples
-
-This document provides examples of how to use the writer agents in DeepCritical for generating research reports.
-
-## Overview
-
-DeepCritical provides three writer agents for different report generation scenarios:
-
-1. **WriterAgent** - Basic writer for simple reports from findings
-2. **LongWriterAgent** - Iterative writer for long-form multi-section reports
-3. **ProofreaderAgent** - Finalizes and polishes report drafts
-
-## WriterAgent
-
-The `WriterAgent` generates final reports from research findings. It's used in iterative research flows.
-
-### Basic Usage
-
-```python
-from src.agent_factory.agents import create_writer_agent
-
-# Create writer agent
-writer = create_writer_agent()
-
-# Generate report
-query = "What is the capital of France?"
-findings = """
-Paris is the capital of France [1].
-It is located in the north-central part of the country [2].
-
-[1] https://example.com/france-info
-[2] https://example.com/paris-info
-"""
-
-report = await writer.write_report(
-    query=query,
-    findings=findings,
-)
-
-print(report)
-```
-
-### With Output Length Specification
-
-```python
-report = await writer.write_report(
-    query="Explain machine learning",
-    findings=findings,
-    output_length="500 words",
-)
-```
-
-### With Additional Instructions
-
-```python
-report = await writer.write_report(
-    query="Explain machine learning",
-    findings=findings,
-    output_length="A comprehensive overview",
-    output_instructions="Use formal academic language and include examples",
-)
-```
-
-### Integration with IterativeResearchFlow
-
-The `WriterAgent` is automatically used by `IterativeResearchFlow`:
-
-```python
-from src.agent_factory.agents import create_iterative_flow
-
-flow = create_iterative_flow(max_iterations=5, max_time_minutes=10)
-report = await flow.run(
-    query="What is quantum computing?",
-    output_length="A detailed explanation",
-    output_instructions="Include practical applications",
-)
-```
-
-## LongWriterAgent
-
-The `LongWriterAgent` iteratively writes report sections with proper citation management. It's used in deep research flows.
-
-### Basic Usage
-
-```python
-from src.agent_factory.agents import create_long_writer_agent
-from src.utils.models import ReportDraft, ReportDraftSection
-
-# Create long writer agent
-long_writer = create_long_writer_agent()
-
-# Create report draft with sections
-report_draft = ReportDraft(
-    sections=[
-        ReportDraftSection(
-            section_title="Introduction",
-            section_content="Draft content for introduction with [1].",
-        ),
-        ReportDraftSection(
-            section_title="Methods",
-            section_content="Draft content for methods with [2].",
-        ),
-        ReportDraftSection(
-            section_title="Results",
-            section_content="Draft content for results with [3].",
-        ),
-    ]
-)
-
-# Generate full report
-report = await long_writer.write_report(
-    original_query="What are the main features of Python?",
-    report_title="Python Programming Language Overview",
-    report_draft=report_draft,
-)
-
-print(report)
-```
-
-### Writing Individual Sections
-
-You can also write sections one at a time:
-
-```python
-# Write first section
-section_output = await long_writer.write_next_section(
-    original_query="What is Python?",
-    report_draft="",  # No existing draft
-    next_section_title="Introduction",
-    next_section_draft="Python is a programming language...",
-)
-
-print(section_output.next_section_markdown)
-print(section_output.references)
-
-# Write second section with existing draft
-section_output = await long_writer.write_next_section(
-    original_query="What is Python?",
-    report_draft="# Report\n\n## Introduction\n\nContent...",
-    next_section_title="Features",
-    next_section_draft="Python features include...",
-)
-```
-
-### Integration with DeepResearchFlow
-
-The `LongWriterAgent` is automatically used by `DeepResearchFlow`:
-
-```python
-from src.agent_factory.agents import create_deep_flow
-
-flow = create_deep_flow(
-    max_iterations=5,
-    max_time_minutes=10,
-    use_long_writer=True,  # Use long writer (default)
-)
-
-report = await flow.run("What are the main features of Python programming language?")
-```
-
-## ProofreaderAgent
-
-The `ProofreaderAgent` finalizes and polishes report drafts by removing duplicates, adding summaries, and refining wording.
-
-### Basic Usage
-
-```python
-from src.agent_factory.agents import create_proofreader_agent
-from src.utils.models import ReportDraft, ReportDraftSection
-
-# Create proofreader agent
-proofreader = create_proofreader_agent()
-
-# Create report draft
-report_draft = ReportDraft(
-    sections=[
-        ReportDraftSection(
-            section_title="Introduction",
-            section_content="Python is a programming language [1].",
-        ),
-        ReportDraftSection(
-            section_title="Features",
-            section_content="Python has many features [2].",
-        ),
-    ]
-)
-
-# Proofread and finalize
-final_report = await proofreader.proofread(
-    query="What is Python?",
-    report_draft=report_draft,
-)
-
-print(final_report)
-```
-
-### Integration with DeepResearchFlow
-
-Use `ProofreaderAgent` instead of `LongWriterAgent`:
-
-```python
-from src.agent_factory.agents import create_deep_flow
-
-flow = create_deep_flow(
-    max_iterations=5,
-    max_time_minutes=10,
-    use_long_writer=False,  # Use proofreader instead
-)
-
-report = await flow.run("What are the main features of Python?")
-```
-
-## Error Handling
-
-All writer agents include robust error handling:
-
-### Handling Empty Inputs
-
-```python
-# WriterAgent handles empty findings gracefully
-report = await writer.write_report(
-    query="Test query",
-    findings="",  # Empty findings
-)
-# Returns a fallback report
-
-# LongWriterAgent handles empty sections
-report = await long_writer.write_report(
-    original_query="Test",
-    report_title="Test Report",
-    report_draft=ReportDraft(sections=[]),  # Empty draft
-)
-# Returns minimal report
-
-# ProofreaderAgent handles empty drafts
-report = await proofreader.proofread(
-    query="Test",
-    report_draft=ReportDraft(sections=[]),
-)
-# Returns minimal report
-```
-
-### Retry Logic
-
-All agents automatically retry on transient errors (timeouts, connection errors):
-
-```python
-# Automatically retries up to 3 times on transient failures
-report = await writer.write_report(
-    query="Test query",
-    findings=findings,
-)
-```
-
-### Fallback Reports
-
-If all retries fail, agents return fallback reports:
-
-```python
-# Returns fallback report with query and findings
-report = await writer.write_report(
-    query="Test query",
-    findings=findings,
-)
-# Fallback includes: "# Research Report\n\n## Query\n...\n\n## Findings\n..."
-```
-
-## Citation Validation
-
-### For Markdown Reports
-
-Use the markdown citation validator:
-
-```python
-from src.utils.citation_validator import validate_markdown_citations
-from src.utils.models import Evidence, Citation
-
-# Collect evidence during research
-evidence = [
-    Evidence(
-        content="Paris is the capital of France",
-        citation=Citation(
-            source="web",
-            title="France Information",
-            url="https://example.com/france",
-            date="2024-01-01",
-        ),
-    ),
-]
-
-# Generate report
-report = await writer.write_report(query="What is the capital of France?", findings=findings)
-
-# Validate citations
-validated_report, removed_count = validate_markdown_citations(report, evidence)
-
-if removed_count > 0:
-    print(f"Removed {removed_count} invalid citations")
-```
-
-### For ResearchReport Objects
-
-Use the structured citation validator:
-
-```python
-from src.utils.citation_validator import validate_references
-
-# For ResearchReport objects (from ReportAgent)
-validated_report = validate_references(report, evidence)
-```
-
-## Custom Model Configuration
-
-All writer agents support custom model configuration:
-
-```python
-from pydantic_ai import Model
-
-# Create custom model
-custom_model = Model("openai", "gpt-4")
-
-# Use with writer agents
-writer = create_writer_agent(model=custom_model)
-long_writer = create_long_writer_agent(model=custom_model)
-proofreader = create_proofreader_agent(model=custom_model)
-```
-
-## Best Practices
-
-1. **Use WriterAgent for simple reports** - When you have findings as a string and need a quick report
-2. **Use LongWriterAgent for structured reports** - When you need multiple sections with proper citation management
-3. **Use ProofreaderAgent for final polish** - When you have draft sections and need a polished final report
-4. **Validate citations** - Always validate citations against collected evidence
-5. **Handle errors gracefully** - All agents return fallback reports on failure
-6. **Specify output length** - Use `output_length` parameter to control report size
-7. **Provide instructions** - Use `output_instructions` for specific formatting requirements
-
-## Integration Examples
-
-### Full Iterative Research Flow
-
-```python
-from src.agent_factory.agents import create_iterative_flow
-
-flow = create_iterative_flow(
-    max_iterations=5,
-    max_time_minutes=10,
-)
-
-report = await flow.run(
-    query="What is machine learning?",
-    output_length="A comprehensive 1000-word explanation",
-    output_instructions="Include practical examples and use cases",
-)
-```
-
-### Full Deep Research Flow with Long Writer
-
-```python
-from src.agent_factory.agents import create_deep_flow
-
-flow = create_deep_flow(
-    max_iterations=5,
-    max_time_minutes=10,
-    use_long_writer=True,
-)
-
-report = await flow.run("What are the main features of Python programming language?")
-```
-
-### Full Deep Research Flow with Proofreader
-
-```python
-from src.agent_factory.agents import create_deep_flow
-
-flow = create_deep_flow(
-    max_iterations=5,
-    max_time_minutes=10,
-    use_long_writer=False,  # Use proofreader
-)
-
-report = await flow.run("Explain quantum computing basics")
-```
-
-## Troubleshooting
-
-### Empty Reports
-
-If you get empty reports, check:
-- Input validation logs (agents log warnings for empty inputs)
-- LLM API key configuration
-- Network connectivity
-
-### Citation Issues
-
-If citations are missing or invalid:
-- Use `validate_markdown_citations()` to check citations
-- Ensure Evidence objects are properly collected during research
-- Check that URLs in findings match Evidence URLs
-
-### Performance Issues
-
-For large reports:
-- Use `LongWriterAgent` for better section management
-- Consider truncating very long findings (agents do this automatically)
-- Use appropriate `max_time_minutes` settings
-
-## See Also
-
-- [Research Flows Documentation](../orchestrator/research_flows.md)
-- [Citation Validation](../utils/citation_validation.md)
-- [Agent Factory](../agent_factory/agents.md)
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/docs/getting-started/examples.md b/docs/getting-started/examples.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e516725ba568c33234166d5169c5708cb5f3a73
--- /dev/null
+++ b/docs/getting-started/examples.md
@@ -0,0 +1,198 @@
+# Examples
+
+This page provides examples of using DeepCritical for various research tasks.
+
+## Basic Research Query
+
+### Example 1: Drug Information
+
+**Query**:
+```
+What are the latest treatments for Alzheimer's disease?
+```
+
+**What DeepCritical Does**:
+1. Searches PubMed for recent papers
+2. Searches ClinicalTrials.gov for active trials
+3. Evaluates evidence quality
+4. Synthesizes findings into a comprehensive report
+
+### Example 2: Clinical Trial Search
+
+**Query**:
+```
+What clinical trials are investigating metformin for cancer prevention?
+```
+
+**What DeepCritical Does**:
+1. Searches ClinicalTrials.gov for relevant trials
+2. Searches PubMed for supporting literature
+3. Provides trial details and status
+4. Summarizes findings
+
+## Advanced Research Queries
+
+### Example 3: Comprehensive Review
+
+**Query**:
+```
+Review the evidence for using metformin as an anti-aging intervention, 
+including clinical trials, mechanisms of action, and safety profile.
+```
+
+**What DeepCritical Does**:
+1. Uses deep research mode (multi-section)
+2. Searches multiple sources in parallel
+3. Generates sections on:
+   - Clinical trials
+   - Mechanisms of action
+   - Safety profile
+4. Synthesizes comprehensive report
+
+### Example 4: Hypothesis Testing
+
+**Query**:
+```
+Test the hypothesis that regular exercise reduces Alzheimer's disease risk.
+```
+
+**What DeepCritical Does**:
+1. Generates testable hypotheses
+2. Searches for supporting/contradicting evidence
+3. Performs statistical analysis (if Modal configured)
+4. Provides verdict: SUPPORTED, REFUTED, or INCONCLUSIVE
+
+## MCP Tool Examples
+
+### Using search_pubmed
+
+```
+Search PubMed for "CRISPR gene editing cancer therapy"
+```
+
+### Using search_clinical_trials
+
+```
+Find active clinical trials for "diabetes type 2 treatment"
+```
+
+### Using search_all
+
+```
+Search all sources for "COVID-19 vaccine side effects"
+```
+
+### Using analyze_hypothesis
+
+```
+Analyze whether vitamin D supplementation reduces COVID-19 severity
+```
+
+## Code Examples
+
+### Python API Usage
+
+```python
+from src.orchestrator_factory import create_orchestrator
+from src.tools.search_handler import SearchHandler
+from src.agent_factory.judges import create_judge_handler
+
+# Create orchestrator
+search_handler = SearchHandler()
+judge_handler = create_judge_handler()
+orchestrator = create_orchestrator(
+    search_handler=search_handler,
+    judge_handler=judge_handler,
+    config={},
+    mode="advanced"
+)
+
+# Run research query
+query = "What are the latest treatments for Alzheimer's disease?"
+async for event in orchestrator.run(query):
+    print(f"Event: {event.type} - {event.data}")
+```
+
+### Gradio UI Integration
+
+```python
+import gradio as gr
+from src.app import create_research_interface
+
+# Create interface
+interface = create_research_interface()
+
+# Launch
+interface.launch(server_name="0.0.0.0", server_port=7860)
+```
+
+## Research Patterns
+
+### Iterative Research
+
+Single-loop research with search-judge-synthesize cycles:
+
+```python
+from src.orchestrator.research_flow import IterativeResearchFlow
+
+flow = IterativeResearchFlow(
+    search_handler=search_handler,
+    judge_handler=judge_handler,
+    use_graph=False
+)
+
+async for event in flow.run(query):
+    # Handle events
+    pass
+```
+
+### Deep Research
+
+Multi-section parallel research:
+
+```python
+from src.orchestrator.research_flow import DeepResearchFlow
+
+flow = DeepResearchFlow(
+    search_handler=search_handler,
+    judge_handler=judge_handler,
+    use_graph=True
+)
+
+async for event in flow.run(query):
+    # Handle events
+    pass
+```
+
+## Configuration Examples
+
+### Basic Configuration
+
+```bash
+# .env file
+LLM_PROVIDER=openai
+OPENAI_API_KEY=your_key_here
+MAX_ITERATIONS=10
+```
+
+### Advanced Configuration
+
+```bash
+# .env file
+LLM_PROVIDER=anthropic
+ANTHROPIC_API_KEY=your_key_here
+EMBEDDING_PROVIDER=local
+WEB_SEARCH_PROVIDER=duckduckgo
+MAX_ITERATIONS=20
+DEFAULT_TOKEN_LIMIT=200000
+USE_GRAPH_EXECUTION=true
+```
+
+## Next Steps
+
+- Read the [Configuration Guide](../configuration/index.md) for all options
+- Explore the [Architecture Documentation](../architecture/graph-orchestration.md)
+- Check out the [API Reference](../api/agents.md) for programmatic usage
+
+
+
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
new file mode 100644
index 0000000000000000000000000000000000000000..f60a26867517b8a3a3465abcf8abc8df52889af1
--- /dev/null
+++ b/docs/getting-started/installation.md
@@ -0,0 +1,137 @@
+# Installation
+
+This guide will help you install and set up DeepCritical on your system.
+
+## Prerequisites
+
+- Python 3.11 or higher
+- `uv` package manager (recommended) or `pip`
+- At least one LLM API key (OpenAI, Anthropic, or HuggingFace)
+
+## Installation Steps
+
+### 1. Install uv (Recommended)
+
+`uv` is a fast Python package installer and resolver. Install it with:
+
+```bash
+pip install uv
+```
+
+### 2. Clone the Repository
+
+```bash
+git clone https://github.com/DeepCritical/GradioDemo.git
+cd GradioDemo
+```
+
+### 3. Install Dependencies
+
+Using `uv` (recommended):
+
+```bash
+uv sync
+```
+
+Using `pip`:
+
+```bash
+pip install -e .
+```
+
+### 4. Install Optional Dependencies
+
+For embeddings support (local sentence-transformers):
+
+```bash
+uv sync --extra embeddings
+```
+
+For Modal sandbox execution:
+
+```bash
+uv sync --extra modal
+```
+
+For Magentic orchestration:
+
+```bash
+uv sync --extra magentic
+```
+
+Install all extras:
+
+```bash
+uv sync --all-extras
+```
+
+### 5. Configure Environment Variables
+
+Create a `.env` file in the project root:
+
+```bash
+# Required: At least one LLM provider
+LLM_PROVIDER=openai  # or "anthropic" or "huggingface"
+OPENAI_API_KEY=your_openai_api_key_here
+
+# Optional: Other services
+NCBI_API_KEY=your_ncbi_api_key_here  # For higher PubMed rate limits
+MODAL_TOKEN_ID=your_modal_token_id
+MODAL_TOKEN_SECRET=your_modal_token_secret
+```
+
+See the [Configuration Guide](../configuration/index.md) for all available options.
+
+### 6. Verify Installation
+
+Run the application:
+
+```bash
+uv run gradio run src/app.py
+```
+
+Open your browser to `http://localhost:7860` to verify the installation.
+
+## Development Setup
+
+For development, install dev dependencies:
+
+```bash
+uv sync --all-extras --dev
+```
+
+Install pre-commit hooks:
+
+```bash
+uv run pre-commit install
+```
+
+## Troubleshooting
+
+### Common Issues
+
+**Import Errors**:
+- Ensure you've installed all required dependencies
+- Check that Python 3.11+ is being used
+
+**API Key Errors**:
+- Verify your `.env` file is in the project root
+- Check that API keys are correctly formatted
+- Ensure at least one LLM provider is configured
+
+**Module Not Found**:
+- Run `uv sync` or `pip install -e .` again
+- Check that you're in the correct virtual environment
+
+**Port Already in Use**:
+- Change the port in `src/app.py` or use environment variable
+- Kill the process using port 7860
+
+## Next Steps
+
+- Read the [Quick Start Guide](quick-start.md)
+- Learn about [MCP Integration](mcp-integration.md)
+- Explore [Examples](examples.md)
+
+
+
diff --git a/docs/getting-started/mcp-integration.md b/docs/getting-started/mcp-integration.md
new file mode 100644
index 0000000000000000000000000000000000000000..61f3601ce4fcc6f16149a1f2e590855ba998f54f
--- /dev/null
+++ b/docs/getting-started/mcp-integration.md
@@ -0,0 +1,204 @@
+# MCP Integration
+
+DeepCritical exposes a Model Context Protocol (MCP) server, allowing you to use its search tools directly from Claude Desktop or other MCP clients.
+
+## What is MCP?
+
+The Model Context Protocol (MCP) is a standard for connecting AI assistants to external tools and data sources. DeepCritical implements an MCP server that exposes its search capabilities as MCP tools.
+
+## MCP Server URL
+
+When running locally:
+
+```
+http://localhost:7860/gradio_api/mcp/
+```
+
+## Claude Desktop Configuration
+
+### 1. Locate Configuration File
+
+**macOS**:
+```
+~/Library/Application Support/Claude/claude_desktop_config.json
+```
+
+**Windows**:
+```
+%APPDATA%\Claude\claude_desktop_config.json
+```
+
+**Linux**:
+```
+~/.config/Claude/claude_desktop_config.json
+```
+
+### 2. Add DeepCritical Server
+
+Edit `claude_desktop_config.json` and add:
+
+```json
+{
+  "mcpServers": {
+    "deepcritical": {
+      "url": "http://localhost:7860/gradio_api/mcp/"
+    }
+  }
+}
+```
+
+### 3. Restart Claude Desktop
+
+Close and restart Claude Desktop for changes to take effect.
+
+### 4. Verify Connection
+
+In Claude Desktop, you should see DeepCritical tools available:
+- `search_pubmed`
+- `search_clinical_trials`
+- `search_biorxiv`
+- `search_all`
+- `analyze_hypothesis`
+
+## Available Tools
+
+### search_pubmed
+
+Search peer-reviewed biomedical literature from PubMed.
+
+**Parameters**:
+- `query` (string): Search query
+- `max_results` (integer, optional): Maximum number of results (default: 10)
+
+**Example**:
+```
+Search PubMed for "metformin diabetes"
+```
+
+### search_clinical_trials
+
+Search ClinicalTrials.gov for interventional studies.
+
+**Parameters**:
+- `query` (string): Search query
+- `max_results` (integer, optional): Maximum number of results (default: 10)
+
+**Example**:
+```
+Search clinical trials for "Alzheimer's disease treatment"
+```
+
+### search_biorxiv
+
+Search bioRxiv/medRxiv preprints via Europe PMC.
+
+**Parameters**:
+- `query` (string): Search query
+- `max_results` (integer, optional): Maximum number of results (default: 10)
+
+**Example**:
+```
+Search bioRxiv for "CRISPR gene editing"
+```
+
+### search_all
+
+Search all sources simultaneously (PubMed, ClinicalTrials.gov, Europe PMC).
+
+**Parameters**:
+- `query` (string): Search query
+- `max_results` (integer, optional): Maximum number of results per source (default: 10)
+
+**Example**:
+```
+Search all sources for "COVID-19 vaccine efficacy"
+```
+
+### analyze_hypothesis
+
+Perform secure statistical analysis using Modal sandboxes.
+
+**Parameters**:
+- `hypothesis` (string): Hypothesis to analyze
+- `data` (string, optional): Data description or code
+
+**Example**:
+```
+Analyze the hypothesis that metformin reduces cancer risk
+```
+
+## Using Tools in Claude Desktop
+
+Once configured, you can ask Claude to use DeepCritical tools:
+
+```
+Use DeepCritical to search PubMed for recent papers on Alzheimer's disease treatments.
+```
+
+Claude will automatically:
+1. Call the appropriate DeepCritical tool
+2. Retrieve results
+3. Use the results in its response
+
+## Troubleshooting
+
+### Connection Issues
+
+**Server Not Found**:
+- Ensure DeepCritical is running (`uv run gradio run src/app.py`)
+- Verify the URL in `claude_desktop_config.json` is correct
+- Check that port 7860 is not blocked by firewall
+
+**Tools Not Appearing**:
+- Restart Claude Desktop after configuration changes
+- Check Claude Desktop logs for errors
+- Verify MCP server is accessible at the configured URL
+
+### Authentication
+
+If DeepCritical requires authentication:
+- Configure API keys in DeepCritical settings
+- Use HuggingFace OAuth login
+- Ensure API keys are valid
+
+## Advanced Configuration
+
+### Custom Port
+
+If running on a different port, update the URL:
+
+```json
+{
+  "mcpServers": {
+    "deepcritical": {
+      "url": "http://localhost:8080/gradio_api/mcp/"
+    }
+  }
+}
+```
+
+### Multiple Instances
+
+You can configure multiple DeepCritical instances:
+
+```json
+{
+  "mcpServers": {
+    "deepcritical-local": {
+      "url": "http://localhost:7860/gradio_api/mcp/"
+    },
+    "deepcritical-remote": {
+      "url": "https://your-server.com/gradio_api/mcp/"
+    }
+  }
+}
+```
+
+## Next Steps
+
+- Learn about [Configuration](../configuration/index.md) for advanced settings
+- Explore [Examples](examples.md) for use cases
+- Read the [Architecture Documentation](../architecture/graph-orchestration.md)
+
+
+
diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md
new file mode 100644
index 0000000000000000000000000000000000000000..485c9ce375892340e1e9e3282f8d06d30fd655f7
--- /dev/null
+++ b/docs/getting-started/quick-start.md
@@ -0,0 +1,108 @@
+# Quick Start Guide
+
+Get up and running with DeepCritical in minutes.
+
+## Start the Application
+
+```bash
+uv run gradio run src/app.py
+```
+
+Open your browser to `http://localhost:7860`.
+
+## First Research Query
+
+1. **Enter a Research Question**
+
+   Type your research question in the chat interface, for example:
+   - "What are the latest treatments for Alzheimer's disease?"
+   - "Review the evidence for metformin in cancer prevention"
+   - "What clinical trials are investigating COVID-19 vaccines?"
+
+2. **Submit the Query**
+
+   Click "Submit" or press Enter. The system will:
+   - Generate observations about your query
+   - Identify knowledge gaps
+   - Search multiple sources (PubMed, ClinicalTrials.gov, Europe PMC)
+   - Evaluate evidence quality
+   - Synthesize findings into a report
+
+3. **Review Results**
+
+   Watch the real-time progress in the chat interface:
+   - Search operations and results
+   - Evidence evaluation
+   - Report generation
+   - Final research report with citations
+
+## Authentication
+
+### HuggingFace OAuth (Recommended)
+
+1. Click "Sign in with HuggingFace" at the top of the app
+2. Authorize the application
+3. Your HuggingFace API token will be automatically used
+4. No need to manually enter API keys
+
+### Manual API Key
+
+1. Open the Settings accordion
+2. Enter your API key:
+   - OpenAI API key
+   - Anthropic API key
+   - HuggingFace API key
+3. Click "Save Settings"
+4. Manual keys take priority over OAuth tokens
+
+## Understanding the Interface
+
+### Chat Interface
+
+- **Input**: Enter your research questions here
+- **Messages**: View conversation history and research progress
+- **Streaming**: Real-time updates as research progresses
+
+### Status Indicators
+
+- **Searching**: Active search operations
+- **Evaluating**: Evidence quality assessment
+- **Synthesizing**: Report generation
+- **Complete**: Research finished
+
+### Settings
+
+- **API Keys**: Configure LLM providers
+- **Research Mode**: Choose iterative or deep research
+- **Budget Limits**: Set token, time, and iteration limits
+
+## Example Queries
+
+### Simple Query
+
+```
+What are the side effects of metformin?
+```
+
+### Complex Query
+
+```
+Review the evidence for using metformin as an anti-aging intervention, 
+including clinical trials, mechanisms of action, and safety profile.
+```
+
+### Clinical Trial Query
+
+```
+What are the active clinical trials investigating Alzheimer's disease treatments?
+```
+
+## Next Steps
+
+- Learn about [MCP Integration](mcp-integration.md) to use DeepCritical from Claude Desktop
+- Explore [Examples](examples.md) for more use cases
+- Read the [Configuration Guide](../configuration/index.md) for advanced settings
+- Check out the [Architecture Documentation](../architecture/graph-orchestration.md) to understand how it works
+
+
+
diff --git a/docs/guides/deployment.md b/docs/guides/deployment.md
deleted file mode 100644
index 35fe7e49ab2ea9f80ff502506cb22f37ce17608a..0000000000000000000000000000000000000000
--- a/docs/guides/deployment.md
+++ /dev/null
@@ -1,142 +0,0 @@
-# Deployment Guide
-## Launching DeepCritical: Gradio, MCP, & Modal
-
----
-
-## Overview
-
-DeepCritical is designed for a multi-platform deployment strategy to maximize hackathon impact:
-
-1. **HuggingFace Spaces**: Host the Gradio UI (User Interface).
-2. **MCP Server**: Expose research tools to Claude Desktop/Agents.
-3. **Modal (Optional)**: Run heavy inference or local LLMs if API costs are prohibitive.
-
----
-
-## 1. HuggingFace Spaces (Gradio UI)
-
-**Goal**: A public URL where judges/users can try the research agent.
-
-### Prerequisites
-- HuggingFace Account
-- `gradio` installed (`uv add gradio`)
-
-### Steps
-
-1. **Create Space**:
-   - Go to HF Spaces -> Create New Space.
-   - SDK: **Gradio**.
-   - Hardware: **CPU Basic** (Free) is sufficient (since we use APIs).
-
-2. **Prepare Files**:
-   - Ensure `app.py` contains the Gradio interface construction.
-   - Ensure `requirements.txt` or `pyproject.toml` lists all dependencies.
-
-3. **Secrets**:
-   - Go to Space Settings -> **Repository secrets**.
-   - Add `ANTHROPIC_API_KEY` (or your chosen LLM provider key).
-   - Add `BRAVE_API_KEY` (for web search).
-
-4. **Deploy**:
-   - Push code to the Space's git repo.
-   - Watch "Build" logs.
-
-### Streaming Optimization
-Ensure `app.py` uses generator functions for the chat interface to prevent timeouts:
-```python
-# app.py
-def predict(message, history):
-    agent = ResearchAgent()
-    for update in agent.research_stream(message):
-        yield update
-```
-
----
-
-## 2. MCP Server Deployment
-
-**Goal**: Allow other agents (like Claude Desktop) to use our PubMed/Research tools directly.
-
-### Local Usage (Claude Desktop)
-
-1. **Install**:
-   ```bash
-   uv sync
-   ```
-
-2. **Configure Claude Desktop**:
-   Edit `~/Library/Application Support/Claude/claude_desktop_config.json`:
-   ```json
-   {
-     "mcpServers": {
-       "deepcritical": {
-         "command": "uv",
-         "args": ["run", "fastmcp", "run", "src/mcp_servers/pubmed_server.py"],
-         "cwd": "/absolute/path/to/DeepCritical"
-       }
-     }
-   }
-   ```
-
-3. **Restart Claude**: You should see a 🔌 icon indicating connected tools.
-
-### Remote Deployment (Smithery/Glama)
-*Target for "MCP Track" bonus points.*
-
-1. **Dockerize**: Create a `Dockerfile` for the MCP server.
-   ```dockerfile
-   FROM python:3.11-slim
-   COPY . /app
-   RUN pip install fastmcp httpx
-   CMD ["fastmcp", "run", "src/mcp_servers/pubmed_server.py", "--transport", "sse"]
-   ```
-   *Note: Use SSE transport for remote/HTTP servers.*
-
-2. **Deploy**: Host on Fly.io or Railway.
-
----
-
-## 3. Modal (GPU/Heavy Compute)
-
-**Goal**: Run a local LLM (e.g., Llama-3-70B) or handle massive parallel searches if APIs are too slow/expensive.
-
-### Setup
-1. **Install**: `uv add modal`
-2. **Auth**: `modal token new`
-
-### Logic
-Instead of calling Anthropic API, we call a Modal function:
-
-```python
-# src/llm/modal_client.py
-import modal
-
-stub = modal.Stub("deepcritical-inference")
-
-@stub.function(gpu="A100")
-def generate_text(prompt: str):
-    # Load vLLM or similar
-    ...
-```
-
-### When to use?
-- **Hackathon Demo**: Stick to Anthropic/OpenAI APIs for speed/reliability.
-- **Production/Stretch**: Use Modal if you hit rate limits or want to show off "Open Source Models" capability.
-
----
-
-## Deployment Checklist
-
-### Pre-Flight
-- [ ] Run `pytest -m unit` to ensure logic is sound.
-- [ ] Run `pytest -m e2e` (one pass) to verify APIs connect.
-- [ ] Check `requirements.txt` matches `pyproject.toml`.
-
-### Secrets Management
-- [ ] **NEVER** commit `.env` files.
-- [ ] Verify keys are added to HF Space settings.
-
-### Post-Launch
-- [ ] Test the live URL.
-- [ ] Verify "Stop" button in Gradio works (interrupts the agent).
-- [ ] Record a walkthrough video (crucial for hackathon submission).
diff --git a/docs/implementation/01_phase_foundation.md b/docs/implementation/01_phase_foundation.md
deleted file mode 100644
index 2b44c4c0629e32444493f9ec60cf5ab0bfd22796..0000000000000000000000000000000000000000
--- a/docs/implementation/01_phase_foundation.md
+++ /dev/null
@@ -1,587 +0,0 @@
-# Phase 1 Implementation Spec: Foundation & Tooling
-
-**Goal**: Establish a "Gucci Banger" development environment using 2025 best practices.
-**Philosophy**: "If the build isn't solid, the agent won't be."
-
----
-
-## 1. Prerequisites
-
-Before starting, ensure these are installed:
-
-```bash
-# Install uv (Rust-based package manager)
-curl -LsSf https://astral.sh/uv/install.sh | sh
-
-# Verify
-uv --version  # Should be >= 0.4.0
-```
-
----
-
-## 2. Project Initialization
-
-```bash
-# From project root
-uv init --name deepcritical
-uv python install 3.11  # Pin Python version
-```
-
----
-
-## 3. The Tooling Stack (Exact Dependencies)
-
-### `pyproject.toml` (Complete, Copy-Paste Ready)
-
-```toml
-[project]
-name = "deepcritical"
-version = "0.1.0"
-description = "AI-Native Drug Repurposing Research Agent"
-readme = "README.md"
-requires-python = ">=3.11"
-dependencies = [
-    # Core
-    "pydantic>=2.7",
-    "pydantic-settings>=2.2",      # For BaseSettings (config)
-    "pydantic-ai>=0.0.16",          # Agent framework
-
-    # HTTP & Parsing
-    "httpx>=0.27",                   # Async HTTP client
-    "beautifulsoup4>=4.12",          # HTML parsing
-    "xmltodict>=0.13",               # PubMed XML -> dict
-
-    # Search
-    "duckduckgo-search>=6.0",        # Free web search
-
-    # UI
-    "gradio>=5.0",                   # Chat interface
-
-    # Utils
-    "python-dotenv>=1.0",            # .env loading
-    "tenacity>=8.2",                 # Retry logic
-    "structlog>=24.1",               # Structured logging
-]
-
-[project.optional-dependencies]
-dev = [
-    # Testing
-    "pytest>=8.0",
-    "pytest-asyncio>=0.23",
-    "pytest-sugar>=1.0",
-    "pytest-cov>=5.0",
-    "pytest-mock>=3.12",
-    "respx>=0.21",                   # Mock httpx requests
-
-    # Quality
-    "ruff>=0.4.0",
-    "mypy>=1.10",
-    "pre-commit>=3.7",
-]
-
-[build-system]
-requires = ["hatchling"]
-build-backend = "hatchling.build"
-
-[tool.hatch.build.targets.wheel]
-packages = ["src"]
-
-# ============== RUFF CONFIG ==============
-[tool.ruff]
-line-length = 100
-target-version = "py311"
-src = ["src", "tests"]
-
-[tool.ruff.lint]
-select = [
-    "E",    # pycodestyle errors
-    "F",    # pyflakes
-    "B",    # flake8-bugbear
-    "I",    # isort
-    "N",    # pep8-naming
-    "UP",   # pyupgrade
-    "PL",   # pylint
-    "RUF",  # ruff-specific
-]
-ignore = [
-    "PLR0913",  # Too many arguments (agents need many params)
-]
-
-[tool.ruff.lint.isort]
-known-first-party = ["src"]
-
-# ============== MYPY CONFIG ==============
-[tool.mypy]
-python_version = "3.11"
-strict = true
-ignore_missing_imports = true
-disallow_untyped_defs = true
-warn_return_any = true
-warn_unused_ignores = true
-
-# ============== PYTEST CONFIG ==============
-[tool.pytest.ini_options]
-testpaths = ["tests"]
-asyncio_mode = "auto"
-addopts = [
-    "-v",
-    "--tb=short",
-    "--strict-markers",
-]
-markers = [
-    "unit: Unit tests (mocked)",
-    "integration: Integration tests (real APIs)",
-    "slow: Slow tests",
-]
-
-# ============== COVERAGE CONFIG ==============
-[tool.coverage.run]
-source = ["src"]
-omit = ["*/__init__.py"]
-
-[tool.coverage.report]
-exclude_lines = [
-    "pragma: no cover",
-    "if TYPE_CHECKING:",
-    "raise NotImplementedError",
-]
-```
-
----
-
-## 4. Directory Structure (Maintainer's Structure)
-
-```bash
-# Execute these commands to create the directory structure
-mkdir -p src/utils
-mkdir -p src/tools
-mkdir -p src/prompts
-mkdir -p src/agent_factory
-mkdir -p src/middleware
-mkdir -p src/database_services
-mkdir -p src/retrieval_factory
-mkdir -p tests/unit/tools
-mkdir -p tests/unit/agent_factory
-mkdir -p tests/unit/utils
-mkdir -p tests/integration
-
-# Create __init__.py files (required for imports)
-touch src/__init__.py
-touch src/utils/__init__.py
-touch src/tools/__init__.py
-touch src/prompts/__init__.py
-touch src/agent_factory/__init__.py
-touch tests/__init__.py
-touch tests/unit/__init__.py
-touch tests/unit/tools/__init__.py
-touch tests/unit/agent_factory/__init__.py
-touch tests/unit/utils/__init__.py
-touch tests/integration/__init__.py
-```
-
-### Final Structure:
-
-```
-src/
-├── __init__.py
-├── app.py                      # Entry point (Gradio UI)
-├── orchestrator.py             # Agent loop
-├── agent_factory/              # Agent creation and judges
-│   ├── __init__.py
-│   ├── agents.py
-│   └── judges.py
-├── tools/                      # Search tools
-│   ├── __init__.py
-│   ├── pubmed.py
-│   ├── websearch.py
-│   └── search_handler.py
-├── prompts/                    # Prompt templates
-│   ├── __init__.py
-│   └── judge.py
-├── utils/                      # Shared utilities
-│   ├── __init__.py
-│   ├── config.py
-│   ├── exceptions.py
-│   ├── models.py
-│   ├── dataloaders.py
-│   └── parsers.py
-├── middleware/                 # (Future)
-├── database_services/          # (Future)
-└── retrieval_factory/          # (Future)
-
-tests/
-├── __init__.py
-├── conftest.py
-├── unit/
-│   ├── __init__.py
-│   ├── tools/
-│   │   ├── __init__.py
-│   │   ├── test_pubmed.py
-│   │   ├── test_websearch.py
-│   │   └── test_search_handler.py
-│   ├── agent_factory/
-│   │   ├── __init__.py
-│   │   └── test_judges.py
-│   ├── utils/
-│   │   ├── __init__.py
-│   │   └── test_config.py
-│   └── test_orchestrator.py
-└── integration/
-    ├── __init__.py
-    └── test_pubmed_live.py
-```
-
----
-
-## 5. Configuration Files
-
-### `.env.example` (Copy to `.env` and fill)
-
-```bash
-# LLM Provider (choose one)
-OPENAI_API_KEY=sk-your-key-here
-ANTHROPIC_API_KEY=sk-ant-your-key-here
-
-# Optional: PubMed API key (higher rate limits)
-NCBI_API_KEY=your-ncbi-key-here
-
-# Optional: For HuggingFace deployment
-HF_TOKEN=hf_your-token-here
-
-# Agent Config
-MAX_ITERATIONS=10
-LOG_LEVEL=INFO
-```
-
-### `.pre-commit-config.yaml`
-
-```yaml
-repos:
-  - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.4.4
-    hooks:
-      - id: ruff
-        args: [--fix]
-      - id: ruff-format
-
-  - repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.10.0
-    hooks:
-      - id: mypy
-        additional_dependencies:
-          - pydantic>=2.7
-          - pydantic-settings>=2.2
-        args: [--ignore-missing-imports]
-```
-
-### `tests/conftest.py` (Pytest Fixtures)
-
-```python
-"""Shared pytest fixtures for all tests."""
-import pytest
-from unittest.mock import AsyncMock
-
-
-@pytest.fixture
-def mock_httpx_client(mocker):
-    """Mock httpx.AsyncClient for API tests."""
-    mock = mocker.patch("httpx.AsyncClient")
-    mock.return_value.__aenter__ = AsyncMock(return_value=mock.return_value)
-    mock.return_value.__aexit__ = AsyncMock(return_value=None)
-    return mock
-
-
-@pytest.fixture
-def mock_llm_response():
-    """Factory fixture for mocking LLM responses."""
-    def _mock(content: str):
-        return AsyncMock(return_value=content)
-    return _mock
-
-
-@pytest.fixture
-def sample_evidence():
-    """Sample Evidence objects for testing."""
-    from src.utils.models import Evidence, Citation
-    return [
-        Evidence(
-            content="Metformin shows promise in Alzheimer's...",
-            citation=Citation(
-                source="pubmed",
-                title="Metformin and Alzheimer's Disease",
-                url="https://pubmed.ncbi.nlm.nih.gov/12345678/",
-                date="2024-01-15"
-            ),
-            relevance=0.85
-        )
-    ]
-```
-
----
-
-## 6. Core Utilities Implementation
-
-### `src/utils/config.py`
-
-```python
-"""Application configuration using Pydantic Settings."""
-from pydantic_settings import BaseSettings, SettingsConfigDict
-from pydantic import Field
-from typing import Literal
-import structlog
-
-
-class Settings(BaseSettings):
-    """Strongly-typed application settings."""
-
-    model_config = SettingsConfigDict(
-        env_file=".env",
-        env_file_encoding="utf-8",
-        case_sensitive=False,
-        extra="ignore",
-    )
-
-    # LLM Configuration
-    openai_api_key: str | None = Field(default=None, description="OpenAI API key")
-    anthropic_api_key: str | None = Field(default=None, description="Anthropic API key")
-    llm_provider: Literal["openai", "anthropic"] = Field(
-        default="openai",
-        description="Which LLM provider to use"
-    )
-    openai_model: str = Field(default="gpt-4o", description="OpenAI model name")
-    anthropic_model: str = Field(default="claude-3-5-sonnet-20241022", description="Anthropic model")
-
-    # PubMed Configuration
-    ncbi_api_key: str | None = Field(default=None, description="NCBI API key for higher rate limits")
-
-    # Agent Configuration
-    max_iterations: int = Field(default=10, ge=1, le=50)
-    search_timeout: int = Field(default=30, description="Seconds to wait for search")
-
-    # Logging
-    log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = "INFO"
-
-    def get_api_key(self) -> str:
-        """Get the API key for the configured provider."""
-        if self.llm_provider == "openai":
-            if not self.openai_api_key:
-                raise ValueError("OPENAI_API_KEY not set")
-            return self.openai_api_key
-        else:
-            if not self.anthropic_api_key:
-                raise ValueError("ANTHROPIC_API_KEY not set")
-            return self.anthropic_api_key
-
-
-def get_settings() -> Settings:
-    """Factory function to get settings (allows mocking in tests)."""
-    return Settings()
-
-
-def configure_logging(settings: Settings) -> None:
-    """Configure structured logging."""
-    structlog.configure(
-        processors=[
-            structlog.stdlib.filter_by_level,
-            structlog.stdlib.add_logger_name,
-            structlog.stdlib.add_log_level,
-            structlog.processors.TimeStamper(fmt="iso"),
-            structlog.processors.JSONRenderer(),
-        ],
-        wrapper_class=structlog.stdlib.BoundLogger,
-        context_class=dict,
-        logger_factory=structlog.stdlib.LoggerFactory(),
-    )
-
-
-# Singleton for easy import
-settings = get_settings()
-```
-
-### `src/utils/exceptions.py`
-
-```python
-"""Custom exceptions for DeepCritical."""
-
-
-class DeepCriticalError(Exception):
-    """Base exception for all DeepCritical errors."""
-    pass
-
-
-class SearchError(DeepCriticalError):
-    """Raised when a search operation fails."""
-    pass
-
-
-class JudgeError(DeepCriticalError):
-    """Raised when the judge fails to assess evidence."""
-    pass
-
-
-class ConfigurationError(DeepCriticalError):
-    """Raised when configuration is invalid."""
-    pass
-
-
-class RateLimitError(SearchError):
-    """Raised when we hit API rate limits."""
-    pass
-```
-
----
-
-## 7. TDD Workflow: First Test
-
-### `tests/unit/utils/test_config.py`
-
-```python
-"""Unit tests for configuration loading."""
-import pytest
-from unittest.mock import patch
-import os
-
-
-class TestSettings:
-    """Tests for Settings class."""
-
-    def test_default_max_iterations(self):
-        """Settings should have default max_iterations of 10."""
-        from src.utils.config import Settings
-
-        # Clear any env vars
-        with patch.dict(os.environ, {}, clear=True):
-            settings = Settings()
-            assert settings.max_iterations == 10
-
-    def test_max_iterations_from_env(self):
-        """Settings should read MAX_ITERATIONS from env."""
-        from src.utils.config import Settings
-
-        with patch.dict(os.environ, {"MAX_ITERATIONS": "25"}):
-            settings = Settings()
-            assert settings.max_iterations == 25
-
-    def test_invalid_max_iterations_raises(self):
-        """Settings should reject invalid max_iterations."""
-        from src.utils.config import Settings
-        from pydantic import ValidationError
-
-        with patch.dict(os.environ, {"MAX_ITERATIONS": "100"}):
-            with pytest.raises(ValidationError):
-                Settings()  # 100 > 50 (max)
-
-    def test_get_api_key_openai(self):
-        """get_api_key should return OpenAI key when provider is openai."""
-        from src.utils.config import Settings
-
-        with patch.dict(os.environ, {
-            "LLM_PROVIDER": "openai",
-            "OPENAI_API_KEY": "sk-test-key"
-        }):
-            settings = Settings()
-            assert settings.get_api_key() == "sk-test-key"
-
-    def test_get_api_key_missing_raises(self):
-        """get_api_key should raise when key is not set."""
-        from src.utils.config import Settings
-
-        with patch.dict(os.environ, {"LLM_PROVIDER": "openai"}, clear=True):
-            settings = Settings()
-            with pytest.raises(ValueError, match="OPENAI_API_KEY not set"):
-                settings.get_api_key()
-```
-
----
-
-## 8. Makefile (Developer Experience)
-
-Create a `Makefile` for standard devex commands:
-
-```makefile
-.PHONY: install test lint format typecheck check clean
-
-install:
-	uv sync --all-extras
-	uv run pre-commit install
-
-test:
-	uv run pytest tests/unit/ -v
-
-test-cov:
-	uv run pytest --cov=src --cov-report=term-missing
-
-lint:
-	uv run ruff check src tests
-
-format:
-	uv run ruff format src tests
-
-typecheck:
-	uv run mypy src
-
-check: lint typecheck test
-	@echo "All checks passed!"
-
-clean:
-	rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ .coverage
-	find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
-```
-
----
-
-## 9. Execution Commands
-
-```bash
-# Install all dependencies
-uv sync --all-extras
-
-# Run tests (should pass after implementing config.py)
-uv run pytest tests/unit/utils/test_config.py -v
-
-# Run full test suite with coverage
-uv run pytest --cov=src --cov-report=term-missing
-
-# Run linting
-uv run ruff check src tests
-uv run ruff format src tests
-
-# Run type checking
-uv run mypy src
-
-# Set up pre-commit hooks
-uv run pre-commit install
-```
-
----
-
-## 10. Implementation Checklist
-
-- [ ] Install `uv` and verify version
-- [ ] Run `uv init --name deepcritical`
-- [ ] Create `pyproject.toml` (copy from above)
-- [ ] Create directory structure (run mkdir commands)
-- [ ] Create `.env.example` and `.env`
-- [ ] Create `.pre-commit-config.yaml`
-- [ ] Create `Makefile` (copy from above)
-- [ ] Create `tests/conftest.py`
-- [ ] Implement `src/utils/config.py`
-- [ ] Implement `src/utils/exceptions.py`
-- [ ] Write tests in `tests/unit/utils/test_config.py`
-- [ ] Run `make install`
-- [ ] Run `make check` — **ALL CHECKS MUST PASS**
-- [ ] Commit: `git commit -m "feat: phase 1 foundation complete"`
-
----
-
-## 11. Definition of Done
-
-Phase 1 is **COMPLETE** when:
-
-1. `uv run pytest` passes with 100% of tests green
-2. `uv run ruff check src tests` has 0 errors
-3. `uv run mypy src` has 0 errors
-4. Pre-commit hooks are installed and working
-5. `from src.utils.config import settings` works in Python REPL
-
-**Proceed to Phase 2 ONLY after all checkboxes are complete.**
diff --git a/docs/implementation/02_phase_search.md b/docs/implementation/02_phase_search.md
deleted file mode 100644
index 62a0564d0d21f9963e72d42728f5090b37523369..0000000000000000000000000000000000000000
--- a/docs/implementation/02_phase_search.md
+++ /dev/null
@@ -1,822 +0,0 @@
-# Phase 2 Implementation Spec: Search Vertical Slice
-
-**Goal**: Implement the "Eyes and Ears" of the agent — retrieving real biomedical data.
-**Philosophy**: "Real data, mocked connections."
-**Prerequisite**: Phase 1 complete (all tests passing)
-
-> **⚠️ Implementation Note (2025-01-27)**: The DuckDuckGo WebTool specified in this phase was removed in favor of the Europe PMC tool (see Phase 11). Europe PMC provides better coverage for biomedical research by including preprints, peer-reviewed articles, and patents. The current implementation uses PubMed, ClinicalTrials.gov, and Europe PMC as search sources.
-
----
-
-## 1. The Slice Definition
-
-This slice covers:
-1. **Input**: A string query (e.g., "metformin Alzheimer's disease").
-2. **Process**:
-   - Fetch from PubMed (E-utilities API).
-   - ~~Fetch from Web (DuckDuckGo).~~ **REMOVED** - Replaced by Europe PMC in Phase 11
-   - Normalize results into `Evidence` models.
-3. **Output**: A list of `Evidence` objects.
-
-**Files to Create**:
-- `src/utils/models.py` - Pydantic models (Evidence, Citation, SearchResult)
-- `src/tools/pubmed.py` - PubMed E-utilities tool
-- ~~`src/tools/websearch.py` - DuckDuckGo search tool~~ **REMOVED** - See Phase 11 for Europe PMC replacement
-- `src/tools/search_handler.py` - Orchestrates multiple tools
-- `src/tools/__init__.py` - Exports
-
-**Additional Files (Post-Phase 2 Enhancements)**:
-- `src/tools/query_utils.py` - Query preprocessing (removes question words, expands medical synonyms)
-
----
-
-## 2. PubMed E-utilities API Reference
-
-**Base URL**: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/`
-
-### Key Endpoints
-
-| Endpoint | Purpose | Example |
-|----------|---------|---------|
-| `esearch.fcgi` | Search for article IDs | `?db=pubmed&term=metformin+alzheimer&retmax=10` |
-| `efetch.fcgi` | Fetch article details | `?db=pubmed&id=12345,67890&rettype=abstract&retmode=xml` |
-
-### Rate Limiting (CRITICAL!)
-
-NCBI **requires** rate limiting:
-- **Without API key**: 3 requests/second
-- **With API key**: 10 requests/second
-
-Get a free API key: https://www.ncbi.nlm.nih.gov/account/settings/
-
-```python
-# Add to .env
-NCBI_API_KEY=your-key-here  # Optional but recommended
-```
-
-### Example Search Flow
-
-```
-1. esearch: "metformin alzheimer" → [PMID: 12345, 67890, ...]
-2. efetch: PMIDs → Full abstracts/metadata
-3. Parse XML → Evidence objects
-```
-
----
-
-## 3. Models (`src/utils/models.py`)
-
-```python
-"""Data models for the Search feature."""
-from pydantic import BaseModel, Field
-from typing import Literal
-
-
-class Citation(BaseModel):
-    """A citation to a source document."""
-
-    source: Literal["pubmed", "web"] = Field(description="Where this came from")
-    title: str = Field(min_length=1, max_length=500)
-    url: str = Field(description="URL to the source")
-    date: str = Field(description="Publication date (YYYY-MM-DD or 'Unknown')")
-    authors: list[str] = Field(default_factory=list)
-
-    @property
-    def formatted(self) -> str:
-        """Format as a citation string."""
-        author_str = ", ".join(self.authors[:3])
-        if len(self.authors) > 3:
-            author_str += " et al."
-        return f"{author_str} ({self.date}). {self.title}. {self.source.upper()}"
-
-
-class Evidence(BaseModel):
-    """A piece of evidence retrieved from search."""
-
-    content: str = Field(min_length=1, description="The actual text content")
-    citation: Citation
-    relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
-
-    class Config:
-        frozen = True  # Immutable after creation
-
-
-class SearchResult(BaseModel):
-    """Result of a search operation."""
-
-    query: str
-    evidence: list[Evidence]
-    sources_searched: list[Literal["pubmed", "web"]]
-    total_found: int
-    errors: list[str] = Field(default_factory=list)
-```
-
----
-
-## 4. Tool Protocol (`src/tools/pubmed.py` and `src/tools/websearch.py`)
-
-### The Interface (Protocol) - Add to `src/tools/__init__.py`
-
-```python
-"""Search tools package."""
-from typing import Protocol, List
-
-# Import implementations
-from src.tools.pubmed import PubMedTool
-from src.tools.websearch import WebTool
-from src.tools.search_handler import SearchHandler
-
-# Re-export
-__all__ = ["SearchTool", "PubMedTool", "WebTool", "SearchHandler"]
-
-
-class SearchTool(Protocol):
-    """Protocol defining the interface for all search tools."""
-
-    @property
-    def name(self) -> str:
-        """Human-readable name of this tool."""
-        ...
-
-    async def search(self, query: str, max_results: int = 10) -> List["Evidence"]:
-        """
-        Execute a search and return evidence.
-
-        Args:
-            query: The search query string
-            max_results: Maximum number of results to return
-
-        Returns:
-            List of Evidence objects
-
-        Raises:
-            SearchError: If the search fails
-            RateLimitError: If we hit rate limits
-        """
-        ...
-```
-
-### PubMed Tool Implementation (`src/tools/pubmed.py`)
-
-```python
-"""PubMed search tool using NCBI E-utilities."""
-import asyncio
-import httpx
-import xmltodict
-from typing import List
-from tenacity import retry, stop_after_attempt, wait_exponential
-
-from src.utils.config import settings
-from src.utils.exceptions import SearchError, RateLimitError
-from src.utils.models import Evidence, Citation
-
-
-class PubMedTool:
-    """Search tool for PubMed/NCBI."""
-
-    BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
-    RATE_LIMIT_DELAY = 0.34  # ~3 requests/sec without API key
-
-    def __init__(self, api_key: str | None = None):
-        self.api_key = api_key or getattr(settings, "ncbi_api_key", None)
-        self._last_request_time = 0.0
-
-    @property
-    def name(self) -> str:
-        return "pubmed"
-
-    async def _rate_limit(self) -> None:
-        """Enforce NCBI rate limiting."""
-        now = asyncio.get_event_loop().time()
-        elapsed = now - self._last_request_time
-        if elapsed < self.RATE_LIMIT_DELAY:
-            await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
-        self._last_request_time = asyncio.get_event_loop().time()
-
-    def _build_params(self, **kwargs) -> dict:
-        """Build request params with optional API key."""
-        params = {**kwargs, "retmode": "json"}
-        if self.api_key:
-            params["api_key"] = self.api_key
-        return params
-
-    @retry(
-        stop=stop_after_attempt(3),
-        wait=wait_exponential(multiplier=1, min=1, max=10),
-        reraise=True,
-    )
-    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
-        """
-        Search PubMed and return evidence.
-
-        1. ESearch: Get PMIDs matching query
-        2. EFetch: Get abstracts for those PMIDs
-        3. Parse and return Evidence objects
-        """
-        await self._rate_limit()
-
-        async with httpx.AsyncClient(timeout=30.0) as client:
-            # Step 1: Search for PMIDs
-            search_params = self._build_params(
-                db="pubmed",
-                term=query,
-                retmax=max_results,
-                sort="relevance",
-            )
-
-            try:
-                search_resp = await client.get(
-                    f"{self.BASE_URL}/esearch.fcgi",
-                    params=search_params,
-                )
-                search_resp.raise_for_status()
-            except httpx.HTTPStatusError as e:
-                if e.response.status_code == 429:
-                    raise RateLimitError("PubMed rate limit exceeded")
-                raise SearchError(f"PubMed search failed: {e}")
-
-            search_data = search_resp.json()
-            pmids = search_data.get("esearchresult", {}).get("idlist", [])
-
-            if not pmids:
-                return []
-
-            # Step 2: Fetch abstracts
-            await self._rate_limit()
-            fetch_params = self._build_params(
-                db="pubmed",
-                id=",".join(pmids),
-                rettype="abstract",
-            )
-            # Use XML for fetch (more reliable parsing)
-            fetch_params["retmode"] = "xml"
-
-            fetch_resp = await client.get(
-                f"{self.BASE_URL}/efetch.fcgi",
-                params=fetch_params,
-            )
-            fetch_resp.raise_for_status()
-
-            # Step 3: Parse XML to Evidence
-            return self._parse_pubmed_xml(fetch_resp.text)
-
-    def _parse_pubmed_xml(self, xml_text: str) -> List[Evidence]:
-        """Parse PubMed XML into Evidence objects."""
-        try:
-            data = xmltodict.parse(xml_text)
-        except Exception as e:
-            raise SearchError(f"Failed to parse PubMed XML: {e}")
-
-        articles = data.get("PubmedArticleSet", {}).get("PubmedArticle", [])
-
-        # Handle single article (xmltodict returns dict instead of list)
-        if isinstance(articles, dict):
-            articles = [articles]
-
-        evidence_list = []
-        for article in articles:
-            try:
-                evidence = self._article_to_evidence(article)
-                if evidence:
-                    evidence_list.append(evidence)
-            except Exception:
-                continue  # Skip malformed articles
-
-        return evidence_list
-
-    def _article_to_evidence(self, article: dict) -> Evidence | None:
-        """Convert a single PubMed article to Evidence."""
-        medline = article.get("MedlineCitation", {})
-        article_data = medline.get("Article", {})
-
-        # Extract PMID
-        pmid = medline.get("PMID", {})
-        if isinstance(pmid, dict):
-            pmid = pmid.get("#text", "")
-
-        # Extract title
-        title = article_data.get("ArticleTitle", "")
-        if isinstance(title, dict):
-            title = title.get("#text", str(title))
-
-        # Extract abstract
-        abstract_data = article_data.get("Abstract", {}).get("AbstractText", "")
-        if isinstance(abstract_data, list):
-            abstract = " ".join(
-                item.get("#text", str(item)) if isinstance(item, dict) else str(item)
-                for item in abstract_data
-            )
-        elif isinstance(abstract_data, dict):
-            abstract = abstract_data.get("#text", str(abstract_data))
-        else:
-            abstract = str(abstract_data)
-
-        if not abstract or not title:
-            return None
-
-        # Extract date
-        pub_date = article_data.get("Journal", {}).get("JournalIssue", {}).get("PubDate", {})
-        year = pub_date.get("Year", "Unknown")
-        month = pub_date.get("Month", "01")
-        day = pub_date.get("Day", "01")
-        date_str = f"{year}-{month}-{day}" if year != "Unknown" else "Unknown"
-
-        # Extract authors
-        author_list = article_data.get("AuthorList", {}).get("Author", [])
-        if isinstance(author_list, dict):
-            author_list = [author_list]
-        authors = []
-        for author in author_list[:5]:  # Limit to 5 authors
-            last = author.get("LastName", "")
-            first = author.get("ForeName", "")
-            if last:
-                authors.append(f"{last} {first}".strip())
-
-        return Evidence(
-            content=abstract[:2000],  # Truncate long abstracts
-            citation=Citation(
-                source="pubmed",
-                title=title[:500],
-                url=f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/",
-                date=date_str,
-                authors=authors,
-            ),
-        )
-```
-
-### DuckDuckGo Tool Implementation (`src/tools/websearch.py`)
-
-```python
-"""Web search tool using DuckDuckGo."""
-from typing import List
-from duckduckgo_search import DDGS
-
-from src.utils.exceptions import SearchError
-from src.utils.models import Evidence, Citation
-
-
-class WebTool:
-    """Search tool for general web search via DuckDuckGo."""
-
-    def __init__(self):
-        pass
-
-    @property
-    def name(self) -> str:
-        return "web"
-
-    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
-        """
-        Search DuckDuckGo and return evidence.
-
-        Note: duckduckgo-search is synchronous, so we run it in executor.
-        """
-        import asyncio
-
-        loop = asyncio.get_event_loop()
-        try:
-            results = await loop.run_in_executor(
-                None,
-                lambda: self._sync_search(query, max_results),
-            )
-            return results
-        except Exception as e:
-            raise SearchError(f"Web search failed: {e}")
-
-    def _sync_search(self, query: str, max_results: int) -> List[Evidence]:
-        """Synchronous search implementation."""
-        evidence_list = []
-
-        with DDGS() as ddgs:
-            results = list(ddgs.text(query, max_results=max_results))
-
-        for result in results:
-            evidence_list.append(
-                Evidence(
-                    content=result.get("body", "")[:1000],
-                    citation=Citation(
-                        source="web",
-                        title=result.get("title", "Unknown")[:500],
-                        url=result.get("href", ""),
-                        date="Unknown",
-                        authors=[],
-                    ),
-                )
-            )
-
-        return evidence_list
-```
-
----
-
-## 5. Search Handler (`src/tools/search_handler.py`)
-
-The handler orchestrates multiple tools using the **Scatter-Gather** pattern.
-
-```python
-"""Search handler - orchestrates multiple search tools."""
-import asyncio
-from typing import List, Protocol
-import structlog
-
-from src.utils.exceptions import SearchError
-from src.utils.models import Evidence, SearchResult
-
-logger = structlog.get_logger()
-
-
-class SearchTool(Protocol):
-    """Protocol defining the interface for all search tools."""
-
-    @property
-    def name(self) -> str:
-        ...
-
-    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
-        ...
-
-
-def flatten(nested: List[List[Evidence]]) -> List[Evidence]:
-    """Flatten a list of lists into a single list."""
-    return [item for sublist in nested for item in sublist]
-
-
-class SearchHandler:
-    """Orchestrates parallel searches across multiple tools."""
-
-    def __init__(self, tools: List[SearchTool], timeout: float = 30.0):
-        """
-        Initialize the search handler.
-
-        Args:
-            tools: List of search tools to use
-            timeout: Timeout for each search in seconds
-        """
-        self.tools = tools
-        self.timeout = timeout
-
-    async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
-        """
-        Execute search across all tools in parallel.
-
-        Args:
-            query: The search query
-            max_results_per_tool: Max results from each tool
-
-        Returns:
-            SearchResult containing all evidence and metadata
-        """
-        logger.info("Starting search", query=query, tools=[t.name for t in self.tools])
-
-        # Create tasks for parallel execution
-        tasks = [
-            self._search_with_timeout(tool, query, max_results_per_tool)
-            for tool in self.tools
-        ]
-
-        # Gather results (don't fail if one tool fails)
-        results = await asyncio.gather(*tasks, return_exceptions=True)
-
-        # Process results
-        all_evidence: List[Evidence] = []
-        sources_searched: List[str] = []
-        errors: List[str] = []
-
-        for tool, result in zip(self.tools, results):
-            if isinstance(result, Exception):
-                errors.append(f"{tool.name}: {str(result)}")
-                logger.warning("Search tool failed", tool=tool.name, error=str(result))
-            else:
-                all_evidence.extend(result)
-                sources_searched.append(tool.name)
-                logger.info("Search tool succeeded", tool=tool.name, count=len(result))
-
-        return SearchResult(
-            query=query,
-            evidence=all_evidence,
-            sources_searched=sources_searched,
-            total_found=len(all_evidence),
-            errors=errors,
-        )
-
-    async def _search_with_timeout(
-        self,
-        tool: SearchTool,
-        query: str,
-        max_results: int,
-    ) -> List[Evidence]:
-        """Execute a single tool search with timeout."""
-        try:
-            return await asyncio.wait_for(
-                tool.search(query, max_results),
-                timeout=self.timeout,
-            )
-        except asyncio.TimeoutError:
-            raise SearchError(f"{tool.name} search timed out after {self.timeout}s")
-```
-
----
-
-## 6. TDD Workflow
-
-### Test File: `tests/unit/tools/test_pubmed.py`
-
-```python
-"""Unit tests for PubMed tool."""
-import pytest
-from unittest.mock import AsyncMock, MagicMock
-
-
-# Sample PubMed XML response for mocking
-SAMPLE_PUBMED_XML = """<?xml version="1.0" ?>
-<PubmedArticleSet>
-    <PubmedArticle>
-        <MedlineCitation>
-            <PMID>12345678</PMID>
-            <Article>
-                <ArticleTitle>Metformin in Alzheimer's Disease: A Systematic Review</ArticleTitle>
-                <Abstract>
-                    <AbstractText>Metformin shows neuroprotective properties...</AbstractText>
-                </Abstract>
-                <AuthorList>
-                    <Author>
-                        <LastName>Smith</LastName>
-                        <ForeName>John</ForeName>
-                    </Author>
-                </AuthorList>
-                <Journal>
-                    <JournalIssue>
-                        <PubDate>
-                            <Year>2024</Year>
-                            <Month>01</Month>
-                        </PubDate>
-                    </JournalIssue>
-                </Journal>
-            </Article>
-        </MedlineCitation>
-    </PubmedArticle>
-</PubmedArticleSet>
-"""
-
-
-class TestPubMedTool:
-    """Tests for PubMedTool."""
-
-    @pytest.mark.asyncio
-    async def test_search_returns_evidence(self, mocker):
-        """PubMedTool should return Evidence objects from search."""
-        from src.tools.pubmed import PubMedTool
-
-        # Mock the HTTP responses
-        mock_search_response = MagicMock()
-        mock_search_response.json.return_value = {
-            "esearchresult": {"idlist": ["12345678"]}
-        }
-        mock_search_response.raise_for_status = MagicMock()
-
-        mock_fetch_response = MagicMock()
-        mock_fetch_response.text = SAMPLE_PUBMED_XML
-        mock_fetch_response.raise_for_status = MagicMock()
-
-        mock_client = AsyncMock()
-        mock_client.get = AsyncMock(side_effect=[mock_search_response, mock_fetch_response])
-        mock_client.__aenter__ = AsyncMock(return_value=mock_client)
-        mock_client.__aexit__ = AsyncMock(return_value=None)
-
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        # Act
-        tool = PubMedTool()
-        results = await tool.search("metformin alzheimer")
-
-        # Assert
-        assert len(results) == 1
-        assert results[0].citation.source == "pubmed"
-        assert "Metformin" in results[0].citation.title
-        assert "12345678" in results[0].citation.url
-
-    @pytest.mark.asyncio
-    async def test_search_empty_results(self, mocker):
-        """PubMedTool should return empty list when no results."""
-        from src.tools.pubmed import PubMedTool
-
-        mock_response = MagicMock()
-        mock_response.json.return_value = {"esearchresult": {"idlist": []}}
-        mock_response.raise_for_status = MagicMock()
-
-        mock_client = AsyncMock()
-        mock_client.get = AsyncMock(return_value=mock_response)
-        mock_client.__aenter__ = AsyncMock(return_value=mock_client)
-        mock_client.__aexit__ = AsyncMock(return_value=None)
-
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        tool = PubMedTool()
-        results = await tool.search("xyznonexistentquery123")
-
-        assert results == []
-
-    def test_parse_pubmed_xml(self):
-        """PubMedTool should correctly parse XML."""
-        from src.tools.pubmed import PubMedTool
-
-        tool = PubMedTool()
-        results = tool._parse_pubmed_xml(SAMPLE_PUBMED_XML)
-
-        assert len(results) == 1
-        assert results[0].citation.source == "pubmed"
-        assert "Smith John" in results[0].citation.authors
-```
-
-### Test File: `tests/unit/tools/test_websearch.py`
-
-```python
-"""Unit tests for WebTool."""
-import pytest
-from unittest.mock import MagicMock
-
-
-class TestWebTool:
-    """Tests for WebTool."""
-
-    @pytest.mark.asyncio
-    async def test_search_returns_evidence(self, mocker):
-        """WebTool should return Evidence objects from search."""
-        from src.tools.websearch import WebTool
-
-        mock_results = [
-            {
-                "title": "Drug Repurposing Article",
-                "href": "https://example.com/article",
-                "body": "Some content about drug repurposing...",
-            }
-        ]
-
-        mock_ddgs = MagicMock()
-        mock_ddgs.__enter__ = MagicMock(return_value=mock_ddgs)
-        mock_ddgs.__exit__ = MagicMock(return_value=None)
-        mock_ddgs.text = MagicMock(return_value=mock_results)
-
-        mocker.patch("src.tools.websearch.DDGS", return_value=mock_ddgs)
-
-        tool = WebTool()
-        results = await tool.search("drug repurposing")
-
-        assert len(results) == 1
-        assert results[0].citation.source == "web"
-        assert "Drug Repurposing" in results[0].citation.title
-```
-
-### Test File: `tests/unit/tools/test_search_handler.py`
-
-```python
-"""Unit tests for SearchHandler."""
-import pytest
-from unittest.mock import AsyncMock
-
-from src.utils.models import Evidence, Citation
-from src.utils.exceptions import SearchError
-
-
-class TestSearchHandler:
-    """Tests for SearchHandler."""
-
-    @pytest.mark.asyncio
-    async def test_execute_aggregates_results(self):
-        """SearchHandler should aggregate results from all tools."""
-        from src.tools.search_handler import SearchHandler
-
-        # Create mock tools
-        mock_tool_1 = AsyncMock()
-        mock_tool_1.name = "mock1"
-        mock_tool_1.search = AsyncMock(return_value=[
-            Evidence(
-                content="Result 1",
-                citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
-            )
-        ])
-
-        mock_tool_2 = AsyncMock()
-        mock_tool_2.name = "mock2"
-        mock_tool_2.search = AsyncMock(return_value=[
-            Evidence(
-                content="Result 2",
-                citation=Citation(source="web", title="T2", url="u2", date="2024"),
-            )
-        ])
-
-        handler = SearchHandler(tools=[mock_tool_1, mock_tool_2])
-        result = await handler.execute("test query")
-
-        assert result.total_found == 2
-        assert "mock1" in result.sources_searched
-        assert "mock2" in result.sources_searched
-        assert len(result.errors) == 0
-
-    @pytest.mark.asyncio
-    async def test_execute_handles_tool_failure(self):
-        """SearchHandler should continue if one tool fails."""
-        from src.tools.search_handler import SearchHandler
-
-        mock_tool_ok = AsyncMock()
-        mock_tool_ok.name = "ok_tool"
-        mock_tool_ok.search = AsyncMock(return_value=[
-            Evidence(
-                content="Good result",
-                citation=Citation(source="pubmed", title="T", url="u", date="2024"),
-            )
-        ])
-
-        mock_tool_fail = AsyncMock()
-        mock_tool_fail.name = "fail_tool"
-        mock_tool_fail.search = AsyncMock(side_effect=SearchError("API down"))
-
-        handler = SearchHandler(tools=[mock_tool_ok, mock_tool_fail])
-        result = await handler.execute("test")
-
-        assert result.total_found == 1
-        assert "ok_tool" in result.sources_searched
-        assert len(result.errors) == 1
-        assert "fail_tool" in result.errors[0]
-```
-
----
-
-## 7. Integration Test (Optional, Real API)
-
-```python
-# tests/integration/test_pubmed_live.py
-"""Integration tests that hit real APIs (run manually)."""
-import pytest
-
-
-@pytest.mark.integration
-@pytest.mark.slow
-@pytest.mark.asyncio
-async def test_pubmed_live_search():
-    """Test real PubMed search (requires network)."""
-    from src.tools.pubmed import PubMedTool
-
-    tool = PubMedTool()
-    results = await tool.search("metformin diabetes", max_results=3)
-
-    assert len(results) > 0
-    assert results[0].citation.source == "pubmed"
-    assert "pubmed.ncbi.nlm.nih.gov" in results[0].citation.url
-
-
-# Run with: uv run pytest tests/integration -m integration
-```
-
----
-
-## 8. Implementation Checklist
-
-- [x] Create `src/utils/models.py` with all Pydantic models (Evidence, Citation, SearchResult) - **COMPLETE**
-- [x] Create `src/tools/__init__.py` with SearchTool Protocol and exports - **COMPLETE**
-- [x] Implement `src/tools/pubmed.py` with PubMedTool class - **COMPLETE**
-- [ ] ~~Implement `src/tools/websearch.py` with WebTool class~~ - **REMOVED** (replaced by Europe PMC in Phase 11)
-- [x] Create `src/tools/search_handler.py` with SearchHandler class - **COMPLETE**
-- [x] Write tests in `tests/unit/tools/test_pubmed.py` - **COMPLETE** (basic tests)
-- [ ] Write tests in `tests/unit/tools/test_websearch.py` - **N/A** (WebTool removed)
-- [x] Write tests in `tests/unit/tools/test_search_handler.py` - **COMPLETE** (basic tests)
-- [x] Run `uv run pytest tests/unit/tools/ -v` — **ALL TESTS MUST PASS** - **PASSING**
-- [ ] (Optional) Run integration test: `uv run pytest -m integration`
-- [ ] Add edge case tests (rate limiting, error handling, timeouts) - **PENDING**
-- [ ] Commit: `git commit -m "feat: phase 2 search slice complete"` - **DONE**
-
-**Post-Phase 2 Enhancements**:
-- [x] Query preprocessing (`src/tools/query_utils.py`) - **ADDED**
-- [x] Europe PMC tool (Phase 11) - **ADDED**
-- [x] ClinicalTrials tool (Phase 10) - **ADDED**
-
----
-
-## 9. Definition of Done
-
-Phase 2 is **COMPLETE** when:
-
-1. ✅ All unit tests pass: `uv run pytest tests/unit/tools/ -v` - **PASSING**
-2. ✅ `SearchHandler` can execute with search tools - **WORKING**
-3. ✅ Graceful degradation: if one tool fails, other tools still return results - **IMPLEMENTED**
-4. ✅ Rate limiting is enforced (verify no 429 errors) - **IMPLEMENTED**
-5. ✅ Can run this in Python REPL:
-
-```python
-import asyncio
-from src.tools.pubmed import PubMedTool
-from src.tools.search_handler import SearchHandler
-
-async def test():
-    handler = SearchHandler([PubMedTool()])
-    result = await handler.execute("metformin alzheimer")
-    print(f"Found {result.total_found} results")
-    for e in result.evidence[:3]:
-        print(f"- {e.citation.title}")
-
-asyncio.run(test())
-```
-
-**Note**: WebTool was removed in favor of Europe PMC (Phase 11). The current implementation uses PubMed as the primary Phase 2 tool, with Europe PMC and ClinicalTrials added in later phases.
-
-**Proceed to Phase 3 ONLY after all checkboxes are complete.**
diff --git a/docs/implementation/03_phase_judge.md b/docs/implementation/03_phase_judge.md
deleted file mode 100644
index f97ff8b814233e6d26046ba52906b39c2ae1b742..0000000000000000000000000000000000000000
--- a/docs/implementation/03_phase_judge.md
+++ /dev/null
@@ -1,1052 +0,0 @@
-# Phase 3 Implementation Spec: Judge Vertical Slice
-
-**Goal**: Implement the "Brain" of the agent — evaluating evidence quality.
-**Philosophy**: "Structured Output or Bust."
-**Prerequisite**: Phase 2 complete (all search tests passing)
-
----
-
-## 1. The Slice Definition
-
-This slice covers:
-1. **Input**: A user question + a list of `Evidence` (from Phase 2).
-2. **Process**:
-   - Construct a prompt with the evidence.
-   - Call LLM (PydanticAI / OpenAI / Anthropic).
-   - Force JSON structured output.
-3. **Output**: A `JudgeAssessment` object.
-
-**Files to Create**:
-- `src/utils/models.py` - Add JudgeAssessment models (extend from Phase 2)
-- `src/prompts/judge.py` - Judge prompt templates
-- `src/agent_factory/judges.py` - JudgeHandler with PydanticAI
-- `tests/unit/agent_factory/test_judges.py` - Unit tests
-
----
-
-## 2. Models (Add to `src/utils/models.py`)
-
-The output schema must be strict for reliable structured output.
-
-```python
-"""Add these models to src/utils/models.py (after Evidence models from Phase 2)."""
-from pydantic import BaseModel, Field
-from typing import List, Literal
-
-
-class AssessmentDetails(BaseModel):
-    """Detailed assessment of evidence quality."""
-
-    mechanism_score: int = Field(
-        ...,
-        ge=0,
-        le=10,
-        description="How well does the evidence explain the mechanism? 0-10"
-    )
-    mechanism_reasoning: str = Field(
-        ...,
-        min_length=10,
-        description="Explanation of mechanism score"
-    )
-    clinical_evidence_score: int = Field(
-        ...,
-        ge=0,
-        le=10,
-        description="Strength of clinical/preclinical evidence. 0-10"
-    )
-    clinical_reasoning: str = Field(
-        ...,
-        min_length=10,
-        description="Explanation of clinical evidence score"
-    )
-    drug_candidates: List[str] = Field(
-        default_factory=list,
-        description="List of specific drug candidates mentioned"
-    )
-    key_findings: List[str] = Field(
-        default_factory=list,
-        description="Key findings from the evidence"
-    )
-
-
-class JudgeAssessment(BaseModel):
-    """Complete assessment from the Judge."""
-
-    details: AssessmentDetails
-    sufficient: bool = Field(
-        ...,
-        description="Is evidence sufficient to provide a recommendation?"
-    )
-    confidence: float = Field(
-        ...,
-        ge=0.0,
-        le=1.0,
-        description="Confidence in the assessment (0-1)"
-    )
-    recommendation: Literal["continue", "synthesize"] = Field(
-        ...,
-        description="continue = need more evidence, synthesize = ready to answer"
-    )
-    next_search_queries: List[str] = Field(
-        default_factory=list,
-        description="If continue, what queries to search next"
-    )
-    reasoning: str = Field(
-        ...,
-        min_length=20,
-        description="Overall reasoning for the recommendation"
-    )
-```
-
----
-
-## 3. Prompt Engineering (`src/prompts/judge.py`)
-
-We treat prompts as code. They should be versioned and clean.
-
-```python
-"""Judge prompts for evidence assessment."""
-from typing import List
-from src.utils.models import Evidence
-
-
-SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
-
-Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition.
-
-## Evaluation Criteria
-
-1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
-   - 0-3: No clear mechanism, speculative
-   - 4-6: Some mechanistic insight, but gaps exist
-   - 7-10: Clear, well-supported mechanism of action
-
-2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
-   - 0-3: No clinical data, only theoretical
-   - 4-6: Preclinical or early clinical data
-   - 7-10: Strong clinical evidence (trials, meta-analyses)
-
-3. **Sufficiency**: Evidence is sufficient when:
-   - Combined scores >= 12 AND
-   - At least one specific drug candidate identified AND
-   - Clear mechanistic rationale exists
-
-## Output Rules
-
-- Always output valid JSON matching the schema
-- Be conservative: only recommend "synthesize" when truly confident
-- If continuing, suggest specific, actionable search queries
-- Never hallucinate drug names or findings not in the evidence
-"""
-
-
-def format_user_prompt(question: str, evidence: List[Evidence]) -> str:
-    """
-    Format the user prompt with question and evidence.
-
-    Args:
-        question: The user's research question
-        evidence: List of Evidence objects from search
-
-    Returns:
-        Formatted prompt string
-    """
-    evidence_text = "\n\n".join([
-        f"### Evidence {i+1}\n"
-        f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
-        f"**URL**: {e.citation.url}\n"
-        f"**Date**: {e.citation.date}\n"
-        f"**Content**:\n{e.content[:1500]}..."
-        if len(e.content) > 1500 else
-        f"### Evidence {i+1}\n"
-        f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
-        f"**URL**: {e.citation.url}\n"
-        f"**Date**: {e.citation.date}\n"
-        f"**Content**:\n{e.content}"
-        for i, e in enumerate(evidence)
-    ])
-
-    return f"""## Research Question
-{question}
-
-## Available Evidence ({len(evidence)} sources)
-
-{evidence_text}
-
-## Your Task
-
-Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
-Respond with a JSON object matching the JudgeAssessment schema.
-"""
-
-
-def format_empty_evidence_prompt(question: str) -> str:
-    """
-    Format prompt when no evidence was found.
-
-    Args:
-        question: The user's research question
-
-    Returns:
-        Formatted prompt string
-    """
-    return f"""## Research Question
-{question}
-
-## Available Evidence
-
-No evidence was found from the search.
-
-## Your Task
-
-Since no evidence was found, recommend search queries that might yield better results.
-Set sufficient=False and recommendation="continue".
-Suggest 3-5 specific search queries.
-"""
-```
-
----
-
-## 4. JudgeHandler Implementation (`src/agent_factory/judges.py`)
-
-Using PydanticAI for structured output with retry logic.
-
-```python
-"""Judge handler for evidence assessment using PydanticAI."""
-import os
-from typing import List
-import structlog
-from pydantic_ai import Agent
-from pydantic_ai.models.openai import OpenAIModel
-from pydantic_ai.models.anthropic import AnthropicModel
-
-from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails
-from src.utils.config import settings
-from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt
-
-logger = structlog.get_logger()
-
-
-def get_model():
-    """Get the LLM model based on configuration."""
-    provider = getattr(settings, "llm_provider", "openai")
-
-    if provider == "anthropic":
-        return AnthropicModel(
-            model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"),
-            api_key=os.getenv("ANTHROPIC_API_KEY"),
-        )
-    else:
-        return OpenAIModel(
-            model_name=getattr(settings, "openai_model", "gpt-4o"),
-            api_key=os.getenv("OPENAI_API_KEY"),
-        )
-
-
-class JudgeHandler:
-    """
-    Handles evidence assessment using an LLM with structured output.
-
-    Uses PydanticAI to ensure responses match the JudgeAssessment schema.
-    """
-
-    def __init__(self, model=None):
-        """
-        Initialize the JudgeHandler.
-
-        Args:
-            model: Optional PydanticAI model. If None, uses config default.
-        """
-        self.model = model or get_model()
-        self.agent = Agent(
-            model=self.model,
-            result_type=JudgeAssessment,
-            system_prompt=SYSTEM_PROMPT,
-            retries=3,
-        )
-
-    async def assess(
-        self,
-        question: str,
-        evidence: List[Evidence],
-    ) -> JudgeAssessment:
-        """
-        Assess evidence and determine if it's sufficient.
-
-        Args:
-            question: The user's research question
-            evidence: List of Evidence objects from search
-
-        Returns:
-            JudgeAssessment with evaluation results
-
-        Raises:
-            JudgeError: If assessment fails after retries
-        """
-        logger.info(
-            "Starting evidence assessment",
-            question=question[:100],
-            evidence_count=len(evidence),
-        )
-
-        # Format the prompt based on whether we have evidence
-        if evidence:
-            user_prompt = format_user_prompt(question, evidence)
-        else:
-            user_prompt = format_empty_evidence_prompt(question)
-
-        try:
-            # Run the agent with structured output
-            result = await self.agent.run(user_prompt)
-            assessment = result.data
-
-            logger.info(
-                "Assessment complete",
-                sufficient=assessment.sufficient,
-                recommendation=assessment.recommendation,
-                confidence=assessment.confidence,
-            )
-
-            return assessment
-
-        except Exception as e:
-            logger.error("Assessment failed", error=str(e))
-            # Return a safe default assessment on failure
-            return self._create_fallback_assessment(question, str(e))
-
-    def _create_fallback_assessment(
-        self,
-        question: str,
-        error: str,
-    ) -> JudgeAssessment:
-        """
-        Create a fallback assessment when LLM fails.
-
-        Args:
-            question: The original question
-            error: The error message
-
-        Returns:
-            Safe fallback JudgeAssessment
-        """
-        return JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=0,
-                mechanism_reasoning="Assessment failed due to LLM error",
-                clinical_evidence_score=0,
-                clinical_reasoning="Assessment failed due to LLM error",
-                drug_candidates=[],
-                key_findings=[],
-            ),
-            sufficient=False,
-            confidence=0.0,
-            recommendation="continue",
-            next_search_queries=[
-                f"{question} mechanism",
-                f"{question} clinical trials",
-                f"{question} drug candidates",
-            ],
-            reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
-        )
-
-
-class HFInferenceJudgeHandler:
-    """
-    JudgeHandler using HuggingFace Inference API for FREE LLM calls.
-
-    This is the DEFAULT for demo mode - provides real AI analysis without
-    requiring users to have OpenAI/Anthropic API keys.
-
-    Model Fallback Chain (handles gated models and rate limits):
-        1. meta-llama/Llama-3.1-8B-Instruct (best quality, requires HF_TOKEN)
-        2. mistralai/Mistral-7B-Instruct-v0.3 (good quality, may require token)
-        3. HuggingFaceH4/zephyr-7b-beta (ungated, always works)
-
-    Rate Limit Handling:
-        - Exponential backoff with 3 retries
-        - Falls back to next model on persistent 429/503 errors
-    """
-
-    # Model fallback chain: gated (best) → ungated (fallback)
-    FALLBACK_MODELS = [
-        "meta-llama/Llama-3.1-8B-Instruct",      # Best quality (gated)
-        "mistralai/Mistral-7B-Instruct-v0.3",    # Good quality
-        "HuggingFaceH4/zephyr-7b-beta",          # Ungated fallback
-    ]
-
-    def __init__(self, model_id: str | None = None) -> None:
-        """
-        Initialize with HF Inference client.
-
-        Args:
-            model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain.
-        """
-        self.model_id = model_id
-        # Will automatically use HF_TOKEN from env if available
-        self.client = InferenceClient()
-        self.call_count = 0
-        self.last_question: str | None = None
-        self.last_evidence: list[Evidence] | None = None
-
-    def _extract_json(self, text: str) -> dict[str, Any] | None:
-        """
-        Robust JSON extraction that handles markdown blocks and nested braces.
-        """
-        text = text.strip()
-
-        # Remove markdown code blocks if present (with bounds checking)
-        if "```json" in text:
-            parts = text.split("```json", 1)
-            if len(parts) > 1:
-                inner_parts = parts[1].split("```", 1)
-                text = inner_parts[0]
-        elif "```" in text:
-            parts = text.split("```", 1)
-            if len(parts) > 1:
-                inner_parts = parts[1].split("```", 1)
-                text = inner_parts[0]
-
-        text = text.strip()
-
-        # Find first '{'
-        start_idx = text.find("{")
-        if start_idx == -1:
-            return None
-
-        # Stack-based parsing ignoring chars in strings
-        count = 0
-        in_string = False
-        escape = False
-
-        for i, char in enumerate(text[start_idx:], start=start_idx):
-            if in_string:
-                if escape:
-                    escape = False
-                elif char == "\\":
-                    escape = True
-                elif char == '"':
-                    in_string = False
-            elif char == '"':
-                in_string = True
-            elif char == "{":
-                count += 1
-            elif char == "}":
-                count -= 1
-                if count == 0:
-                    try:
-                        result = json.loads(text[start_idx : i + 1])
-                        if isinstance(result, dict):
-                            return result
-                        return None
-                    except json.JSONDecodeError:
-                        return None
-
-        return None
-
-    @retry(
-        stop=stop_after_attempt(3),
-        wait=wait_exponential(multiplier=1, min=1, max=4),
-        retry=retry_if_exception_type(Exception),
-        reraise=True,
-    )
-    async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment:
-        """Make API call with retry logic using chat_completion."""
-        loop = asyncio.get_running_loop()
-
-        # Build messages for chat_completion (model-agnostic)
-        messages = [
-            {
-                "role": "system",
-                "content": f"""{SYSTEM_PROMPT}
-
-IMPORTANT: Respond with ONLY valid JSON matching this schema:
-{{
-    "details": {{
-        "mechanism_score": <int 0-10>,
-        "mechanism_reasoning": "<string>",
-        "clinical_evidence_score": <int 0-10>,
-        "clinical_reasoning": "<string>",
-        "drug_candidates": ["<string>", ...],
-        "key_findings": ["<string>", ...]
-    }},
-    "sufficient": <bool>,
-    "confidence": <float 0-1>,
-    "recommendation": "continue" | "synthesize",
-    "next_search_queries": ["<string>", ...],
-    "reasoning": "<string>"
-}}""",
-            },
-            {"role": "user", "content": prompt},
-        ]
-
-        # Use chat_completion (conversational task - supported by all models)
-        response = await loop.run_in_executor(
-            None,
-            lambda: self.client.chat_completion(
-                messages=messages,
-                model=model,
-                max_tokens=1024,
-                temperature=0.1,
-            ),
-        )
-
-        # Extract content from response
-        content = response.choices[0].message.content
-        if not content:
-            raise ValueError("Empty response from model")
-
-        # Extract and parse JSON
-        json_data = self._extract_json(content)
-        if not json_data:
-            raise ValueError("No valid JSON found in response")
-
-        return JudgeAssessment(**json_data)
-
-    async def assess(
-        self,
-        question: str,
-        evidence: list[Evidence],
-    ) -> JudgeAssessment:
-        """
-        Assess evidence using HuggingFace Inference API.
-        Attempts models in order until one succeeds.
-        """
-        self.call_count += 1
-        self.last_question = question
-        self.last_evidence = evidence
-
-        # Format the user prompt
-        if evidence:
-            user_prompt = format_user_prompt(question, evidence)
-        else:
-            user_prompt = format_empty_evidence_prompt(question)
-
-        models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS
-        last_error: Exception | None = None
-
-        for model in models_to_try:
-            try:
-                return await self._call_with_retry(model, user_prompt, question)
-            except Exception as e:
-                logger.warning("Model failed", model=model, error=str(e))
-                last_error = e
-                continue
-
-        # All models failed
-        logger.error("All HF models failed", error=str(last_error))
-        return self._create_fallback_assessment(question, str(last_error))
-
-    def _create_fallback_assessment(
-        self,
-        question: str,
-        error: str,
-    ) -> JudgeAssessment:
-        """Create a fallback assessment when inference fails."""
-        return JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=0,
-                mechanism_reasoning=f"Assessment failed: {error}",
-                clinical_evidence_score=0,
-                clinical_reasoning=f"Assessment failed: {error}",
-                drug_candidates=[],
-                key_findings=[],
-            ),
-            sufficient=False,
-            confidence=0.0,
-            recommendation="continue",
-            next_search_queries=[
-                f"{question} mechanism",
-                f"{question} clinical trials",
-                f"{question} drug candidates",
-            ],
-            reasoning=f"HF Inference failed: {error}. Recommend retrying.",
-        )
-
-
-class MockJudgeHandler:
-    """
-    Mock JudgeHandler for UNIT TESTING ONLY.
-
-    NOT for production use. Use HFInferenceJudgeHandler for demo mode.
-    """
-
-    def __init__(self, mock_response: JudgeAssessment | None = None):
-        """Initialize with optional mock response for testing."""
-        self.mock_response = mock_response
-        self.call_count = 0
-        self.last_question = None
-        self.last_evidence = None
-
-    async def assess(
-        self,
-        question: str,
-        evidence: List[Evidence],
-    ) -> JudgeAssessment:
-        """Return the mock response (for testing only)."""
-        self.call_count += 1
-        self.last_question = question
-        self.last_evidence = evidence
-
-        if self.mock_response:
-            return self.mock_response
-
-        # Default mock response for tests
-        return JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=7,
-                mechanism_reasoning="Mock assessment for testing",
-                clinical_evidence_score=6,
-                clinical_reasoning="Mock assessment for testing",
-                drug_candidates=["TestDrug"],
-                key_findings=["Test finding"],
-            ),
-            sufficient=len(evidence) >= 3,
-            confidence=0.75,
-            recommendation="synthesize" if len(evidence) >= 3 else "continue",
-            next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
-            reasoning="Mock assessment for unit testing only",
-        )
-```
-
----
-
-## 5. TDD Workflow
-
-### Test File: `tests/unit/agent_factory/test_judges.py`
-
-```python
-"""Unit tests for JudgeHandler."""
-import pytest
-from unittest.mock import AsyncMock, MagicMock, patch
-
-from src.utils.models import (
-    Evidence,
-    Citation,
-    JudgeAssessment,
-    AssessmentDetails,
-)
-
-
-class TestJudgeHandler:
-    """Tests for JudgeHandler."""
-
-    @pytest.mark.asyncio
-    async def test_assess_returns_assessment(self):
-        """JudgeHandler should return JudgeAssessment from LLM."""
-        from src.agent_factory.judges import JudgeHandler
-
-        # Create mock assessment
-        mock_assessment = JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=8,
-                mechanism_reasoning="Strong mechanistic evidence",
-                clinical_evidence_score=7,
-                clinical_reasoning="Good clinical support",
-                drug_candidates=["Metformin"],
-                key_findings=["Neuroprotective effects"],
-            ),
-            sufficient=True,
-            confidence=0.85,
-            recommendation="synthesize",
-            next_search_queries=[],
-            reasoning="Evidence is sufficient for synthesis",
-        )
-
-        # Mock the PydanticAI agent
-        mock_result = MagicMock()
-        mock_result.data = mock_assessment
-
-        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
-            mock_agent = AsyncMock()
-            mock_agent.run = AsyncMock(return_value=mock_result)
-            mock_agent_class.return_value = mock_agent
-
-            handler = JudgeHandler()
-            # Replace the agent with our mock
-            handler.agent = mock_agent
-
-            evidence = [
-                Evidence(
-                    content="Metformin shows neuroprotective properties...",
-                    citation=Citation(
-                        source="pubmed",
-                        title="Metformin in AD",
-                        url="https://pubmed.ncbi.nlm.nih.gov/12345/",
-                        date="2024-01-01",
-                    ),
-                )
-            ]
-
-            result = await handler.assess("metformin alzheimer", evidence)
-
-            assert result.sufficient is True
-            assert result.recommendation == "synthesize"
-            assert result.confidence == 0.85
-            assert "Metformin" in result.details.drug_candidates
-
-    @pytest.mark.asyncio
-    async def test_assess_empty_evidence(self):
-        """JudgeHandler should handle empty evidence gracefully."""
-        from src.agent_factory.judges import JudgeHandler
-
-        mock_assessment = JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=0,
-                mechanism_reasoning="No evidence to assess",
-                clinical_evidence_score=0,
-                clinical_reasoning="No evidence to assess",
-                drug_candidates=[],
-                key_findings=[],
-            ),
-            sufficient=False,
-            confidence=0.0,
-            recommendation="continue",
-            next_search_queries=["metformin alzheimer mechanism"],
-            reasoning="No evidence found, need to search more",
-        )
-
-        mock_result = MagicMock()
-        mock_result.data = mock_assessment
-
-        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
-            mock_agent = AsyncMock()
-            mock_agent.run = AsyncMock(return_value=mock_result)
-            mock_agent_class.return_value = mock_agent
-
-            handler = JudgeHandler()
-            handler.agent = mock_agent
-
-            result = await handler.assess("metformin alzheimer", [])
-
-            assert result.sufficient is False
-            assert result.recommendation == "continue"
-            assert len(result.next_search_queries) > 0
-
-    @pytest.mark.asyncio
-    async def test_assess_handles_llm_failure(self):
-        """JudgeHandler should return fallback on LLM failure."""
-        from src.agent_factory.judges import JudgeHandler
-
-        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
-            mock_agent = AsyncMock()
-            mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
-            mock_agent_class.return_value = mock_agent
-
-            handler = JudgeHandler()
-            handler.agent = mock_agent
-
-            evidence = [
-                Evidence(
-                    content="Some content",
-                    citation=Citation(
-                        source="pubmed",
-                        title="Title",
-                        url="url",
-                        date="2024",
-                    ),
-                )
-            ]
-
-            result = await handler.assess("test question", evidence)
-
-            # Should return fallback, not raise
-            assert result.sufficient is False
-            assert result.recommendation == "continue"
-            assert "failed" in result.reasoning.lower()
-
-
-class TestHFInferenceJudgeHandler:
-    """Tests for HFInferenceJudgeHandler."""
-
-    @pytest.mark.asyncio
-    async def test_extract_json_raw(self):
-        """Should extract raw JSON."""
-        from src.agent_factory.judges import HFInferenceJudgeHandler
-
-        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
-        # Bypass __init__ for unit testing extraction
-
-        result = handler._extract_json('{"key": "value"}')
-        assert result == {"key": "value"}
-
-    @pytest.mark.asyncio
-    async def test_extract_json_markdown_block(self):
-        """Should extract JSON from markdown code block."""
-        from src.agent_factory.judges import HFInferenceJudgeHandler
-
-        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
-
-        response = '''Here is the assessment:
-```json
-{"key": "value", "nested": {"inner": 1}}
-```
-'''
-        result = handler._extract_json(response)
-        assert result == {"key": "value", "nested": {"inner": 1}}
-
-    @pytest.mark.asyncio
-    async def test_extract_json_with_preamble(self):
-        """Should extract JSON with preamble text."""
-        from src.agent_factory.judges import HFInferenceJudgeHandler
-
-        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
-
-        response = 'Here is your JSON response:\n{"sufficient": true, "confidence": 0.85}'
-        result = handler._extract_json(response)
-        assert result == {"sufficient": True, "confidence": 0.85}
-
-    @pytest.mark.asyncio
-    async def test_extract_json_nested_braces(self):
-        """Should handle nested braces correctly."""
-        from src.agent_factory.judges import HFInferenceJudgeHandler
-
-        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
-
-        response = '{"details": {"mechanism_score": 8}, "reasoning": "test"}'
-        result = handler._extract_json(response)
-        assert result["details"]["mechanism_score"] == 8
-
-    @pytest.mark.asyncio
-    async def test_hf_handler_uses_fallback_models(self):
-        """HFInferenceJudgeHandler should have fallback model chain."""
-        from src.agent_factory.judges import HFInferenceJudgeHandler
-
-        # Check class has fallback models defined
-        assert len(HFInferenceJudgeHandler.FALLBACK_MODELS) >= 3
-        assert "zephyr-7b-beta" in HFInferenceJudgeHandler.FALLBACK_MODELS[-1]
-
-    @pytest.mark.asyncio
-    async def test_hf_handler_fallback_on_auth_error(self):
-        """Should fall back to ungated model on auth error."""
-        from src.agent_factory.judges import HFInferenceJudgeHandler
-        from unittest.mock import MagicMock, patch
-
-        with patch("src.agent_factory.judges.InferenceClient") as mock_client_class:
-            # First call raises 403, second succeeds
-            mock_client = MagicMock()
-            mock_client.chat_completion.side_effect = [
-                Exception("403 Forbidden: gated model"),
-                MagicMock(choices=[MagicMock(message=MagicMock(content='{"sufficient": false}'))])
-            ]
-            mock_client_class.return_value = mock_client
-
-            handler = HFInferenceJudgeHandler()
-            # Manually trigger fallback test
-            assert handler._try_fallback_model() is True
-            assert handler.model_id != "meta-llama/Llama-3.1-8B-Instruct"
-
-
-class TestMockJudgeHandler:
-    """Tests for MockJudgeHandler (UNIT TESTING ONLY)."""
-
-    @pytest.mark.asyncio
-    async def test_mock_handler_returns_default(self):
-        """MockJudgeHandler should return default assessment."""
-        from src.agent_factory.judges import MockJudgeHandler
-
-        handler = MockJudgeHandler()
-
-        evidence = [
-            Evidence(
-                content="Content 1",
-                citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
-            ),
-            Evidence(
-                content="Content 2",
-                citation=Citation(source="web", title="T2", url="u2", date="2024"),
-            ),
-        ]
-
-        result = await handler.assess("test", evidence)
-
-        assert handler.call_count == 1
-        assert handler.last_question == "test"
-        assert len(handler.last_evidence) == 2
-        assert result.details.mechanism_score == 7
-
-    @pytest.mark.asyncio
-    async def test_mock_handler_custom_response(self):
-        """MockJudgeHandler should return custom response when provided."""
-        from src.agent_factory.judges import MockJudgeHandler
-
-        custom_assessment = JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=10,
-                mechanism_reasoning="Custom reasoning",
-                clinical_evidence_score=10,
-                clinical_reasoning="Custom clinical",
-                drug_candidates=["CustomDrug"],
-                key_findings=["Custom finding"],
-            ),
-            sufficient=True,
-            confidence=1.0,
-            recommendation="synthesize",
-            next_search_queries=[],
-            reasoning="Custom assessment",
-        )
-
-        handler = MockJudgeHandler(mock_response=custom_assessment)
-        result = await handler.assess("test", [])
-
-        assert result.details.mechanism_score == 10
-        assert result.details.drug_candidates == ["CustomDrug"]
-
-    @pytest.mark.asyncio
-    async def test_mock_handler_insufficient_with_few_evidence(self):
-        """MockJudgeHandler should recommend continue with < 3 evidence."""
-        from src.agent_factory.judges import MockJudgeHandler
-
-        handler = MockJudgeHandler()
-
-        # Only 2 pieces of evidence
-        evidence = [
-            Evidence(
-                content="Content",
-                citation=Citation(source="pubmed", title="T", url="u", date="2024"),
-            ),
-            Evidence(
-                content="Content 2",
-                citation=Citation(source="web", title="T2", url="u2", date="2024"),
-            ),
-        ]
-
-        result = await handler.assess("test", evidence)
-
-        assert result.sufficient is False
-        assert result.recommendation == "continue"
-        assert len(result.next_search_queries) > 0
-```
-
----
-
-## 6. Dependencies
-
-Add to `pyproject.toml`:
-
-```toml
-[project]
-dependencies = [
-    # ... existing deps ...
-    "pydantic-ai>=0.0.16",
-    "openai>=1.0.0",
-    "anthropic>=0.18.0",
-    "huggingface-hub>=0.20.0",  # For HFInferenceJudgeHandler (FREE LLM)
-]
-```
-
-**Note**: `huggingface-hub` is required for the free tier to work. It:
-- Provides `InferenceClient` for API calls
-- Auto-reads `HF_TOKEN` from environment (optional, for gated models)
-- Works without any token for ungated models like `zephyr-7b-beta`
-
----
-
-## 7. Configuration (`src/utils/config.py`)
-
-Add LLM configuration:
-
-```python
-"""Add to src/utils/config.py."""
-from pydantic_settings import BaseSettings
-from typing import Literal
-
-
-class Settings(BaseSettings):
-    """Application settings."""
-
-    # LLM Configuration
-    llm_provider: Literal["openai", "anthropic"] = "openai"
-    openai_model: str = "gpt-4o"
-    anthropic_model: str = "claude-3-5-sonnet-20241022"
-
-    # API Keys (loaded from environment)
-    openai_api_key: str | None = None
-    anthropic_api_key: str | None = None
-    ncbi_api_key: str | None = None
-
-    class Config:
-        env_file = ".env"
-        env_file_encoding = "utf-8"
-
-
-settings = Settings()
-```
-
----
-
-## 8. Implementation Checklist
-
-- [ ] Add `AssessmentDetails` and `JudgeAssessment` models to `src/utils/models.py`
-- [ ] Create `src/prompts/__init__.py` (empty, for package)
-- [ ] Create `src/prompts/judge.py` with prompt templates
-- [ ] Create `src/agent_factory/__init__.py` with exports
-- [ ] Implement `src/agent_factory/judges.py` with JudgeHandler
-- [ ] Update `src/utils/config.py` with LLM settings
-- [ ] Create `tests/unit/agent_factory/__init__.py`
-- [ ] Write tests in `tests/unit/agent_factory/test_judges.py`
-- [ ] Run `uv run pytest tests/unit/agent_factory/ -v` — **ALL TESTS MUST PASS**
-- [ ] Commit: `git commit -m "feat: phase 3 judge slice complete"`
-
----
-
-## 9. Definition of Done
-
-Phase 3 is **COMPLETE** when:
-
-1. All unit tests pass: `uv run pytest tests/unit/agent_factory/ -v`
-2. `JudgeHandler` can assess evidence and return structured output
-3. Graceful degradation: if LLM fails, returns safe fallback
-4. MockJudgeHandler works for testing without API calls
-5. Can run this in Python REPL:
-
-```python
-import asyncio
-import os
-from src.utils.models import Evidence, Citation
-from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
-
-# Test with mock (no API key needed)
-async def test_mock():
-    handler = MockJudgeHandler()
-    evidence = [
-        Evidence(
-            content="Metformin shows neuroprotective effects in AD models",
-            citation=Citation(
-                source="pubmed",
-                title="Metformin and Alzheimer's",
-                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
-                date="2024-01-01",
-            ),
-        ),
-    ]
-    result = await handler.assess("metformin alzheimer", evidence)
-    print(f"Sufficient: {result.sufficient}")
-    print(f"Recommendation: {result.recommendation}")
-    print(f"Drug candidates: {result.details.drug_candidates}")
-
-asyncio.run(test_mock())
-
-# Test with real LLM (requires API key)
-async def test_real():
-    os.environ["OPENAI_API_KEY"] = "your-key-here"  # Or set in .env
-    handler = JudgeHandler()
-    evidence = [
-        Evidence(
-            content="Metformin shows neuroprotective effects in AD models...",
-            citation=Citation(
-                source="pubmed",
-                title="Metformin and Alzheimer's",
-                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
-                date="2024-01-01",
-            ),
-        ),
-    ]
-    result = await handler.assess("metformin alzheimer", evidence)
-    print(f"Sufficient: {result.sufficient}")
-    print(f"Confidence: {result.confidence}")
-    print(f"Reasoning: {result.reasoning}")
-
-# asyncio.run(test_real())  # Uncomment with valid API key
-```
-
-**Proceed to Phase 4 ONLY after all checkboxes are complete.**
diff --git a/docs/implementation/04_phase_ui.md b/docs/implementation/04_phase_ui.md
deleted file mode 100644
index 90767d7515a40a9958287e9a171f6adf2bb702b6..0000000000000000000000000000000000000000
--- a/docs/implementation/04_phase_ui.md
+++ /dev/null
@@ -1,1104 +0,0 @@
-# Phase 4 Implementation Spec: Orchestrator & UI
-
-**Goal**: Connect the Brain and the Body, then give it a Face.
-**Philosophy**: "Streaming is Trust."
-**Prerequisite**: Phase 3 complete (all judge tests passing)
-
----
-
-## 1. The Slice Definition
-
-This slice connects:
-1. **Orchestrator**: The state machine (While loop) calling Search -> Judge.
-2. **UI**: Gradio interface that visualizes the loop.
-
-**Files to Create/Modify**:
-- `src/orchestrator.py` - Agent loop logic
-- `src/app.py` - Gradio UI
-- `tests/unit/test_orchestrator.py` - Unit tests
-- `Dockerfile` - Container for deployment
-- `README.md` - Usage instructions (update)
-
----
-
-## 2. Agent Events (`src/utils/models.py`)
-
-Add event types for streaming UI updates:
-
-```python
-"""Add to src/utils/models.py (after JudgeAssessment models)."""
-from pydantic import BaseModel, Field
-from typing import Literal, Any
-from datetime import datetime
-
-
-class AgentEvent(BaseModel):
-    """Event emitted by the orchestrator for UI streaming."""
-
-    type: Literal[
-        "started",
-        "searching",
-        "search_complete",
-        "judging",
-        "judge_complete",
-        "looping",
-        "synthesizing",
-        "complete",
-        "error",
-    ]
-    message: str
-    data: Any = None
-    timestamp: datetime = Field(default_factory=datetime.now)
-    iteration: int = 0
-
-    def to_markdown(self) -> str:
-        """Format event as markdown for chat display."""
-        icons = {
-            "started": "🚀",
-            "searching": "🔍",
-            "search_complete": "📚",
-            "judging": "🧠",
-            "judge_complete": "✅",
-            "looping": "🔄",
-            "synthesizing": "📝",
-            "complete": "🎉",
-            "error": "❌",
-        }
-        icon = icons.get(self.type, "•")
-        return f"{icon} **{self.type.upper()}**: {self.message}"
-
-
-class OrchestratorConfig(BaseModel):
-    """Configuration for the orchestrator."""
-
-    max_iterations: int = Field(default=5, ge=1, le=10)
-    max_results_per_tool: int = Field(default=10, ge=1, le=50)
-    search_timeout: float = Field(default=30.0, ge=5.0, le=120.0)
-```
-
----
-
-## 3. The Orchestrator (`src/orchestrator.py`)
-
-This is the "Agent" logic — the while loop that drives search and judgment.
-
-```python
-"""Orchestrator - the agent loop connecting Search and Judge."""
-import asyncio
-from typing import AsyncGenerator, List, Protocol
-import structlog
-
-from src.utils.models import (
-    Evidence,
-    SearchResult,
-    JudgeAssessment,
-    AgentEvent,
-    OrchestratorConfig,
-)
-
-logger = structlog.get_logger()
-
-
-class SearchHandlerProtocol(Protocol):
-    """Protocol for search handler."""
-    async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
-        ...
-
-
-class JudgeHandlerProtocol(Protocol):
-    """Protocol for judge handler."""
-    async def assess(self, question: str, evidence: List[Evidence]) -> JudgeAssessment:
-        ...
-
-
-class Orchestrator:
-    """
-    The agent orchestrator - runs the Search -> Judge -> Loop cycle.
-
-    This is a generator-based design that yields events for real-time UI updates.
-    """
-
-    def __init__(
-        self,
-        search_handler: SearchHandlerProtocol,
-        judge_handler: JudgeHandlerProtocol,
-        config: OrchestratorConfig | None = None,
-    ):
-        """
-        Initialize the orchestrator.
-
-        Args:
-            search_handler: Handler for executing searches
-            judge_handler: Handler for assessing evidence
-            config: Optional configuration (uses defaults if not provided)
-        """
-        self.search = search_handler
-        self.judge = judge_handler
-        self.config = config or OrchestratorConfig()
-        self.history: List[dict] = []
-
-    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
-        """
-        Run the agent loop for a query.
-
-        Yields AgentEvent objects for each step, allowing real-time UI updates.
-
-        Args:
-            query: The user's research question
-
-        Yields:
-            AgentEvent objects for each step of the process
-        """
-        logger.info("Starting orchestrator", query=query)
-
-        yield AgentEvent(
-            type="started",
-            message=f"Starting research for: {query}",
-            iteration=0,
-        )
-
-        all_evidence: List[Evidence] = []
-        current_queries = [query]
-        iteration = 0
-
-        while iteration < self.config.max_iterations:
-            iteration += 1
-            logger.info("Iteration", iteration=iteration, queries=current_queries)
-
-            # === SEARCH PHASE ===
-            yield AgentEvent(
-                type="searching",
-                message=f"Searching for: {', '.join(current_queries[:3])}...",
-                iteration=iteration,
-            )
-
-            try:
-                # Execute searches for all current queries
-                search_tasks = [
-                    self.search.execute(q, self.config.max_results_per_tool)
-                    for q in current_queries[:3]  # Limit to 3 queries per iteration
-                ]
-                search_results = await asyncio.gather(*search_tasks, return_exceptions=True)
-
-                # Collect evidence from successful searches
-                new_evidence: List[Evidence] = []
-                errors: List[str] = []
-
-                for q, result in zip(current_queries[:3], search_results):
-                    if isinstance(result, Exception):
-                        errors.append(f"Search for '{q}' failed: {str(result)}")
-                    else:
-                        new_evidence.extend(result.evidence)
-                        errors.extend(result.errors)
-
-                # Deduplicate evidence by URL
-                seen_urls = {e.citation.url for e in all_evidence}
-                unique_new = [e for e in new_evidence if e.citation.url not in seen_urls]
-                all_evidence.extend(unique_new)
-
-                yield AgentEvent(
-                    type="search_complete",
-                    message=f"Found {len(unique_new)} new sources ({len(all_evidence)} total)",
-                    data={"new_count": len(unique_new), "total_count": len(all_evidence)},
-                    iteration=iteration,
-                )
-
-                if errors:
-                    logger.warning("Search errors", errors=errors)
-
-            except Exception as e:
-                logger.error("Search phase failed", error=str(e))
-                yield AgentEvent(
-                    type="error",
-                    message=f"Search failed: {str(e)}",
-                    iteration=iteration,
-                )
-                continue
-
-            # === JUDGE PHASE ===
-            yield AgentEvent(
-                type="judging",
-                message=f"Evaluating {len(all_evidence)} sources...",
-                iteration=iteration,
-            )
-
-            try:
-                assessment = await self.judge.assess(query, all_evidence)
-
-                yield AgentEvent(
-                    type="judge_complete",
-                    message=f"Assessment: {assessment.recommendation} (confidence: {assessment.confidence:.0%})",
-                    data={
-                        "sufficient": assessment.sufficient,
-                        "confidence": assessment.confidence,
-                        "mechanism_score": assessment.details.mechanism_score,
-                        "clinical_score": assessment.details.clinical_evidence_score,
-                    },
-                    iteration=iteration,
-                )
-
-                # Record this iteration in history
-                self.history.append({
-                    "iteration": iteration,
-                    "queries": current_queries,
-                    "evidence_count": len(all_evidence),
-                    "assessment": assessment.model_dump(),
-                })
-
-                # === DECISION PHASE ===
-                if assessment.sufficient and assessment.recommendation == "synthesize":
-                    yield AgentEvent(
-                        type="synthesizing",
-                        message="Evidence sufficient! Preparing synthesis...",
-                        iteration=iteration,
-                    )
-
-                    # Generate final response
-                    final_response = self._generate_synthesis(query, all_evidence, assessment)
-
-                    yield AgentEvent(
-                        type="complete",
-                        message=final_response,
-                        data={
-                            "evidence_count": len(all_evidence),
-                            "iterations": iteration,
-                            "drug_candidates": assessment.details.drug_candidates,
-                            "key_findings": assessment.details.key_findings,
-                        },
-                        iteration=iteration,
-                    )
-                    return
-
-                else:
-                    # Need more evidence - prepare next queries
-                    current_queries = assessment.next_search_queries or [
-                        f"{query} mechanism of action",
-                        f"{query} clinical evidence",
-                    ]
-
-                    yield AgentEvent(
-                        type="looping",
-                        message=f"Need more evidence. Next searches: {', '.join(current_queries[:2])}...",
-                        data={"next_queries": current_queries},
-                        iteration=iteration,
-                    )
-
-            except Exception as e:
-                logger.error("Judge phase failed", error=str(e))
-                yield AgentEvent(
-                    type="error",
-                    message=f"Assessment failed: {str(e)}",
-                    iteration=iteration,
-                )
-                continue
-
-        # Max iterations reached
-        yield AgentEvent(
-            type="complete",
-            message=self._generate_partial_synthesis(query, all_evidence),
-            data={
-                "evidence_count": len(all_evidence),
-                "iterations": iteration,
-                "max_reached": True,
-            },
-            iteration=iteration,
-        )
-
-    def _generate_synthesis(
-        self,
-        query: str,
-        evidence: List[Evidence],
-        assessment: JudgeAssessment,
-    ) -> str:
-        """
-        Generate the final synthesis response.
-
-        Args:
-            query: The original question
-            evidence: All collected evidence
-            assessment: The final assessment
-
-        Returns:
-            Formatted synthesis as markdown
-        """
-        drug_list = "\n".join([f"- **{d}**" for d in assessment.details.drug_candidates]) or "- No specific candidates identified"
-        findings_list = "\n".join([f"- {f}" for f in assessment.details.key_findings]) or "- See evidence below"
-
-        citations = "\n".join([
-            f"{i+1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()}, {e.citation.date})"
-            for i, e in enumerate(evidence[:10])  # Limit to 10 citations
-        ])
-
-        return f"""## Drug Repurposing Analysis
-
-### Question
-{query}
-
-### Drug Candidates
-{drug_list}
-
-### Key Findings
-{findings_list}
-
-### Assessment
-- **Mechanism Score**: {assessment.details.mechanism_score}/10
-- **Clinical Evidence Score**: {assessment.details.clinical_evidence_score}/10
-- **Confidence**: {assessment.confidence:.0%}
-
-### Reasoning
-{assessment.reasoning}
-
-### Citations ({len(evidence)} sources)
-{citations}
-
----
-*Analysis based on {len(evidence)} sources across {len(self.history)} iterations.*
-"""
-
-    def _generate_partial_synthesis(
-        self,
-        query: str,
-        evidence: List[Evidence],
-    ) -> str:
-        """
-        Generate a partial synthesis when max iterations reached.
-
-        Args:
-            query: The original question
-            evidence: All collected evidence
-
-        Returns:
-            Formatted partial synthesis as markdown
-        """
-        citations = "\n".join([
-            f"{i+1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
-            for i, e in enumerate(evidence[:10])
-        ])
-
-        return f"""## Partial Analysis (Max Iterations Reached)
-
-### Question
-{query}
-
-### Status
-Maximum search iterations reached. The evidence gathered may be incomplete.
-
-### Evidence Collected
-Found {len(evidence)} sources. Consider refining your query for more specific results.
-
-### Citations
-{citations}
-
----
-*Consider searching with more specific terms or drug names.*
-"""
-```
-
----
-
-## 4. The Gradio UI (`src/app.py`)
-
-Using Gradio 5 generator pattern for real-time streaming.
-
-```python
-"""Gradio UI for DeepCritical agent."""
-import asyncio
-import gradio as gr
-from typing import AsyncGenerator
-
-from src.orchestrator import Orchestrator
-from src.tools.pubmed import PubMedTool
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.tools.biorxiv import BioRxivTool
-from src.tools.search_handler import SearchHandler
-from src.agent_factory.judges import JudgeHandler, HFInferenceJudgeHandler
-from src.utils.models import OrchestratorConfig, AgentEvent
-
-
-def create_orchestrator(
-    user_api_key: str | None = None,
-    api_provider: str = "openai",
-) -> tuple[Orchestrator, str]:
-    """
-    Create an orchestrator instance.
-
-    Args:
-        user_api_key: Optional user-provided API key (BYOK)
-        api_provider: API provider ("openai" or "anthropic")
-
-    Returns:
-        Tuple of (Configured Orchestrator instance, backend_name)
-
-    Priority:
-        1. User-provided API key → JudgeHandler (OpenAI/Anthropic)
-        2. Environment API key → JudgeHandler (OpenAI/Anthropic)
-        3. No key → HFInferenceJudgeHandler (FREE, automatic fallback chain)
-
-    HF Inference Fallback Chain:
-        1. Llama 3.1 8B (requires HF_TOKEN for gated model)
-        2. Mistral 7B (may require token)
-        3. Zephyr 7B (ungated, always works)
-    """
-    import os
-
-    # Create search tools
-    search_handler = SearchHandler(
-        tools=[PubMedTool(), ClinicalTrialsTool(), BioRxivTool()],
-        timeout=30.0,
-    )
-
-    # Determine which judge to use
-    has_env_key = bool(os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY"))
-    has_user_key = bool(user_api_key)
-    has_hf_token = bool(os.getenv("HF_TOKEN"))
-
-    if has_user_key:
-        # User provided their own key
-        judge_handler = JudgeHandler(model=None)
-        backend_name = f"your {api_provider.upper()} API key"
-    elif has_env_key:
-        # Environment has API key configured
-        judge_handler = JudgeHandler(model=None)
-        backend_name = "configured API key"
-    else:
-        # Use FREE HuggingFace Inference with automatic fallback
-        judge_handler = HFInferenceJudgeHandler()
-        if has_hf_token:
-            backend_name = "HuggingFace Inference (Llama 3.1)"
-        else:
-            backend_name = "HuggingFace Inference (free tier)"
-
-    # Create orchestrator
-    config = OrchestratorConfig(
-        max_iterations=5,
-        max_results_per_tool=10,
-    )
-
-    return Orchestrator(
-        search_handler=search_handler,
-        judge_handler=judge_handler,
-        config=config,
-    ), backend_name
-
-
-async def research_agent(
-    message: str,
-    history: list[dict],
-    api_key: str = "",
-    api_provider: str = "openai",
-) -> AsyncGenerator[str, None]:
-    """
-    Gradio chat function that runs the research agent.
-
-    Args:
-        message: User's research question
-        history: Chat history (Gradio format)
-        api_key: Optional user-provided API key (BYOK)
-        api_provider: API provider ("openai" or "anthropic")
-
-    Yields:
-        Markdown-formatted responses for streaming
-    """
-    if not message.strip():
-        yield "Please enter a research question."
-        return
-
-    import os
-
-    # Clean user-provided API key
-    user_api_key = api_key.strip() if api_key else None
-
-    # Create orchestrator with appropriate judge
-    orchestrator, backend_name = create_orchestrator(
-        user_api_key=user_api_key,
-        api_provider=api_provider,
-    )
-
-    # Determine icon based on backend
-    has_hf_token = bool(os.getenv("HF_TOKEN"))
-    if "HuggingFace" in backend_name:
-        icon = "🤗"
-        extra_note = (
-            "\n*For premium analysis, enter an OpenAI or Anthropic API key.*"
-            if not has_hf_token else ""
-        )
-    else:
-        icon = "🔑"
-        extra_note = ""
-
-    # Inform user which backend is being used
-    yield f"{icon} **Using {backend_name}**{extra_note}\n\n"
-
-    # Run the agent and stream events
-    response_parts = []
-
-    try:
-        async for event in orchestrator.run(message):
-            # Format event as markdown
-            event_md = event.to_markdown()
-            response_parts.append(event_md)
-
-            # If complete, show full response
-            if event.type == "complete":
-                yield event.message
-            else:
-                # Show progress
-                yield "\n\n".join(response_parts)
-
-    except Exception as e:
-        yield f"❌ **Error**: {str(e)}"
-
-
-def create_demo() -> gr.Blocks:
-    """
-    Create the Gradio demo interface.
-
-    Returns:
-        Configured Gradio Blocks interface
-    """
-    with gr.Blocks(
-        title="DeepCritical - Drug Repurposing Research Agent",
-        theme=gr.themes.Soft(),
-    ) as demo:
-        gr.Markdown("""
-        # 🧬 DeepCritical
-        ## AI-Powered Drug Repurposing Research Agent
-
-        Ask questions about potential drug repurposing opportunities.
-        The agent will search PubMed and the web, evaluate evidence, and provide recommendations.
-
-        **Example questions:**
-        - "What drugs could be repurposed for Alzheimer's disease?"
-        - "Is metformin effective for cancer treatment?"
-        - "What existing medications show promise for Long COVID?"
-        """)
-
-        # Note: additional_inputs render in an accordion below the chat input
-        gr.ChatInterface(
-            fn=research_agent,
-            examples=[
-                [
-                    "What drugs could be repurposed for Alzheimer's disease?",
-                    "simple",
-                    "",
-                    "openai",
-                ],
-                [
-                    "Is metformin effective for treating cancer?",
-                    "simple",
-                    "",
-                    "openai",
-                ],
-            ],
-            additional_inputs=[
-                gr.Radio(
-                    choices=["simple", "magentic"],
-                    value="simple",
-                    label="Orchestrator Mode",
-                    info="Simple: Linear | Magentic: Multi-Agent (OpenAI)",
-                ),
-                gr.Textbox(
-                    label="API Key (Optional - Bring Your Own Key)",
-                    placeholder="sk-... or sk-ant-...",
-                    type="password",
-                    info="Enter your own API key for full AI analysis. Never stored.",
-                ),
-                gr.Radio(
-                    choices=["openai", "anthropic"],
-                    value="openai",
-                    label="API Provider",
-                    info="Select the provider for your API key",
-                ),
-            ],
-        )
-
-        gr.Markdown("""
-        ---
-        **Note**: This is a research tool and should not be used for medical decisions.
-        Always consult healthcare professionals for medical advice.
-
-        Built with 🤖 PydanticAI + 🔬 PubMed + 🦆 DuckDuckGo
-        """)
-
-    return demo
-
-
-def main():
-    """Run the Gradio app."""
-    demo = create_demo()
-    demo.launch(
-        server_name="0.0.0.0",
-        server_port=7860,
-        share=False,
-    )
-
-
-if __name__ == "__main__":
-    main()
-```
-
----
-
-## 5. TDD Workflow
-
-### Test File: `tests/unit/test_orchestrator.py`
-
-```python
-"""Unit tests for Orchestrator."""
-import pytest
-from unittest.mock import AsyncMock, MagicMock
-
-from src.utils.models import (
-    Evidence,
-    Citation,
-    SearchResult,
-    JudgeAssessment,
-    AssessmentDetails,
-    OrchestratorConfig,
-)
-
-
-class TestOrchestrator:
-    """Tests for Orchestrator."""
-
-    @pytest.fixture
-    def mock_search_handler(self):
-        """Create a mock search handler."""
-        handler = AsyncMock()
-        handler.execute = AsyncMock(return_value=SearchResult(
-            query="test",
-            evidence=[
-                Evidence(
-                    content="Test content",
-                    citation=Citation(
-                        source="pubmed",
-                        title="Test Title",
-                        url="https://pubmed.ncbi.nlm.nih.gov/12345/",
-                        date="2024-01-01",
-                    ),
-                ),
-            ],
-            sources_searched=["pubmed"],
-            total_found=1,
-            errors=[],
-        ))
-        return handler
-
-    @pytest.fixture
-    def mock_judge_sufficient(self):
-        """Create a mock judge that returns sufficient."""
-        handler = AsyncMock()
-        handler.assess = AsyncMock(return_value=JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=8,
-                mechanism_reasoning="Good mechanism",
-                clinical_evidence_score=7,
-                clinical_reasoning="Good clinical",
-                drug_candidates=["Drug A"],
-                key_findings=["Finding 1"],
-            ),
-            sufficient=True,
-            confidence=0.85,
-            recommendation="synthesize",
-            next_search_queries=[],
-            reasoning="Evidence is sufficient",
-        ))
-        return handler
-
-    @pytest.fixture
-    def mock_judge_insufficient(self):
-        """Create a mock judge that returns insufficient."""
-        handler = AsyncMock()
-        handler.assess = AsyncMock(return_value=JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=4,
-                mechanism_reasoning="Weak mechanism",
-                clinical_evidence_score=3,
-                clinical_reasoning="Weak clinical",
-                drug_candidates=[],
-                key_findings=[],
-            ),
-            sufficient=False,
-            confidence=0.3,
-            recommendation="continue",
-            next_search_queries=["more specific query"],
-            reasoning="Need more evidence",
-        ))
-        return handler
-
-    @pytest.mark.asyncio
-    async def test_orchestrator_completes_with_sufficient_evidence(
-        self,
-        mock_search_handler,
-        mock_judge_sufficient,
-    ):
-        """Orchestrator should complete when evidence is sufficient."""
-        from src.orchestrator import Orchestrator
-
-        config = OrchestratorConfig(max_iterations=5)
-        orchestrator = Orchestrator(
-            search_handler=mock_search_handler,
-            judge_handler=mock_judge_sufficient,
-            config=config,
-        )
-
-        events = []
-        async for event in orchestrator.run("test query"):
-            events.append(event)
-
-        # Should have started, searched, judged, and completed
-        event_types = [e.type for e in events]
-        assert "started" in event_types
-        assert "searching" in event_types
-        assert "search_complete" in event_types
-        assert "judging" in event_types
-        assert "judge_complete" in event_types
-        assert "complete" in event_types
-
-        # Should only have 1 iteration
-        complete_event = [e for e in events if e.type == "complete"][0]
-        assert complete_event.iteration == 1
-
-    @pytest.mark.asyncio
-    async def test_orchestrator_loops_when_insufficient(
-        self,
-        mock_search_handler,
-        mock_judge_insufficient,
-    ):
-        """Orchestrator should loop when evidence is insufficient."""
-        from src.orchestrator import Orchestrator
-
-        config = OrchestratorConfig(max_iterations=3)
-        orchestrator = Orchestrator(
-            search_handler=mock_search_handler,
-            judge_handler=mock_judge_insufficient,
-            config=config,
-        )
-
-        events = []
-        async for event in orchestrator.run("test query"):
-            events.append(event)
-
-        # Should have looping events
-        event_types = [e.type for e in events]
-        assert event_types.count("looping") >= 2  # At least 2 loop events
-
-        # Should hit max iterations
-        complete_event = [e for e in events if e.type == "complete"][0]
-        assert complete_event.data.get("max_reached") is True
-
-    @pytest.mark.asyncio
-    async def test_orchestrator_respects_max_iterations(
-        self,
-        mock_search_handler,
-        mock_judge_insufficient,
-    ):
-        """Orchestrator should stop at max_iterations."""
-        from src.orchestrator import Orchestrator
-
-        config = OrchestratorConfig(max_iterations=2)
-        orchestrator = Orchestrator(
-            search_handler=mock_search_handler,
-            judge_handler=mock_judge_insufficient,
-            config=config,
-        )
-
-        events = []
-        async for event in orchestrator.run("test query"):
-            events.append(event)
-
-        # Should have exactly 2 iterations
-        max_iteration = max(e.iteration for e in events)
-        assert max_iteration == 2
-
-    @pytest.mark.asyncio
-    async def test_orchestrator_handles_search_error(self):
-        """Orchestrator should handle search errors gracefully."""
-        from src.orchestrator import Orchestrator
-
-        mock_search = AsyncMock()
-        mock_search.execute = AsyncMock(side_effect=Exception("Search failed"))
-
-        mock_judge = AsyncMock()
-        mock_judge.assess = AsyncMock(return_value=JudgeAssessment(
-            details=AssessmentDetails(
-                mechanism_score=0,
-                mechanism_reasoning="N/A",
-                clinical_evidence_score=0,
-                clinical_reasoning="N/A",
-                drug_candidates=[],
-                key_findings=[],
-            ),
-            sufficient=False,
-            confidence=0.0,
-            recommendation="continue",
-            next_search_queries=["retry query"],
-            reasoning="Search failed",
-        ))
-
-        config = OrchestratorConfig(max_iterations=2)
-        orchestrator = Orchestrator(
-            search_handler=mock_search,
-            judge_handler=mock_judge,
-            config=config,
-        )
-
-        events = []
-        async for event in orchestrator.run("test query"):
-            events.append(event)
-
-        # Should have error events
-        event_types = [e.type for e in events]
-        assert "error" in event_types
-
-    @pytest.mark.asyncio
-    async def test_orchestrator_deduplicates_evidence(self, mock_judge_insufficient):
-        """Orchestrator should deduplicate evidence by URL."""
-        from src.orchestrator import Orchestrator
-
-        # Search returns same evidence each time
-        duplicate_evidence = Evidence(
-            content="Duplicate content",
-            citation=Citation(
-                source="pubmed",
-                title="Same Title",
-                url="https://pubmed.ncbi.nlm.nih.gov/12345/",  # Same URL
-                date="2024-01-01",
-            ),
-        )
-
-        mock_search = AsyncMock()
-        mock_search.execute = AsyncMock(return_value=SearchResult(
-            query="test",
-            evidence=[duplicate_evidence],
-            sources_searched=["pubmed"],
-            total_found=1,
-            errors=[],
-        ))
-
-        config = OrchestratorConfig(max_iterations=2)
-        orchestrator = Orchestrator(
-            search_handler=mock_search,
-            judge_handler=mock_judge_insufficient,
-            config=config,
-        )
-
-        events = []
-        async for event in orchestrator.run("test query"):
-            events.append(event)
-
-        # Second search_complete should show 0 new evidence
-        search_complete_events = [e for e in events if e.type == "search_complete"]
-        assert len(search_complete_events) == 2
-
-        # First iteration should have 1 new
-        assert search_complete_events[0].data["new_count"] == 1
-
-        # Second iteration should have 0 new (duplicate)
-        assert search_complete_events[1].data["new_count"] == 0
-
-
-class TestAgentEvent:
-    """Tests for AgentEvent."""
-
-    def test_to_markdown(self):
-        """AgentEvent should format to markdown correctly."""
-        from src.utils.models import AgentEvent
-
-        event = AgentEvent(
-            type="searching",
-            message="Searching for: metformin alzheimer",
-            iteration=1,
-        )
-
-        md = event.to_markdown()
-        assert "🔍" in md
-        assert "SEARCHING" in md
-        assert "metformin alzheimer" in md
-
-    def test_complete_event_icon(self):
-        """Complete event should have celebration icon."""
-        from src.utils.models import AgentEvent
-
-        event = AgentEvent(
-            type="complete",
-            message="Done!",
-            iteration=3,
-        )
-
-        md = event.to_markdown()
-        assert "🎉" in md
-```
-
----
-
-## 6. Dockerfile
-
-```dockerfile
-# Dockerfile for DeepCritical
-FROM python:3.11-slim
-
-# Set working directory
-WORKDIR /app
-
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
-    git \
-    && rm -rf /var/lib/apt/lists/*
-
-# Install uv
-RUN pip install uv
-
-# Copy project files
-COPY pyproject.toml .
-COPY src/ src/
-
-# Install dependencies
-RUN uv pip install --system .
-
-# Expose port
-EXPOSE 7860
-
-# Set environment variables
-ENV GRADIO_SERVER_NAME=0.0.0.0
-ENV GRADIO_SERVER_PORT=7860
-
-# Run the app
-CMD ["python", "-m", "src.app"]
-```
-
----
-
-## 7. HuggingFace Spaces Configuration
-
-Create `README.md` header for HuggingFace Spaces:
-
-```markdown
----
-title: DeepCritical
-emoji: 🧬
-colorFrom: blue
-colorTo: purple
-sdk: gradio
-sdk_version: 5.0.0
-app_file: src/app.py
-pinned: false
-license: mit
----
-
-# DeepCritical
-
-AI-Powered Drug Repurposing Research Agent
-```
-
----
-
-## 8. Implementation Checklist
-
-- [ ] Add `AgentEvent` and `OrchestratorConfig` models to `src/utils/models.py`
-- [ ] Implement `src/orchestrator.py` with full Orchestrator class
-- [ ] Implement `src/app.py` with Gradio interface
-- [ ] Create `tests/unit/test_orchestrator.py` with all tests
-- [ ] Create `Dockerfile` for deployment
-- [ ] Update project `README.md` with usage instructions
-- [ ] Run `uv run pytest tests/unit/test_orchestrator.py -v` — **ALL TESTS MUST PASS**
-- [ ] Test locally: `uv run python -m src.app`
-- [ ] Commit: `git commit -m "feat: phase 4 orchestrator and UI complete"`
-
----
-
-## 9. Definition of Done
-
-Phase 4 is **COMPLETE** when:
-
-1. All unit tests pass: `uv run pytest tests/unit/test_orchestrator.py -v`
-2. Orchestrator correctly loops Search -> Judge until sufficient
-3. Max iterations limit is enforced
-4. Graceful error handling throughout
-5. Gradio UI streams events in real-time
-6. Can run locally:
-
-```bash
-# Start the UI
-uv run python -m src.app
-
-# Open browser to http://localhost:7860
-# Enter a question like "What drugs could be repurposed for Alzheimer's disease?"
-# Watch the agent search, evaluate, and respond
-```
-
-7. Can run the full flow in Python:
-
-```python
-import asyncio
-from src.orchestrator import Orchestrator
-from src.tools.pubmed import PubMedTool
-from src.tools.biorxiv import BioRxivTool
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.tools.search_handler import SearchHandler
-from src.agent_factory.judges import HFInferenceJudgeHandler, MockJudgeHandler
-from src.utils.models import OrchestratorConfig
-
-async def test_full_flow():
-    # Create components
-    search_handler = SearchHandler([PubMedTool(), ClinicalTrialsTool(), BioRxivTool()])
-
-    # Option 1: Use FREE HuggingFace Inference (real AI analysis)
-    judge_handler = HFInferenceJudgeHandler()
-
-    # Option 2: Use MockJudgeHandler for UNIT TESTING ONLY
-    # judge_handler = MockJudgeHandler()
-
-    config = OrchestratorConfig(max_iterations=3)
-
-    # Create orchestrator
-    orchestrator = Orchestrator(
-        search_handler=search_handler,
-        judge_handler=judge_handler,
-        config=config,
-    )
-
-    # Run and collect events
-    print("Starting agent...")
-    async for event in orchestrator.run("metformin alzheimer"):
-        print(event.to_markdown())
-
-    print("\nDone!")
-
-asyncio.run(test_full_flow())
-```
-
-**Important**: `MockJudgeHandler` is for **unit testing only**. For actual demo/production use, always use `HFInferenceJudgeHandler` (free) or `JudgeHandler` (with API key).
-
----
-
-## 10. Deployment Verification
-
-After deployment to HuggingFace Spaces:
-
-1. **Visit the Space URL** and verify the UI loads
-2. **Test with example queries**:
-   - "What drugs could be repurposed for Alzheimer's disease?"
-   - "Is metformin effective for cancer treatment?"
-3. **Verify streaming** - events should appear in real-time
-4. **Check error handling** - try an empty query, verify graceful handling
-5. **Monitor logs** for any errors
-
----
-
-## Project Complete! 🎉
-
-When Phase 4 is done, the DeepCritical MVP is complete:
-
-- **Phase 1**: Foundation (uv, pytest, config) ✅
-- **Phase 2**: Search Slice (PubMed, DuckDuckGo) ✅
-- **Phase 3**: Judge Slice (PydanticAI, structured output) ✅
-- **Phase 4**: Orchestrator + UI (Gradio, streaming) ✅
-
-The agent can:
-1. Accept a drug repurposing question
-2. Search PubMed and the web for evidence
-3. Evaluate evidence quality with an LLM
-4. Loop until confident or max iterations
-5. Synthesize a research-backed recommendation
-6. Display real-time progress in a beautiful UI
diff --git a/docs/implementation/05_phase_magentic.md b/docs/implementation/05_phase_magentic.md
deleted file mode 100644
index fd5de5fc30fea1802c6198923bc0b542a4f566aa..0000000000000000000000000000000000000000
--- a/docs/implementation/05_phase_magentic.md
+++ /dev/null
@@ -1,1091 +0,0 @@
-# Phase 5 Implementation Spec: Magentic Integration
-
-**Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
-**Philosophy**: "Same API, Better Engine."
-**Prerequisite**: Phase 4 complete (MVP working end-to-end)
-
----
-
-## 1. Why Magentic?
-
-Magentic-One provides:
-- **LLM-powered manager** that dynamically plans, selects agents, tracks progress
-- **Built-in stall detection** and automatic replanning
-- **Checkpointing** for pause/resume workflows
-- **Event streaming** for real-time UI updates
-- **Multi-agent coordination** with round limits and reset logic
-
----
-
-## 2. Critical Architecture Understanding
-
-### 2.1 How Magentic Actually Works
-
-```
-┌─────────────────────────────────────────────────────────────────────────┐
-│                        MagenticBuilder Workflow                          │
-├─────────────────────────────────────────────────────────────────────────┤
-│                                                                          │
-│  User Task: "Research drug repurposing for metformin alzheimer"          │
-│                              ↓                                           │
-│  ┌──────────────────────────────────────────────────────────────────┐   │
-│  │                   StandardMagenticManager                         │   │
-│  │                                                                   │   │
-│  │  1. plan() → LLM generates facts & plan                          │   │
-│  │  2. create_progress_ledger() → LLM decides:                      │   │
-│  │     - is_request_satisfied?                                       │   │
-│  │     - next_speaker: "searcher"                                    │   │
-│  │     - instruction_or_question: "Search for clinical trials..."   │   │
-│  │                                                                   │   │
-│  └──────────────────────────────────────────────────────────────────┘   │
-│                              ↓                                           │
-│           NATURAL LANGUAGE INSTRUCTION sent to agent                     │
-│           "Search for clinical trials about metformin..."                │
-│                              ↓                                           │
-│  ┌──────────────────────────────────────────────────────────────────┐   │
-│  │                      ChatAgent (searcher)                         │   │
-│  │                                                                   │   │
-│  │  chat_client (INTERNAL LLM) ← understands instruction            │   │
-│  │         ↓                                                         │   │
-│  │  "I'll search for metformin alzheimer clinical trials"           │   │
-│  │         ↓                                                         │   │
-│  │  tools=[search_pubmed, search_clinicaltrials] ← calls tools      │   │
-│  │         ↓                                                         │   │
-│  │  Returns natural language response to manager                     │   │
-│  │                                                                   │   │
-│  └──────────────────────────────────────────────────────────────────┘   │
-│                              ↓                                           │
-│                    Manager evaluates response                            │
-│                    Decides next agent or completion                      │
-│                                                                          │
-└─────────────────────────────────────────────────────────────────────────┘
-```
-
-### 2.2 The Critical Insight
-
-**Microsoft's ChatAgent has an INTERNAL LLM (`chat_client`) that:**
-1. Receives natural language instructions from the manager
-2. Understands what action to take
-3. Calls attached tools (functions)
-4. Returns natural language responses
-
-**Our previous implementation was WRONG because:**
-- We wrapped handlers as bare `BaseAgent` subclasses
-- No internal LLM to understand instructions
-- Raw instruction text was passed directly to APIs (PubMed doesn't understand "Search for clinical trials...")
-
-### 2.3 Correct Pattern: ChatAgent with Tools
-
-```python
-# CORRECT: Agent backed by LLM that calls tools
-from agent_framework import ChatAgent, AIFunction
-from agent_framework.openai import OpenAIChatClient
-
-# Define tool that ChatAgent can call
-@AIFunction
-async def search_pubmed(query: str, max_results: int = 10) -> str:
-    """Search PubMed for biomedical literature.
-
-    Args:
-        query: Search keywords (e.g., "metformin alzheimer mechanism")
-        max_results: Maximum number of results to return
-    """
-    result = await pubmed_tool.search(query, max_results)
-    return format_results(result)
-
-# ChatAgent with internal LLM + tools
-search_agent = ChatAgent(
-    name="SearchAgent",
-    description="Searches biomedical databases for drug repurposing evidence",
-    instructions="You search PubMed, ClinicalTrials.gov, and bioRxiv for evidence.",
-    chat_client=OpenAIChatClient(model_id="gpt-4o-mini"),  # INTERNAL LLM
-    tools=[search_pubmed, search_clinicaltrials, search_biorxiv],  # TOOLS
-)
-```
-
----
-
-## 3. Correct Implementation
-
-### 3.1 Shared State Module (`src/agents/state.py`)
-
-**CRITICAL**: Tools must update shared state so:
-1. EmbeddingService can deduplicate across searches
-2. ReportAgent can access structured Evidence objects for citations
-
-```python
-"""Shared state for Magentic agents.
-
-This module provides global state that tools update as a side effect.
-ChatAgent tools return strings to the LLM, but also update this state
-for semantic deduplication and structured citation access.
-"""
-from __future__ import annotations
-
-from typing import TYPE_CHECKING
-
-import structlog
-
-if TYPE_CHECKING:
-    from src.services.embeddings import EmbeddingService
-
-from src.utils.models import Evidence
-
-logger = structlog.get_logger()
-
-
-class MagenticState:
-    """Shared state container for Magentic workflow.
-
-    Maintains:
-    - evidence_store: All collected Evidence objects (for citations)
-    - embedding_service: Optional semantic search (for deduplication)
-    """
-
-    def __init__(self) -> None:
-        self.evidence_store: list[Evidence] = []
-        self.embedding_service: EmbeddingService | None = None
-        self._seen_urls: set[str] = set()
-
-    def init_embedding_service(self) -> None:
-        """Lazy-initialize embedding service if available."""
-        if self.embedding_service is not None:
-            return
-        try:
-            from src.services.embeddings import get_embedding_service
-            self.embedding_service = get_embedding_service()
-            logger.info("Embedding service enabled for Magentic mode")
-        except Exception as e:
-            logger.warning("Embedding service unavailable", error=str(e))
-
-    async def add_evidence(self, evidence_list: list[Evidence]) -> list[Evidence]:
-        """Add evidence with semantic deduplication.
-
-        Args:
-            evidence_list: New evidence from search
-
-        Returns:
-            List of unique evidence (not duplicates)
-        """
-        if not evidence_list:
-            return []
-
-        # URL-based deduplication first (fast)
-        url_unique = [
-            e for e in evidence_list
-            if e.citation.url not in self._seen_urls
-        ]
-
-        # Semantic deduplication if available
-        if self.embedding_service and url_unique:
-            try:
-                unique = await self.embedding_service.deduplicate(url_unique, threshold=0.85)
-                logger.info(
-                    "Semantic deduplication",
-                    before=len(url_unique),
-                    after=len(unique),
-                )
-            except Exception as e:
-                logger.warning("Deduplication failed, using URL-based", error=str(e))
-                unique = url_unique
-        else:
-            unique = url_unique
-
-        # Update state
-        for e in unique:
-            self._seen_urls.add(e.citation.url)
-            self.evidence_store.append(e)
-
-        return unique
-
-    async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]:
-        """Find semantically related evidence from vector store.
-
-        Args:
-            query: Search query
-            n_results: Number of related items
-
-        Returns:
-            Related Evidence objects (reconstructed from vector store)
-        """
-        if not self.embedding_service:
-            return []
-
-        try:
-            from src.utils.models import Citation
-
-            related = await self.embedding_service.search_similar(query, n_results)
-            evidence = []
-
-            for item in related:
-                if item["id"] in self._seen_urls:
-                    continue  # Already in results
-
-                meta = item.get("metadata", {})
-                authors_str = meta.get("authors", "")
-                authors = [a.strip() for a in authors_str.split(",") if a.strip()]
-
-                ev = Evidence(
-                    content=item["content"],
-                    citation=Citation(
-                        title=meta.get("title", "Related Evidence"),
-                        url=item["id"],
-                        source=meta.get("source", "pubmed"),
-                        date=meta.get("date", "n.d."),
-                        authors=authors,
-                    ),
-                    relevance=max(0.0, 1.0 - item.get("distance", 0.5)),
-                )
-                evidence.append(ev)
-
-            return evidence
-        except Exception as e:
-            logger.warning("Related search failed", error=str(e))
-            return []
-
-    def reset(self) -> None:
-        """Reset state for new workflow run."""
-        self.evidence_store.clear()
-        self._seen_urls.clear()
-
-
-# Global singleton for workflow
-_state: MagenticState | None = None
-
-
-def get_magentic_state() -> MagenticState:
-    """Get or create the global Magentic state."""
-    global _state
-    if _state is None:
-        _state = MagenticState()
-    return _state
-
-
-def reset_magentic_state() -> None:
-    """Reset state for a fresh workflow run."""
-    global _state
-    if _state is not None:
-        _state.reset()
-    else:
-        _state = MagenticState()
-```
-
-### 3.2 Tool Functions (`src/agents/tools.py`)
-
-Tools call APIs AND update shared state. Return strings to LLM, but also store structured Evidence.
-
-```python
-"""Tool functions for Magentic agents.
-
-IMPORTANT: These tools do TWO things:
-1. Return formatted strings to the ChatAgent's internal LLM
-2. Update shared state (evidence_store, embeddings) as a side effect
-
-This preserves semantic deduplication and structured citation access.
-"""
-from agent_framework import AIFunction
-
-from src.agents.state import get_magentic_state
-from src.tools.biorxiv import BioRxivTool
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.tools.pubmed import PubMedTool
-
-# Singleton tool instances
-_pubmed = PubMedTool()
-_clinicaltrials = ClinicalTrialsTool()
-_biorxiv = BioRxivTool()
-
-
-def _format_results(results: list, source_name: str, query: str) -> str:
-    """Format search results for LLM consumption."""
-    if not results:
-        return f"No {source_name} results found for: {query}"
-
-    output = [f"Found {len(results)} {source_name} results:\n"]
-    for i, r in enumerate(results[:10], 1):
-        output.append(f"{i}. **{r.citation.title}**")
-        output.append(f"   Source: {r.citation.source} | Date: {r.citation.date}")
-        output.append(f"   {r.content[:300]}...")
-        output.append(f"   URL: {r.citation.url}\n")
-
-    return "\n".join(output)
-
-
-@AIFunction
-async def search_pubmed(query: str, max_results: int = 10) -> str:
-    """Search PubMed for biomedical research papers.
-
-    Use this tool to find peer-reviewed scientific literature about
-    drugs, diseases, mechanisms of action, and clinical studies.
-
-    Args:
-        query: Search keywords (e.g., "metformin alzheimer mechanism")
-        max_results: Maximum results to return (default 10)
-
-    Returns:
-        Formatted list of papers with titles, abstracts, and citations
-    """
-    # 1. Execute search
-    results = await _pubmed.search(query, max_results)
-
-    # 2. Update shared state (semantic dedup + evidence store)
-    state = get_magentic_state()
-    unique = await state.add_evidence(results)
-
-    # 3. Also get related evidence from vector store
-    related = await state.search_related(query, n_results=3)
-    if related:
-        await state.add_evidence(related)
-
-    # 4. Return formatted string for LLM
-    total_new = len(unique)
-    total_stored = len(state.evidence_store)
-
-    output = _format_results(results, "PubMed", query)
-    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
-
-    if related:
-        output += f"\n[Also found {len(related)} semantically related items from previous searches]"
-
-    return output
-
-
-@AIFunction
-async def search_clinical_trials(query: str, max_results: int = 10) -> str:
-    """Search ClinicalTrials.gov for clinical studies.
-
-    Use this tool to find ongoing and completed clinical trials
-    for drug repurposing candidates.
-
-    Args:
-        query: Search terms (e.g., "metformin cancer phase 3")
-        max_results: Maximum results to return (default 10)
-
-    Returns:
-        Formatted list of clinical trials with status and details
-    """
-    # 1. Execute search
-    results = await _clinicaltrials.search(query, max_results)
-
-    # 2. Update shared state
-    state = get_magentic_state()
-    unique = await state.add_evidence(results)
-
-    # 3. Return formatted string
-    total_new = len(unique)
-    total_stored = len(state.evidence_store)
-
-    output = _format_results(results, "ClinicalTrials.gov", query)
-    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
-
-    return output
-
-
-@AIFunction
-async def search_preprints(query: str, max_results: int = 10) -> str:
-    """Search bioRxiv/medRxiv for preprint papers.
-
-    Use this tool to find the latest research that hasn't been
-    peer-reviewed yet. Good for cutting-edge findings.
-
-    Args:
-        query: Search terms (e.g., "long covid treatment")
-        max_results: Maximum results to return (default 10)
-
-    Returns:
-        Formatted list of preprints with abstracts and links
-    """
-    # 1. Execute search
-    results = await _biorxiv.search(query, max_results)
-
-    # 2. Update shared state
-    state = get_magentic_state()
-    unique = await state.add_evidence(results)
-
-    # 3. Return formatted string
-    total_new = len(unique)
-    total_stored = len(state.evidence_store)
-
-    output = _format_results(results, "bioRxiv/medRxiv", query)
-    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
-
-    return output
-
-
-@AIFunction
-async def get_evidence_summary() -> str:
-    """Get summary of all collected evidence.
-
-    Use this tool when you need to review what evidence has been collected
-    before making an assessment or generating a report.
-
-    Returns:
-        Summary of evidence store with counts and key citations
-    """
-    state = get_magentic_state()
-    evidence = state.evidence_store
-
-    if not evidence:
-        return "No evidence collected yet."
-
-    # Group by source
-    by_source: dict[str, list] = {}
-    for e in evidence:
-        src = e.citation.source
-        if src not in by_source:
-            by_source[src] = []
-        by_source[src].append(e)
-
-    output = [f"**Evidence Store Summary** ({len(evidence)} total items)\n"]
-
-    for source, items in by_source.items():
-        output.append(f"\n### {source.upper()} ({len(items)} items)")
-        for e in items[:5]:  # First 5 per source
-            output.append(f"- {e.citation.title[:80]}...")
-
-    return "\n".join(output)
-
-
-@AIFunction
-async def get_bibliography() -> str:
-    """Get full bibliography of all collected evidence.
-
-    Use this tool when generating a final report to get properly
-    formatted citations for all evidence.
-
-    Returns:
-        Numbered bibliography with full citation details
-    """
-    state = get_magentic_state()
-    evidence = state.evidence_store
-
-    if not evidence:
-        return "No evidence collected for bibliography."
-
-    output = ["## References\n"]
-
-    for i, e in enumerate(evidence, 1):
-        # Format: Authors (Year). Title. Source. URL
-        authors = ", ".join(e.citation.authors[:3]) if e.citation.authors else "Unknown"
-        if e.citation.authors and len(e.citation.authors) > 3:
-            authors += " et al."
-
-        year = e.citation.date[:4] if e.citation.date else "n.d."
-
-        output.append(
-            f"{i}. {authors} ({year}). {e.citation.title}. "
-            f"*{e.citation.source.upper()}*. [{e.citation.url}]({e.citation.url})"
-        )
-
-    return "\n".join(output)
-```
-
-### 3.3 ChatAgent-Based Agents (`src/agents/magentic_agents.py`)
-
-```python
-"""Magentic-compatible agents using ChatAgent pattern."""
-from agent_framework import ChatAgent
-from agent_framework.openai import OpenAIChatClient
-
-from src.agents.tools import (
-    get_bibliography,
-    get_evidence_summary,
-    search_clinical_trials,
-    search_preprints,
-    search_pubmed,
-)
-from src.utils.config import settings
-
-
-def create_search_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
-    """Create a search agent with internal LLM and search tools.
-
-    Args:
-        chat_client: Optional custom chat client. If None, uses default.
-
-    Returns:
-        ChatAgent configured for biomedical search
-    """
-    client = chat_client or OpenAIChatClient(
-        model_id="gpt-4o-mini",  # Fast, cheap for tool orchestration
-        api_key=settings.openai_api_key,
-    )
-
-    return ChatAgent(
-        name="SearchAgent",
-        description="Searches biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) for drug repurposing evidence",
-        instructions="""You are a biomedical search specialist. When asked to find evidence:
-
-1. Analyze the request to determine what to search for
-2. Extract key search terms (drug names, disease names, mechanisms)
-3. Use the appropriate search tools:
-   - search_pubmed for peer-reviewed papers
-   - search_clinical_trials for clinical studies
-   - search_preprints for cutting-edge findings
-4. Summarize what you found and highlight key evidence
-
-Be thorough - search multiple databases when appropriate.
-Focus on finding: mechanisms of action, clinical evidence, and specific drug candidates.""",
-        chat_client=client,
-        tools=[search_pubmed, search_clinical_trials, search_preprints],
-        temperature=0.3,  # More deterministic for tool use
-    )
-
-
-def create_judge_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
-    """Create a judge agent that evaluates evidence quality.
-
-    Args:
-        chat_client: Optional custom chat client. If None, uses default.
-
-    Returns:
-        ChatAgent configured for evidence assessment
-    """
-    client = chat_client or OpenAIChatClient(
-        model_id="gpt-4o",  # Better model for nuanced judgment
-        api_key=settings.openai_api_key,
-    )
-
-    return ChatAgent(
-        name="JudgeAgent",
-        description="Evaluates evidence quality and determines if sufficient for synthesis",
-        instructions="""You are an evidence quality assessor. When asked to evaluate:
-
-1. First, call get_evidence_summary() to see all collected evidence
-2. Score on two dimensions (0-10 each):
-   - Mechanism Score: How well is the biological mechanism explained?
-   - Clinical Score: How strong is the clinical/preclinical evidence?
-3. Determine if evidence is SUFFICIENT for a final report:
-   - Sufficient: Clear mechanism + supporting clinical data
-   - Insufficient: Gaps in mechanism OR weak clinical evidence
-4. If insufficient, suggest specific search queries to fill gaps
-
-Be rigorous but fair. Look for:
-- Molecular targets and pathways
-- Animal model studies
-- Human clinical trials
-- Safety data
-- Drug-drug interactions""",
-        chat_client=client,
-        tools=[get_evidence_summary],  # Can review collected evidence
-        temperature=0.2,  # Consistent judgments
-    )
-
-
-def create_hypothesis_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
-    """Create a hypothesis generation agent.
-
-    Args:
-        chat_client: Optional custom chat client. If None, uses default.
-
-    Returns:
-        ChatAgent configured for hypothesis generation
-    """
-    client = chat_client or OpenAIChatClient(
-        model_id="gpt-4o",
-        api_key=settings.openai_api_key,
-    )
-
-    return ChatAgent(
-        name="HypothesisAgent",
-        description="Generates mechanistic hypotheses for drug repurposing",
-        instructions="""You are a biomedical hypothesis generator. Based on evidence:
-
-1. Identify the key molecular targets involved
-2. Map the biological pathways affected
-3. Generate testable hypotheses in this format:
-
-   DRUG → TARGET → PATHWAY → THERAPEUTIC EFFECT
-
-   Example:
-   Metformin → AMPK activation → mTOR inhibition → Reduced tau phosphorylation
-
-4. Explain the rationale for each hypothesis
-5. Suggest what additional evidence would support or refute it
-
-Focus on mechanistic plausibility and existing evidence.""",
-        chat_client=client,
-        temperature=0.5,  # Some creativity for hypothesis generation
-    )
-
-
-def create_report_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
-    """Create a report synthesis agent.
-
-    Args:
-        chat_client: Optional custom chat client. If None, uses default.
-
-    Returns:
-        ChatAgent configured for report generation
-    """
-    client = chat_client or OpenAIChatClient(
-        model_id="gpt-4o",
-        api_key=settings.openai_api_key,
-    )
-
-    return ChatAgent(
-        name="ReportAgent",
-        description="Synthesizes research findings into structured reports",
-        instructions="""You are a scientific report writer. When asked to synthesize:
-
-1. First, call get_evidence_summary() to review all collected evidence
-2. Then call get_bibliography() to get properly formatted citations
-
-Generate a structured report with these sections:
-
-## Executive Summary
-Brief overview of findings and recommendation
-
-## Methodology
-Databases searched, queries used, evidence reviewed
-
-## Key Findings
-### Mechanism of Action
-- Molecular targets
-- Biological pathways
-- Proposed mechanism
-
-### Clinical Evidence
-- Preclinical studies
-- Clinical trials
-- Safety profile
-
-## Drug Candidates
-List specific drugs with repurposing potential
-
-## Limitations
-Gaps in evidence, conflicting data, caveats
-
-## Conclusion
-Final recommendation with confidence level
-
-## References
-Use the output from get_bibliography() - do not make up citations!
-
-Be comprehensive but concise. Cite evidence for all claims.""",
-        chat_client=client,
-        tools=[get_evidence_summary, get_bibliography],  # Access to collected evidence
-        temperature=0.3,
-    )
-```
-
-### 3.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
-
-```python
-"""Magentic-based orchestrator using ChatAgent pattern."""
-from collections.abc import AsyncGenerator
-from typing import Any
-
-import structlog
-from agent_framework import (
-    MagenticAgentDeltaEvent,
-    MagenticAgentMessageEvent,
-    MagenticBuilder,
-    MagenticFinalResultEvent,
-    MagenticOrchestratorMessageEvent,
-    WorkflowOutputEvent,
-)
-from agent_framework.openai import OpenAIChatClient
-
-from src.agents.magentic_agents import (
-    create_hypothesis_agent,
-    create_judge_agent,
-    create_report_agent,
-    create_search_agent,
-)
-from src.agents.state import get_magentic_state, reset_magentic_state
-from src.utils.config import settings
-from src.utils.exceptions import ConfigurationError
-from src.utils.models import AgentEvent
-
-logger = structlog.get_logger()
-
-
-class MagenticOrchestrator:
-    """
-    Magentic-based orchestrator using ChatAgent pattern.
-
-    Each agent has an internal LLM that understands natural language
-    instructions from the manager and can call tools appropriately.
-    """
-
-    def __init__(
-        self,
-        max_rounds: int = 10,
-        chat_client: OpenAIChatClient | None = None,
-    ) -> None:
-        """Initialize orchestrator.
-
-        Args:
-            max_rounds: Maximum coordination rounds
-            chat_client: Optional shared chat client for agents
-        """
-        if not settings.openai_api_key:
-            raise ConfigurationError(
-                "Magentic mode requires OPENAI_API_KEY. "
-                "Set the key or use mode='simple'."
-            )
-
-        self._max_rounds = max_rounds
-        self._chat_client = chat_client
-
-    def _build_workflow(self) -> Any:
-        """Build the Magentic workflow with ChatAgent participants."""
-        # Create agents with internal LLMs
-        search_agent = create_search_agent(self._chat_client)
-        judge_agent = create_judge_agent(self._chat_client)
-        hypothesis_agent = create_hypothesis_agent(self._chat_client)
-        report_agent = create_report_agent(self._chat_client)
-
-        # Manager chat client (orchestrates the agents)
-        manager_client = OpenAIChatClient(
-            model_id="gpt-4o",  # Good model for planning/coordination
-            api_key=settings.openai_api_key,
-        )
-
-        return (
-            MagenticBuilder()
-            .participants(
-                searcher=search_agent,
-                hypothesizer=hypothesis_agent,
-                judge=judge_agent,
-                reporter=report_agent,
-            )
-            .with_standard_manager(
-                chat_client=manager_client,
-                max_round_count=self._max_rounds,
-                max_stall_count=3,
-                max_reset_count=2,
-            )
-            .build()
-        )
-
-    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
-        """
-        Run the Magentic workflow.
-
-        Args:
-            query: User's research question
-
-        Yields:
-            AgentEvent objects for real-time UI updates
-        """
-        logger.info("Starting Magentic orchestrator", query=query)
-
-        # CRITICAL: Reset state for fresh workflow run
-        reset_magentic_state()
-
-        # Initialize embedding service if available
-        state = get_magentic_state()
-        state.init_embedding_service()
-
-        yield AgentEvent(
-            type="started",
-            message=f"Starting research (Magentic mode): {query}",
-            iteration=0,
-        )
-
-        workflow = self._build_workflow()
-
-        task = f"""Research drug repurposing opportunities for: {query}
-
-Workflow:
-1. SearchAgent: Find evidence from PubMed, ClinicalTrials.gov, and bioRxiv
-2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
-3. JudgeAgent: Evaluate if evidence is sufficient
-4. If insufficient → SearchAgent refines search based on gaps
-5. If sufficient → ReportAgent synthesizes final report
-
-Focus on:
-- Identifying specific molecular targets
-- Understanding mechanism of action
-- Finding clinical evidence supporting hypotheses
-
-The final output should be a structured research report."""
-
-        iteration = 0
-        try:
-            async for event in workflow.run_stream(task):
-                agent_event = self._process_event(event, iteration)
-                if agent_event:
-                    if isinstance(event, MagenticAgentMessageEvent):
-                        iteration += 1
-                    yield agent_event
-
-        except Exception as e:
-            logger.error("Magentic workflow failed", error=str(e))
-            yield AgentEvent(
-                type="error",
-                message=f"Workflow error: {e!s}",
-                iteration=iteration,
-            )
-
-    def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
-        """Process workflow event into AgentEvent."""
-        if isinstance(event, MagenticOrchestratorMessageEvent):
-            text = event.message.text if event.message else ""
-            if text:
-                return AgentEvent(
-                    type="judging",
-                    message=f"Manager ({event.kind}): {text[:200]}...",
-                    iteration=iteration,
-                )
-
-        elif isinstance(event, MagenticAgentMessageEvent):
-            agent_name = event.agent_id or "unknown"
-            text = event.message.text if event.message else ""
-
-            event_type = "judging"
-            if "search" in agent_name.lower():
-                event_type = "search_complete"
-            elif "judge" in agent_name.lower():
-                event_type = "judge_complete"
-            elif "hypothes" in agent_name.lower():
-                event_type = "hypothesizing"
-            elif "report" in agent_name.lower():
-                event_type = "synthesizing"
-
-            return AgentEvent(
-                type=event_type,
-                message=f"{agent_name}: {text[:200]}...",
-                iteration=iteration + 1,
-            )
-
-        elif isinstance(event, MagenticFinalResultEvent):
-            text = event.message.text if event.message else "No result"
-            return AgentEvent(
-                type="complete",
-                message=text,
-                data={"iterations": iteration},
-                iteration=iteration,
-            )
-
-        elif isinstance(event, MagenticAgentDeltaEvent):
-            if event.text:
-                return AgentEvent(
-                    type="streaming",
-                    message=event.text,
-                    data={"agent_id": event.agent_id},
-                    iteration=iteration,
-                )
-
-        elif isinstance(event, WorkflowOutputEvent):
-            if event.data:
-                return AgentEvent(
-                    type="complete",
-                    message=str(event.data),
-                    iteration=iteration,
-                )
-
-        return None
-```
-
-### 3.4 Updated Factory (`src/orchestrator_factory.py`)
-
-```python
-"""Factory for creating orchestrators."""
-from typing import Any, Literal
-
-from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
-from src.utils.models import OrchestratorConfig
-
-
-def create_orchestrator(
-    search_handler: SearchHandlerProtocol | None = None,
-    judge_handler: JudgeHandlerProtocol | None = None,
-    config: OrchestratorConfig | None = None,
-    mode: Literal["simple", "magentic"] = "simple",
-) -> Any:
-    """
-    Create an orchestrator instance.
-
-    Args:
-        search_handler: The search handler (required for simple mode)
-        judge_handler: The judge handler (required for simple mode)
-        config: Optional configuration
-        mode: "simple" for Phase 4 loop, "magentic" for ChatAgent-based multi-agent
-
-    Returns:
-        Orchestrator instance
-
-    Note:
-        Magentic mode does NOT use search_handler/judge_handler.
-        It creates ChatAgent instances with internal LLMs that call tools directly.
-    """
-    if mode == "magentic":
-        try:
-            from src.orchestrator_magentic import MagenticOrchestrator
-
-            return MagenticOrchestrator(
-                max_rounds=config.max_iterations if config else 10,
-            )
-        except ImportError:
-            # Fallback to simple if agent-framework not installed
-            pass
-
-    # Simple mode requires handlers
-    if search_handler is None or judge_handler is None:
-        raise ValueError("Simple mode requires search_handler and judge_handler")
-
-    return Orchestrator(
-        search_handler=search_handler,
-        judge_handler=judge_handler,
-        config=config,
-    )
-```
-
----
-
-## 4. Why This Works
-
-### 4.1 The Manager → Agent Communication
-
-```
-Manager LLM decides: "Tell SearchAgent to find clinical trials for metformin"
-           ↓
-Sends instruction: "Search for clinical trials about metformin and cancer"
-           ↓
-SearchAgent's INTERNAL LLM receives this
-           ↓
-Internal LLM understands: "I should call search_clinical_trials('metformin cancer')"
-           ↓
-Tool executes: ClinicalTrials.gov API
-           ↓
-Internal LLM formats response: "I found 15 trials. Here are the key ones..."
-           ↓
-Manager receives natural language response
-```
-
-### 4.2 Why Our Old Implementation Failed
-
-```
-Manager sends: "Search for clinical trials about metformin..."
-           ↓
-OLD SearchAgent.run() extracts: query = "Search for clinical trials about metformin..."
-           ↓
-Passes to PubMed: pubmed.search("Search for clinical trials about metformin...")
-           ↓
-PubMed doesn't understand English instructions → garbage results or error
-```
-
----
-
-## 5. Directory Structure
-
-```text
-src/
-├── agents/
-│   ├── __init__.py
-│   ├── state.py                 # MagenticState (evidence_store + embeddings)
-│   ├── tools.py                 # AIFunction tool definitions (update state)
-│   └── magentic_agents.py       # ChatAgent factory functions
-├── services/
-│   └── embeddings.py            # EmbeddingService (semantic dedup)
-├── orchestrator.py              # Simple mode (unchanged)
-├── orchestrator_magentic.py     # Magentic mode with ChatAgents
-└── orchestrator_factory.py      # Mode selection
-```
-
----
-
-## 6. Dependencies
-
-```toml
-[project.optional-dependencies]
-magentic = [
-    "agent-framework-core>=1.0.0b",
-    "agent-framework-openai>=1.0.0b",  # For OpenAIChatClient
-]
-embeddings = [
-    "chromadb>=0.4.0",
-    "sentence-transformers>=2.2.0",
-]
-```
-
-**IMPORTANT: Magentic mode REQUIRES OpenAI API key.**
-
-The Microsoft Agent Framework's standard manager and ChatAgent use OpenAIChatClient internally.
-There is no AnthropicChatClient in the framework. If only `ANTHROPIC_API_KEY` is set:
-- `mode="simple"` works fine
-- `mode="magentic"` throws `ConfigurationError`
-
-This is enforced in `MagenticOrchestrator.__init__`.
-
----
-
-## 7. Implementation Checklist
-
-- [ ] Create `src/agents/state.py` with MagenticState class
-- [ ] Create `src/agents/tools.py` with AIFunction search tools + state updates
-- [ ] Create `src/agents/magentic_agents.py` with ChatAgent factories
-- [ ] Rewrite `src/orchestrator_magentic.py` to use ChatAgent pattern
-- [ ] Update `src/orchestrator_factory.py` for new signature
-- [ ] Test with real OpenAI API
-- [ ] Verify manager properly coordinates agents
-- [ ] Ensure tools are called with correct parameters
-- [ ] Verify semantic deduplication works (evidence_store populates)
-- [ ] Verify bibliography generation in final reports
-
----
-
-## 8. Definition of Done
-
-Phase 5 is **COMPLETE** when:
-
-1. Magentic mode runs without hanging
-2. Manager successfully coordinates agents via natural language
-3. SearchAgent calls tools with proper search keywords (not raw instructions)
-4. JudgeAgent evaluates evidence from conversation history
-5. ReportAgent generates structured final report
-6. Events stream to UI correctly
-
----
-
-## 9. Testing Magentic Mode
-
-```bash
-# Test with real API
-OPENAI_API_KEY=sk-... uv run python -c "
-import asyncio
-from src.orchestrator_factory import create_orchestrator
-
-async def test():
-    orch = create_orchestrator(mode='magentic')
-    async for event in orch.run('metformin alzheimer'):
-        print(f'[{event.type}] {event.message[:100]}')
-
-asyncio.run(test())
-"
-```
-
-Expected output:
-```
-[started] Starting research (Magentic mode): metformin alzheimer
-[judging] Manager (plan): I will coordinate the agents to research...
-[search_complete] SearchAgent: Found 25 PubMed results for metformin alzheimer...
-[hypothesizing] HypothesisAgent: Based on the evidence, I propose...
-[judge_complete] JudgeAgent: Mechanism Score: 7/10, Clinical Score: 6/10...
-[synthesizing] ReportAgent: ## Executive Summary...
-[complete] <full research report>
-```
-
----
-
-## 10. Key Differences from Old Spec
-
-| Aspect | OLD (Wrong) | NEW (Correct) |
-|--------|-------------|---------------|
-| Agent type | `BaseAgent` subclass | `ChatAgent` with `chat_client` |
-| Internal LLM | None | OpenAIChatClient |
-| How tools work | Handler.execute(raw_instruction) | LLM understands instruction, calls AIFunction |
-| Message handling | Extract text → pass to API | LLM interprets → extracts keywords → calls tool |
-| State management | Passed to agent constructors | Global MagenticState singleton |
-| Evidence storage | In agent instance | In MagenticState.evidence_store |
-| Semantic search | Coupled to agents | Tools call state.add_evidence() |
-| Citations for report | From agent's store | Via get_bibliography() tool |
-
-**Key Insights:**
-1. Magentic agents must have internal LLMs to understand natural language instructions
-2. Tools must update shared state as a side effect (return strings, but also store Evidence)
-3. ReportAgent uses `get_bibliography()` tool to access structured citations
-4. State is reset at start of each workflow run via `reset_magentic_state()`
diff --git a/docs/implementation/06_phase_embeddings.md b/docs/implementation/06_phase_embeddings.md
deleted file mode 100644
index e71887baa02988ea23f201b806ac9d31cb677d2c..0000000000000000000000000000000000000000
--- a/docs/implementation/06_phase_embeddings.md
+++ /dev/null
@@ -1,409 +0,0 @@
-# Phase 6 Implementation Spec: Embeddings & Semantic Search
-
-**Goal**: Add vector search for semantic evidence retrieval.
-**Philosophy**: "Find what you mean, not just what you type."
-**Prerequisite**: Phase 5 complete (Magentic working)
-
----
-
-## 1. Why Embeddings?
-
-Current limitation: **Keyword-only search misses semantically related papers.**
-
-Example problem:
-- User searches: "metformin alzheimer"
-- PubMed returns: Papers with exact keywords
-- MISSED: Papers about "AMPK activation neuroprotection" (same mechanism, different words)
-
-With embeddings:
-- Embed the query AND all evidence
-- Find semantically similar papers even without keyword match
-- Deduplicate by meaning, not just URL
-
----
-
-## 2. Architecture
-
-### Current (Phase 5)
-```
-Query → SearchAgent → PubMed/Web (keyword) → Evidence
-```
-
-### Phase 6
-```
-Query → Embed(Query) → SearchAgent
-                          ├── PubMed/Web (keyword) → Evidence
-                          └── VectorDB (semantic) → Related Evidence
-                                    ↑
-                          Evidence → Embed → Store
-```
-
-### Shared Context Enhancement
-```python
-# Current
-evidence_store = {"current": []}
-
-# Phase 6
-evidence_store = {
-    "current": [],           # Raw evidence
-    "embeddings": {},        # URL -> embedding vector
-    "vector_index": None,    # ChromaDB collection
-}
-```
-
----
-
-## 3. Technology Choice
-
-### ChromaDB (Recommended)
-- **Free**, open-source, local-first
-- No API keys, no cloud dependency
-- Supports sentence-transformers out of the box
-- Perfect for hackathon (no infra setup)
-
-### Embedding Model
-- `sentence-transformers/all-MiniLM-L6-v2` (fast, good quality)
-- Or `BAAI/bge-small-en-v1.5` (better quality, still fast)
-
----
-
-## 4. Implementation
-
-### 4.1 Dependencies
-
-Add to `pyproject.toml`:
-```toml
-[project.optional-dependencies]
-embeddings = [
-    "chromadb>=0.4.0",
-    "sentence-transformers>=2.2.0",
-]
-```
-
-### 4.2 Embedding Service (`src/services/embeddings.py`)
-
-> **CRITICAL: Async Pattern Required**
->
-> `sentence-transformers` is synchronous and CPU-bound. Running it directly in async code
-> will **block the event loop**, freezing the UI and halting all concurrent operations.
->
-> **Solution**: Use `asyncio.run_in_executor()` to offload to thread pool.
-> This pattern already exists in `src/tools/websearch.py:28-34`.
-
-```python
-"""Embedding service for semantic search.
-
-IMPORTANT: All public methods are async to avoid blocking the event loop.
-The sentence-transformers model is CPU-bound, so we use run_in_executor().
-"""
-import asyncio
-from typing import List
-
-import chromadb
-from sentence_transformers import SentenceTransformer
-
-
-class EmbeddingService:
-    """Handles text embedding and vector storage.
-
-    All embedding operations run in a thread pool to avoid blocking
-    the async event loop. See src/tools/websearch.py for the pattern.
-    """
-
-    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
-        self._model = SentenceTransformer(model_name)
-        self._client = chromadb.Client()  # In-memory for hackathon
-        self._collection = self._client.create_collection(
-            name="evidence",
-            metadata={"hnsw:space": "cosine"}
-        )
-
-    # ─────────────────────────────────────────────────────────────────
-    # Sync internal methods (run in thread pool)
-    # ─────────────────────────────────────────────────────────────────
-
-    def _sync_embed(self, text: str) -> List[float]:
-        """Synchronous embedding - DO NOT call directly from async code."""
-        return self._model.encode(text).tolist()
-
-    def _sync_batch_embed(self, texts: List[str]) -> List[List[float]]:
-        """Batch embedding for efficiency - DO NOT call directly from async code."""
-        return [e.tolist() for e in self._model.encode(texts)]
-
-    # ─────────────────────────────────────────────────────────────────
-    # Async public methods (safe for event loop)
-    # ─────────────────────────────────────────────────────────────────
-
-    async def embed(self, text: str) -> List[float]:
-        """Embed a single text (async-safe).
-
-        Uses run_in_executor to avoid blocking the event loop.
-        """
-        loop = asyncio.get_running_loop()
-        return await loop.run_in_executor(None, self._sync_embed, text)
-
-    async def embed_batch(self, texts: List[str]) -> List[List[float]]:
-        """Batch embed multiple texts (async-safe, more efficient)."""
-        loop = asyncio.get_running_loop()
-        return await loop.run_in_executor(None, self._sync_batch_embed, texts)
-
-    async def add_evidence(self, evidence_id: str, content: str, metadata: dict) -> None:
-        """Add evidence to vector store (async-safe)."""
-        embedding = await self.embed(content)
-        # ChromaDB operations are fast, but wrap for consistency
-        loop = asyncio.get_running_loop()
-        await loop.run_in_executor(
-            None,
-            lambda: self._collection.add(
-                ids=[evidence_id],
-                embeddings=[embedding],
-                metadatas=[metadata],
-                documents=[content]
-            )
-        )
-
-    async def search_similar(self, query: str, n_results: int = 5) -> List[dict]:
-        """Find semantically similar evidence (async-safe)."""
-        query_embedding = await self.embed(query)
-
-        loop = asyncio.get_running_loop()
-        results = await loop.run_in_executor(
-            None,
-            lambda: self._collection.query(
-                query_embeddings=[query_embedding],
-                n_results=n_results
-            )
-        )
-
-        # Handle empty results gracefully
-        if not results["ids"] or not results["ids"][0]:
-            return []
-
-        return [
-            {"id": id, "content": doc, "metadata": meta, "distance": dist}
-            for id, doc, meta, dist in zip(
-                results["ids"][0],
-                results["documents"][0],
-                results["metadatas"][0],
-                results["distances"][0]
-            )
-        ]
-
-    async def deduplicate(self, new_evidence: List, threshold: float = 0.9) -> List:
-        """Remove semantically duplicate evidence (async-safe)."""
-        unique = []
-        for evidence in new_evidence:
-            similar = await self.search_similar(evidence.content, n_results=1)
-            if not similar or similar[0]["distance"] > (1 - threshold):
-                unique.append(evidence)
-                await self.add_evidence(
-                    evidence_id=evidence.citation.url,
-                    content=evidence.content,
-                    metadata={"source": evidence.citation.source}
-                )
-        return unique
-```
-
-### 4.3 Enhanced SearchAgent (`src/agents/search_agent.py`)
-
-Update SearchAgent to use embeddings. **Note**: All embedding calls are `await`ed:
-
-```python
-class SearchAgent(BaseAgent):
-    def __init__(
-        self,
-        search_handler: SearchHandlerProtocol,
-        evidence_store: dict,
-        embedding_service: EmbeddingService | None = None,  # NEW
-    ):
-        # ... existing init ...
-        self._embeddings = embedding_service
-
-    async def run(self, messages, *, thread=None, **kwargs) -> AgentRunResponse:
-        # ... extract query ...
-
-        # Execute keyword search
-        result = await self._handler.execute(query, max_results_per_tool=10)
-
-        # Semantic deduplication (NEW) - ALL CALLS ARE AWAITED
-        if self._embeddings:
-            # Deduplicate by semantic similarity (async-safe)
-            unique_evidence = await self._embeddings.deduplicate(result.evidence)
-
-            # Also search for semantically related evidence (async-safe)
-            related = await self._embeddings.search_similar(query, n_results=5)
-
-            # Merge related evidence not already in results
-            existing_urls = {e.citation.url for e in unique_evidence}
-            for item in related:
-                if item["id"] not in existing_urls:
-                    # Reconstruct Evidence from stored data
-                    # ... merge logic ...
-
-        # ... rest of method ...
-```
-
-### 4.4 Semantic Expansion in Orchestrator
-
-The MagenticOrchestrator can use embeddings to expand queries:
-
-```python
-# In task instruction
-task = f"""Research drug repurposing opportunities for: {query}
-
-The system has semantic search enabled. When evidence is found:
-1. Related concepts will be automatically surfaced
-2. Duplicates are removed by meaning, not just URL
-3. Use the surfaced related concepts to refine searches
-"""
-```
-
-### 4.5 HuggingFace Spaces Deployment
-
-> **⚠️ Important for HF Spaces**
->
-> `sentence-transformers` downloads models (~500MB) to `~/.cache` on first use.
-> HuggingFace Spaces have **ephemeral storage** - the cache is wiped on restart.
-> This causes slow cold starts and bandwidth usage.
-
-**Solution**: Pre-download the model in your Dockerfile:
-
-```dockerfile
-# In Dockerfile
-FROM python:3.11-slim
-
-# Set cache directory
-ENV HF_HOME=/app/.cache
-ENV TRANSFORMERS_CACHE=/app/.cache
-
-# Pre-download the embedding model during build
-RUN pip install sentence-transformers && \
-    python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
-
-# ... rest of Dockerfile
-```
-
-**Alternative**: Use environment variable to specify persistent path:
-
-```yaml
-# In HF Spaces settings or app.yaml
-env:
-  - name: HF_HOME
-    value: /data/.cache  # Persistent volume
-```
-
----
-
-## 5. Directory Structure After Phase 6
-
-```
-src/
-├── services/                   # NEW
-│   ├── __init__.py
-│   └── embeddings.py           # EmbeddingService
-├── agents/
-│   ├── search_agent.py         # Updated with embeddings
-│   └── judge_agent.py
-└── ...
-```
-
----
-
-## 6. Tests
-
-### 6.1 Unit Tests (`tests/unit/services/test_embeddings.py`)
-
-> **Note**: All tests are async since the EmbeddingService methods are async.
-
-```python
-"""Unit tests for EmbeddingService."""
-import pytest
-from src.services.embeddings import EmbeddingService
-
-
-class TestEmbeddingService:
-    @pytest.mark.asyncio
-    async def test_embed_returns_vector(self):
-        """Embedding should return a float vector."""
-        service = EmbeddingService()
-        embedding = await service.embed("metformin diabetes")
-        assert isinstance(embedding, list)
-        assert len(embedding) > 0
-        assert all(isinstance(x, float) for x in embedding)
-
-    @pytest.mark.asyncio
-    async def test_similar_texts_have_close_embeddings(self):
-        """Semantically similar texts should have similar embeddings."""
-        service = EmbeddingService()
-        e1 = await service.embed("metformin treats diabetes")
-        e2 = await service.embed("metformin is used for diabetes treatment")
-        e3 = await service.embed("the weather is sunny today")
-
-        # Cosine similarity helper
-        from numpy import dot
-        from numpy.linalg import norm
-        cosine = lambda a, b: dot(a, b) / (norm(a) * norm(b))
-
-        # Similar texts should be closer
-        assert cosine(e1, e2) > cosine(e1, e3)
-
-    @pytest.mark.asyncio
-    async def test_batch_embed_efficient(self):
-        """Batch embedding should be more efficient than individual calls."""
-        service = EmbeddingService()
-        texts = ["text one", "text two", "text three"]
-
-        # Batch embed
-        batch_results = await service.embed_batch(texts)
-        assert len(batch_results) == 3
-        assert all(isinstance(e, list) for e in batch_results)
-
-    @pytest.mark.asyncio
-    async def test_add_and_search(self):
-        """Should be able to add evidence and search for similar."""
-        service = EmbeddingService()
-        await service.add_evidence(
-            evidence_id="test1",
-            content="Metformin activates AMPK pathway",
-            metadata={"source": "pubmed"}
-        )
-
-        results = await service.search_similar("AMPK activation drugs", n_results=1)
-        assert len(results) == 1
-        assert "AMPK" in results[0]["content"]
-
-    @pytest.mark.asyncio
-    async def test_search_similar_empty_collection(self):
-        """Search on empty collection should return empty list, not error."""
-        service = EmbeddingService()
-        results = await service.search_similar("anything", n_results=5)
-        assert results == []
-```
-
----
-
-## 7. Definition of Done
-
-Phase 6 is **COMPLETE** when:
-
-1. `EmbeddingService` implemented with ChromaDB
-2. SearchAgent uses embeddings for deduplication
-3. Semantic search surfaces related evidence
-4. All unit tests pass
-5. Integration test shows improved recall (finds related papers)
-
----
-
-## 8. Value Delivered
-
-| Before (Phase 5) | After (Phase 6) |
-|------------------|-----------------|
-| Keyword-only search | Semantic + keyword search |
-| URL-based deduplication | Meaning-based deduplication |
-| Miss related papers | Surface related concepts |
-| Exact match required | Fuzzy semantic matching |
-
-**Real example improvement:**
-- Query: "metformin alzheimer"
-- Before: Only papers mentioning both words
-- After: Also finds "AMPK neuroprotection", "biguanide cognitive", etc.
diff --git a/docs/implementation/07_phase_hypothesis.md b/docs/implementation/07_phase_hypothesis.md
deleted file mode 100644
index ee587cab2be7faee8b954906ee88cc35234fa067..0000000000000000000000000000000000000000
--- a/docs/implementation/07_phase_hypothesis.md
+++ /dev/null
@@ -1,630 +0,0 @@
-# Phase 7 Implementation Spec: Hypothesis Agent
-
-**Goal**: Add an agent that generates scientific hypotheses to guide targeted searches.
-**Philosophy**: "Don't just find evidence—understand the mechanisms."
-**Prerequisite**: Phase 6 complete (Embeddings working)
-
----
-
-## 1. Why Hypothesis Agent?
-
-Current limitation: **Search is reactive, not hypothesis-driven.**
-
-Current flow:
-1. User asks about "metformin alzheimer"
-2. Search finds papers
-3. Judge says "need more evidence"
-4. Search again with slightly different keywords
-
-With Hypothesis Agent:
-1. User asks about "metformin alzheimer"
-2. Search finds initial papers
-3. **Hypothesis Agent analyzes**: "Evidence suggests metformin → AMPK activation → autophagy → amyloid clearance"
-4. Search can now target: "metformin AMPK", "autophagy neurodegeneration", "amyloid clearance drugs"
-
-**Key insight**: Scientific research is hypothesis-driven. The agent should think like a researcher.
-
----
-
-## 2. Architecture
-
-### Current (Phase 6)
-```
-User Query → Magentic Manager
-                ├── SearchAgent → Evidence
-                └── JudgeAgent → Sufficient? → Synthesize/Continue
-```
-
-### Phase 7
-```
-User Query → Magentic Manager
-                ├── SearchAgent → Evidence
-                ├── HypothesisAgent → Mechanistic Hypotheses  ← NEW
-                └── JudgeAgent → Sufficient? → Synthesize/Continue
-                       ↑
-                  Uses hypotheses to guide next search
-```
-
-### Shared Context Enhancement
-```python
-evidence_store = {
-    "current": [],
-    "embeddings": {},
-    "vector_index": None,
-    "hypotheses": [],        # NEW: Generated hypotheses
-    "tested_hypotheses": [], # NEW: Hypotheses with supporting/contradicting evidence
-}
-```
-
----
-
-## 3. Hypothesis Model
-
-### 3.1 Data Model (`src/utils/models.py`)
-
-```python
-class MechanismHypothesis(BaseModel):
-    """A scientific hypothesis about drug mechanism."""
-
-    drug: str = Field(description="The drug being studied")
-    target: str = Field(description="Molecular target (e.g., AMPK, mTOR)")
-    pathway: str = Field(description="Biological pathway affected")
-    effect: str = Field(description="Downstream effect on disease")
-    confidence: float = Field(ge=0, le=1, description="Confidence in hypothesis")
-    supporting_evidence: list[str] = Field(
-        default_factory=list,
-        description="PMIDs or URLs supporting this hypothesis"
-    )
-    contradicting_evidence: list[str] = Field(
-        default_factory=list,
-        description="PMIDs or URLs contradicting this hypothesis"
-    )
-    search_suggestions: list[str] = Field(
-        default_factory=list,
-        description="Suggested searches to test this hypothesis"
-    )
-
-    def to_search_queries(self) -> list[str]:
-        """Generate search queries to test this hypothesis."""
-        return [
-            f"{self.drug} {self.target}",
-            f"{self.target} {self.pathway}",
-            f"{self.pathway} {self.effect}",
-            *self.search_suggestions
-        ]
-```
-
-### 3.2 Hypothesis Assessment
-
-```python
-class HypothesisAssessment(BaseModel):
-    """Assessment of evidence against hypotheses."""
-
-    hypotheses: list[MechanismHypothesis]
-    primary_hypothesis: MechanismHypothesis | None = Field(
-        description="Most promising hypothesis based on current evidence"
-    )
-    knowledge_gaps: list[str] = Field(
-        description="What we don't know yet"
-    )
-    recommended_searches: list[str] = Field(
-        description="Searches to fill knowledge gaps"
-    )
-```
-
----
-
-## 4. Implementation
-
-### 4.0 Text Utilities (`src/utils/text_utils.py`)
-
-> **Why These Utilities?**
->
-> The original spec used arbitrary truncation (`evidence[:10]` and `content[:300]`).
-> This loses important information randomly. These utilities provide:
-> 1. **Sentence-aware truncation** - cuts at sentence boundaries, not mid-word
-> 2. **Diverse evidence selection** - uses embeddings to select varied evidence (MMR)
-
-```python
-"""Text processing utilities for evidence handling."""
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from src.services.embeddings import EmbeddingService
-    from src.utils.models import Evidence
-
-
-def truncate_at_sentence(text: str, max_chars: int = 300) -> str:
-    """Truncate text at sentence boundary, preserving meaning.
-
-    Args:
-        text: The text to truncate
-        max_chars: Maximum characters (default 300)
-
-    Returns:
-        Text truncated at last complete sentence within limit
-    """
-    if len(text) <= max_chars:
-        return text
-
-    # Find truncation point
-    truncated = text[:max_chars]
-
-    # Look for sentence endings: . ! ? followed by space or end
-    for sep in ['. ', '! ', '? ', '.\n', '!\n', '?\n']:
-        last_sep = truncated.rfind(sep)
-        if last_sep > max_chars // 2:  # Don't truncate too aggressively
-            return text[:last_sep + 1].strip()
-
-    # Fallback: find last period
-    last_period = truncated.rfind('.')
-    if last_period > max_chars // 2:
-        return text[:last_period + 1].strip()
-
-    # Last resort: truncate at word boundary
-    last_space = truncated.rfind(' ')
-    if last_space > 0:
-        return text[:last_space].strip() + "..."
-
-    return truncated + "..."
-
-
-async def select_diverse_evidence(
-    evidence: list["Evidence"],
-    n: int,
-    query: str,
-    embeddings: "EmbeddingService | None" = None
-) -> list["Evidence"]:
-    """Select n most diverse and relevant evidence items.
-
-    Uses Maximal Marginal Relevance (MMR) when embeddings available,
-    falls back to relevance_score sorting otherwise.
-
-    Args:
-        evidence: All available evidence
-        n: Number of items to select
-        query: Original query for relevance scoring
-        embeddings: Optional EmbeddingService for semantic diversity
-
-    Returns:
-        Selected evidence items, diverse and relevant
-    """
-    if not evidence:
-        return []
-
-    if n >= len(evidence):
-        return evidence
-
-    # Fallback: sort by relevance score if no embeddings
-    if embeddings is None:
-        return sorted(
-            evidence,
-            key=lambda e: e.relevance_score,
-            reverse=True
-        )[:n]
-
-    # MMR: Maximal Marginal Relevance for diverse selection
-    # Score = λ * relevance - (1-λ) * max_similarity_to_selected
-    lambda_param = 0.7  # Balance relevance vs diversity
-
-    # Get query embedding
-    query_emb = await embeddings.embed(query)
-
-    # Get all evidence embeddings
-    evidence_embs = await embeddings.embed_batch([e.content for e in evidence])
-
-    # Compute relevance scores (cosine similarity to query)
-    from numpy import dot
-    from numpy.linalg import norm
-    cosine = lambda a, b: float(dot(a, b) / (norm(a) * norm(b)))
-
-    relevance_scores = [cosine(query_emb, emb) for emb in evidence_embs]
-
-    # Greedy MMR selection
-    selected_indices: list[int] = []
-    remaining = set(range(len(evidence)))
-
-    for _ in range(n):
-        best_score = float('-inf')
-        best_idx = -1
-
-        for idx in remaining:
-            # Relevance component
-            relevance = relevance_scores[idx]
-
-            # Diversity component: max similarity to already selected
-            if selected_indices:
-                max_sim = max(
-                    cosine(evidence_embs[idx], evidence_embs[sel])
-                    for sel in selected_indices
-                )
-            else:
-                max_sim = 0
-
-            # MMR score
-            mmr_score = lambda_param * relevance - (1 - lambda_param) * max_sim
-
-            if mmr_score > best_score:
-                best_score = mmr_score
-                best_idx = idx
-
-        if best_idx >= 0:
-            selected_indices.append(best_idx)
-            remaining.remove(best_idx)
-
-    return [evidence[i] for i in selected_indices]
-```
-
-### 4.1 Hypothesis Prompts (`src/prompts/hypothesis.py`)
-
-```python
-"""Prompts for Hypothesis Agent."""
-from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
-
-SYSTEM_PROMPT = """You are a biomedical research scientist specializing in drug repurposing.
-
-Your role is to generate mechanistic hypotheses based on evidence.
-
-A good hypothesis:
-1. Proposes a MECHANISM: Drug → Target → Pathway → Effect
-2. Is TESTABLE: Can be supported or refuted by literature search
-3. Is SPECIFIC: Names actual molecular targets and pathways
-4. Generates SEARCH QUERIES: Helps find more evidence
-
-Example hypothesis format:
-- Drug: Metformin
-- Target: AMPK (AMP-activated protein kinase)
-- Pathway: mTOR inhibition → autophagy activation
-- Effect: Enhanced clearance of amyloid-beta in Alzheimer's
-- Confidence: 0.7
-- Search suggestions: ["metformin AMPK brain", "autophagy amyloid clearance"]
-
-Be specific. Use actual gene/protein names when possible."""
-
-
-async def format_hypothesis_prompt(
-    query: str,
-    evidence: list,
-    embeddings=None
-) -> str:
-    """Format prompt for hypothesis generation.
-
-    Uses smart evidence selection instead of arbitrary truncation.
-
-    Args:
-        query: The research query
-        evidence: All collected evidence
-        embeddings: Optional EmbeddingService for diverse selection
-    """
-    # Select diverse, relevant evidence (not arbitrary first 10)
-    selected = await select_diverse_evidence(
-        evidence, n=10, query=query, embeddings=embeddings
-    )
-
-    # Format with sentence-aware truncation
-    evidence_text = "\n".join([
-        f"- **{e.citation.title}** ({e.citation.source}): {truncate_at_sentence(e.content, 300)}"
-        for e in selected
-    ])
-
-    return f"""Based on the following evidence about "{query}", generate mechanistic hypotheses.
-
-## Evidence ({len(selected)} papers selected for diversity)
-{evidence_text}
-
-## Task
-1. Identify potential drug targets mentioned in the evidence
-2. Propose mechanism hypotheses (Drug → Target → Pathway → Effect)
-3. Rate confidence based on evidence strength
-4. Suggest searches to test each hypothesis
-
-Generate 2-4 hypotheses, prioritized by confidence."""
-```
-
-### 4.2 Hypothesis Agent (`src/agents/hypothesis_agent.py`)
-
-```python
-"""Hypothesis agent for mechanistic reasoning."""
-from collections.abc import AsyncIterable
-from typing import TYPE_CHECKING, Any
-
-from agent_framework import (
-    AgentRunResponse,
-    AgentRunResponseUpdate,
-    AgentThread,
-    BaseAgent,
-    ChatMessage,
-    Role,
-)
-from pydantic_ai import Agent
-
-from src.prompts.hypothesis import SYSTEM_PROMPT, format_hypothesis_prompt
-from src.utils.config import settings
-from src.utils.models import Evidence, HypothesisAssessment
-
-if TYPE_CHECKING:
-    from src.services.embeddings import EmbeddingService
-
-
-class HypothesisAgent(BaseAgent):
-    """Generates mechanistic hypotheses based on evidence."""
-
-    def __init__(
-        self,
-        evidence_store: dict[str, list[Evidence]],
-        embedding_service: "EmbeddingService | None" = None,  # NEW: for diverse selection
-    ) -> None:
-        super().__init__(
-            name="HypothesisAgent",
-            description="Generates scientific hypotheses about drug mechanisms to guide research",
-        )
-        self._evidence_store = evidence_store
-        self._embeddings = embedding_service  # Used for MMR evidence selection
-        self._agent = Agent(
-            model=settings.llm_provider,  # Uses configured LLM
-            output_type=HypothesisAssessment,
-            system_prompt=SYSTEM_PROMPT,
-        )
-
-    async def run(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AgentRunResponse:
-        """Generate hypotheses based on current evidence."""
-        # Extract query
-        query = self._extract_query(messages)
-
-        # Get current evidence
-        evidence = self._evidence_store.get("current", [])
-
-        if not evidence:
-            return AgentRunResponse(
-                messages=[ChatMessage(
-                    role=Role.ASSISTANT,
-                    text="No evidence available yet. Search for evidence first."
-                )],
-                response_id="hypothesis-no-evidence",
-            )
-
-        # Generate hypotheses with diverse evidence selection
-        # NOTE: format_hypothesis_prompt is now async
-        prompt = await format_hypothesis_prompt(
-            query, evidence, embeddings=self._embeddings
-        )
-        result = await self._agent.run(prompt)
-        assessment = result.output
-
-        # Store hypotheses in shared context
-        existing = self._evidence_store.get("hypotheses", [])
-        self._evidence_store["hypotheses"] = existing + assessment.hypotheses
-
-        # Format response
-        response_text = self._format_response(assessment)
-
-        return AgentRunResponse(
-            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
-            response_id=f"hypothesis-{len(assessment.hypotheses)}",
-            additional_properties={"assessment": assessment.model_dump()},
-        )
-
-    def _format_response(self, assessment: HypothesisAssessment) -> str:
-        """Format hypothesis assessment as markdown."""
-        lines = ["## Generated Hypotheses\n"]
-
-        for i, h in enumerate(assessment.hypotheses, 1):
-            lines.append(f"### Hypothesis {i} (Confidence: {h.confidence:.0%})")
-            lines.append(f"**Mechanism**: {h.drug} → {h.target} → {h.pathway} → {h.effect}")
-            lines.append(f"**Suggested searches**: {', '.join(h.search_suggestions)}\n")
-
-        if assessment.primary_hypothesis:
-            lines.append(f"### Primary Hypothesis")
-            h = assessment.primary_hypothesis
-            lines.append(f"{h.drug} → {h.target} → {h.pathway} → {h.effect}\n")
-
-        if assessment.knowledge_gaps:
-            lines.append("### Knowledge Gaps")
-            for gap in assessment.knowledge_gaps:
-                lines.append(f"- {gap}")
-
-        if assessment.recommended_searches:
-            lines.append("\n### Recommended Next Searches")
-            for search in assessment.recommended_searches:
-                lines.append(f"- `{search}`")
-
-        return "\n".join(lines)
-
-    def _extract_query(self, messages) -> str:
-        """Extract query from messages."""
-        if isinstance(messages, str):
-            return messages
-        elif isinstance(messages, ChatMessage):
-            return messages.text or ""
-        elif isinstance(messages, list):
-            for msg in reversed(messages):
-                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
-                    return msg.text or ""
-                elif isinstance(msg, str):
-                    return msg
-        return ""
-
-    async def run_stream(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AsyncIterable[AgentRunResponseUpdate]:
-        """Streaming wrapper."""
-        result = await self.run(messages, thread=thread, **kwargs)
-        yield AgentRunResponseUpdate(
-            messages=result.messages,
-            response_id=result.response_id
-        )
-```
-
-### 4.3 Update MagenticOrchestrator
-
-Add HypothesisAgent to the workflow:
-
-```python
-# In MagenticOrchestrator.__init__
-self._hypothesis_agent = HypothesisAgent(self._evidence_store)
-
-# In workflow building
-workflow = (
-    MagenticBuilder()
-    .participants(
-        searcher=search_agent,
-        hypothesizer=self._hypothesis_agent,  # NEW
-        judge=judge_agent,
-    )
-    .with_standard_manager(...)
-    .build()
-)
-
-# Update task instruction
-task = f"""Research drug repurposing opportunities for: {query}
-
-Workflow:
-1. SearchAgent: Find initial evidence from PubMed and web
-2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
-3. SearchAgent: Use hypothesis-suggested queries for targeted search
-4. JudgeAgent: Evaluate if evidence supports hypotheses
-5. Repeat until confident or max rounds
-
-Focus on:
-- Identifying specific molecular targets
-- Understanding mechanism of action
-- Finding supporting/contradicting evidence for hypotheses
-"""
-```
-
----
-
-## 5. Directory Structure After Phase 7
-
-```
-src/
-├── agents/
-│   ├── search_agent.py
-│   ├── judge_agent.py
-│   └── hypothesis_agent.py     # NEW
-├── prompts/
-│   ├── judge.py
-│   └── hypothesis.py           # NEW
-├── services/
-│   └── embeddings.py
-└── utils/
-    └── models.py               # Updated with hypothesis models
-```
-
----
-
-## 6. Tests
-
-### 6.1 Unit Tests (`tests/unit/agents/test_hypothesis_agent.py`)
-
-```python
-"""Unit tests for HypothesisAgent."""
-import pytest
-from unittest.mock import AsyncMock, MagicMock, patch
-
-from src.agents.hypothesis_agent import HypothesisAgent
-from src.utils.models import Citation, Evidence, HypothesisAssessment, MechanismHypothesis
-
-
-@pytest.fixture
-def sample_evidence():
-    return [
-        Evidence(
-            content="Metformin activates AMPK, which inhibits mTOR signaling...",
-            citation=Citation(
-                source="pubmed",
-                title="Metformin and AMPK",
-                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
-                date="2023"
-            )
-        )
-    ]
-
-
-@pytest.fixture
-def mock_assessment():
-    return HypothesisAssessment(
-        hypotheses=[
-            MechanismHypothesis(
-                drug="Metformin",
-                target="AMPK",
-                pathway="mTOR inhibition",
-                effect="Reduced cancer cell proliferation",
-                confidence=0.75,
-                search_suggestions=["metformin AMPK cancer", "mTOR cancer therapy"]
-            )
-        ],
-        primary_hypothesis=None,
-        knowledge_gaps=["Clinical trial data needed"],
-        recommended_searches=["metformin clinical trial cancer"]
-    )
-
-
-@pytest.mark.asyncio
-async def test_hypothesis_agent_generates_hypotheses(sample_evidence, mock_assessment):
-    """HypothesisAgent should generate mechanistic hypotheses."""
-    store = {"current": sample_evidence, "hypotheses": []}
-
-    with patch("src.agents.hypothesis_agent.Agent") as MockAgent:
-        mock_result = MagicMock()
-        mock_result.output = mock_assessment
-        MockAgent.return_value.run = AsyncMock(return_value=mock_result)
-
-        agent = HypothesisAgent(store)
-        response = await agent.run("metformin cancer")
-
-        assert "AMPK" in response.messages[0].text
-        assert len(store["hypotheses"]) == 1
-
-
-@pytest.mark.asyncio
-async def test_hypothesis_agent_no_evidence():
-    """HypothesisAgent should handle empty evidence gracefully."""
-    store = {"current": [], "hypotheses": []}
-    agent = HypothesisAgent(store)
-
-    response = await agent.run("test query")
-
-    assert "No evidence" in response.messages[0].text
-```
-
----
-
-## 7. Definition of Done
-
-Phase 7 is **COMPLETE** when:
-
-1. `MechanismHypothesis` and `HypothesisAssessment` models implemented
-2. `HypothesisAgent` generates hypotheses from evidence
-3. Hypotheses stored in shared context
-4. Search queries generated from hypotheses
-5. Magentic workflow includes HypothesisAgent
-6. All unit tests pass
-
----
-
-## 8. Value Delivered
-
-| Before (Phase 6) | After (Phase 7) |
-|------------------|-----------------|
-| Reactive search | Hypothesis-driven search |
-| Generic queries | Mechanism-targeted queries |
-| No scientific reasoning | Drug → Target → Pathway → Effect |
-| Judge says "need more" | Hypothesis says "search for X to test Y" |
-
-**Real example improvement:**
-- Query: "metformin alzheimer"
-- Before: "metformin alzheimer mechanism", "metformin brain"
-- After: "metformin AMPK activation", "AMPK autophagy neurodegeneration", "autophagy amyloid clearance"
-
-The search becomes **scientifically targeted** rather than keyword variations.
diff --git a/docs/implementation/08_phase_report.md b/docs/implementation/08_phase_report.md
deleted file mode 100644
index 3618734ce9102b7aa40d8332c0049b88e3bb6653..0000000000000000000000000000000000000000
--- a/docs/implementation/08_phase_report.md
+++ /dev/null
@@ -1,854 +0,0 @@
-# Phase 8 Implementation Spec: Report Agent
-
-**Goal**: Generate structured scientific reports with proper citations and methodology.
-**Philosophy**: "Research isn't complete until it's communicated clearly."
-**Prerequisite**: Phase 7 complete (Hypothesis Agent working)
-
----
-
-## 1. Why Report Agent?
-
-Current limitation: **Synthesis is basic markdown, not a scientific report.**
-
-Current output:
-```markdown
-## Drug Repurposing Analysis
-### Drug Candidates
-- Metformin
-### Key Findings
-- Some findings
-### Citations
-1. [Paper 1](url)
-```
-
-With Report Agent:
-```markdown
-## Executive Summary
-One-paragraph summary for busy readers...
-
-## Research Question
-Clear statement of what was investigated...
-
-## Methodology
-- Sources searched: PubMed, DuckDuckGo
-- Date range: ...
-- Inclusion criteria: ...
-
-## Hypotheses Tested
-1. Metformin → AMPK → neuroprotection (Supported: 7 papers, Contradicted: 2)
-
-## Findings
-### Mechanistic Evidence
-...
-### Clinical Evidence
-...
-
-## Limitations
-- Only English language papers
-- Abstract-level analysis only
-
-## Conclusion
-...
-
-## References
-Properly formatted citations...
-```
-
----
-
-## 2. Architecture
-
-### Phase 8 Addition
-```text
-Evidence + Hypotheses + Assessment
-            ↓
-      Report Agent
-            ↓
-   Structured Scientific Report
-```
-
-### Report Generation Flow
-```text
-1. JudgeAgent says "synthesize"
-2. Magentic Manager selects ReportAgent
-3. ReportAgent gathers:
-   - All evidence from shared context
-   - All hypotheses (supported/contradicted)
-   - Assessment scores
-4. ReportAgent generates structured report
-5. Final output to user
-```
-
----
-
-## 3. Report Model
-
-### 3.1 Data Model (`src/utils/models.py`)
-
-```python
-class ReportSection(BaseModel):
-    """A section of the research report."""
-    title: str
-    content: str
-    citations: list[str] = Field(default_factory=list)
-
-
-class ResearchReport(BaseModel):
-    """Structured scientific report."""
-
-    title: str = Field(description="Report title")
-    executive_summary: str = Field(
-        description="One-paragraph summary for quick reading",
-        min_length=100,
-        max_length=500
-    )
-    research_question: str = Field(description="Clear statement of what was investigated")
-
-    methodology: ReportSection = Field(description="How the research was conducted")
-    hypotheses_tested: list[dict] = Field(
-        description="Hypotheses with supporting/contradicting evidence counts"
-    )
-
-    mechanistic_findings: ReportSection = Field(
-        description="Findings about drug mechanisms"
-    )
-    clinical_findings: ReportSection = Field(
-        description="Findings from clinical/preclinical studies"
-    )
-
-    drug_candidates: list[str] = Field(description="Identified drug candidates")
-    limitations: list[str] = Field(description="Study limitations")
-    conclusion: str = Field(description="Overall conclusion")
-
-    references: list[dict] = Field(
-        description="Formatted references with title, authors, source, URL"
-    )
-
-    # Metadata
-    sources_searched: list[str] = Field(default_factory=list)
-    total_papers_reviewed: int = 0
-    search_iterations: int = 0
-    confidence_score: float = Field(ge=0, le=1)
-
-    def to_markdown(self) -> str:
-        """Render report as markdown."""
-        sections = [
-            f"# {self.title}\n",
-            f"## Executive Summary\n{self.executive_summary}\n",
-            f"## Research Question\n{self.research_question}\n",
-            f"## Methodology\n{self.methodology.content}\n",
-        ]
-
-        # Hypotheses
-        sections.append("## Hypotheses Tested\n")
-        for h in self.hypotheses_tested:
-            status = "✅ Supported" if h.get("supported", 0) > h.get("contradicted", 0) else "⚠️ Mixed"
-            sections.append(
-                f"- **{h['mechanism']}** ({status}): "
-                f"{h.get('supported', 0)} supporting, {h.get('contradicted', 0)} contradicting\n"
-            )
-
-        # Findings
-        sections.append(f"## Mechanistic Findings\n{self.mechanistic_findings.content}\n")
-        sections.append(f"## Clinical Findings\n{self.clinical_findings.content}\n")
-
-        # Drug candidates
-        sections.append("## Drug Candidates\n")
-        for drug in self.drug_candidates:
-            sections.append(f"- **{drug}**\n")
-
-        # Limitations
-        sections.append("## Limitations\n")
-        for lim in self.limitations:
-            sections.append(f"- {lim}\n")
-
-        # Conclusion
-        sections.append(f"## Conclusion\n{self.conclusion}\n")
-
-        # References
-        sections.append("## References\n")
-        for i, ref in enumerate(self.references, 1):
-            sections.append(
-                f"{i}. {ref.get('authors', 'Unknown')}. "
-                f"*{ref.get('title', 'Untitled')}*. "
-                f"{ref.get('source', '')} ({ref.get('date', '')}). "
-                f"[Link]({ref.get('url', '#')})\n"
-            )
-
-        # Metadata footer
-        sections.append("\n---\n")
-        sections.append(
-            f"*Report generated from {self.total_papers_reviewed} papers "
-            f"across {self.search_iterations} search iterations. "
-            f"Confidence: {self.confidence_score:.0%}*"
-        )
-
-        return "\n".join(sections)
-```
-
----
-
-## 4. Implementation
-
-### 4.0 Citation Validation (`src/utils/citation_validator.py`)
-
-> **🚨 CRITICAL: Why Citation Validation?**
->
-> LLMs frequently **hallucinate** citations - inventing paper titles, authors, and URLs
-> that don't exist. For a medical research tool, fake citations are **dangerous**.
->
-> This validation layer ensures every reference in the report actually exists
-> in the collected evidence.
-
-```python
-"""Citation validation to prevent LLM hallucination.
-
-CRITICAL: Medical research requires accurate citations.
-This module validates that all references exist in collected evidence.
-"""
-import logging
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from src.utils.models import Evidence, ResearchReport
-
-logger = logging.getLogger(__name__)
-
-
-def validate_references(
-    report: "ResearchReport",
-    evidence: list["Evidence"]
-) -> "ResearchReport":
-    """Ensure all references actually exist in collected evidence.
-
-    CRITICAL: Prevents LLM hallucination of citations.
-
-    Args:
-        report: The generated research report
-        evidence: All evidence collected during research
-
-    Returns:
-        Report with only valid references (hallucinated ones removed)
-    """
-    # Build set of valid URLs from evidence
-    valid_urls = {e.citation.url for e in evidence}
-    valid_titles = {e.citation.title.lower() for e in evidence}
-
-    validated_refs = []
-    removed_count = 0
-
-    for ref in report.references:
-        ref_url = ref.get("url", "")
-        ref_title = ref.get("title", "").lower()
-
-        # Check if URL matches collected evidence
-        if ref_url in valid_urls:
-            validated_refs.append(ref)
-        # Fallback: check title match (URLs might differ slightly)
-        elif ref_title and any(ref_title in t or t in ref_title for t in valid_titles):
-            validated_refs.append(ref)
-        else:
-            removed_count += 1
-            logger.warning(
-                f"Removed hallucinated reference: '{ref.get('title', 'Unknown')}' "
-                f"(URL: {ref_url[:50]}...)"
-            )
-
-    if removed_count > 0:
-        logger.info(
-            f"Citation validation removed {removed_count} hallucinated references. "
-            f"{len(validated_refs)} valid references remain."
-        )
-
-    # Update report with validated references
-    report.references = validated_refs
-    return report
-
-
-def build_reference_from_evidence(evidence: "Evidence") -> dict:
-    """Build a properly formatted reference from evidence.
-
-    Use this to ensure references match the original evidence exactly.
-    """
-    return {
-        "title": evidence.citation.title,
-        "authors": evidence.citation.authors or ["Unknown"],
-        "source": evidence.citation.source,
-        "date": evidence.citation.date or "n.d.",
-        "url": evidence.citation.url,
-    }
-```
-
-### 4.1 Report Prompts (`src/prompts/report.py`)
-
-```python
-"""Prompts for Report Agent."""
-from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
-
-SYSTEM_PROMPT = """You are a scientific writer specializing in drug repurposing research reports.
-
-Your role is to synthesize evidence and hypotheses into a clear, structured report.
-
-A good report:
-1. Has a clear EXECUTIVE SUMMARY (one paragraph, key takeaways)
-2. States the RESEARCH QUESTION clearly
-3. Describes METHODOLOGY (what was searched, how)
-4. Evaluates HYPOTHESES with evidence counts
-5. Separates MECHANISTIC and CLINICAL findings
-6. Lists specific DRUG CANDIDATES
-7. Acknowledges LIMITATIONS honestly
-8. Provides a balanced CONCLUSION
-9. Includes properly formatted REFERENCES
-
-Write in scientific but accessible language. Be specific about evidence strength.
-
-─────────────────────────────────────────────────────────────────────────────
-🚨 CRITICAL CITATION REQUIREMENTS 🚨
-─────────────────────────────────────────────────────────────────────────────
-
-You MUST follow these rules for the References section:
-
-1. You may ONLY cite papers that appear in the Evidence section above
-2. Every reference URL must EXACTLY match a provided evidence URL
-3. Do NOT invent, fabricate, or hallucinate any references
-4. Do NOT modify paper titles, authors, dates, or URLs
-5. If unsure about a citation, OMIT it rather than guess
-6. Copy URLs exactly as provided - do not create similar-looking URLs
-
-VIOLATION OF THESE RULES PRODUCES DANGEROUS MISINFORMATION.
-─────────────────────────────────────────────────────────────────────────────"""
-
-
-async def format_report_prompt(
-    query: str,
-    evidence: list,
-    hypotheses: list,
-    assessment: dict,
-    metadata: dict,
-    embeddings=None
-) -> str:
-    """Format prompt for report generation.
-
-    Includes full evidence details for accurate citation.
-    """
-    # Select diverse evidence (not arbitrary truncation)
-    selected = await select_diverse_evidence(
-        evidence, n=20, query=query, embeddings=embeddings
-    )
-
-    # Include FULL citation details for each evidence item
-    # This helps the LLM create accurate references
-    evidence_summary = "\n".join([
-        f"- **Title**: {e.citation.title}\n"
-        f"  **URL**: {e.citation.url}\n"
-        f"  **Authors**: {', '.join(e.citation.authors or ['Unknown'])}\n"
-        f"  **Date**: {e.citation.date or 'n.d.'}\n"
-        f"  **Source**: {e.citation.source}\n"
-        f"  **Content**: {truncate_at_sentence(e.content, 200)}\n"
-        for e in selected
-    ])
-
-    hypotheses_summary = "\n".join([
-        f"- {h.drug} → {h.target} → {h.pathway} → {h.effect} (Confidence: {h.confidence:.0%})"
-        for h in hypotheses
-    ]) if hypotheses else "No hypotheses generated yet."
-
-    return f"""Generate a structured research report for the following query.
-
-## Original Query
-{query}
-
-## Evidence Collected ({len(selected)} papers, selected for diversity)
-
-{evidence_summary}
-
-## Hypotheses Generated
-{hypotheses_summary}
-
-## Assessment Scores
-- Mechanism Score: {assessment.get('mechanism_score', 'N/A')}/10
-- Clinical Evidence Score: {assessment.get('clinical_score', 'N/A')}/10
-- Overall Confidence: {assessment.get('confidence', 0):.0%}
-
-## Metadata
-- Sources Searched: {', '.join(metadata.get('sources', []))}
-- Search Iterations: {metadata.get('iterations', 0)}
-
-Generate a complete ResearchReport with all sections filled in.
-
-REMINDER: Only cite papers from the Evidence section above. Copy URLs exactly."""
-```
-
-### 4.2 Report Agent (`src/agents/report_agent.py`)
-
-```python
-"""Report agent for generating structured research reports."""
-from collections.abc import AsyncIterable
-from typing import TYPE_CHECKING, Any
-
-from agent_framework import (
-    AgentRunResponse,
-    AgentRunResponseUpdate,
-    AgentThread,
-    BaseAgent,
-    ChatMessage,
-    Role,
-)
-from pydantic_ai import Agent
-
-from src.prompts.report import SYSTEM_PROMPT, format_report_prompt
-from src.utils.citation_validator import validate_references  # CRITICAL
-from src.utils.config import settings
-from src.utils.models import Evidence, MechanismHypothesis, ResearchReport
-
-if TYPE_CHECKING:
-    from src.services.embeddings import EmbeddingService
-
-
-class ReportAgent(BaseAgent):
-    """Generates structured scientific reports from evidence and hypotheses."""
-
-    def __init__(
-        self,
-        evidence_store: dict[str, list[Evidence]],
-        embedding_service: "EmbeddingService | None" = None,  # For diverse selection
-    ) -> None:
-        super().__init__(
-            name="ReportAgent",
-            description="Generates structured scientific research reports with citations",
-        )
-        self._evidence_store = evidence_store
-        self._embeddings = embedding_service
-        self._agent = Agent(
-            model=settings.llm_provider,
-            output_type=ResearchReport,
-            system_prompt=SYSTEM_PROMPT,
-        )
-
-    async def run(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AgentRunResponse:
-        """Generate research report."""
-        query = self._extract_query(messages)
-
-        # Gather all context
-        evidence = self._evidence_store.get("current", [])
-        hypotheses = self._evidence_store.get("hypotheses", [])
-        assessment = self._evidence_store.get("last_assessment", {})
-
-        if not evidence:
-            return AgentRunResponse(
-                messages=[ChatMessage(
-                    role=Role.ASSISTANT,
-                    text="Cannot generate report: No evidence collected."
-                )],
-                response_id="report-no-evidence",
-            )
-
-        # Build metadata
-        metadata = {
-            "sources": list(set(e.citation.source for e in evidence)),
-            "iterations": self._evidence_store.get("iteration_count", 0),
-        }
-
-        # Generate report (format_report_prompt is now async)
-        prompt = await format_report_prompt(
-            query=query,
-            evidence=evidence,
-            hypotheses=hypotheses,
-            assessment=assessment,
-            metadata=metadata,
-            embeddings=self._embeddings,
-        )
-
-        result = await self._agent.run(prompt)
-        report = result.output
-
-        # ═══════════════════════════════════════════════════════════════════
-        # 🚨 CRITICAL: Validate citations to prevent hallucination
-        # ═══════════════════════════════════════════════════════════════════
-        report = validate_references(report, evidence)
-
-        # Store validated report
-        self._evidence_store["final_report"] = report
-
-        # Return markdown version
-        return AgentRunResponse(
-            messages=[ChatMessage(role=Role.ASSISTANT, text=report.to_markdown())],
-            response_id="report-complete",
-            additional_properties={"report": report.model_dump()},
-        )
-
-    def _extract_query(self, messages) -> str:
-        """Extract query from messages."""
-        if isinstance(messages, str):
-            return messages
-        elif isinstance(messages, ChatMessage):
-            return messages.text or ""
-        elif isinstance(messages, list):
-            for msg in reversed(messages):
-                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
-                    return msg.text or ""
-                elif isinstance(msg, str):
-                    return msg
-        return ""
-
-    async def run_stream(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AsyncIterable[AgentRunResponseUpdate]:
-        """Streaming wrapper."""
-        result = await self.run(messages, thread=thread, **kwargs)
-        yield AgentRunResponseUpdate(
-            messages=result.messages,
-            response_id=result.response_id
-        )
-```
-
-### 4.3 Update MagenticOrchestrator
-
-Add ReportAgent as the final synthesis step:
-
-```python
-# In MagenticOrchestrator.__init__
-self._report_agent = ReportAgent(self._evidence_store)
-
-# In workflow building
-workflow = (
-    MagenticBuilder()
-    .participants(
-        searcher=search_agent,
-        hypothesizer=hypothesis_agent,
-        judge=judge_agent,
-        reporter=self._report_agent,  # NEW
-    )
-    .with_standard_manager(...)
-    .build()
-)
-
-# Update task instruction
-task = f"""Research drug repurposing opportunities for: {query}
-
-Workflow:
-1. SearchAgent: Find evidence from PubMed and web
-2. HypothesisAgent: Generate mechanistic hypotheses
-3. SearchAgent: Targeted search based on hypotheses
-4. JudgeAgent: Evaluate evidence sufficiency
-5. If sufficient → ReportAgent: Generate structured research report
-6. If not sufficient → Repeat from step 1 with refined queries
-
-The final output should be a complete research report with:
-- Executive summary
-- Methodology
-- Hypotheses tested
-- Mechanistic and clinical findings
-- Drug candidates
-- Limitations
-- Conclusion with references
-"""
-```
-
----
-
-## 5. Directory Structure After Phase 8
-
-```
-src/
-├── agents/
-│   ├── search_agent.py
-│   ├── judge_agent.py
-│   ├── hypothesis_agent.py
-│   └── report_agent.py         # NEW
-├── prompts/
-│   ├── judge.py
-│   ├── hypothesis.py
-│   └── report.py               # NEW
-├── services/
-│   └── embeddings.py
-└── utils/
-    └── models.py               # Updated with report models
-```
-
----
-
-## 6. Tests
-
-### 6.1 Unit Tests (`tests/unit/agents/test_report_agent.py`)
-
-```python
-"""Unit tests for ReportAgent."""
-import pytest
-from unittest.mock import AsyncMock, MagicMock, patch
-
-from src.agents.report_agent import ReportAgent
-from src.utils.models import (
-    Citation, Evidence, MechanismHypothesis,
-    ResearchReport, ReportSection
-)
-
-
-@pytest.fixture
-def sample_evidence():
-    return [
-        Evidence(
-            content="Metformin activates AMPK...",
-            citation=Citation(
-                source="pubmed",
-                title="Metformin mechanisms",
-                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
-                date="2023",
-                authors=["Smith J", "Jones A"]
-            )
-        )
-    ]
-
-
-@pytest.fixture
-def sample_hypotheses():
-    return [
-        MechanismHypothesis(
-            drug="Metformin",
-            target="AMPK",
-            pathway="mTOR inhibition",
-            effect="Neuroprotection",
-            confidence=0.8,
-            search_suggestions=[]
-        )
-    ]
-
-
-@pytest.fixture
-def mock_report():
-    return ResearchReport(
-        title="Drug Repurposing Analysis: Metformin for Alzheimer's",
-        executive_summary="This report analyzes metformin as a potential...",
-        research_question="Can metformin be repurposed for Alzheimer's disease?",
-        methodology=ReportSection(
-            title="Methodology",
-            content="Searched PubMed and web sources..."
-        ),
-        hypotheses_tested=[
-            {"mechanism": "Metformin → AMPK → neuroprotection", "supported": 5, "contradicted": 1}
-        ],
-        mechanistic_findings=ReportSection(
-            title="Mechanistic Findings",
-            content="Evidence suggests AMPK activation..."
-        ),
-        clinical_findings=ReportSection(
-            title="Clinical Findings",
-            content="Limited clinical data available..."
-        ),
-        drug_candidates=["Metformin"],
-        limitations=["Abstract-level analysis only"],
-        conclusion="Metformin shows promise...",
-        references=[],
-        sources_searched=["pubmed", "web"],
-        total_papers_reviewed=10,
-        search_iterations=3,
-        confidence_score=0.75
-    )
-
-
-@pytest.mark.asyncio
-async def test_report_agent_generates_report(
-    sample_evidence, sample_hypotheses, mock_report
-):
-    """ReportAgent should generate structured report."""
-    store = {
-        "current": sample_evidence,
-        "hypotheses": sample_hypotheses,
-        "last_assessment": {"mechanism_score": 8, "clinical_score": 6}
-    }
-
-    with patch("src.agents.report_agent.Agent") as MockAgent:
-        mock_result = MagicMock()
-        mock_result.output = mock_report
-        MockAgent.return_value.run = AsyncMock(return_value=mock_result)
-
-        agent = ReportAgent(store)
-        response = await agent.run("metformin alzheimer")
-
-        assert "Executive Summary" in response.messages[0].text
-        assert "Methodology" in response.messages[0].text
-        assert "References" in response.messages[0].text
-
-
-@pytest.mark.asyncio
-async def test_report_agent_no_evidence():
-    """ReportAgent should handle empty evidence gracefully."""
-    store = {"current": [], "hypotheses": []}
-    agent = ReportAgent(store)
-
-    response = await agent.run("test query")
-
-    assert "Cannot generate report" in response.messages[0].text
-
-
-# ═══════════════════════════════════════════════════════════════════════════
-# 🚨 CRITICAL: Citation Validation Tests
-# ═══════════════════════════════════════════════════════════════════════════
-
-@pytest.mark.asyncio
-async def test_report_agent_removes_hallucinated_citations(sample_evidence):
-    """ReportAgent should remove citations not in evidence."""
-    from src.utils.citation_validator import validate_references
-
-    # Create report with mix of valid and hallucinated references
-    report_with_hallucinations = ResearchReport(
-        title="Test Report",
-        executive_summary="This is a test report for citation validation...",
-        research_question="Testing citation validation",
-        methodology=ReportSection(title="Methodology", content="Test"),
-        hypotheses_tested=[],
-        mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
-        clinical_findings=ReportSection(title="Clinical", content="Test"),
-        drug_candidates=["TestDrug"],
-        limitations=["Test limitation"],
-        conclusion="Test conclusion",
-        references=[
-            # Valid reference (matches sample_evidence)
-            {
-                "title": "Metformin mechanisms",
-                "url": "https://pubmed.ncbi.nlm.nih.gov/12345/",
-                "authors": ["Smith J", "Jones A"],
-                "date": "2023",
-                "source": "pubmed"
-            },
-            # HALLUCINATED reference (URL doesn't exist in evidence)
-            {
-                "title": "Fake Paper That Doesn't Exist",
-                "url": "https://fake-journal.com/made-up-paper",
-                "authors": ["Hallucinated A"],
-                "date": "2024",
-                "source": "fake"
-            },
-            # Another HALLUCINATED reference
-            {
-                "title": "Invented Research",
-                "url": "https://pubmed.ncbi.nlm.nih.gov/99999999/",
-                "authors": ["NotReal B"],
-                "date": "2025",
-                "source": "pubmed"
-            }
-        ],
-        sources_searched=["pubmed"],
-        total_papers_reviewed=1,
-        search_iterations=1,
-        confidence_score=0.5
-    )
-
-    # Validate - should remove hallucinated references
-    validated_report = validate_references(report_with_hallucinations, sample_evidence)
-
-    # Only the valid reference should remain
-    assert len(validated_report.references) == 1
-    assert validated_report.references[0]["title"] == "Metformin mechanisms"
-    assert "Fake Paper" not in str(validated_report.references)
-
-
-def test_citation_validator_handles_empty_references():
-    """Citation validator should handle reports with no references."""
-    from src.utils.citation_validator import validate_references
-
-    report = ResearchReport(
-        title="Empty Refs Report",
-        executive_summary="This report has no references...",
-        research_question="Testing empty refs",
-        methodology=ReportSection(title="Methodology", content="Test"),
-        hypotheses_tested=[],
-        mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
-        clinical_findings=ReportSection(title="Clinical", content="Test"),
-        drug_candidates=[],
-        limitations=[],
-        conclusion="Test",
-        references=[],  # Empty!
-        sources_searched=[],
-        total_papers_reviewed=0,
-        search_iterations=0,
-        confidence_score=0.0
-    )
-
-    validated = validate_references(report, [])
-    assert validated.references == []
-```
-
----
-
-## 7. Definition of Done
-
-Phase 8 is **COMPLETE** when:
-
-1. `ResearchReport` model implemented with all sections
-2. `ReportAgent` generates structured reports
-3. Reports include proper citations and methodology
-4. Magentic workflow uses ReportAgent for final synthesis
-5. Report renders as clean markdown
-6. All unit tests pass
-
----
-
-## 8. Value Delivered
-
-| Before (Phase 7) | After (Phase 8) |
-|------------------|-----------------|
-| Basic synthesis | Structured scientific report |
-| Simple bullet points | Executive summary + methodology |
-| List of citations | Formatted references |
-| No methodology | Clear research process |
-| No limitations | Honest limitations section |
-
-**Sample output comparison:**
-
-Before:
-```
-## Analysis
-- Metformin might help
-- Found 5 papers
-[Link 1] [Link 2]
-```
-
-After:
-```
-# Drug Repurposing Analysis: Metformin for Alzheimer's Disease
-
-## Executive Summary
-Analysis of 15 papers suggests metformin may provide neuroprotection
-through AMPK activation. Mechanistic evidence is strong (8/10),
-while clinical evidence is moderate (6/10)...
-
-## Methodology
-Systematic search of PubMed and web sources using queries...
-
-## Hypotheses Tested
-- ✅ Metformin → AMPK → neuroprotection (7 supporting, 2 contradicting)
-
-## References
-1. Smith J, Jones A. *Metformin mechanisms*. Nature (2023). [Link](...)
-```
-
----
-
-## 9. Complete Magentic Architecture (Phases 5-8)
-
-```
-User Query
-    ↓
-Gradio UI
-    ↓
-Magentic Manager (LLM Coordinator)
-    ├── SearchAgent ←→ PubMed + Web + VectorDB
-    ├── HypothesisAgent ←→ Mechanistic Reasoning
-    ├── JudgeAgent ←→ Evidence Assessment
-    └── ReportAgent ←→ Final Synthesis
-    ↓
-Structured Research Report
-```
-
-**This matches Mario's diagram** with the practical agents that add real value for drug repurposing research.
diff --git a/docs/implementation/09_phase_source_cleanup.md b/docs/implementation/09_phase_source_cleanup.md
deleted file mode 100644
index b4b9c818e1a51491acdfa9c25634b62aa7f62371..0000000000000000000000000000000000000000
--- a/docs/implementation/09_phase_source_cleanup.md
+++ /dev/null
@@ -1,257 +0,0 @@
-# Phase 9 Implementation Spec: Remove DuckDuckGo
-
-**Goal**: Remove unreliable web search, focus on credible scientific sources.
-**Philosophy**: "Scientific credibility over source quantity."
-**Prerequisite**: Phase 8 complete (all agents working)
-**Estimated Time**: 30-45 minutes
-
----
-
-## 1. Why Remove DuckDuckGo?
-
-### Current Problems
-
-| Issue | Impact |
-|-------|--------|
-| Rate-limited aggressively | Returns 0 results frequently |
-| Not peer-reviewed | Random blogs, news, misinformation |
-| Not citable | Cannot use in scientific reports |
-| Adds noise | Dilutes quality evidence |
-
-### After Removal
-
-| Benefit | Impact |
-|---------|--------|
-| Cleaner codebase | -150 lines of dead code |
-| No rate limit failures | 100% source reliability |
-| Scientific credibility | All sources peer-reviewed/preprint |
-| Simpler debugging | Fewer failure modes |
-
----
-
-## 2. Files to Modify/Delete
-
-### 2.1 DELETE: `src/tools/websearch.py`
-
-```bash
-# File to delete entirely
-src/tools/websearch.py  # ~80 lines
-```
-
-### 2.2 MODIFY: SearchHandler Usage
-
-Update all files that instantiate `SearchHandler` with `WebTool()`:
-
-| File | Change |
-|------|--------|
-| `examples/search_demo/run_search.py` | Remove `WebTool()` from tools list |
-| `examples/hypothesis_demo/run_hypothesis.py` | Remove `WebTool()` from tools list |
-| `examples/full_stack_demo/run_full.py` | Remove `WebTool()` from tools list |
-| `examples/orchestrator_demo/run_agent.py` | Remove `WebTool()` from tools list |
-| `examples/orchestrator_demo/run_magentic.py` | Remove `WebTool()` from tools list |
-
-### 2.3 MODIFY: Type Definitions
-
-Update `src/utils/models.py`:
-
-```python
-# BEFORE
-sources_searched: list[Literal["pubmed", "web"]]
-
-# AFTER (Phase 9)
-sources_searched: list[Literal["pubmed"]]
-
-# AFTER (Phase 10-11)
-sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
-```
-
-### 2.4 DELETE: Tests for WebTool
-
-```bash
-# File to delete
-tests/unit/tools/test_websearch.py
-```
-
----
-
-## 3. TDD Implementation
-
-### 3.1 Test: SearchHandler Works Without WebTool
-
-```python
-# tests/unit/tools/test_search_handler.py
-
-@pytest.mark.asyncio
-async def test_search_handler_pubmed_only():
-    """SearchHandler should work with only PubMed tool."""
-    from src.tools.pubmed import PubMedTool
-    from src.tools.search_handler import SearchHandler
-
-    handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
-
-    # Should not raise
-    result = await handler.execute("metformin diabetes", max_results_per_tool=3)
-
-    assert result.sources_searched == ["pubmed"]
-    assert "web" not in result.sources_searched
-    assert len(result.errors) == 0  # No failures
-```
-
-### 3.2 Test: WebTool Import Fails (Deleted)
-
-```python
-# tests/unit/tools/test_websearch_removed.py
-
-def test_websearch_module_deleted():
-    """WebTool should no longer exist."""
-    with pytest.raises(ImportError):
-        from src.tools.websearch import WebTool
-```
-
-### 3.3 Test: Examples Don't Reference WebTool
-
-```python
-# tests/unit/test_no_webtool_references.py
-
-import ast
-import pathlib
-
-def test_examples_no_webtool_imports():
-    """No example files should import WebTool."""
-    examples_dir = pathlib.Path("examples")
-
-    for py_file in examples_dir.rglob("*.py"):
-        content = py_file.read_text()
-        tree = ast.parse(content)
-
-        for node in ast.walk(tree):
-            if isinstance(node, ast.ImportFrom):
-                if node.module and "websearch" in node.module:
-                    pytest.fail(f"{py_file} imports websearch (should be removed)")
-            if isinstance(node, ast.Import):
-                for alias in node.names:
-                    if "websearch" in alias.name:
-                        pytest.fail(f"{py_file} imports websearch (should be removed)")
-```
-
----
-
-## 4. Step-by-Step Implementation
-
-### Step 1: Write Tests First (TDD)
-
-```bash
-# Create the test file
-touch tests/unit/tools/test_websearch_removed.py
-# Write the tests from section 3
-```
-
-### Step 2: Run Tests (Should Fail)
-
-```bash
-uv run pytest tests/unit/tools/test_websearch_removed.py -v
-# Expected: FAIL (websearch still exists)
-```
-
-### Step 3: Delete WebTool
-
-```bash
-rm src/tools/websearch.py
-rm tests/unit/tools/test_websearch.py
-```
-
-### Step 4: Update SearchHandler Usages
-
-```python
-# BEFORE (in each example file)
-from src.tools.websearch import WebTool
-search_handler = SearchHandler(tools=[PubMedTool(), WebTool()], timeout=30.0)
-
-# AFTER
-from src.tools.pubmed import PubMedTool
-search_handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
-```
-
-### Step 5: Update Type Definitions
-
-```python
-# src/utils/models.py
-# BEFORE
-sources_searched: list[Literal["pubmed", "web"]]
-
-# AFTER
-sources_searched: list[Literal["pubmed"]]
-```
-
-### Step 6: Run All Tests
-
-```bash
-uv run pytest tests/unit/ -v
-# Expected: ALL PASS
-```
-
-### Step 7: Run Lints
-
-```bash
-uv run ruff check src tests examples
-uv run mypy src
-# Expected: No errors
-```
-
----
-
-## 5. Definition of Done
-
-Phase 9 is **COMPLETE** when:
-
-- [ ] `src/tools/websearch.py` deleted
-- [ ] `tests/unit/tools/test_websearch.py` deleted
-- [ ] All example files updated (no WebTool imports)
-- [ ] Type definitions updated in models.py
-- [ ] New tests verify WebTool is removed
-- [ ] All existing tests pass
-- [ ] Lints pass
-- [ ] Examples run successfully with PubMed only
-
----
-
-## 6. Verification Commands
-
-```bash
-# 1. Verify websearch.py is gone
-ls src/tools/websearch.py 2>&1 | grep "No such file"
-
-# 2. Verify no WebTool imports remain
-grep -r "WebTool" src/ examples/ && echo "FAIL: WebTool references found" || echo "PASS"
-grep -r "websearch" src/ examples/ && echo "FAIL: websearch references found" || echo "PASS"
-
-# 3. Run tests
-uv run pytest tests/unit/ -v
-
-# 4. Run example (should work)
-source .env && uv run python examples/search_demo/run_search.py "metformin cancer"
-```
-
----
-
-## 7. Rollback Plan
-
-If something breaks:
-
-```bash
-git checkout HEAD -- src/tools/websearch.py
-git checkout HEAD -- tests/unit/tools/test_websearch.py
-```
-
----
-
-## 8. Value Delivered
-
-| Before | After |
-|--------|-------|
-| 2 search sources (1 broken) | 1 reliable source |
-| Rate limit failures | No failures |
-| Web noise in results | Pure scientific sources |
-| ~230 lines for websearch | 0 lines |
-
-**Net effect**: Simpler, more reliable, more credible.
diff --git a/docs/implementation/10_phase_clinicaltrials.md b/docs/implementation/10_phase_clinicaltrials.md
deleted file mode 100644
index 382b5fb631fc10b029404133375b01ed5375bde0..0000000000000000000000000000000000000000
--- a/docs/implementation/10_phase_clinicaltrials.md
+++ /dev/null
@@ -1,437 +0,0 @@
-# Phase 10 Implementation Spec: ClinicalTrials.gov Integration
-
-**Goal**: Add clinical trial search for drug repurposing evidence.
-**Philosophy**: "Clinical trials are the bridge from hypothesis to therapy."
-**Prerequisite**: Phase 9 complete (DuckDuckGo removed)
-**Estimated Time**: 2-3 hours
-
----
-
-## 1. Why ClinicalTrials.gov?
-
-### Scientific Value
-
-| Feature | Value for Drug Repurposing |
-|---------|---------------------------|
-| **400,000+ studies** | Massive evidence base |
-| **Trial phase data** | Phase I/II/III = evidence strength |
-| **Intervention details** | Exact drug + dosing |
-| **Outcome measures** | What was measured |
-| **Status tracking** | Completed vs recruiting |
-| **Free API** | No cost, no key required |
-
-### Example Query Response
-
-Query: "metformin Alzheimer's"
-
-```json
-{
-  "studies": [
-    {
-      "nctId": "NCT04098666",
-      "briefTitle": "Metformin in Alzheimer's Dementia Prevention",
-      "phase": "Phase 2",
-      "status": "Recruiting",
-      "conditions": ["Alzheimer Disease"],
-      "interventions": ["Drug: Metformin"]
-    }
-  ]
-}
-```
-
-**This is GOLD for drug repurposing** - actual trials testing the hypothesis!
-
----
-
-## 2. API Specification
-
-### Endpoint
-
-```
-Base URL: https://clinicaltrials.gov/api/v2/studies
-```
-
-### Key Parameters
-
-| Parameter | Description | Example |
-|-----------|-------------|---------|
-| `query.cond` | Condition/disease | `Alzheimer` |
-| `query.intr` | Intervention/drug | `Metformin` |
-| `query.term` | General search | `metformin alzheimer` |
-| `pageSize` | Results per page | `20` |
-| `fields` | Fields to return | See below |
-
-### Fields We Need
-
-```
-NCTId, BriefTitle, Phase, OverallStatus, Condition,
-InterventionName, StartDate, CompletionDate, BriefSummary
-```
-
-### Rate Limits
-
-- ~50 requests/minute per IP
-- No authentication required
-- Paginated (100 results max per call)
-
-### Documentation
-
-- [API v2 Docs](https://clinicaltrials.gov/data-api/api)
-- [Migration Guide](https://www.nlm.nih.gov/pubs/techbull/ma24/ma24_clinicaltrials_api.html)
-
----
-
-## 3. Data Model
-
-### 3.1 Update Citation Source Type (`src/utils/models.py`)
-
-```python
-# BEFORE
-source: Literal["pubmed", "web"]
-
-# AFTER
-source: Literal["pubmed", "clinicaltrials", "biorxiv"]
-```
-
-### 3.2 Evidence from Clinical Trials
-
-Clinical trial data maps to our existing `Evidence` model:
-
-```python
-Evidence(
-    content=f"{brief_summary}. Phase: {phase}. Status: {status}.",
-    citation=Citation(
-        source="clinicaltrials",
-        title=brief_title,
-        url=f"https://clinicaltrials.gov/study/{nct_id}",
-        date=start_date or "Unknown",
-        authors=[]  # Trials don't have authors in the same way
-    ),
-    relevance=0.8  # Trials are highly relevant for repurposing
-)
-```
-
----
-
-## 4. Implementation
-
-### 4.0 Important: HTTP Client Selection
-
-**ClinicalTrials.gov's WAF blocks `httpx`'s TLS fingerprint.** Use `requests` instead.
-
-| Library | Status | Notes |
-|---------|--------|-------|
-| `httpx` | ❌ 403 Blocked | TLS/JA3 fingerprint flagged |
-| `httpx[http2]` | ❌ 403 Blocked | HTTP/2 doesn't help |
-| `requests` | ✅ Works | Industry standard, not blocked |
-| `urllib` | ✅ Works | Stdlib alternative |
-
-We use `requests` wrapped in `asyncio.to_thread()` for async compatibility.
-
-### 4.1 ClinicalTrials Tool (`src/tools/clinicaltrials.py`)
-
-```python
-"""ClinicalTrials.gov search tool using API v2."""
-
-import asyncio
-from typing import Any, ClassVar
-
-import requests
-from tenacity import retry, stop_after_attempt, wait_exponential
-
-from src.utils.exceptions import SearchError
-from src.utils.models import Citation, Evidence
-
-
-class ClinicalTrialsTool:
-    """Search tool for ClinicalTrials.gov.
-
-    Note: Uses `requests` library instead of `httpx` because ClinicalTrials.gov's
-    WAF blocks httpx's TLS fingerprint. The `requests` library is not blocked.
-    """
-
-    BASE_URL = "https://clinicaltrials.gov/api/v2/studies"
-    FIELDS: ClassVar[list[str]] = [
-        "NCTId",
-        "BriefTitle",
-        "Phase",
-        "OverallStatus",
-        "Condition",
-        "InterventionName",
-        "StartDate",
-        "BriefSummary",
-    ]
-
-    @property
-    def name(self) -> str:
-        return "clinicaltrials"
-
-    @retry(
-        stop=stop_after_attempt(3),
-        wait=wait_exponential(multiplier=1, min=1, max=10),
-        reraise=True,
-    )
-    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
-        """Search ClinicalTrials.gov for studies."""
-        params = {
-            "query.term": query,
-            "pageSize": min(max_results, 100),
-            "fields": "|".join(self.FIELDS),
-        }
-
-        try:
-            # Run blocking requests.get in a separate thread for async compatibility
-            response = await asyncio.to_thread(
-                requests.get,
-                self.BASE_URL,
-                params=params,
-                headers={"User-Agent": "DeepCritical-Research-Agent/1.0"},
-                timeout=30,
-            )
-            response.raise_for_status()
-
-            data = response.json()
-            studies = data.get("studies", [])
-            return [self._study_to_evidence(study) for study in studies[:max_results]]
-
-        except requests.HTTPError as e:
-            raise SearchError(f"ClinicalTrials.gov API error: {e}") from e
-        except requests.RequestException as e:
-            raise SearchError(f"ClinicalTrials.gov request failed: {e}") from e
-
-    def _study_to_evidence(self, study: dict) -> Evidence:
-        """Convert a clinical trial study to Evidence."""
-        # Navigate nested structure
-        protocol = study.get("protocolSection", {})
-        id_module = protocol.get("identificationModule", {})
-        status_module = protocol.get("statusModule", {})
-        desc_module = protocol.get("descriptionModule", {})
-        design_module = protocol.get("designModule", {})
-        conditions_module = protocol.get("conditionsModule", {})
-        arms_module = protocol.get("armsInterventionsModule", {})
-
-        nct_id = id_module.get("nctId", "Unknown")
-        title = id_module.get("briefTitle", "Untitled Study")
-        status = status_module.get("overallStatus", "Unknown")
-        start_date = status_module.get("startDateStruct", {}).get("date", "Unknown")
-
-        # Get phase (might be a list)
-        phases = design_module.get("phases", [])
-        phase = phases[0] if phases else "Not Applicable"
-
-        # Get conditions
-        conditions = conditions_module.get("conditions", [])
-        conditions_str = ", ".join(conditions[:3]) if conditions else "Unknown"
-
-        # Get interventions
-        interventions = arms_module.get("interventions", [])
-        intervention_names = [i.get("name", "") for i in interventions[:3]]
-        interventions_str = ", ".join(intervention_names) if intervention_names else "Unknown"
-
-        # Get summary
-        summary = desc_module.get("briefSummary", "No summary available.")
-
-        # Build content with key trial info
-        content = (
-            f"{summary[:500]}... "
-            f"Trial Phase: {phase}. "
-            f"Status: {status}. "
-            f"Conditions: {conditions_str}. "
-            f"Interventions: {interventions_str}."
-        )
-
-        return Evidence(
-            content=content[:2000],
-            citation=Citation(
-                source="clinicaltrials",
-                title=title[:500],
-                url=f"https://clinicaltrials.gov/study/{nct_id}",
-                date=start_date,
-                authors=[],  # Trials don't have traditional authors
-            ),
-            relevance=0.85,  # Trials are highly relevant for repurposing
-        )
-```
-
----
-
-## 5. TDD Test Suite
-
-### 5.1 Unit Tests (`tests/unit/tools/test_clinicaltrials.py`)
-
-Uses `unittest.mock.patch` to mock `requests.get` (not `respx` since we're not using `httpx`).
-
-```python
-"""Unit tests for ClinicalTrials.gov tool."""
-
-from unittest.mock import MagicMock, patch
-
-import pytest
-import requests
-
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.utils.exceptions import SearchError
-from src.utils.models import Evidence
-
-
-@pytest.fixture
-def mock_clinicaltrials_response() -> dict:
-    """Mock ClinicalTrials.gov API response."""
-    return {
-        "studies": [
-            {
-                "protocolSection": {
-                    "identificationModule": {
-                        "nctId": "NCT04098666",
-                        "briefTitle": "Metformin in Alzheimer's Dementia Prevention",
-                    },
-                    "statusModule": {
-                        "overallStatus": "Recruiting",
-                        "startDateStruct": {"date": "2020-01-15"},
-                    },
-                    "descriptionModule": {
-                        "briefSummary": "This study evaluates metformin for Alzheimer's prevention."
-                    },
-                    "designModule": {"phases": ["PHASE2"]},
-                    "conditionsModule": {"conditions": ["Alzheimer Disease", "Dementia"]},
-                    "armsInterventionsModule": {
-                        "interventions": [{"name": "Metformin", "type": "Drug"}]
-                    },
-                }
-            }
-        ]
-    }
-
-
-class TestClinicalTrialsTool:
-    """Tests for ClinicalTrialsTool."""
-
-    def test_tool_name(self) -> None:
-        """Tool should have correct name."""
-        tool = ClinicalTrialsTool()
-        assert tool.name == "clinicaltrials"
-
-    @pytest.mark.asyncio
-    async def test_search_returns_evidence(
-        self, mock_clinicaltrials_response: dict
-    ) -> None:
-        """Search should return Evidence objects."""
-        with patch("src.tools.clinicaltrials.requests.get") as mock_get:
-            mock_response = MagicMock()
-            mock_response.json.return_value = mock_clinicaltrials_response
-            mock_response.raise_for_status = MagicMock()
-            mock_get.return_value = mock_response
-
-            tool = ClinicalTrialsTool()
-            results = await tool.search("metformin alzheimer", max_results=5)
-
-            assert len(results) == 1
-            assert isinstance(results[0], Evidence)
-            assert results[0].citation.source == "clinicaltrials"
-            assert "NCT04098666" in results[0].citation.url
-            assert "Metformin" in results[0].citation.title
-
-    @pytest.mark.asyncio
-    async def test_search_api_error(self) -> None:
-        """Search should raise SearchError on API failure."""
-        with patch("src.tools.clinicaltrials.requests.get") as mock_get:
-            mock_response = MagicMock()
-            mock_response.raise_for_status.side_effect = requests.HTTPError(
-                "500 Server Error"
-            )
-            mock_get.return_value = mock_response
-
-            tool = ClinicalTrialsTool()
-
-            with pytest.raises(SearchError):
-                await tool.search("metformin alzheimer")
-
-
-class TestClinicalTrialsIntegration:
-    """Integration tests (marked for separate run)."""
-
-    @pytest.mark.integration
-    @pytest.mark.asyncio
-    async def test_real_api_call(self) -> None:
-        """Test actual API call (requires network)."""
-        tool = ClinicalTrialsTool()
-        results = await tool.search("metformin diabetes", max_results=3)
-
-        assert len(results) > 0
-        assert all(isinstance(r, Evidence) for r in results)
-        assert all(r.citation.source == "clinicaltrials" for r in results)
-```
-
----
-
-## 6. Integration with SearchHandler
-
-### 6.1 Update Example Files
-
-```python
-# examples/search_demo/run_search.py
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.tools.pubmed import PubMedTool
-from src.tools.search_handler import SearchHandler
-
-search_handler = SearchHandler(
-    tools=[PubMedTool(), ClinicalTrialsTool()],
-    timeout=30.0
-)
-```
-
-### 6.2 Update SearchResult Type
-
-```python
-# src/utils/models.py
-sources_searched: list[Literal["pubmed", "clinicaltrials"]]
-```
-
----
-
-## 7. Definition of Done
-
-Phase 10 is **COMPLETE** when:
-
-- [ ] `src/tools/clinicaltrials.py` implemented
-- [ ] Unit tests in `tests/unit/tools/test_clinicaltrials.py`
-- [ ] Integration test marked with `@pytest.mark.integration`
-- [ ] SearchHandler updated to include ClinicalTrialsTool
-- [ ] Type definitions updated in models.py
-- [ ] Example files updated
-- [ ] All unit tests pass
-- [ ] Lints pass
-- [ ] Manual verification with real API
-
----
-
-## 8. Verification Commands
-
-```bash
-# 1. Run unit tests
-uv run pytest tests/unit/tools/test_clinicaltrials.py -v
-
-# 2. Run integration test (requires network)
-uv run pytest tests/unit/tools/test_clinicaltrials.py -v -m integration
-
-# 3. Run full test suite
-uv run pytest tests/unit/ -v
-
-# 4. Run example
-source .env && uv run python examples/search_demo/run_search.py "metformin alzheimer"
-# Should show results from BOTH PubMed AND ClinicalTrials.gov
-```
-
----
-
-## 9. Value Delivered
-
-| Before | After |
-|--------|-------|
-| Papers only | Papers + Clinical Trials |
-| "Drug X might help" | "Drug X is in Phase II trial" |
-| No trial status | Recruiting/Completed/Terminated |
-| No phase info | Phase I/II/III evidence strength |
-
-**Demo pitch addition**:
-> "DeepCritical searches PubMed for peer-reviewed evidence AND ClinicalTrials.gov for 400,000+ clinical trials."
diff --git a/docs/implementation/11_phase_biorxiv.md b/docs/implementation/11_phase_biorxiv.md
deleted file mode 100644
index 4e17d3c8c16c7e0bd9ec6b28086141337e98b40c..0000000000000000000000000000000000000000
--- a/docs/implementation/11_phase_biorxiv.md
+++ /dev/null
@@ -1,572 +0,0 @@
-# Phase 11 Implementation Spec: bioRxiv Preprint Integration
-
-**Goal**: Add cutting-edge preprint search for the latest research.
-**Philosophy**: "Preprints are where breakthroughs appear first."
-**Prerequisite**: Phase 10 complete (ClinicalTrials.gov working)
-**Estimated Time**: 2-3 hours
-
----
-
-## 1. Why bioRxiv?
-
-### Scientific Value
-
-| Feature | Value for Drug Repurposing |
-|---------|---------------------------|
-| **Cutting-edge research** | 6-12 months ahead of PubMed |
-| **Rapid publication** | Days, not months |
-| **Free full-text** | Complete papers, not just abstracts |
-| **medRxiv included** | Medical preprints via same API |
-| **No API key required** | Free and open |
-
-### The Preprint Advantage
-
-```
-Traditional Publication Timeline:
-  Research → Submit → Review → Revise → Accept → Publish
-  |___________________________ 6-18 months _______________|
-
-Preprint Timeline:
-  Research → Upload → Available
-  |______ 1-3 days ______|
-```
-
-**For drug repurposing**: Preprints contain the newest hypotheses and evidence!
-
----
-
-## 2. API Specification
-
-### Endpoint
-
-```
-Base URL: https://api.biorxiv.org/details/[server]/[interval]/[cursor]/[format]
-```
-
-### Servers
-
-| Server | Content |
-|--------|---------|
-| `biorxiv` | Biology preprints |
-| `medrxiv` | Medical preprints (more relevant for us!) |
-
-### Interval Formats
-
-| Format | Example | Description |
-|--------|---------|-------------|
-| Date range | `2024-01-01/2024-12-31` | Papers between dates |
-| Recent N | `50` | Most recent N papers |
-| Recent N days | `30d` | Papers from last N days |
-
-### Response Format
-
-```json
-{
-  "collection": [
-    {
-      "doi": "10.1101/2024.01.15.123456",
-      "title": "Metformin repurposing for neurodegeneration",
-      "authors": "Smith, J; Jones, A",
-      "date": "2024-01-15",
-      "category": "neuroscience",
-      "abstract": "We investigated metformin's potential..."
-    }
-  ],
-  "messages": [{"status": "ok", "count": 100}]
-}
-```
-
-### Rate Limits
-
-- No official limit, but be respectful
-- Results paginated (100 per call)
-- Use cursor for pagination
-
-### Documentation
-
-- [bioRxiv API](https://api.biorxiv.org/)
-- [medrxivr R package docs](https://docs.ropensci.org/medrxivr/)
-
----
-
-## 3. Search Strategy
-
-### Challenge: bioRxiv API Limitations
-
-The bioRxiv API does NOT support keyword search directly. It returns papers by:
-- Date range
-- Recent count
-
-### Solution: Client-Side Filtering
-
-```python
-# Strategy:
-# 1. Fetch recent papers (e.g., last 90 days)
-# 2. Filter by keyword matching in title/abstract
-# 3. Use embeddings for semantic matching (leverage Phase 6!)
-```
-
-### Alternative: Content Search Endpoint
-
-```
-https://api.biorxiv.org/pubs/[server]/[doi_prefix]
-```
-
-For searching, we can use the publisher endpoint with filtering.
-
----
-
-## 4. Data Model
-
-### 4.1 Update Citation Source Type (`src/utils/models.py`)
-
-```python
-# After Phase 11
-source: Literal["pubmed", "clinicaltrials", "biorxiv"]
-```
-
-### 4.2 Evidence from Preprints
-
-```python
-Evidence(
-    content=abstract[:2000],
-    citation=Citation(
-        source="biorxiv",  # or "medrxiv"
-        title=title,
-        url=f"https://doi.org/{doi}",
-        date=date,
-        authors=authors.split("; ")[:5]
-    ),
-    relevance=0.75  # Preprints slightly lower than peer-reviewed
-)
-```
-
----
-
-## 5. Implementation
-
-### 5.1 bioRxiv Tool (`src/tools/biorxiv.py`)
-
-```python
-"""bioRxiv/medRxiv preprint search tool."""
-
-import re
-from datetime import datetime, timedelta
-
-import httpx
-from tenacity import retry, stop_after_attempt, wait_exponential
-
-from src.utils.exceptions import SearchError
-from src.utils.models import Citation, Evidence
-
-
-class BioRxivTool:
-    """Search tool for bioRxiv and medRxiv preprints."""
-
-    BASE_URL = "https://api.biorxiv.org/details"
-    # Use medRxiv for medical/clinical content (more relevant for drug repurposing)
-    DEFAULT_SERVER = "medrxiv"
-    # Fetch papers from last N days
-    DEFAULT_DAYS = 90
-
-    def __init__(self, server: str = DEFAULT_SERVER, days: int = DEFAULT_DAYS):
-        """
-        Initialize bioRxiv tool.
-
-        Args:
-            server: "biorxiv" or "medrxiv"
-            days: How many days back to search
-        """
-        self.server = server
-        self.days = days
-
-    @property
-    def name(self) -> str:
-        return "biorxiv"
-
-    @retry(
-        stop=stop_after_attempt(3),
-        wait=wait_exponential(multiplier=1, min=1, max=10),
-        reraise=True,
-    )
-    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
-        """
-        Search bioRxiv/medRxiv for preprints matching query.
-
-        Note: bioRxiv API doesn't support keyword search directly.
-        We fetch recent papers and filter client-side.
-
-        Args:
-            query: Search query (keywords)
-            max_results: Maximum results to return
-
-        Returns:
-            List of Evidence objects from preprints
-        """
-        # Build date range for last N days
-        end_date = datetime.now().strftime("%Y-%m-%d")
-        start_date = (datetime.now() - timedelta(days=self.days)).strftime("%Y-%m-%d")
-        interval = f"{start_date}/{end_date}"
-
-        # Fetch recent papers
-        url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"
-
-        async with httpx.AsyncClient(timeout=30.0) as client:
-            try:
-                response = await client.get(url)
-                response.raise_for_status()
-            except httpx.HTTPStatusError as e:
-                raise SearchError(f"bioRxiv search failed: {e}") from e
-
-            data = response.json()
-            papers = data.get("collection", [])
-
-            # Filter papers by query keywords
-            query_terms = self._extract_terms(query)
-            matching = self._filter_by_keywords(papers, query_terms, max_results)
-
-            return [self._paper_to_evidence(paper) for paper in matching]
-
-    def _extract_terms(self, query: str) -> list[str]:
-        """Extract search terms from query."""
-        # Simple tokenization, lowercase
-        terms = re.findall(r'\b\w+\b', query.lower())
-        # Filter out common stop words
-        stop_words = {'the', 'a', 'an', 'in', 'on', 'for', 'and', 'or', 'of', 'to'}
-        return [t for t in terms if t not in stop_words and len(t) > 2]
-
-    def _filter_by_keywords(
-        self, papers: list[dict], terms: list[str], max_results: int
-    ) -> list[dict]:
-        """Filter papers that contain query terms in title or abstract."""
-        scored_papers = []
-
-        for paper in papers:
-            title = paper.get("title", "").lower()
-            abstract = paper.get("abstract", "").lower()
-            text = f"{title} {abstract}"
-
-            # Count matching terms
-            matches = sum(1 for term in terms if term in text)
-
-            if matches > 0:
-                scored_papers.append((matches, paper))
-
-        # Sort by match count (descending)
-        scored_papers.sort(key=lambda x: x[0], reverse=True)
-
-        return [paper for _, paper in scored_papers[:max_results]]
-
-    def _paper_to_evidence(self, paper: dict) -> Evidence:
-        """Convert a preprint paper to Evidence."""
-        doi = paper.get("doi", "")
-        title = paper.get("title", "Untitled")
-        authors_str = paper.get("authors", "Unknown")
-        date = paper.get("date", "Unknown")
-        abstract = paper.get("abstract", "No abstract available.")
-        category = paper.get("category", "")
-
-        # Parse authors (format: "Smith, J; Jones, A")
-        authors = [a.strip() for a in authors_str.split(";")][:5]
-
-        # Note this is a preprint in the content
-        content = (
-            f"[PREPRINT - Not peer-reviewed] "
-            f"{abstract[:1800]}... "
-            f"Category: {category}."
-        )
-
-        return Evidence(
-            content=content[:2000],
-            citation=Citation(
-                source="biorxiv",
-                title=title[:500],
-                url=f"https://doi.org/{doi}" if doi else f"https://www.medrxiv.org/",
-                date=date,
-                authors=authors,
-            ),
-            relevance=0.75,  # Slightly lower than peer-reviewed
-        )
-```
-
----
-
-## 6. TDD Test Suite
-
-### 6.1 Unit Tests (`tests/unit/tools/test_biorxiv.py`)
-
-```python
-"""Unit tests for bioRxiv tool."""
-
-import pytest
-import respx
-from httpx import Response
-
-from src.tools.biorxiv import BioRxivTool
-from src.utils.models import Evidence
-
-
-@pytest.fixture
-def mock_biorxiv_response():
-    """Mock bioRxiv API response."""
-    return {
-        "collection": [
-            {
-                "doi": "10.1101/2024.01.15.24301234",
-                "title": "Metformin repurposing for Alzheimer's disease: a systematic review",
-                "authors": "Smith, John; Jones, Alice; Brown, Bob",
-                "date": "2024-01-15",
-                "category": "neurology",
-                "abstract": "Background: Metformin has shown neuroprotective effects. "
-                           "We conducted a systematic review of metformin's potential "
-                           "for Alzheimer's disease treatment."
-            },
-            {
-                "doi": "10.1101/2024.01.10.24301111",
-                "title": "COVID-19 vaccine efficacy study",
-                "authors": "Wilson, C",
-                "date": "2024-01-10",
-                "category": "infectious diseases",
-                "abstract": "This study evaluates COVID-19 vaccine efficacy."
-            }
-        ],
-        "messages": [{"status": "ok", "count": 2}]
-    }
-
-
-class TestBioRxivTool:
-    """Tests for BioRxivTool."""
-
-    def test_tool_name(self):
-        """Tool should have correct name."""
-        tool = BioRxivTool()
-        assert tool.name == "biorxiv"
-
-    def test_default_server_is_medrxiv(self):
-        """Default server should be medRxiv for medical relevance."""
-        tool = BioRxivTool()
-        assert tool.server == "medrxiv"
-
-    @pytest.mark.asyncio
-    @respx.mock
-    async def test_search_returns_evidence(self, mock_biorxiv_response):
-        """Search should return Evidence objects."""
-        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
-            return_value=Response(200, json=mock_biorxiv_response)
-        )
-
-        tool = BioRxivTool()
-        results = await tool.search("metformin alzheimer", max_results=5)
-
-        assert len(results) == 1  # Only the matching paper
-        assert isinstance(results[0], Evidence)
-        assert results[0].citation.source == "biorxiv"
-        assert "metformin" in results[0].citation.title.lower()
-
-    @pytest.mark.asyncio
-    @respx.mock
-    async def test_search_filters_by_keywords(self, mock_biorxiv_response):
-        """Search should filter papers by query keywords."""
-        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
-            return_value=Response(200, json=mock_biorxiv_response)
-        )
-
-        tool = BioRxivTool()
-
-        # Search for metformin - should match first paper
-        results = await tool.search("metformin")
-        assert len(results) == 1
-        assert "metformin" in results[0].citation.title.lower()
-
-        # Search for COVID - should match second paper
-        results = await tool.search("covid vaccine")
-        assert len(results) == 1
-        assert "covid" in results[0].citation.title.lower()
-
-    @pytest.mark.asyncio
-    @respx.mock
-    async def test_search_marks_as_preprint(self, mock_biorxiv_response):
-        """Evidence content should note it's a preprint."""
-        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
-            return_value=Response(200, json=mock_biorxiv_response)
-        )
-
-        tool = BioRxivTool()
-        results = await tool.search("metformin")
-
-        assert "PREPRINT" in results[0].content
-        assert "Not peer-reviewed" in results[0].content
-
-    @pytest.mark.asyncio
-    @respx.mock
-    async def test_search_empty_results(self):
-        """Search should handle empty results gracefully."""
-        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
-            return_value=Response(200, json={"collection": [], "messages": []})
-        )
-
-        tool = BioRxivTool()
-        results = await tool.search("xyznonexistent")
-
-        assert results == []
-
-    @pytest.mark.asyncio
-    @respx.mock
-    async def test_search_api_error(self):
-        """Search should raise SearchError on API failure."""
-        from src.utils.exceptions import SearchError
-
-        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
-            return_value=Response(500, text="Internal Server Error")
-        )
-
-        tool = BioRxivTool()
-
-        with pytest.raises(SearchError):
-            await tool.search("metformin")
-
-    def test_extract_terms(self):
-        """Should extract meaningful search terms."""
-        tool = BioRxivTool()
-
-        terms = tool._extract_terms("metformin for Alzheimer's disease")
-
-        assert "metformin" in terms
-        assert "alzheimer" in terms
-        assert "disease" in terms
-        assert "for" not in terms  # Stop word
-        assert "the" not in terms  # Stop word
-
-
-class TestBioRxivIntegration:
-    """Integration tests (marked for separate run)."""
-
-    @pytest.mark.integration
-    @pytest.mark.asyncio
-    async def test_real_api_call(self):
-        """Test actual API call (requires network)."""
-        tool = BioRxivTool(days=30)  # Last 30 days
-        results = await tool.search("diabetes", max_results=3)
-
-        # May or may not find results depending on recent papers
-        assert isinstance(results, list)
-        for r in results:
-            assert isinstance(r, Evidence)
-            assert r.citation.source == "biorxiv"
-```
-
----
-
-## 7. Integration with SearchHandler
-
-### 7.1 Final SearchHandler Configuration
-
-```python
-# examples/search_demo/run_search.py
-from src.tools.biorxiv import BioRxivTool
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.tools.pubmed import PubMedTool
-from src.tools.search_handler import SearchHandler
-
-search_handler = SearchHandler(
-    tools=[
-        PubMedTool(),           # Peer-reviewed papers
-        ClinicalTrialsTool(),   # Clinical trials
-        BioRxivTool(),          # Preprints (cutting edge)
-    ],
-    timeout=30.0
-)
-```
-
-### 7.2 Final Type Definition
-
-```python
-# src/utils/models.py
-sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
-```
-
----
-
-## 8. Definition of Done
-
-Phase 11 is **COMPLETE** when:
-
-- [ ] `src/tools/biorxiv.py` implemented
-- [ ] Unit tests in `tests/unit/tools/test_biorxiv.py`
-- [ ] Integration test marked with `@pytest.mark.integration`
-- [ ] SearchHandler updated to include BioRxivTool
-- [ ] Type definitions updated in models.py
-- [ ] Example files updated
-- [ ] All unit tests pass
-- [ ] Lints pass
-- [ ] Manual verification with real API
-
----
-
-## 9. Verification Commands
-
-```bash
-# 1. Run unit tests
-uv run pytest tests/unit/tools/test_biorxiv.py -v
-
-# 2. Run integration test (requires network)
-uv run pytest tests/unit/tools/test_biorxiv.py -v -m integration
-
-# 3. Run full test suite
-uv run pytest tests/unit/ -v
-
-# 4. Run example with all three sources
-source .env && uv run python examples/search_demo/run_search.py "metformin diabetes"
-# Should show results from PubMed, ClinicalTrials.gov, AND bioRxiv/medRxiv
-```
-
----
-
-## 10. Value Delivered
-
-| Before | After |
-|--------|-------|
-| Only published papers | Published + Preprints |
-| 6-18 month lag | Near real-time research |
-| Miss cutting-edge | Catch breakthroughs early |
-
-**Demo pitch (final)**:
-> "DeepCritical searches PubMed for peer-reviewed evidence, ClinicalTrials.gov for 400,000+ clinical trials, and bioRxiv/medRxiv for cutting-edge preprints - then uses LLMs to generate mechanistic hypotheses and synthesize findings into publication-quality reports."
-
----
-
-## 11. Complete Source Architecture (After Phase 11)
-
-```
-User Query: "Can metformin treat Alzheimer's?"
-                    |
-                    v
-            SearchHandler
-                    |
-    ┌───────────────┼───────────────┐
-    |               |               |
-    v               v               v
-PubMedTool    ClinicalTrials   BioRxivTool
-    |          Tool               |
-    |               |               |
-    v               v               v
-"15 peer-    "3 Phase II     "2 preprints
-reviewed      trials          from last
-papers"       recruiting"     90 days"
-    |               |               |
-    └───────────────┼───────────────┘
-                    |
-                    v
-            Evidence Pool
-                    |
-                    v
-        EmbeddingService.deduplicate()
-                    |
-                    v
-        HypothesisAgent → JudgeAgent → ReportAgent
-                    |
-                    v
-        Structured Research Report
-```
-
-**This is the Gucci Banger stack.**
diff --git a/docs/implementation/12_phase_mcp_server.md b/docs/implementation/12_phase_mcp_server.md
deleted file mode 100644
index 64bc5559e3e4986eb362382627ea8cd7c753a2e2..0000000000000000000000000000000000000000
--- a/docs/implementation/12_phase_mcp_server.md
+++ /dev/null
@@ -1,832 +0,0 @@
-# Phase 12 Implementation Spec: MCP Server Integration
-
-**Goal**: Expose DeepCritical search tools as MCP servers for Track 2 compliance.
-**Philosophy**: "MCP is the bridge between tools and LLMs."
-**Prerequisite**: Phase 11 complete (all search tools working)
-**Priority**: P0 - REQUIRED FOR HACKATHON TRACK 2
-**Estimated Time**: 2-3 hours
-
----
-
-## 1. Why MCP Server?
-
-### Hackathon Requirement
-
-| Requirement | Status Before | Status After |
-|-------------|---------------|--------------|
-| Must use MCP servers as tools | **MISSING** | **COMPLIANT** |
-| Autonomous Agent behavior | **Have it** | Have it |
-| Must be Gradio app | **Have it** | Have it |
-| Planning/reasoning/execution | **Have it** | Have it |
-
-**Bottom Line**: Without MCP server, we're disqualified from Track 2.
-
-### What MCP Enables
-
-```text
-Current State:
-  Our Tools → Called directly by Python code → Only our app can use them
-
-After MCP:
-  Our Tools → Exposed via MCP protocol → Claude Desktop, Cursor, ANY MCP client
-```
-
----
-
-## 2. Implementation Options Analysis
-
-### Option A: Gradio MCP (Recommended)
-
-**Pros:**
-- Single parameter: `demo.launch(mcp_server=True)`
-- Already have Gradio app
-- Automatic tool schema generation from docstrings
-- Built into Gradio 5.0+
-
-**Cons:**
-- Requires Gradio 5.0+ with MCP extras
-- Must follow strict docstring format
-
-### Option B: Native MCP SDK (FastMCP)
-
-**Pros:**
-- More control over tool definitions
-- Explicit server configuration
-- Separate from UI concerns
-
-**Cons:**
-- Separate server process
-- More code to maintain
-- Additional dependency
-
-### Decision: **Gradio MCP (Option A)**
-
-Rationale:
-1. Already have Gradio app (`src/app.py`)
-2. Minimal code changes
-3. Judges will appreciate simplicity
-4. Follows hackathon's official Gradio guide
-
----
-
-## 3. Technical Specification
-
-### 3.1 Dependencies
-
-```toml
-# pyproject.toml - add MCP extras
-dependencies = [
-    "gradio[mcp]>=5.0.0",  # Updated from gradio>=4.0
-    # ... existing deps
-]
-```
-
-### 3.2 MCP Tool Functions
-
-Each tool needs:
-1. **Type hints** on all parameters
-2. **Docstring** with Args section (Google style)
-3. **Return type** annotation
-4. **`api_name`** parameter for explicit endpoint naming
-
-```python
-async def search_pubmed(query: str, max_results: int = 10) -> str:
-    """Search PubMed for biomedical literature.
-
-    Args:
-        query: Search query for PubMed (e.g., "metformin alzheimer")
-        max_results: Maximum number of results to return (1-50)
-
-    Returns:
-        Formatted search results with titles, citations, and abstracts
-    """
-```
-
-### 3.3 MCP Server URL
-
-Once launched:
-```text
-http://localhost:7860/gradio_api/mcp/
-```
-
-Or on HuggingFace Spaces:
-```text
-https://[space-id].hf.space/gradio_api/mcp/
-```
-
----
-
-## 4. Implementation
-
-### 4.1 MCP Tool Wrappers (`src/mcp_tools.py`)
-
-```python
-"""MCP tool wrappers for DeepCritical search tools.
-
-These functions expose our search tools via MCP protocol.
-Each function follows the MCP tool contract:
-- Full type hints
-- Google-style docstrings with Args section
-- Formatted string returns
-"""
-
-from src.tools.biorxiv import BioRxivTool
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.tools.pubmed import PubMedTool
-
-
-# Singleton instances (avoid recreating on each call)
-_pubmed = PubMedTool()
-_trials = ClinicalTrialsTool()
-_biorxiv = BioRxivTool()
-
-
-async def search_pubmed(query: str, max_results: int = 10) -> str:
-    """Search PubMed for peer-reviewed biomedical literature.
-
-    Searches NCBI PubMed database for scientific papers matching your query.
-    Returns titles, authors, abstracts, and citation information.
-
-    Args:
-        query: Search query (e.g., "metformin alzheimer", "drug repurposing cancer")
-        max_results: Maximum results to return (1-50, default 10)
-
-    Returns:
-        Formatted search results with paper titles, authors, dates, and abstracts
-    """
-    max_results = max(1, min(50, max_results))  # Clamp to valid range
-
-    results = await _pubmed.search(query, max_results)
-
-    if not results:
-        return f"No PubMed results found for: {query}"
-
-    formatted = [f"## PubMed Results for: {query}\n"]
-    for i, evidence in enumerate(results, 1):
-        formatted.append(f"### {i}. {evidence.citation.title}")
-        formatted.append(f"**Authors**: {', '.join(evidence.citation.authors[:3])}")
-        formatted.append(f"**Date**: {evidence.citation.date}")
-        formatted.append(f"**URL**: {evidence.citation.url}")
-        formatted.append(f"\n{evidence.content}\n")
-
-    return "\n".join(formatted)
-
-
-async def search_clinical_trials(query: str, max_results: int = 10) -> str:
-    """Search ClinicalTrials.gov for clinical trial data.
-
-    Searches the ClinicalTrials.gov database for trials matching your query.
-    Returns trial titles, phases, status, conditions, and interventions.
-
-    Args:
-        query: Search query (e.g., "metformin alzheimer", "diabetes phase 3")
-        max_results: Maximum results to return (1-50, default 10)
-
-    Returns:
-        Formatted clinical trial information with NCT IDs, phases, and status
-    """
-    max_results = max(1, min(50, max_results))
-
-    results = await _trials.search(query, max_results)
-
-    if not results:
-        return f"No clinical trials found for: {query}"
-
-    formatted = [f"## Clinical Trials for: {query}\n"]
-    for i, evidence in enumerate(results, 1):
-        formatted.append(f"### {i}. {evidence.citation.title}")
-        formatted.append(f"**URL**: {evidence.citation.url}")
-        formatted.append(f"**Date**: {evidence.citation.date}")
-        formatted.append(f"\n{evidence.content}\n")
-
-    return "\n".join(formatted)
-
-
-async def search_biorxiv(query: str, max_results: int = 10) -> str:
-    """Search bioRxiv/medRxiv for preprint research.
-
-    Searches bioRxiv and medRxiv preprint servers for cutting-edge research.
-    Note: Preprints are NOT peer-reviewed but contain the latest findings.
-
-    Args:
-        query: Search query (e.g., "metformin neuroprotection", "long covid treatment")
-        max_results: Maximum results to return (1-50, default 10)
-
-    Returns:
-        Formatted preprint results with titles, authors, and abstracts
-    """
-    max_results = max(1, min(50, max_results))
-
-    results = await _biorxiv.search(query, max_results)
-
-    if not results:
-        return f"No bioRxiv/medRxiv preprints found for: {query}"
-
-    formatted = [f"## Preprint Results for: {query}\n"]
-    for i, evidence in enumerate(results, 1):
-        formatted.append(f"### {i}. {evidence.citation.title}")
-        formatted.append(f"**Authors**: {', '.join(evidence.citation.authors[:3])}")
-        formatted.append(f"**Date**: {evidence.citation.date}")
-        formatted.append(f"**URL**: {evidence.citation.url}")
-        formatted.append(f"\n{evidence.content}\n")
-
-    return "\n".join(formatted)
-
-
-async def search_all_sources(query: str, max_per_source: int = 5) -> str:
-    """Search all biomedical sources simultaneously.
-
-    Performs parallel search across PubMed, ClinicalTrials.gov, and bioRxiv.
-    This is the most comprehensive search option for drug repurposing research.
-
-    Args:
-        query: Search query (e.g., "metformin alzheimer", "aspirin cancer prevention")
-        max_per_source: Maximum results per source (1-20, default 5)
-
-    Returns:
-        Combined results from all sources with source labels
-    """
-    import asyncio
-
-    max_per_source = max(1, min(20, max_per_source))
-
-    # Run all searches in parallel
-    pubmed_task = search_pubmed(query, max_per_source)
-    trials_task = search_clinical_trials(query, max_per_source)
-    biorxiv_task = search_biorxiv(query, max_per_source)
-
-    pubmed_results, trials_results, biorxiv_results = await asyncio.gather(
-        pubmed_task, trials_task, biorxiv_task, return_exceptions=True
-    )
-
-    formatted = [f"# Comprehensive Search: {query}\n"]
-
-    # Add each result section (handle exceptions gracefully)
-    if isinstance(pubmed_results, str):
-        formatted.append(pubmed_results)
-    else:
-        formatted.append(f"## PubMed\n*Error: {pubmed_results}*\n")
-
-    if isinstance(trials_results, str):
-        formatted.append(trials_results)
-    else:
-        formatted.append(f"## Clinical Trials\n*Error: {trials_results}*\n")
-
-    if isinstance(biorxiv_results, str):
-        formatted.append(biorxiv_results)
-    else:
-        formatted.append(f"## Preprints\n*Error: {biorxiv_results}*\n")
-
-    return "\n---\n".join(formatted)
-```
-
-### 4.2 Update Gradio App (`src/app.py`)
-
-```python
-"""Gradio UI for DeepCritical agent with MCP server support."""
-
-import os
-from collections.abc import AsyncGenerator
-from typing import Any
-
-import gradio as gr
-
-from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
-from src.mcp_tools import (
-    search_all_sources,
-    search_biorxiv,
-    search_clinical_trials,
-    search_pubmed,
-)
-from src.orchestrator_factory import create_orchestrator
-from src.tools.biorxiv import BioRxivTool
-from src.tools.clinicaltrials import ClinicalTrialsTool
-from src.tools.pubmed import PubMedTool
-from src.tools.search_handler import SearchHandler
-from src.utils.models import OrchestratorConfig
-
-
-# ... (existing configure_orchestrator and research_agent functions unchanged)
-
-
-def create_demo() -> Any:
-    """
-    Create the Gradio demo interface with MCP support.
-
-    Returns:
-        Configured Gradio Blocks interface with MCP server enabled
-    """
-    with gr.Blocks(
-        title="DeepCritical - Drug Repurposing Research Agent",
-        theme=gr.themes.Soft(),
-    ) as demo:
-        gr.Markdown("""
-        # DeepCritical
-        ## AI-Powered Drug Repurposing Research Agent
-
-        Ask questions about potential drug repurposing opportunities.
-        The agent searches PubMed, ClinicalTrials.gov, and bioRxiv/medRxiv preprints.
-
-        **Example questions:**
-        - "What drugs could be repurposed for Alzheimer's disease?"
-        - "Is metformin effective for cancer treatment?"
-        - "What existing medications show promise for Long COVID?"
-        """)
-
-        # Main chat interface (existing)
-        gr.ChatInterface(
-            fn=research_agent,
-            type="messages",
-            title="",
-            examples=[
-                "What drugs could be repurposed for Alzheimer's disease?",
-                "Is metformin effective for treating cancer?",
-                "What medications show promise for Long COVID treatment?",
-                "Can statins be repurposed for neurological conditions?",
-            ],
-            additional_inputs=[
-                gr.Radio(
-                    choices=["simple", "magentic"],
-                    value="simple",
-                    label="Orchestrator Mode",
-                    info="Simple: Linear (OpenAI/Anthropic) | Magentic: Multi-Agent (OpenAI)",
-                )
-            ],
-        )
-
-        # MCP Tool Interfaces (exposed via MCP protocol)
-        gr.Markdown("---\n## MCP Tools (Also Available via Claude Desktop)")
-
-        with gr.Tab("PubMed Search"):
-            gr.Interface(
-                fn=search_pubmed,
-                inputs=[
-                    gr.Textbox(label="Query", placeholder="metformin alzheimer"),
-                    gr.Slider(1, 50, value=10, step=1, label="Max Results"),
-                ],
-                outputs=gr.Markdown(label="Results"),
-                api_name="search_pubmed",
-            )
-
-        with gr.Tab("Clinical Trials"):
-            gr.Interface(
-                fn=search_clinical_trials,
-                inputs=[
-                    gr.Textbox(label="Query", placeholder="diabetes phase 3"),
-                    gr.Slider(1, 50, value=10, step=1, label="Max Results"),
-                ],
-                outputs=gr.Markdown(label="Results"),
-                api_name="search_clinical_trials",
-            )
-
-        with gr.Tab("Preprints"):
-            gr.Interface(
-                fn=search_biorxiv,
-                inputs=[
-                    gr.Textbox(label="Query", placeholder="long covid treatment"),
-                    gr.Slider(1, 50, value=10, step=1, label="Max Results"),
-                ],
-                outputs=gr.Markdown(label="Results"),
-                api_name="search_biorxiv",
-            )
-
-        with gr.Tab("Search All"):
-            gr.Interface(
-                fn=search_all_sources,
-                inputs=[
-                    gr.Textbox(label="Query", placeholder="metformin cancer"),
-                    gr.Slider(1, 20, value=5, step=1, label="Max Per Source"),
-                ],
-                outputs=gr.Markdown(label="Results"),
-                api_name="search_all",
-            )
-
-        gr.Markdown("""
-        ---
-        **Note**: This is a research tool and should not be used for medical decisions.
-        Always consult healthcare professionals for medical advice.
-
-        Built with PydanticAI + PubMed, ClinicalTrials.gov & bioRxiv
-
-        **MCP Server**: Available at `/gradio_api/mcp/` for Claude Desktop integration
-        """)
-
-    return demo
-
-
-def main() -> None:
-    """Run the Gradio app with MCP server enabled."""
-    demo = create_demo()
-    demo.launch(
-        server_name="0.0.0.0",
-        server_port=7860,
-        share=False,
-        mcp_server=True,  # Enable MCP server
-    )
-
-
-if __name__ == "__main__":
-    main()
-```
-
----
-
-## 5. TDD Test Suite
-
-### 5.1 Unit Tests (`tests/unit/test_mcp_tools.py`)
-
-```python
-"""Unit tests for MCP tool wrappers."""
-
-from unittest.mock import AsyncMock, patch
-
-import pytest
-
-from src.mcp_tools import (
-    search_all_sources,
-    search_biorxiv,
-    search_clinical_trials,
-    search_pubmed,
-)
-from src.utils.models import Citation, Evidence
-
-
-@pytest.fixture
-def mock_evidence() -> Evidence:
-    """Sample evidence for testing."""
-    return Evidence(
-        content="Metformin shows neuroprotective effects in preclinical models.",
-        citation=Citation(
-            source="pubmed",
-            title="Metformin and Alzheimer's Disease",
-            url="https://pubmed.ncbi.nlm.nih.gov/12345678/",
-            date="2024-01-15",
-            authors=["Smith J", "Jones M", "Brown K"],
-        ),
-        relevance=0.85,
-    )
-
-
-class TestSearchPubMed:
-    """Tests for search_pubmed MCP tool."""
-
-    @pytest.mark.asyncio
-    async def test_returns_formatted_string(self, mock_evidence: Evidence) -> None:
-        """Should return formatted markdown string."""
-        with patch("src.mcp_tools._pubmed") as mock_tool:
-            mock_tool.search = AsyncMock(return_value=[mock_evidence])
-
-            result = await search_pubmed("metformin alzheimer", 10)
-
-            assert isinstance(result, str)
-            assert "PubMed Results" in result
-            assert "Metformin and Alzheimer's Disease" in result
-            assert "Smith J" in result
-
-    @pytest.mark.asyncio
-    async def test_clamps_max_results(self) -> None:
-        """Should clamp max_results to valid range (1-50)."""
-        with patch("src.mcp_tools._pubmed") as mock_tool:
-            mock_tool.search = AsyncMock(return_value=[])
-
-            # Test lower bound
-            await search_pubmed("test", 0)
-            mock_tool.search.assert_called_with("test", 1)
-
-            # Test upper bound
-            await search_pubmed("test", 100)
-            mock_tool.search.assert_called_with("test", 50)
-
-    @pytest.mark.asyncio
-    async def test_handles_no_results(self) -> None:
-        """Should return appropriate message when no results."""
-        with patch("src.mcp_tools._pubmed") as mock_tool:
-            mock_tool.search = AsyncMock(return_value=[])
-
-            result = await search_pubmed("xyznonexistent", 10)
-
-            assert "No PubMed results found" in result
-
-
-class TestSearchClinicalTrials:
-    """Tests for search_clinical_trials MCP tool."""
-
-    @pytest.mark.asyncio
-    async def test_returns_formatted_string(self, mock_evidence: Evidence) -> None:
-        """Should return formatted markdown string."""
-        mock_evidence.citation.source = "clinicaltrials"  # type: ignore
-
-        with patch("src.mcp_tools._trials") as mock_tool:
-            mock_tool.search = AsyncMock(return_value=[mock_evidence])
-
-            result = await search_clinical_trials("diabetes", 10)
-
-            assert isinstance(result, str)
-            assert "Clinical Trials" in result
-
-
-class TestSearchBiorxiv:
-    """Tests for search_biorxiv MCP tool."""
-
-    @pytest.mark.asyncio
-    async def test_returns_formatted_string(self, mock_evidence: Evidence) -> None:
-        """Should return formatted markdown string."""
-        mock_evidence.citation.source = "biorxiv"  # type: ignore
-
-        with patch("src.mcp_tools._biorxiv") as mock_tool:
-            mock_tool.search = AsyncMock(return_value=[mock_evidence])
-
-            result = await search_biorxiv("preprint search", 10)
-
-            assert isinstance(result, str)
-            assert "Preprint Results" in result
-
-
-class TestSearchAllSources:
-    """Tests for search_all_sources MCP tool."""
-
-    @pytest.mark.asyncio
-    async def test_combines_all_sources(self, mock_evidence: Evidence) -> None:
-        """Should combine results from all sources."""
-        with patch("src.mcp_tools.search_pubmed", new_callable=AsyncMock) as mock_pubmed, \
-             patch("src.mcp_tools.search_clinical_trials", new_callable=AsyncMock) as mock_trials, \
-             patch("src.mcp_tools.search_biorxiv", new_callable=AsyncMock) as mock_biorxiv:
-
-            mock_pubmed.return_value = "## PubMed Results"
-            mock_trials.return_value = "## Clinical Trials"
-            mock_biorxiv.return_value = "## Preprints"
-
-            result = await search_all_sources("metformin", 5)
-
-            assert "Comprehensive Search" in result
-            assert "PubMed" in result
-            assert "Clinical Trials" in result
-            assert "Preprints" in result
-
-    @pytest.mark.asyncio
-    async def test_handles_partial_failures(self) -> None:
-        """Should handle partial failures gracefully."""
-        with patch("src.mcp_tools.search_pubmed", new_callable=AsyncMock) as mock_pubmed, \
-             patch("src.mcp_tools.search_clinical_trials", new_callable=AsyncMock) as mock_trials, \
-             patch("src.mcp_tools.search_biorxiv", new_callable=AsyncMock) as mock_biorxiv:
-
-            mock_pubmed.return_value = "## PubMed Results"
-            mock_trials.side_effect = Exception("API Error")
-            mock_biorxiv.return_value = "## Preprints"
-
-            result = await search_all_sources("metformin", 5)
-
-            # Should still contain working sources
-            assert "PubMed" in result
-            assert "Preprints" in result
-            # Should show error for failed source
-            assert "Error" in result
-
-
-class TestMCPDocstrings:
-    """Tests that docstrings follow MCP format."""
-
-    def test_search_pubmed_has_args_section(self) -> None:
-        """Docstring must have Args section for MCP schema generation."""
-        assert search_pubmed.__doc__ is not None
-        assert "Args:" in search_pubmed.__doc__
-        assert "query:" in search_pubmed.__doc__
-        assert "max_results:" in search_pubmed.__doc__
-        assert "Returns:" in search_pubmed.__doc__
-
-    def test_search_clinical_trials_has_args_section(self) -> None:
-        """Docstring must have Args section for MCP schema generation."""
-        assert search_clinical_trials.__doc__ is not None
-        assert "Args:" in search_clinical_trials.__doc__
-
-    def test_search_biorxiv_has_args_section(self) -> None:
-        """Docstring must have Args section for MCP schema generation."""
-        assert search_biorxiv.__doc__ is not None
-        assert "Args:" in search_biorxiv.__doc__
-
-    def test_search_all_sources_has_args_section(self) -> None:
-        """Docstring must have Args section for MCP schema generation."""
-        assert search_all_sources.__doc__ is not None
-        assert "Args:" in search_all_sources.__doc__
-
-
-class TestMCPTypeHints:
-    """Tests that type hints are complete for MCP."""
-
-    def test_search_pubmed_type_hints(self) -> None:
-        """All parameters and return must have type hints."""
-        import inspect
-
-        sig = inspect.signature(search_pubmed)
-
-        # Check parameter hints
-        assert sig.parameters["query"].annotation == str
-        assert sig.parameters["max_results"].annotation == int
-
-        # Check return hint
-        assert sig.return_annotation == str
-
-    def test_search_clinical_trials_type_hints(self) -> None:
-        """All parameters and return must have type hints."""
-        import inspect
-
-        sig = inspect.signature(search_clinical_trials)
-        assert sig.parameters["query"].annotation == str
-        assert sig.parameters["max_results"].annotation == int
-        assert sig.return_annotation == str
-```
-
-### 5.2 Integration Test (`tests/integration/test_mcp_server.py`)
-
-```python
-"""Integration tests for MCP server functionality."""
-
-import pytest
-
-
-class TestMCPServerIntegration:
-    """Integration tests for MCP server (requires running app)."""
-
-    @pytest.mark.integration
-    @pytest.mark.asyncio
-    async def test_mcp_tools_work_end_to_end(self) -> None:
-        """Test that MCP tools execute real searches."""
-        from src.mcp_tools import search_pubmed
-
-        result = await search_pubmed("metformin diabetes", 3)
-
-        assert isinstance(result, str)
-        assert "PubMed Results" in result
-        # Should have actual content (not just "no results")
-        assert len(result) > 100
-```
-
----
-
-## 6. Claude Desktop Configuration
-
-### 6.1 Local Development
-
-```json
-// ~/.config/claude/claude_desktop_config.json (Linux/Mac)
-// %APPDATA%\Claude\claude_desktop_config.json (Windows)
-{
-  "mcpServers": {
-    "deepcritical": {
-      "url": "http://localhost:7860/gradio_api/mcp/"
-    }
-  }
-}
-```
-
-### 6.2 HuggingFace Spaces
-
-```json
-{
-  "mcpServers": {
-    "deepcritical": {
-      "url": "https://MCP-1st-Birthday-deepcritical.hf.space/gradio_api/mcp/"
-    }
-  }
-}
-```
-
-### 6.3 Private Spaces (with auth)
-
-```json
-{
-  "mcpServers": {
-    "deepcritical": {
-      "url": "https://your-space.hf.space/gradio_api/mcp/",
-      "headers": {
-        "Authorization": "Bearer hf_xxxxxxxxxxxxx"
-      }
-    }
-  }
-}
-```
-
----
-
-## 7. Verification Commands
-
-```bash
-# 1. Install MCP extras
-uv add "gradio[mcp]>=5.0.0"
-
-# 2. Run unit tests
-uv run pytest tests/unit/test_mcp_tools.py -v
-
-# 3. Run full test suite
-make check
-
-# 4. Start server with MCP
-uv run python src/app.py
-
-# 5. Verify MCP schema (in another terminal)
-curl http://localhost:7860/gradio_api/mcp/schema
-
-# 6. Test with MCP Inspector
-npx @anthropic/mcp-inspector http://localhost:7860/gradio_api/mcp/
-
-# 7. Integration test (requires running server)
-uv run pytest tests/integration/test_mcp_server.py -v -m integration
-```
-
----
-
-## 8. Definition of Done
-
-Phase 12 is **COMPLETE** when:
-
-- [ ] `src/mcp_tools.py` created with all 4 MCP tools
-- [ ] `src/app.py` updated with `mcp_server=True`
-- [ ] Unit tests in `tests/unit/test_mcp_tools.py`
-- [ ] Integration test in `tests/integration/test_mcp_server.py`
-- [ ] `pyproject.toml` updated with `gradio[mcp]`
-- [ ] MCP schema accessible at `/gradio_api/mcp/schema`
-- [ ] Claude Desktop can connect and use tools
-- [ ] All unit tests pass
-- [ ] Lints pass
-
----
-
-## 9. Demo Script for Judges
-
-### Show MCP Integration Works
-
-1. **Start the server**:
-   ```bash
-   uv run python src/app.py
-   ```
-
-2. **Show Claude Desktop using our tools**:
-   - Open Claude Desktop with DeepCritical MCP configured
-   - Ask: "Search PubMed for metformin Alzheimer's"
-   - Show real results appearing
-   - Ask: "Now search clinical trials for the same"
-   - Show combined analysis
-
-3. **Show MCP Inspector**:
-   ```bash
-   npx @anthropic/mcp-inspector http://localhost:7860/gradio_api/mcp/
-   ```
-   - Show all 4 tools listed
-   - Execute `search_pubmed` from inspector
-   - Show results
-
----
-
-## 10. Value Delivered
-
-| Before | After |
-|--------|-------|
-| Tools only usable in our app | Tools usable by ANY MCP client |
-| Not Track 2 compliant | **FULLY TRACK 2 COMPLIANT** |
-| Can't use with Claude Desktop | Full Claude Desktop integration |
-
-**Prize Impact**:
-- Without MCP: **Disqualified from Track 2**
-- With MCP: **Eligible for $2,500 1st place**
-
----
-
-## 11. Files to Create/Modify
-
-| File | Action | Purpose |
-|------|--------|---------|
-| `src/mcp_tools.py` | CREATE | MCP tool wrapper functions |
-| `src/app.py` | MODIFY | Add `mcp_server=True`, add tool tabs |
-| `pyproject.toml` | MODIFY | Add `gradio[mcp]>=5.0.0` |
-| `tests/unit/test_mcp_tools.py` | CREATE | Unit tests for MCP tools |
-| `tests/integration/test_mcp_server.py` | CREATE | Integration tests |
-| `README.md` | MODIFY | Add MCP usage instructions |
-
----
-
-## 12. Architecture After Phase 12
-
-```text
-┌────────────────────────────────────────────────────────────────┐
-│                      Claude Desktop / Cursor                   │
-│                           (MCP Client)                         │
-└─────────────────────────────┬──────────────────────────────────┘
-                              │ MCP Protocol
-                              ▼
-┌─────────────────────────────────────────────────────────────────┐
-│                        Gradio MCP Server                        │
-│                  /gradio_api/mcp/                               │
-│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
-│  │search_pubmed │ │search_trials │ │search_biorxiv│ │search_  │ │
-│  │              │ │              │ │              │ │all      │ │
-│  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └────┬────┘ │
-└─────────┼────────────────┼────────────────┼──────────────┼──────┘
-          │                │                │              │
-          ▼                ▼                ▼              ▼
-   ┌──────────┐     ┌──────────┐     ┌──────────┐    (calls all)
-   │PubMedTool│     │Trials    │     │BioRxiv   │
-   │          │     │Tool      │     │Tool      │
-   └──────────┘     └──────────┘     └──────────┘
-```
-
-**This is the MCP compliance stack.**
diff --git a/docs/implementation/13_phase_modal_integration.md b/docs/implementation/13_phase_modal_integration.md
deleted file mode 100644
index edb0f7c628f3b7cf1687ab0ad24c188e6ecbfff8..0000000000000000000000000000000000000000
--- a/docs/implementation/13_phase_modal_integration.md
+++ /dev/null
@@ -1,1195 +0,0 @@
-# Phase 13 Implementation Spec: Modal Pipeline Integration
-
-**Goal**: Wire existing Modal code execution into the agent pipeline.
-**Philosophy**: "Sandboxed execution makes AI-generated code trustworthy."
-**Prerequisite**: Phase 12 complete (MCP server working)
-**Priority**: P1 - HIGH VALUE ($2,500 Modal Innovation Award)
-**Estimated Time**: 2-3 hours
-
----
-
-## 1. Why Modal Integration?
-
-### Current State Analysis
-
-Mario already implemented `src/tools/code_execution.py`:
-
-| Component | Status | Notes |
-|-----------|--------|-------|
-| `ModalCodeExecutor` class | Built | Executes Python in Modal sandbox |
-| `SANDBOX_LIBRARIES` | Defined | pandas, numpy, scipy, etc. |
-| `execute()` method | Implemented | Stdout/stderr capture |
-| `execute_with_return()` | Implemented | Returns `result` variable |
-| `AnalysisAgent` | Built | Uses Modal for statistical analysis |
-| **Pipeline Integration** | **MISSING** | Not wired into main orchestrator |
-
-### What's Missing
-
-```text
-Current Flow:
-  User Query → Orchestrator → Search → Judge → [Report] → Done
-
-With Modal:
-  User Query → Orchestrator → Search → Judge → [Analysis*] → Report → Done
-                                                    ↓
-                                          Modal Sandbox Execution
-```
-
-*The AnalysisAgent exists but is NOT called by either orchestrator.
-
----
-
-## 2. Critical Dependency Analysis
-
-### The Problem (Senior Feedback)
-
-```python
-# src/agents/analysis_agent.py - Line 8
-from agent_framework import (
-    AgentRunResponse,
-    BaseAgent,
-    ...
-)
-```
-
-```toml
-# pyproject.toml - agent-framework is OPTIONAL
-[project.optional-dependencies]
-magentic = [
-    "agent-framework-core",
-]
-```
-
-**If we import `AnalysisAgent` in the simple orchestrator without the `magentic` extra installed, the app CRASHES on startup.**
-
-### The SOLID Solution
-
-**Single Responsibility Principle**: Decouple Modal execution logic from `agent_framework`.
-
-```text
-BEFORE (Coupled):
-  AnalysisAgent (requires agent_framework)
-       ↓
-  ModalCodeExecutor
-
-AFTER (Decoupled):
-  StatisticalAnalyzer (no agent_framework dependency)  ← Simple mode uses this
-       ↓
-  ModalCodeExecutor
-       ↑
-  AnalysisAgent (wraps StatisticalAnalyzer)  ← Magentic mode uses this
-```
-
-**Key insight**: Create `src/services/statistical_analyzer.py` with ZERO agent_framework imports.
-
----
-
-## 3. Prize Opportunity
-
-### Modal Innovation Award: $2,500
-
-**Judging Criteria**:
-1. **Sandbox Isolation** - Code runs in container, not local
-2. **Scientific Computing** - Real pandas/scipy analysis
-3. **Safety** - Can't access local filesystem
-4. **Speed** - Modal's fast cold starts
-
-### What We Need to Show
-
-```python
-# LLM generates analysis code
-code = """
-import pandas as pd
-import scipy.stats as stats
-
-data = pd.DataFrame({
-    'study': ['Study1', 'Study2', 'Study3'],
-    'effect_size': [0.45, 0.52, 0.38],
-    'sample_size': [120, 85, 200]
-})
-
-weighted_mean = (data['effect_size'] * data['sample_size']).sum() / data['sample_size'].sum()
-t_stat, p_value = stats.ttest_1samp(data['effect_size'], 0)
-
-print(f"Weighted Effect Size: {weighted_mean:.3f}")
-print(f"P-value: {p_value:.4f}")
-
-result = "SUPPORTED" if p_value < 0.05 else "INCONCLUSIVE"
-"""
-
-# Executed SAFELY in Modal sandbox
-executor = get_code_executor()
-output = executor.execute(code)  # Runs in isolated container!
-```
-
----
-
-## 4. Technical Specification
-
-### 4.1 Dependencies
-
-```toml
-# pyproject.toml - NO CHANGES to dependencies
-# StatisticalAnalyzer uses only:
-#   - pydantic-ai (already in main deps)
-#   - modal (already in main deps)
-#   - src.tools.code_execution (no agent_framework)
-```
-
-### 4.2 Environment Variables
-
-```bash
-# .env
-MODAL_TOKEN_ID=your-token-id
-MODAL_TOKEN_SECRET=your-token-secret
-```
-
-### 4.3 Integration Points
-
-| Integration Point | File | Change Required |
-|-------------------|------|-----------------|
-| New Service | `src/services/statistical_analyzer.py` | CREATE (no agent_framework) |
-| Simple Orchestrator | `src/orchestrator.py` | Use `StatisticalAnalyzer` |
-| Config | `src/utils/config.py` | Add `enable_modal_analysis` setting |
-| AnalysisAgent | `src/agents/analysis_agent.py` | Refactor to wrap `StatisticalAnalyzer` |
-| MCP Tool | `src/mcp_tools.py` | Add `analyze_hypothesis` tool |
-
----
-
-## 5. Implementation
-
-### 5.1 Configuration Update (`src/utils/config.py`)
-
-```python
-class Settings(BaseSettings):
-    # ... existing settings ...
-
-    # Modal Configuration
-    modal_token_id: str | None = None
-    modal_token_secret: str | None = None
-    enable_modal_analysis: bool = False  # Opt-in for hackathon demo
-
-    @property
-    def modal_available(self) -> bool:
-        """Check if Modal credentials are configured."""
-        return bool(self.modal_token_id and self.modal_token_secret)
-```
-
-### 5.2 StatisticalAnalyzer Service (`src/services/statistical_analyzer.py`)
-
-**This is the key fix - NO agent_framework imports.**
-
-```python
-"""Statistical analysis service using Modal code execution.
-
-This module provides Modal-based statistical analysis WITHOUT depending on
-agent_framework. This allows it to be used in the simple orchestrator mode
-without requiring the magentic optional dependency.
-
-The AnalysisAgent (in src/agents/) wraps this service for magentic mode.
-"""
-
-import asyncio
-import re
-from functools import partial
-from typing import Any
-
-from pydantic import BaseModel, Field
-from pydantic_ai import Agent
-
-from src.agent_factory.judges import get_model
-from src.tools.code_execution import (
-    CodeExecutionError,
-    get_code_executor,
-    get_sandbox_library_prompt,
-)
-from src.utils.models import Evidence
-
-
-class AnalysisResult(BaseModel):
-    """Result of statistical analysis."""
-
-    verdict: str = Field(
-        description="SUPPORTED, REFUTED, or INCONCLUSIVE",
-    )
-    confidence: float = Field(ge=0.0, le=1.0, description="Confidence in verdict (0-1)")
-    statistical_evidence: str = Field(
-        description="Summary of statistical findings from code execution"
-    )
-    code_generated: str = Field(description="Python code that was executed")
-    execution_output: str = Field(description="Output from code execution")
-    key_findings: list[str] = Field(default_factory=list, description="Key takeaways")
-    limitations: list[str] = Field(default_factory=list, description="Limitations")
-
-
-class StatisticalAnalyzer:
-    """Performs statistical analysis using Modal code execution.
-
-    This service:
-    1. Generates Python code for statistical analysis using LLM
-    2. Executes code in Modal sandbox
-    3. Interprets results
-    4. Returns verdict (SUPPORTED/REFUTED/INCONCLUSIVE)
-
-    Note: This class has NO agent_framework dependency, making it safe
-    to use in the simple orchestrator without the magentic extra.
-    """
-
-    def __init__(self) -> None:
-        """Initialize the analyzer."""
-        self._code_executor: Any = None
-        self._agent: Agent[None, str] | None = None
-
-    def _get_code_executor(self) -> Any:
-        """Lazy initialization of code executor."""
-        if self._code_executor is None:
-            self._code_executor = get_code_executor()
-        return self._code_executor
-
-    def _get_agent(self) -> Agent[None, str]:
-        """Lazy initialization of LLM agent for code generation."""
-        if self._agent is None:
-            library_versions = get_sandbox_library_prompt()
-            self._agent = Agent(
-                model=get_model(),
-                output_type=str,
-                system_prompt=f"""You are a biomedical data scientist.
-
-Generate Python code to analyze research evidence and test hypotheses.
-
-Guidelines:
-1. Use pandas, numpy, scipy.stats for analysis
-2. Print clear, interpretable results
-3. Include statistical tests (t-tests, chi-square, etc.)
-4. Calculate effect sizes and confidence intervals
-5. Keep code concise (<50 lines)
-6. Set 'result' variable to SUPPORTED, REFUTED, or INCONCLUSIVE
-
-Available libraries:
-{library_versions}
-
-Output format: Return ONLY executable Python code, no explanations.""",
-            )
-        return self._agent
-
-    async def analyze(
-        self,
-        query: str,
-        evidence: list[Evidence],
-        hypothesis: dict[str, Any] | None = None,
-    ) -> AnalysisResult:
-        """Run statistical analysis on evidence.
-
-        Args:
-            query: The research question
-            evidence: List of Evidence objects to analyze
-            hypothesis: Optional hypothesis dict with drug, target, pathway, effect
-
-        Returns:
-            AnalysisResult with verdict and statistics
-        """
-        # Build analysis prompt
-        evidence_summary = self._summarize_evidence(evidence[:10])
-        hypothesis_text = ""
-        if hypothesis:
-            hypothesis_text = f"""
-Hypothesis: {hypothesis.get('drug', 'Unknown')} → {hypothesis.get('target', '?')} → {hypothesis.get('pathway', '?')} → {hypothesis.get('effect', '?')}
-Confidence: {hypothesis.get('confidence', 0.5):.0%}
-"""
-
-        prompt = f"""Generate Python code to statistically analyze:
-
-**Research Question**: {query}
-{hypothesis_text}
-
-**Evidence Summary**:
-{evidence_summary}
-
-Generate executable Python code to analyze this evidence."""
-
-        try:
-            # Generate code
-            agent = self._get_agent()
-            code_result = await agent.run(prompt)
-            generated_code = code_result.output
-
-            # Execute in Modal sandbox
-            loop = asyncio.get_running_loop()
-            executor = self._get_code_executor()
-            execution = await loop.run_in_executor(
-                None, partial(executor.execute, generated_code, timeout=120)
-            )
-
-            if not execution["success"]:
-                return AnalysisResult(
-                    verdict="INCONCLUSIVE",
-                    confidence=0.0,
-                    statistical_evidence=f"Execution failed: {execution['error']}",
-                    code_generated=generated_code,
-                    execution_output=execution.get("stderr", ""),
-                    key_findings=[],
-                    limitations=["Code execution failed"],
-                )
-
-            # Interpret results
-            return self._interpret_results(generated_code, execution)
-
-        except CodeExecutionError as e:
-            return AnalysisResult(
-                verdict="INCONCLUSIVE",
-                confidence=0.0,
-                statistical_evidence=str(e),
-                code_generated="",
-                execution_output="",
-                key_findings=[],
-                limitations=[f"Analysis error: {e}"],
-            )
-
-    def _summarize_evidence(self, evidence: list[Evidence]) -> str:
-        """Summarize evidence for code generation prompt."""
-        if not evidence:
-            return "No evidence available."
-
-        lines = []
-        for i, ev in enumerate(evidence[:5], 1):
-            lines.append(f"{i}. {ev.content[:200]}...")
-            lines.append(f"   Source: {ev.citation.title}")
-            lines.append(f"   Relevance: {ev.relevance:.0%}\n")
-
-        return "\n".join(lines)
-
-    def _interpret_results(
-        self,
-        code: str,
-        execution: dict[str, Any],
-    ) -> AnalysisResult:
-        """Interpret code execution results."""
-        stdout = execution["stdout"]
-        stdout_upper = stdout.upper()
-
-        # Extract verdict with robust word-boundary matching
-        verdict = "INCONCLUSIVE"
-        if re.search(r"\bSUPPORTED\b", stdout_upper) and not re.search(
-            r"\b(?:NOT|UN)SUPPORTED\b", stdout_upper
-        ):
-            verdict = "SUPPORTED"
-        elif re.search(r"\bREFUTED\b", stdout_upper):
-            verdict = "REFUTED"
-
-        # Extract key findings
-        key_findings = []
-        for line in stdout.split("\n"):
-            line_lower = line.lower()
-            if any(kw in line_lower for kw in ["p-value", "significant", "effect", "mean"]):
-                key_findings.append(line.strip())
-
-        # Calculate confidence from p-values
-        confidence = self._calculate_confidence(stdout)
-
-        return AnalysisResult(
-            verdict=verdict,
-            confidence=confidence,
-            statistical_evidence=stdout.strip(),
-            code_generated=code,
-            execution_output=stdout,
-            key_findings=key_findings[:5],
-            limitations=[
-                "Analysis based on summary data only",
-                "Limited to available evidence",
-                "Statistical tests assume data independence",
-            ],
-        )
-
-    def _calculate_confidence(self, output: str) -> float:
-        """Calculate confidence based on statistical results."""
-        p_values = re.findall(r"p[-\s]?value[:\s]+(\d+\.?\d*)", output.lower())
-
-        if p_values:
-            try:
-                min_p = min(float(p) for p in p_values)
-                if min_p < 0.001:
-                    return 0.95
-                elif min_p < 0.01:
-                    return 0.90
-                elif min_p < 0.05:
-                    return 0.80
-                else:
-                    return 0.60
-            except ValueError:
-                pass
-
-        return 0.70  # Default
-
-
-# Singleton for reuse
-_analyzer: StatisticalAnalyzer | None = None
-
-
-def get_statistical_analyzer() -> StatisticalAnalyzer:
-    """Get or create singleton StatisticalAnalyzer instance."""
-    global _analyzer
-    if _analyzer is None:
-        _analyzer = StatisticalAnalyzer()
-    return _analyzer
-```
-
-### 5.3 Simple Orchestrator Update (`src/orchestrator.py`)
-
-**Uses `StatisticalAnalyzer` directly - NO agent_framework import.**
-
-```python
-"""Main orchestrator with optional Modal analysis."""
-
-from src.utils.config import settings
-
-# ... existing imports ...
-
-
-class Orchestrator:
-    """Search-Judge-Analyze orchestration loop."""
-
-    def __init__(
-        self,
-        search_handler: SearchHandlerProtocol,
-        judge_handler: JudgeHandlerProtocol,
-        config: OrchestratorConfig | None = None,
-        enable_analysis: bool = False,  # New parameter
-    ) -> None:
-        self.search = search_handler
-        self.judge = judge_handler
-        self.config = config or OrchestratorConfig()
-        self.history: list[dict[str, Any]] = []
-        self._enable_analysis = enable_analysis and settings.modal_available
-
-        # Lazy-load analysis (NO agent_framework dependency!)
-        self._analyzer: Any = None
-
-    def _get_analyzer(self) -> Any:
-        """Lazy initialization of StatisticalAnalyzer.
-
-        Note: This imports from src.services, NOT src.agents,
-        so it works without the magentic optional dependency.
-        """
-        if self._analyzer is None:
-            from src.services.statistical_analyzer import get_statistical_analyzer
-
-            self._analyzer = get_statistical_analyzer()
-        return self._analyzer
-
-    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
-        """Main orchestration loop with optional Modal analysis."""
-        # ... existing search/judge loop ...
-
-        # After judge says "synthesize", optionally run analysis
-        if self._enable_analysis and assessment.recommendation == "synthesize":
-            yield AgentEvent(
-                type="analyzing",
-                message="Running statistical analysis in Modal sandbox...",
-                data={},
-                iteration=iteration,
-            )
-
-            try:
-                analyzer = self._get_analyzer()
-
-                # Run Modal analysis (no agent_framework needed!)
-                analysis_result = await analyzer.analyze(
-                    query=query,
-                    evidence=all_evidence,
-                    hypothesis=None,  # Could add hypothesis generation later
-                )
-
-                yield AgentEvent(
-                    type="analysis_complete",
-                    message=f"Analysis verdict: {analysis_result.verdict}",
-                    data=analysis_result.model_dump(),
-                    iteration=iteration,
-                )
-
-            except Exception as e:
-                yield AgentEvent(
-                    type="error",
-                    message=f"Modal analysis failed: {e}",
-                    data={"error": str(e)},
-                    iteration=iteration,
-                )
-
-        # Continue to synthesis...
-```
-
-### 5.4 Refactor AnalysisAgent (`src/agents/analysis_agent.py`)
-
-**Wrap `StatisticalAnalyzer` for magentic mode.**
-
-```python
-"""Analysis agent for statistical analysis using Modal code execution.
-
-This agent wraps StatisticalAnalyzer for use in magentic multi-agent mode.
-The core logic is in src/services/statistical_analyzer.py to avoid
-coupling agent_framework to the simple orchestrator.
-"""
-
-from collections.abc import AsyncIterable
-from typing import TYPE_CHECKING, Any
-
-from agent_framework import (
-    AgentRunResponse,
-    AgentRunResponseUpdate,
-    AgentThread,
-    BaseAgent,
-    ChatMessage,
-    Role,
-)
-
-from src.services.statistical_analyzer import (
-    AnalysisResult,
-    get_statistical_analyzer,
-)
-from src.utils.models import Evidence
-
-if TYPE_CHECKING:
-    from src.services.embeddings import EmbeddingService
-
-
-class AnalysisAgent(BaseAgent):  # type: ignore[misc]
-    """Wraps StatisticalAnalyzer for magentic multi-agent mode."""
-
-    def __init__(
-        self,
-        evidence_store: dict[str, Any],
-        embedding_service: "EmbeddingService | None" = None,
-    ) -> None:
-        super().__init__(
-            name="AnalysisAgent",
-            description="Performs statistical analysis using Modal sandbox",
-        )
-        self._evidence_store = evidence_store
-        self._embeddings = embedding_service
-        self._analyzer = get_statistical_analyzer()
-
-    async def run(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AgentRunResponse:
-        """Analyze evidence and return verdict."""
-        query = self._extract_query(messages)
-        hypotheses = self._evidence_store.get("hypotheses", [])
-        evidence = self._evidence_store.get("current", [])
-
-        if not evidence:
-            return self._error_response("No evidence available.")
-
-        # Get primary hypothesis if available
-        hypothesis_dict = None
-        if hypotheses:
-            h = hypotheses[0]
-            hypothesis_dict = {
-                "drug": getattr(h, "drug", "Unknown"),
-                "target": getattr(h, "target", "?"),
-                "pathway": getattr(h, "pathway", "?"),
-                "effect": getattr(h, "effect", "?"),
-                "confidence": getattr(h, "confidence", 0.5),
-            }
-
-        # Delegate to StatisticalAnalyzer
-        result = await self._analyzer.analyze(
-            query=query,
-            evidence=evidence,
-            hypothesis=hypothesis_dict,
-        )
-
-        # Store in shared context
-        self._evidence_store["analysis"] = result.model_dump()
-
-        # Format response
-        response_text = self._format_response(result)
-
-        return AgentRunResponse(
-            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
-            response_id=f"analysis-{result.verdict.lower()}",
-            additional_properties={"analysis": result.model_dump()},
-        )
-
-    def _format_response(self, result: AnalysisResult) -> str:
-        """Format analysis result as markdown."""
-        lines = [
-            "## Statistical Analysis Complete\n",
-            f"### Verdict: **{result.verdict}**",
-            f"**Confidence**: {result.confidence:.0%}\n",
-            "### Key Findings",
-        ]
-        for finding in result.key_findings:
-            lines.append(f"- {finding}")
-
-        lines.extend([
-            "\n### Statistical Evidence",
-            "```",
-            result.statistical_evidence,
-            "```",
-        ])
-        return "\n".join(lines)
-
-    def _error_response(self, message: str) -> AgentRunResponse:
-        """Create error response."""
-        return AgentRunResponse(
-            messages=[ChatMessage(role=Role.ASSISTANT, text=f"**Error**: {message}")],
-            response_id="analysis-error",
-        )
-
-    def _extract_query(
-        self, messages: str | ChatMessage | list[str] | list[ChatMessage] | None
-    ) -> str:
-        """Extract query from messages."""
-        if isinstance(messages, str):
-            return messages
-        elif isinstance(messages, ChatMessage):
-            return messages.text or ""
-        elif isinstance(messages, list):
-            for msg in reversed(messages):
-                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
-                    return msg.text or ""
-                elif isinstance(msg, str):
-                    return msg
-        return ""
-
-    async def run_stream(
-        self,
-        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
-        *,
-        thread: AgentThread | None = None,
-        **kwargs: Any,
-    ) -> AsyncIterable[AgentRunResponseUpdate]:
-        """Streaming wrapper."""
-        result = await self.run(messages, thread=thread, **kwargs)
-        yield AgentRunResponseUpdate(messages=result.messages, response_id=result.response_id)
-```
-
-### 5.5 MCP Tool for Modal Analysis (`src/mcp_tools.py`)
-
-Add to existing MCP tools:
-
-```python
-async def analyze_hypothesis(
-    drug: str,
-    condition: str,
-    evidence_summary: str,
-) -> str:
-    """Perform statistical analysis of drug repurposing hypothesis using Modal.
-
-    Executes AI-generated Python code in a secure Modal sandbox to analyze
-    the statistical evidence for a drug repurposing hypothesis.
-
-    Args:
-        drug: The drug being evaluated (e.g., "metformin")
-        condition: The target condition (e.g., "Alzheimer's disease")
-        evidence_summary: Summary of evidence to analyze
-
-    Returns:
-        Analysis result with verdict (SUPPORTED/REFUTED/INCONCLUSIVE) and statistics
-    """
-    from src.services.statistical_analyzer import get_statistical_analyzer
-    from src.utils.config import settings
-    from src.utils.models import Citation, Evidence
-
-    if not settings.modal_available:
-        return "Error: Modal credentials not configured. Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET."
-
-    # Create evidence from summary
-    evidence = [
-        Evidence(
-            content=evidence_summary,
-            citation=Citation(
-                source="pubmed",
-                title=f"Evidence for {drug} in {condition}",
-                url="https://example.com",
-                date="2024-01-01",
-                authors=["User Provided"],
-            ),
-            relevance=0.9,
-        )
-    ]
-
-    analyzer = get_statistical_analyzer()
-    result = await analyzer.analyze(
-        query=f"Can {drug} treat {condition}?",
-        evidence=evidence,
-        hypothesis={"drug": drug, "target": "unknown", "pathway": "unknown", "effect": condition},
-    )
-
-    return f"""## Statistical Analysis: {drug} for {condition}
-
-### Verdict: **{result.verdict}**
-**Confidence**: {result.confidence:.0%}
-
-### Key Findings
-{chr(10).join(f"- {f}" for f in result.key_findings) or "- No specific findings extracted"}
-
-### Execution Output
-```
-{result.execution_output}
-```
-
-### Generated Code
-```python
-{result.code_generated}
-```
-
-**Executed in Modal Sandbox** - Isolated, secure, reproducible.
-"""
-```
-
-### 5.6 Demo Scripts
-
-#### `examples/modal_demo/verify_sandbox.py`
-
-```python
-#!/usr/bin/env python3
-"""Verify that Modal sandbox is properly isolated.
-
-This script proves to judges that code runs in Modal, not locally.
-NO agent_framework dependency - uses only src.tools.code_execution.
-
-Usage:
-    uv run python examples/modal_demo/verify_sandbox.py
-"""
-
-import asyncio
-from functools import partial
-
-from src.tools.code_execution import get_code_executor
-from src.utils.config import settings
-
-
-async def main() -> None:
-    """Verify Modal sandbox isolation."""
-    if not settings.modal_available:
-        print("Error: Modal credentials not configured.")
-        print("Set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in .env")
-        return
-
-    executor = get_code_executor()
-    loop = asyncio.get_running_loop()
-
-    print("=" * 60)
-    print("Modal Sandbox Isolation Verification")
-    print("=" * 60 + "\n")
-
-    # Test 1: Hostname
-    print("Test 1: Check hostname (should NOT be your machine)")
-    code1 = "import socket; print(f'Hostname: {socket.gethostname()}')"
-    result1 = await loop.run_in_executor(None, partial(executor.execute, code1))
-    print(f"  {result1['stdout'].strip()}\n")
-
-    # Test 2: Scientific libraries
-    print("Test 2: Verify scientific libraries")
-    code2 = """
-import pandas as pd
-import numpy as np
-import scipy
-print(f"pandas: {pd.__version__}")
-print(f"numpy: {np.__version__}")
-print(f"scipy: {scipy.__version__}")
-"""
-    result2 = await loop.run_in_executor(None, partial(executor.execute, code2))
-    print(f"  {result2['stdout'].strip()}\n")
-
-    # Test 3: Network blocked
-    print("Test 3: Verify network isolation")
-    code3 = """
-import urllib.request
-try:
-    urllib.request.urlopen("https://google.com", timeout=2)
-    print("Network: ALLOWED (unexpected!)")
-except Exception:
-    print("Network: BLOCKED (as expected)")
-"""
-    result3 = await loop.run_in_executor(None, partial(executor.execute, code3))
-    print(f"  {result3['stdout'].strip()}\n")
-
-    # Test 4: Real statistics
-    print("Test 4: Execute statistical analysis")
-    code4 = """
-import pandas as pd
-import scipy.stats as stats
-
-data = pd.DataFrame({'effect': [0.42, 0.38, 0.51]})
-mean = data['effect'].mean()
-t_stat, p_val = stats.ttest_1samp(data['effect'], 0)
-
-print(f"Mean Effect: {mean:.3f}")
-print(f"P-value: {p_val:.4f}")
-print(f"Verdict: {'SUPPORTED' if p_val < 0.05 else 'INCONCLUSIVE'}")
-"""
-    result4 = await loop.run_in_executor(None, partial(executor.execute, code4))
-    print(f"  {result4['stdout'].strip()}\n")
-
-    print("=" * 60)
-    print("All tests complete - Modal sandbox verified!")
-    print("=" * 60)
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-#### `examples/modal_demo/run_analysis.py`
-
-```python
-#!/usr/bin/env python3
-"""Demo: Modal-powered statistical analysis.
-
-This script uses StatisticalAnalyzer directly (NO agent_framework dependency).
-
-Usage:
-    uv run python examples/modal_demo/run_analysis.py "metformin alzheimer"
-"""
-
-import argparse
-import asyncio
-import os
-import sys
-
-from src.services.statistical_analyzer import get_statistical_analyzer
-from src.tools.pubmed import PubMedTool
-from src.utils.config import settings
-
-
-async def main() -> None:
-    """Run the Modal analysis demo."""
-    parser = argparse.ArgumentParser(description="Modal Analysis Demo")
-    parser.add_argument("query", help="Research query")
-    args = parser.parse_args()
-
-    if not settings.modal_available:
-        print("Error: Modal credentials not configured.")
-        sys.exit(1)
-
-    if not (os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY")):
-        print("Error: No LLM API key found.")
-        sys.exit(1)
-
-    print(f"\n{'=' * 60}")
-    print("DeepCritical Modal Analysis Demo")
-    print(f"Query: {args.query}")
-    print(f"{'=' * 60}\n")
-
-    # Step 1: Gather Evidence
-    print("Step 1: Gathering evidence from PubMed...")
-    pubmed = PubMedTool()
-    evidence = await pubmed.search(args.query, max_results=5)
-    print(f"  Found {len(evidence)} papers\n")
-
-    # Step 2: Run Modal Analysis
-    print("Step 2: Running statistical analysis in Modal sandbox...")
-    analyzer = get_statistical_analyzer()
-    result = await analyzer.analyze(query=args.query, evidence=evidence)
-
-    # Step 3: Display Results
-    print("\n" + "=" * 60)
-    print("ANALYSIS RESULTS")
-    print("=" * 60)
-    print(f"\nVerdict: {result.verdict}")
-    print(f"Confidence: {result.confidence:.0%}")
-    print("\nKey Findings:")
-    for finding in result.key_findings:
-        print(f"  - {finding}")
-
-    print("\n[Demo Complete - Code executed in Modal, not locally]")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
----
-
-## 6. TDD Test Suite
-
-### 6.1 Unit Tests (`tests/unit/services/test_statistical_analyzer.py`)
-
-```python
-"""Unit tests for StatisticalAnalyzer service."""
-
-from unittest.mock import AsyncMock, MagicMock, patch
-
-import pytest
-
-from src.services.statistical_analyzer import (
-    AnalysisResult,
-    StatisticalAnalyzer,
-    get_statistical_analyzer,
-)
-from src.utils.models import Citation, Evidence
-
-
-@pytest.fixture
-def sample_evidence() -> list[Evidence]:
-    """Sample evidence for testing."""
-    return [
-        Evidence(
-            content="Metformin shows effect size of 0.45.",
-            citation=Citation(
-                source="pubmed",
-                title="Metformin Study",
-                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
-                date="2024-01-15",
-                authors=["Smith J"],
-            ),
-            relevance=0.9,
-        )
-    ]
-
-
-class TestStatisticalAnalyzer:
-    """Tests for StatisticalAnalyzer (no agent_framework dependency)."""
-
-    def test_no_agent_framework_import(self) -> None:
-        """StatisticalAnalyzer must NOT import agent_framework."""
-        import src.services.statistical_analyzer as module
-
-        # Check module doesn't import agent_framework
-        source = open(module.__file__).read()
-        assert "agent_framework" not in source
-        assert "BaseAgent" not in source
-
-    @pytest.mark.asyncio
-    async def test_analyze_returns_result(
-        self, sample_evidence: list[Evidence]
-    ) -> None:
-        """analyze() should return AnalysisResult."""
-        analyzer = StatisticalAnalyzer()
-
-        with patch.object(analyzer, "_get_agent") as mock_agent, \
-             patch.object(analyzer, "_get_code_executor") as mock_executor:
-
-            # Mock LLM
-            mock_agent.return_value.run = AsyncMock(
-                return_value=MagicMock(output="print('SUPPORTED')")
-            )
-
-            # Mock Modal
-            mock_executor.return_value.execute.return_value = {
-                "stdout": "SUPPORTED\np-value: 0.01",
-                "stderr": "",
-                "success": True,
-            }
-
-            result = await analyzer.analyze("test query", sample_evidence)
-
-            assert isinstance(result, AnalysisResult)
-            assert result.verdict == "SUPPORTED"
-
-    def test_singleton(self) -> None:
-        """get_statistical_analyzer should return singleton."""
-        a1 = get_statistical_analyzer()
-        a2 = get_statistical_analyzer()
-        assert a1 is a2
-
-
-class TestAnalysisResult:
-    """Tests for AnalysisResult model."""
-
-    def test_verdict_values(self) -> None:
-        """Verdict should be one of the expected values."""
-        for verdict in ["SUPPORTED", "REFUTED", "INCONCLUSIVE"]:
-            result = AnalysisResult(
-                verdict=verdict,
-                confidence=0.8,
-                statistical_evidence="test",
-                code_generated="print('test')",
-                execution_output="test",
-            )
-            assert result.verdict == verdict
-
-    def test_confidence_bounds(self) -> None:
-        """Confidence must be 0.0-1.0."""
-        with pytest.raises(ValueError):
-            AnalysisResult(
-                verdict="SUPPORTED",
-                confidence=1.5,  # Invalid
-                statistical_evidence="test",
-                code_generated="test",
-                execution_output="test",
-            )
-```
-
-### 6.2 Integration Test (`tests/integration/test_modal.py`)
-
-```python
-"""Integration tests for Modal (requires credentials)."""
-
-import pytest
-
-from src.utils.config import settings
-
-
-@pytest.mark.integration
-@pytest.mark.skipif(not settings.modal_available, reason="Modal not configured")
-class TestModalIntegration:
-    """Integration tests requiring Modal credentials."""
-
-    @pytest.mark.asyncio
-    async def test_sandbox_executes_code(self) -> None:
-        """Modal sandbox should execute Python code."""
-        import asyncio
-        from functools import partial
-
-        from src.tools.code_execution import get_code_executor
-
-        executor = get_code_executor()
-        code = "import pandas as pd; print(pd.DataFrame({'a': [1,2,3]})['a'].sum())"
-
-        loop = asyncio.get_running_loop()
-        result = await loop.run_in_executor(
-            None, partial(executor.execute, code, timeout=30)
-        )
-
-        assert result["success"]
-        assert "6" in result["stdout"]
-
-    @pytest.mark.asyncio
-    async def test_statistical_analyzer_works(self) -> None:
-        """StatisticalAnalyzer should work end-to-end."""
-        from src.services.statistical_analyzer import get_statistical_analyzer
-        from src.utils.models import Citation, Evidence
-
-        evidence = [
-            Evidence(
-                content="Drug shows 40% improvement in trial.",
-                citation=Citation(
-                    source="pubmed",
-                    title="Test",
-                    url="https://test.com",
-                    date="2024-01-01",
-                    authors=["Test"],
-                ),
-                relevance=0.9,
-            )
-        ]
-
-        analyzer = get_statistical_analyzer()
-        result = await analyzer.analyze("test drug efficacy", evidence)
-
-        assert result.verdict in ["SUPPORTED", "REFUTED", "INCONCLUSIVE"]
-        assert 0.0 <= result.confidence <= 1.0
-```
-
----
-
-## 7. Verification Commands
-
-```bash
-# 1. Verify NO agent_framework in StatisticalAnalyzer
-grep -r "agent_framework" src/services/statistical_analyzer.py
-# Should return nothing!
-
-# 2. Run unit tests (no Modal needed)
-uv run pytest tests/unit/services/test_statistical_analyzer.py -v
-
-# 3. Run verification script (requires Modal)
-uv run python examples/modal_demo/verify_sandbox.py
-
-# 4. Run analysis demo (requires Modal + LLM)
-uv run python examples/modal_demo/run_analysis.py "metformin alzheimer"
-
-# 5. Run integration tests
-uv run pytest tests/integration/test_modal.py -v -m integration
-
-# 6. Full test suite
-make check
-```
-
----
-
-## 8. Definition of Done
-
-Phase 13 is **COMPLETE** when:
-
-- [ ] `src/services/statistical_analyzer.py` created (NO agent_framework)
-- [ ] `src/utils/config.py` has `enable_modal_analysis` setting
-- [ ] `src/orchestrator.py` uses `StatisticalAnalyzer` directly
-- [ ] `src/agents/analysis_agent.py` refactored to wrap `StatisticalAnalyzer`
-- [ ] `src/mcp_tools.py` has `analyze_hypothesis` tool
-- [ ] `examples/modal_demo/verify_sandbox.py` working
-- [ ] `examples/modal_demo/run_analysis.py` working
-- [ ] Unit tests pass WITHOUT magentic extra installed
-- [ ] Integration tests pass WITH Modal credentials
-- [ ] All lints pass
-
----
-
-## 9. Architecture After Phase 13
-
-```text
-┌─────────────────────────────────────────────────────────────────┐
-│                        MCP Clients                              │
-│              (Claude Desktop, Cursor, etc.)                     │
-└───────────────────────────┬─────────────────────────────────────┘
-                            │ MCP Protocol
-                            ▼
-┌─────────────────────────────────────────────────────────────────┐
-│                     Gradio App + MCP Server                     │
-│  ┌──────────────────────────────────────────────────────────┐   │
-│  │  MCP Tools: search_pubmed, search_trials, search_biorxiv │   │
-│  │             search_all, analyze_hypothesis               │   │
-│  └──────────────────────────────────────────────────────────┘   │
-└───────────────────────────┬─────────────────────────────────────┘
-                            │
-        ┌───────────────────┴───────────────────┐
-        │                                       │
-        ▼                                       ▼
-┌───────────────────────┐            ┌───────────────────────────┐
-│   Simple Orchestrator │            │   Magentic Orchestrator   │
-│  (no agent_framework) │            │   (with agent_framework)  │
-│                       │            │                           │
-│  SearchHandler        │            │  SearchAgent              │
-│  JudgeHandler         │            │  JudgeAgent               │
-│  StatisticalAnalyzer ─┼────────────┼→ AnalysisAgent ───────────┤
-│                       │            │  (wraps StatisticalAnalyzer)
-└───────────┬───────────┘            └───────────────────────────┘
-            │
-            ▼
-┌──────────────────────────────────────────────────────────────────┐
-│                    StatisticalAnalyzer                           │
-│              (src/services/statistical_analyzer.py)              │
-│                    NO agent_framework dependency                 │
-│                                                                  │
-│  1. Generate code with pydantic-ai                               │
-│  2. Execute in Modal sandbox                                     │
-│  3. Return AnalysisResult                                        │
-└───────────────────────────┬──────────────────────────────────────┘
-                            │
-                            ▼
-┌─────────────────────────────────────────────────────────────────┐
-│                       Modal Sandbox                             │
-│  ┌─────────────────────────────────────────────────────────┐    │
-│  │  - pandas, numpy, scipy, sklearn, statsmodels           │    │
-│  │  - Network: BLOCKED                                     │    │
-│  │  - Filesystem: ISOLATED                                 │    │
-│  │  - Timeout: ENFORCED                                    │    │
-│  └─────────────────────────────────────────────────────────┘    │
-└─────────────────────────────────────────────────────────────────┘
-```
-
-**This is the dependency-safe Modal stack.**
-
----
-
-## 10. Files Summary
-
-| File | Action | Purpose |
-|------|--------|---------|
-| `src/services/statistical_analyzer.py` | **CREATE** | Core analysis (no agent_framework) |
-| `src/utils/config.py` | MODIFY | Add `enable_modal_analysis` |
-| `src/orchestrator.py` | MODIFY | Use `StatisticalAnalyzer` |
-| `src/agents/analysis_agent.py` | MODIFY | Wrap `StatisticalAnalyzer` |
-| `src/mcp_tools.py` | MODIFY | Add `analyze_hypothesis` |
-| `examples/modal_demo/verify_sandbox.py` | CREATE | Sandbox verification |
-| `examples/modal_demo/run_analysis.py` | CREATE | Demo script |
-| `tests/unit/services/test_statistical_analyzer.py` | CREATE | Unit tests |
-| `tests/integration/test_modal.py` | CREATE | Integration tests |
-
-**Key Fix**: `StatisticalAnalyzer` has ZERO agent_framework imports, making it safe for the simple orchestrator.
diff --git a/docs/implementation/14_phase_demo_submission.md b/docs/implementation/14_phase_demo_submission.md
deleted file mode 100644
index 3dee9bc235fbe58e5aea4e0b48135bf4b08d4da5..0000000000000000000000000000000000000000
--- a/docs/implementation/14_phase_demo_submission.md
+++ /dev/null
@@ -1,464 +0,0 @@
-# Phase 14 Implementation Spec: Demo Video & Hackathon Submission
-
-**Goal**: Create compelling demo video and complete hackathon submission.
-**Philosophy**: "Ship it with style."
-**Prerequisite**: Phases 12-13 complete (MCP + Modal working)
-**Priority**: P0 - REQUIRED FOR SUBMISSION
-**Deadline**: November 30, 2025 11:59 PM UTC
-**Estimated Time**: 2-3 hours
-
----
-
-## 1. Submission Requirements
-
-### MCP's 1st Birthday Hackathon Checklist
-
-| Requirement | Status | Action |
-|-------------|--------|--------|
-| HuggingFace Space in `MCP-1st-Birthday` org | Pending | Transfer or create |
-| Track tag in README.md | Pending | Add tag |
-| Social media post link | Pending | Create post |
-| Demo video (1-5 min) | Pending | Record |
-| Team members registered | Pending | Verify |
-| Original work (Nov 14-30) | **DONE** | All commits in range |
-
-### Track 2: MCP in Action - Tags
-
-```yaml
-# Add to HuggingFace Space README.md
-tags:
-  - mcp-in-action-track-enterprise   # Healthcare/enterprise focus
-```
-
----
-
-## 2. Prize Eligibility Summary
-
-### After Phases 12-13
-
-| Award | Amount | Eligible | Requirements Met |
-|-------|--------|----------|------------------|
-| Track 2: MCP in Action (1st) | $2,500 | **YES** | MCP server working |
-| Modal Innovation | $2,500 | **YES** | Sandbox demo ready |
-| LlamaIndex | $1,000 | **YES** | Using RAG |
-| Community Choice | $1,000 | Possible | Need great demo |
-| **Total Potential** | **$7,000** | | |
-
----
-
-## 3. Demo Video Specification
-
-### 3.1 Duration & Format
-
-- **Length**: 3-4 minutes (sweet spot)
-- **Format**: Screen recording + voice-over
-- **Resolution**: 1080p minimum
-- **Audio**: Clear narration, no background music
-
-### 3.2 Recommended Tools
-
-| Tool | Purpose | Notes |
-|------|---------|-------|
-| OBS Studio | Screen recording | Free, cross-platform |
-| Loom | Quick recording | Good for demos |
-| QuickTime | Mac screen recording | Built-in |
-| DaVinci Resolve | Editing | Free, professional |
-
-### 3.3 Demo Script (4 minutes)
-
-```markdown
-## Section 1: Hook (30 seconds)
-
-[Show Gradio UI]
-
-"DeepCritical is an AI-powered drug repurposing research agent.
-It searches peer-reviewed literature, clinical trials, and cutting-edge preprints
-to find new uses for existing drugs."
-
-"Let me show you how it works."
-
----
-
-## Section 2: Core Functionality (60 seconds)
-
-[Type query: "Can metformin treat Alzheimer's disease?"]
-
-"When I ask about metformin for Alzheimer's, DeepCritical:
-1. Searches PubMed for peer-reviewed papers
-2. Queries ClinicalTrials.gov for active trials
-3. Scans bioRxiv for the latest preprints"
-
-[Show search results streaming]
-
-"It then uses an LLM to assess the evidence quality and
-synthesize findings into a structured research report."
-
-[Show final report]
-
----
-
-## Section 3: MCP Integration (60 seconds)
-
-[Switch to Claude Desktop]
-
-"What makes DeepCritical unique is full MCP integration.
-These same tools are available to any MCP client."
-
-[Show Claude Desktop with DeepCritical tools]
-
-"I can ask Claude: 'Search PubMed for aspirin cancer prevention'"
-
-[Show results appearing in Claude Desktop]
-
-"The agent uses our MCP server to search real biomedical databases."
-
-[Show MCP Inspector briefly]
-
-"Here's the MCP schema - four tools exposed for any AI to use."
-
----
-
-## Section 4: Modal Innovation (45 seconds)
-
-[Run verify_sandbox.py]
-
-"For statistical analysis, we use Modal for secure code execution."
-
-[Show sandbox verification output]
-
-"Notice the hostname is NOT my machine - code runs in an isolated container.
-Network is blocked. The AI can't reach the internet from the sandbox."
-
-[Run analysis demo]
-
-"Modal executes LLM-generated statistical code safely,
-returning verdicts like SUPPORTED, REFUTED, or INCONCLUSIVE."
-
----
-
-## Section 5: Close (45 seconds)
-
-[Return to Gradio UI]
-
-"DeepCritical brings together:
-- Three biomedical data sources
-- MCP protocol for universal tool access
-- Modal sandboxes for safe code execution
-- LlamaIndex for semantic search
-
-All in a beautiful Gradio interface."
-
-"Check out the code on GitHub, try it on HuggingFace Spaces,
-and let us know what you think."
-
-"Thanks for watching!"
-
-[Show links: GitHub, HuggingFace, Team names]
-```
-
----
-
-## 4. HuggingFace Space Configuration
-
-### 4.1 Space README.md
-
-```markdown
----
-title: DeepCritical
-emoji: 🧬
-colorFrom: blue
-colorTo: purple
-sdk: gradio
-sdk_version: "5.0.0"
-app_file: src/app.py
-pinned: false
-license: mit
-tags:
-  - mcp-in-action-track-enterprise
-  - mcp-hackathon
-  - drug-repurposing
-  - biomedical-ai
-  - pydantic-ai
-  - llamaindex
-  - modal
----
-
-# DeepCritical
-
-AI-Powered Drug Repurposing Research Agent
-
-## Features
-
-- **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
-- **MCP Integration**: Use our tools from Claude Desktop or any MCP client
-- **Modal Sandbox**: Secure execution of AI-generated statistical code
-- **LlamaIndex RAG**: Semantic search and evidence synthesis
-
-## MCP Tools
-
-Connect to our MCP server at:
-```
-https://MCP-1st-Birthday-deepcritical.hf.space/gradio_api/mcp/
-```
-
-Available tools:
-- `search_pubmed` - Search peer-reviewed biomedical literature
-- `search_clinical_trials` - Search ClinicalTrials.gov
-- `search_biorxiv` - Search bioRxiv/medRxiv preprints
-- `search_all` - Search all sources simultaneously
-
-## Team
-
-- The-Obstacle-Is-The-Way
-- MarioAderman
-
-## Links
-
-- [GitHub Repository](https://github.com/The-Obstacle-Is-The-Way/DeepCritical-1)
-- [Demo Video](link-to-video)
-```
-
-### 4.2 Environment Variables (Secrets)
-
-Set in HuggingFace Space settings:
-
-```
-OPENAI_API_KEY=sk-...
-ANTHROPIC_API_KEY=sk-ant-...
-NCBI_API_KEY=...
-MODAL_TOKEN_ID=...
-MODAL_TOKEN_SECRET=...
-```
-
----
-
-## 5. Social Media Post
-
-### Twitter/X Template
-
-```
-🧬 Excited to submit DeepCritical to MCP's 1st Birthday Hackathon!
-
-An AI agent that:
-✅ Searches PubMed, ClinicalTrials.gov & bioRxiv
-✅ Exposes tools via MCP protocol
-✅ Runs statistical code in Modal sandboxes
-✅ Uses LlamaIndex for semantic search
-
-Try it: [HuggingFace link]
-Demo: [Video link]
-
-#MCPHackathon #AIAgents #DrugRepurposing @huggingface @AnthropicAI
-```
-
-### LinkedIn Template
-
-```
-Thrilled to share DeepCritical - our submission to MCP's 1st Birthday Hackathon!
-
-🔬 What it does:
-DeepCritical is an AI-powered drug repurposing research agent that searches
-peer-reviewed literature, clinical trials, and preprints to find new uses
-for existing drugs.
-
-🛠️ Technical highlights:
-• Full MCP integration - tools work with Claude Desktop
-• Modal sandboxes for secure AI-generated code execution
-• LlamaIndex RAG for semantic evidence search
-• Three biomedical data sources in parallel
-
-Built with PydanticAI, Gradio, and deployed on HuggingFace Spaces.
-
-Try it: [link]
-Watch the demo: [link]
-
-#ArtificialIntelligence #Healthcare #DrugDiscovery #MCP #Hackathon
-```
-
----
-
-## 6. Pre-Submission Checklist
-
-### 6.1 Code Quality
-
-```bash
-# Run all checks
-make check
-
-# Expected output:
-# ✅ Linting passed (ruff)
-# ✅ Type checking passed (mypy)
-# ✅ All 80+ tests passed (pytest)
-```
-
-### 6.2 Documentation
-
-- [ ] README.md updated with MCP instructions
-- [ ] All demo scripts have docstrings
-- [ ] Example files work end-to-end
-- [ ] CLAUDE.md is current
-
-### 6.3 Deployment Verification
-
-```bash
-# Test locally
-uv run python src/app.py
-# Visit http://localhost:7860
-
-# Test MCP schema
-curl http://localhost:7860/gradio_api/mcp/schema
-
-# Test Modal (if configured)
-uv run python examples/modal_demo/verify_sandbox.py
-```
-
-### 6.4 HuggingFace Space
-
-- [ ] Space created in `MCP-1st-Birthday` organization
-- [ ] Secrets configured (API keys)
-- [ ] App starts without errors
-- [ ] MCP endpoint accessible
-- [ ] Track tag in README
-
----
-
-## 7. Recording Checklist
-
-### Before Recording
-
-- [ ] Close unnecessary apps/notifications
-- [ ] Clear browser history/tabs
-- [ ] Test all demos work
-- [ ] Prepare terminal windows
-- [ ] Write down talking points
-
-### During Recording
-
-- [ ] Speak clearly and at moderate pace
-- [ ] Pause briefly between sections
-- [ ] Show your face? (optional, adds personality)
-- [ ] Don't rush - 3-4 min is enough time
-
-### After Recording
-
-- [ ] Watch playback for errors
-- [ ] Trim dead air at start/end
-- [ ] Add title/end cards
-- [ ] Export at 1080p
-- [ ] Upload to YouTube/Loom
-
----
-
-## 8. Submission Steps
-
-### Step 1: Finalize Code
-
-```bash
-# Ensure clean state
-git status
-make check
-
-# Push to GitHub
-git push origin main
-
-# Sync to HuggingFace
-git push huggingface-upstream main
-```
-
-### Step 2: Verify HuggingFace Space
-
-1. Visit Space URL
-2. Test the chat interface
-3. Test MCP endpoint: `/gradio_api/mcp/schema`
-4. Verify README has track tag
-
-### Step 3: Record Demo Video
-
-1. Follow script from Section 3.3
-2. Edit and export
-3. Upload to YouTube (unlisted) or Loom
-4. Copy shareable link
-
-### Step 4: Create Social Post
-
-1. Write post (see templates)
-2. Include video link
-3. Tag relevant accounts
-4. Post and copy link
-
-### Step 5: Submit
-
-1. Ensure Space is in `MCP-1st-Birthday` org
-2. Verify track tag in README
-3. Submit entry (check hackathon page for form)
-4. Include all links
-
----
-
-## 9. Verification Commands
-
-```bash
-# 1. Full test suite
-make check
-
-# 2. Start local server
-uv run python src/app.py
-
-# 3. Verify MCP works
-curl http://localhost:7860/gradio_api/mcp/schema | jq
-
-# 4. Test with MCP Inspector
-npx @anthropic/mcp-inspector http://localhost:7860/gradio_api/mcp/
-
-# 5. Run Modal verification
-uv run python examples/modal_demo/verify_sandbox.py
-
-# 6. Run full demo
-uv run python examples/orchestrator_demo/run_agent.py "metformin alzheimer"
-```
-
----
-
-## 10. Definition of Done
-
-Phase 14 is **COMPLETE** when:
-
-- [ ] Demo video recorded (3-4 min)
-- [ ] Video uploaded (YouTube/Loom)
-- [ ] Social media post created with link
-- [ ] HuggingFace Space in `MCP-1st-Birthday` org
-- [ ] Track tag in Space README
-- [ ] All team members registered
-- [ ] Entry submitted before deadline
-- [ ] Confirmation received
-
----
-
-## 11. Timeline
-
-| Task | Time | Deadline |
-|------|------|----------|
-| Phase 12: MCP Server | 2-3 hours | Nov 28 |
-| Phase 13: Modal Integration | 2-3 hours | Nov 29 |
-| Phase 14: Demo & Submit | 2-3 hours | Nov 30 |
-| **Buffer** | ~24 hours | Before 11:59 PM UTC |
-
----
-
-## 12. Contact & Support
-
-### Hackathon Resources
-
-- Discord: `#agents-mcp-hackathon-winter25`
-- HuggingFace: [MCP-1st-Birthday org](https://huggingface.co/MCP-1st-Birthday)
-- MCP Docs: [modelcontextprotocol.io](https://modelcontextprotocol.io/)
-
-### Team Communication
-
-- Coordinate on final review
-- Agree on who submits
-- Celebrate when done! 🎉
-
----
-
-**Good luck! Ship it with confidence.**
diff --git a/docs/implementation/roadmap.md b/docs/implementation/roadmap.md
deleted file mode 100644
index 1f4862e9ee898881d04dbecd8c27b8bc4848fd61..0000000000000000000000000000000000000000
--- a/docs/implementation/roadmap.md
+++ /dev/null
@@ -1,247 +0,0 @@
-# Implementation Roadmap: DeepCritical (Vertical Slices)
-
-**Philosophy:** AI-Native Engineering, Vertical Slice Architecture, TDD, Modern Tooling (2025).
-
-This roadmap defines the execution strategy to deliver **DeepCritical** effectively. We reject "overplanning" in favor of **ironclad, testable vertical slices**. Each phase delivers a fully functional slice of end-to-end value.
-
----
-
-## The 2025 "Gucci" Tooling Stack
-
-We are using the bleeding edge of Python engineering to ensure speed, safety, and developer joy.
-
-| Category | Tool | Why? |
-|----------|------|------|
-| **Package Manager** | **`uv`** | Rust-based, 10-100x faster than pip/poetry. Manages python versions, venvs, and deps. |
-| **Linting/Format** | **`ruff`** | Rust-based, instant. Replaces black, isort, flake8. |
-| **Type Checking** | **`mypy`** | Strict static typing. Run via `uv run mypy`. |
-| **Testing** | **`pytest`** | The standard. |
-| **Test Plugins** | **`pytest-sugar`** | Instant feedback, progress bars. "Gucci" visuals. |
-| **Test Plugins** | **`pytest-asyncio`** | Essential for our async agent loop. |
-| **Test Plugins** | **`pytest-cov`** | Coverage reporting to ensure TDD adherence. |
-| **Git Hooks** | **`pre-commit`** | Enforce ruff/mypy before commit. |
-
----
-
-## Architecture: Vertical Slices
-
-Instead of horizontal layers (e.g., "Building the Database Layer"), we build **Vertical Slices**.
-Each slice implements a feature from **Entry Point (UI/API) -> Logic -> Data/External**.
-
-### Directory Structure (Maintainer's Structure)
-
-```bash
-src/
-├── app.py                      # Entry point (Gradio UI)
-├── orchestrator.py             # Agent loop (Search -> Judge -> Loop)
-├── agent_factory/              # Agent creation and judges
-│   ├── __init__.py
-│   ├── agents.py               # PydanticAI agent definitions
-│   └── judges.py               # JudgeHandler for evidence assessment
-├── tools/                      # Search tools
-│   ├── __init__.py
-│   ├── pubmed.py               # PubMed E-utilities tool
-│   ├── clinicaltrials.py       # ClinicalTrials.gov API
-│   ├── biorxiv.py              # bioRxiv/medRxiv preprints
-│   ├── code_execution.py       # Modal sandbox execution
-│   └── search_handler.py       # Orchestrates multiple tools
-├── prompts/                    # Prompt templates
-│   ├── __init__.py
-│   └── judge.py                # Judge prompts
-├── utils/                      # Shared utilities
-│   ├── __init__.py
-│   ├── config.py               # Settings/configuration
-│   ├── exceptions.py           # Custom exceptions
-│   ├── models.py               # Shared Pydantic models
-│   ├── dataloaders.py          # Data loading utilities
-│   └── parsers.py              # Parsing utilities
-├── middleware/                 # (Future: middleware components)
-├── database_services/          # (Future: database integrations)
-└── retrieval_factory/          # (Future: RAG components)
-
-tests/
-├── unit/
-│   ├── tools/
-│   │   ├── test_pubmed.py
-│   │   ├── test_clinicaltrials.py
-│   │   ├── test_biorxiv.py
-│   │   └── test_search_handler.py
-│   ├── agent_factory/
-│   │   └── test_judges.py
-│   └── test_orchestrator.py
-└── integration/
-    └── test_pubmed_live.py
-```
-
----
-
-## Phased Execution Plan
-
-### **Phase 1: Foundation & Tooling (Day 1)**
-
-*Goal: A rock-solid, CI-ready environment with `uv` and `pytest` configured.*
-
-- [ ] Initialize `pyproject.toml` with `uv`.
-- [ ] Configure `ruff` (strict) and `mypy` (strict).
-- [ ] Set up `pytest` with sugar and coverage.
-- [ ] Implement `src/utils/config.py` (Configuration Slice).
-- [ ] Implement `src/utils/exceptions.py` (Custom exceptions).
-- **Deliverable**: A repo that passes CI with `uv run pytest`.
-
-### **Phase 2: The "Search" Vertical Slice (Day 2)**
-
-*Goal: Agent can receive a query and get raw results from PubMed/Web.*
-
-- [ ] **TDD**: Write test for `SearchHandler`.
-- [ ] Implement `src/tools/pubmed.py` (PubMed E-utilities).
-- [ ] Implement `src/tools/websearch.py` (DuckDuckGo).
-- [ ] Implement `src/tools/search_handler.py` (Orchestrates tools).
-- [ ] Implement `src/utils/models.py` (Evidence, Citation, SearchResult).
-- **Deliverable**: Function that takes "long covid" -> returns `List[Evidence]`.
-
-### **Phase 3: The "Judge" Vertical Slice (Day 3)**
-
-*Goal: Agent can decide if evidence is sufficient.*
-
-- [ ] **TDD**: Write test for `JudgeHandler` (Mocked LLM).
-- [ ] Implement `src/prompts/judge.py` (Structured outputs).
-- [ ] Implement `src/agent_factory/judges.py` (LLM interaction).
-- **Deliverable**: Function that takes `List[Evidence]` -> returns `JudgeAssessment`.
-
-### **Phase 4: The "Loop" & UI Slice (Day 4)**
-
-*Goal: End-to-End User Value.*
-
-- [ ] Implement `src/orchestrator.py` (Connects Search + Judge loops).
-- [ ] Build `src/app.py` (Gradio with Streaming).
-- **Deliverable**: Working DeepCritical Agent on HuggingFace.
-
----
-
-### **Phase 5: Magentic Integration** ✅ COMPLETE
-
-*Goal: Upgrade orchestrator to use Microsoft Agent Framework patterns.*
-
-- [x] Wrap SearchHandler as `AgentProtocol` (SearchAgent) with strict protocol compliance.
-- [x] Wrap JudgeHandler as `AgentProtocol` (JudgeAgent) with strict protocol compliance.
-- [x] Implement `MagenticOrchestrator` using `MagenticBuilder`.
-- [x] Create factory pattern for switching implementations.
-- **Deliverable**: Same API, better multi-agent orchestration engine.
-
----
-
-### **Phase 6: Embeddings & Semantic Search**
-
-*Goal: Add vector search for semantic evidence retrieval.*
-
-- [ ] Implement `EmbeddingService` with ChromaDB.
-- [ ] Add semantic deduplication to SearchAgent.
-- [ ] Enable semantic search for related evidence.
-- [ ] Store embeddings in shared context.
-- **Deliverable**: Find semantically related papers, not just keyword matches.
-
----
-
-### **Phase 7: Hypothesis Agent**
-
-*Goal: Generate scientific hypotheses to guide targeted searches.*
-
-- [ ] Implement `MechanismHypothesis` and `HypothesisAssessment` models.
-- [ ] Implement `HypothesisAgent` for mechanistic reasoning.
-- [ ] Add hypothesis-driven search queries.
-- [ ] Integrate into Magentic workflow.
-- **Deliverable**: Drug → Target → Pathway → Effect hypotheses that guide research.
-
----
-
-### **Phase 8: Report Agent**
-
-*Goal: Generate structured scientific reports with proper citations.*
-
-- [ ] Implement `ResearchReport` model with all sections.
-- [ ] Implement `ReportAgent` for synthesis.
-- [ ] Include methodology, limitations, formatted references.
-- [ ] Integrate as final synthesis step in Magentic workflow.
-- **Deliverable**: Publication-quality research reports.
-
----
-
-## Complete Architecture (Phases 1-8)
-
-```text
-User Query
-    ↓
-Gradio UI (Phase 4)
-    ↓
-Magentic Manager (Phase 5)
-    ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6)
-    ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning
-    ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment
-    └── ReportAgent (Phase 8) ←→ Final Synthesis
-    ↓
-Structured Research Report
-```
-
----
-
-## Spec Documents
-
-### Core Platform (Phases 1-8)
-
-1. **[Phase 1 Spec: Foundation](01_phase_foundation.md)** ✅
-2. **[Phase 2 Spec: Search Slice](02_phase_search.md)** ✅
-3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)** ✅
-4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)** ✅
-5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** ✅
-6. **[Phase 6 Spec: Embeddings & Semantic Search](06_phase_embeddings.md)** ✅
-7. **[Phase 7 Spec: Hypothesis Agent](07_phase_hypothesis.md)** ✅
-8. **[Phase 8 Spec: Report Agent](08_phase_report.md)** ✅
-
-### Multi-Source Search (Phases 9-11)
-
-9. **[Phase 9 Spec: Remove DuckDuckGo](09_phase_source_cleanup.md)** ✅
-10. **[Phase 10 Spec: ClinicalTrials.gov](10_phase_clinicaltrials.md)** ✅
-11. **[Phase 11 Spec: bioRxiv Preprints](11_phase_biorxiv.md)** ✅
-
-### Hackathon Integration (Phases 12-14)
-
-12. **[Phase 12 Spec: MCP Server](12_phase_mcp_server.md)** ✅ COMPLETE
-13. **[Phase 13 Spec: Modal Pipeline](13_phase_modal_integration.md)** 📝 P1 - $2,500
-14. **[Phase 14 Spec: Demo & Submission](14_phase_demo_submission.md)** 📝 P0 - REQUIRED
-
----
-
-## Progress Summary
-
-| Phase | Status | Deliverable |
-|-------|--------|-------------|
-| Phase 1: Foundation | ✅ COMPLETE | CI-ready repo with uv/pytest |
-| Phase 2: Search | ✅ COMPLETE | PubMed + Web search |
-| Phase 3: Judge | ✅ COMPLETE | LLM evidence assessment |
-| Phase 4: UI & Loop | ✅ COMPLETE | Working Gradio app |
-| Phase 5: Magentic | ✅ COMPLETE | Multi-agent orchestration |
-| Phase 6: Embeddings | ✅ COMPLETE | Semantic search + ChromaDB |
-| Phase 7: Hypothesis | ✅ COMPLETE | Mechanistic reasoning chains |
-| Phase 8: Report | ✅ COMPLETE | Structured scientific reports |
-| Phase 9: Source Cleanup | ✅ COMPLETE | Remove DuckDuckGo |
-| Phase 10: ClinicalTrials | ✅ COMPLETE | ClinicalTrials.gov API |
-| Phase 11: bioRxiv | ✅ COMPLETE | Preprint search |
-| Phase 12: MCP Server | ✅ COMPLETE | MCP protocol integration |
-| Phase 13: Modal Pipeline | 📝 SPEC READY | Sandboxed code execution |
-| Phase 14: Demo & Submit | 📝 SPEC READY | Hackathon submission |
-
-*Phases 1-12 COMPLETE. Phases 13-14 for hackathon prizes.*
-
----
-
-## Hackathon Prize Potential
-
-| Award | Amount | Requirement | Phase |
-|-------|--------|-------------|-------|
-| Track 2: MCP in Action (1st) | $2,500 | MCP server working | 12 |
-| Modal Innovation | $2,500 | Sandbox demo ready | 13 |
-| LlamaIndex | $1,000 | Using RAG | ✅ Done |
-| Community Choice | $1,000 | Great demo video | 14 |
-| **Total Potential** | **$7,000** | | |
-
-**Deadline: November 30, 2025 11:59 PM UTC**
diff --git a/docs/index.md b/docs/index.md
index 400ddfa44d974f61407c1754bfe57e5d6dfedace..43c57235acdba5ff7dacf09ad33960b133bcbaa3 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,92 +1,63 @@
-# DeepCritical Documentation
+# DeepCritical
 
-## Medical Drug Repurposing Research Agent
+**AI-Native Drug Repurposing Research Agent**
 
-AI-powered deep research system for accelerating drug repurposing discovery.
+DeepCritical is a deep research agent system that uses iterative search-and-judge loops to comprehensively answer research questions. The system supports multiple orchestration patterns, graph-based execution, parallel research workflows, and long-running task management with real-time streaming.
 
----
-
-## Quick Links
-
-### Architecture
-- **[Overview](architecture/overview.md)** - Project overview, use case, architecture
-- **[Design Patterns](architecture/design-patterns.md)** - Technical patterns, data models
-
-### Implementation
-- **[Roadmap](implementation/roadmap.md)** - Phased execution plan with TDD
-- **[Phase 1: Foundation](implementation/01_phase_foundation.md)** ✅ - Tooling, config, first tests
-- **[Phase 2: Search](implementation/02_phase_search.md)** ✅ - PubMed search
-- **[Phase 3: Judge](implementation/03_phase_judge.md)** ✅ - LLM evidence assessment
-- **[Phase 4: UI](implementation/04_phase_ui.md)** ✅ - Orchestrator + Gradio
-- **[Phase 5: Magentic](implementation/05_phase_magentic.md)** ✅ - Multi-agent orchestration
-- **[Phase 6: Embeddings](implementation/06_phase_embeddings.md)** ✅ - Semantic search + dedup
-- **[Phase 7: Hypothesis](implementation/07_phase_hypothesis.md)** ✅ - Mechanistic reasoning
-- **[Phase 8: Report](implementation/08_phase_report.md)** ✅ - Structured scientific reports
-- **[Phase 9: Source Cleanup](implementation/09_phase_source_cleanup.md)** ✅ - Remove DuckDuckGo
-- **[Phase 10: ClinicalTrials](implementation/10_phase_clinicaltrials.md)** ✅ - Clinical trials API
-- **[Phase 11: bioRxiv](implementation/11_phase_biorxiv.md)** ✅ - Preprint search
-- **[Phase 12: MCP Server](implementation/12_phase_mcp_server.md)** ✅ - Claude Desktop integration
-- **[Phase 13: Modal Integration](implementation/13_phase_modal_integration.md)** ✅ - Secure code execution
-- **[Phase 14: Demo Submission](implementation/14_phase_demo_submission.md)** ✅ - Hackathon submission
-
-### Guides
-- **[Deployment Guide](guides/deployment.md)** - Gradio, MCP, and Modal launch steps
-
-### Development
-- **[Testing Strategy](development/testing.md)** - Unit, Integration, and E2E testing patterns
+## Features
 
----
+- **Multi-Source Search**: PubMed, ClinicalTrials.gov, Europe PMC (includes bioRxiv/medRxiv)
+- **MCP Integration**: Use our tools from Claude Desktop or any MCP client
+- **HuggingFace OAuth**: Sign in with your HuggingFace account to automatically use your API token
+- **Modal Sandbox**: Secure execution of AI-generated statistical code
+- **LlamaIndex RAG**: Semantic search and evidence synthesis
+- **HuggingFace Inference**: Free tier support with automatic fallback
+- **Strongly Typed Composable Graphs**: Graph-based orchestration with Pydantic AI
+- **Specialized Research Teams of Agents**: Multi-agent coordination for complex research tasks
 
-## What We're Building
+## Quick Start
 
-**One-liner**: AI agent that searches medical literature to find existing drugs that might treat new diseases.
+```bash
+# Install uv if you haven't already
+pip install uv
 
-**Example Query**:
-> "What existing drugs might help treat long COVID fatigue?"
+# Sync dependencies
+uv sync
 
-**Output**: Research report with drug candidates, mechanisms, evidence quality, and citations.
+# Start the Gradio app
+uv run gradio run src/app.py
+```
 
----
+Open your browser to `http://localhost:7860`.
 
-## Architecture Summary
+For detailed installation and setup instructions, see the [Getting Started Guide](getting-started/installation.md).
 
-```
-User Question → Research Agent (Orchestrator)
-                      ↓
-              Search Loop:
-                → Tools (PubMed, ClinicalTrials, bioRxiv)
-                → Judge (Quality + Budget)
-                → Repeat or Synthesize
-                      ↓
-              Research Report with Citations
-```
+## Architecture
 
----
+DeepCritical uses a Vertical Slice Architecture:
 
-## Features
+1. **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and Europe PMC
+2. **Judge Slice**: Evaluating evidence quality using LLMs
+3. **Orchestrator Slice**: Managing the research loop and UI
 
-| Feature | Status | Description |
-|---------|--------|-------------|
-| **Gradio UI** | ✅ Complete | Streaming chat interface |
-| **MCP Server** | ✅ Complete | Tools accessible from Claude Desktop |
-| **Modal Sandbox** | ✅ Complete | Secure statistical analysis |
-| **Multi-Source Search** | ✅ Complete | PubMed, ClinicalTrials, bioRxiv |
+The system supports three main research patterns:
 
----
+- **Iterative Research**: Single research loop with search-judge-synthesize cycles
+- **Deep Research**: Multi-section parallel research with planning and synthesis
+- **Research Team**: Multi-agent coordination using Magentic framework
 
-## Team
+Learn more about the [Architecture](overview/architecture.md).
 
-- The-Obstacle-Is-The-Way
-- MarioAderman
-- Josephrp
+## Documentation
 
----
+- [Overview](overview/architecture.md) - System architecture and design
+- [Getting Started](getting-started/installation.md) - Installation and setup
+- [Configuration](configuration/index.md) - Configuration guide
+- [API Reference](api/agents.md) - API documentation
+- [Contributing](contributing.md) - Development guidelines
 
-## Status
+## Links
 
-| Phase | Status |
-|-------|--------|
-| Phases 1-14 | ✅ COMPLETE |
+- [GitHub Repository](https://github.com/DeepCritical/GradioDemo)
+- [HuggingFace Space](https://huggingface.co/spaces/DataQuests/DeepCritical)
 
-**Test Coverage**: 65% (96 tests passing)
-**Architecture Review**: PASSED (98-99/100)
diff --git a/docs/license.md b/docs/license.md
new file mode 100644
index 0000000000000000000000000000000000000000..c6244b068f976a978a0334b98eda9c0f09e3cad0
--- /dev/null
+++ b/docs/license.md
@@ -0,0 +1,28 @@
+# License
+
+DeepCritical is licensed under the MIT License.
+
+## MIT License
+
+Copyright (c) 2024 DeepCritical Team
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+
+
diff --git a/docs/overview/architecture.md b/docs/overview/architecture.md
new file mode 100644
index 0000000000000000000000000000000000000000..4c59b53a4807a4d1bcefc80ba0086a62b07ebba5
--- /dev/null
+++ b/docs/overview/architecture.md
@@ -0,0 +1,185 @@
+# Architecture Overview
+
+DeepCritical is a deep research agent system that uses iterative search-and-judge loops to comprehensively answer research questions. The system supports multiple orchestration patterns, graph-based execution, parallel research workflows, and long-running task management with real-time streaming.
+
+## Core Architecture
+
+### Orchestration Patterns
+
+1. **Graph Orchestrator** (`src/orchestrator/graph_orchestrator.py`):
+   - Graph-based execution using Pydantic AI agents as nodes
+   - Supports both iterative and deep research patterns
+   - Node types: Agent, State, Decision, Parallel
+   - Edge types: Sequential, Conditional, Parallel
+   - Conditional routing based on knowledge gaps, budget, and iterations
+   - Parallel execution for concurrent research loops
+   - Event streaming via `AsyncGenerator[AgentEvent]` for real-time UI updates
+   - Fallback to agent chains when graph execution is disabled
+
+2. **Deep Research Flow** (`src/orchestrator/research_flow.py`):
+   - **Pattern**: Planner → Parallel Iterative Loops (one per section) → Synthesis
+   - Uses `PlannerAgent` to break query into report sections
+   - Runs `IterativeResearchFlow` instances in parallel per section via `WorkflowManager`
+   - Synthesizes results using `LongWriterAgent` or `ProofreaderAgent`
+   - Supports both graph execution (`use_graph=True`) and agent chains (`use_graph=False`)
+   - Budget tracking per section and globally
+   - State synchronization across parallel loops
+
+3. **Iterative Research Flow** (`src/orchestrator/research_flow.py`):
+   - **Pattern**: Generate observations → Evaluate gaps → Select tools → Execute → Judge → Continue/Complete
+   - Uses `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`, `WriterAgent`
+   - `JudgeHandler` assesses evidence sufficiency
+   - Iterates until research complete or constraints met (iterations, time, tokens)
+   - Supports graph execution and agent chains
+
+4. **Magentic Orchestrator** (`src/orchestrator_magentic.py`):
+   - Multi-agent coordination using `agent-framework-core`
+   - ChatAgent pattern with internal LLMs per agent
+   - Uses `MagenticBuilder` with participants: searcher, hypothesizer, judge, reporter
+   - Manager orchestrates agents via `OpenAIChatClient`
+   - Requires OpenAI API key (function calling support)
+   - Event-driven: converts Magentic events to `AgentEvent` for UI streaming
+   - Supports long-running workflows with max rounds and stall/reset handling
+
+5. **Hierarchical Orchestrator** (`src/orchestrator_hierarchical.py`):
+   - Uses `SubIterationMiddleware` with `ResearchTeam` and `LLMSubIterationJudge`
+   - Adapts Magentic ChatAgent to `SubIterationTeam` protocol
+   - Event-driven via `asyncio.Queue` for coordination
+   - Supports sub-iteration patterns for complex research tasks
+
+6. **Legacy Simple Mode** (`src/legacy_orchestrator.py`):
+   - Linear search-judge-synthesize loop
+   - Uses `SearchHandlerProtocol` and `JudgeHandlerProtocol`
+   - Generator-based design yielding `AgentEvent` objects
+   - Backward compatibility for simple use cases
+
+## Long-Running Task Support
+
+The system is designed for long-running research tasks with comprehensive state management and streaming:
+
+1. **Event Streaming**:
+   - All orchestrators yield `AgentEvent` objects via `AsyncGenerator`
+   - Real-time UI updates through Gradio chat interface
+   - Event types: `started`, `searching`, `search_complete`, `judging`, `judge_complete`, `looping`, `synthesizing`, `hypothesizing`, `complete`, `error`
+   - Metadata includes iteration numbers, tool names, result counts, durations
+
+2. **Budget Tracking** (`src/middleware/budget_tracker.py`):
+   - Per-loop and global budget management
+   - Tracks: tokens, time (seconds), iterations
+   - Budget enforcement at decision nodes
+   - Token estimation (~4 chars per token)
+   - Early termination when budgets exceeded
+   - Budget summaries for monitoring
+
+3. **Workflow Manager** (`src/middleware/workflow_manager.py`):
+   - Coordinates parallel research loops
+   - Tracks loop status: `pending`, `running`, `completed`, `failed`, `cancelled`
+   - Synchronizes evidence between loops and global state
+   - Handles errors per loop (doesn't fail all if one fails)
+   - Supports loop cancellation and timeout handling
+   - Evidence deduplication across parallel loops
+
+4. **State Management** (`src/middleware/state_machine.py`):
+   - Thread-safe isolation using `ContextVar` for concurrent requests
+   - `WorkflowState` tracks: evidence, conversation history, embedding service
+   - Evidence deduplication by URL
+   - Semantic search via embedding service
+   - State persistence across long-running workflows
+   - Supports both iterative and deep research patterns
+
+5. **Gradio UI** (`src/app.py`):
+   - Real-time streaming of research progress
+   - Accordion-based UI for pending/done operations
+   - OAuth integration (HuggingFace)
+   - Multiple backend support (API keys, free tier)
+   - Handles long-running tasks with progress indicators
+   - Event accumulation for pending operations
+
+## Graph Architecture
+
+The graph orchestrator (`src/orchestrator/graph_orchestrator.py`) implements a flexible graph-based execution model:
+
+**Node Types**:
+
+- **Agent Nodes**: Execute Pydantic AI agents (e.g., `KnowledgeGapAgent`, `ToolSelectorAgent`)
+- **State Nodes**: Update or read workflow state (evidence, conversation)
+- **Decision Nodes**: Make routing decisions (research complete?, budget exceeded?)
+- **Parallel Nodes**: Execute multiple nodes concurrently (parallel research loops)
+
+**Edge Types**:
+
+- **Sequential Edges**: Always traversed (no condition)
+- **Conditional Edges**: Traversed based on condition (e.g., if research complete → writer, else → tool selector)
+- **Parallel Edges**: Used for parallel execution branches
+
+**Graph Patterns**:
+
+- **Iterative Graph**: `[Input] → [Thinking] → [Knowledge Gap] → [Decision: Complete?] → [Tool Selector] or [Writer]`
+- **Deep Research Graph**: `[Input] → [Planner] → [Parallel Iterative Loops] → [Synthesizer]`
+
+**Execution Flow**:
+
+1. Graph construction from nodes and edges
+2. Graph validation (no cycles, all nodes reachable)
+3. Graph execution from entry node
+4. Node execution based on type
+5. Edge evaluation for next node(s)
+6. Parallel execution via `asyncio.gather()`
+7. State updates at state nodes
+8. Event streaming for UI
+
+## Key Components
+
+- **Orchestrators**: Multiple orchestration patterns (`src/orchestrator/`, `src/orchestrator_*.py`)
+- **Research Flows**: Iterative and deep research patterns (`src/orchestrator/research_flow.py`)
+- **Graph Builder**: Graph construction utilities (`src/agent_factory/graph_builder.py`)
+- **Agents**: Pydantic AI agents (`src/agents/`, `src/agent_factory/agents.py`)
+- **Search Tools**: PubMed, ClinicalTrials.gov, Europe PMC, RAG (`src/tools/`)
+- **Judge Handler**: LLM-based evidence assessment (`src/agent_factory/judges.py`)
+- **Embeddings**: Semantic search & deduplication (`src/services/embeddings.py`)
+- **Statistical Analyzer**: Modal sandbox execution (`src/services/statistical_analyzer.py`)
+- **Middleware**: State management, budget tracking, workflow coordination (`src/middleware/`)
+- **MCP Tools**: Claude Desktop integration (`src/mcp_tools.py`)
+- **Gradio UI**: Web interface with MCP server and streaming (`src/app.py`)
+
+## Research Team & Parallel Execution
+
+The system supports complex research workflows through:
+
+1. **WorkflowManager**: Coordinates multiple parallel research loops
+   - Creates and tracks `ResearchLoop` instances
+   - Runs loops in parallel via `asyncio.gather()`
+   - Synchronizes evidence to global state
+   - Handles loop failures gracefully
+
+2. **Deep Research Pattern**: Breaks complex queries into sections
+   - Planner creates report outline with sections
+   - Each section runs as independent iterative research loop
+   - Loops execute in parallel
+   - Evidence shared across loops via global state
+   - Final synthesis combines all section results
+
+3. **State Synchronization**: Thread-safe evidence sharing
+   - Evidence deduplication by URL
+   - Global state accessible to all loops
+   - Semantic search across all collected evidence
+   - Conversation history tracking per iteration
+
+## Configuration & Modes
+
+- **Orchestrator Factory** (`src/orchestrator_factory.py`):
+  - Auto-detects mode: "advanced" if OpenAI key available, else "simple"
+  - Supports explicit mode selection: "simple", "magentic", "advanced"
+  - Lazy imports for optional dependencies
+
+- **Research Modes**:
+  - `iterative`: Single research loop
+  - `deep`: Multi-section parallel research
+  - `auto`: Auto-detect based on query complexity
+
+- **Execution Modes**:
+  - `use_graph=True`: Graph-based execution (parallel, conditional routing)
+  - `use_graph=False`: Agent chains (sequential, backward compatible)
+
+
+
diff --git a/docs/overview/features.md b/docs/overview/features.md
new file mode 100644
index 0000000000000000000000000000000000000000..01bf0df3ddea6a69ab264c2ee11857dea31266ef
--- /dev/null
+++ b/docs/overview/features.md
@@ -0,0 +1,137 @@
+# Features
+
+DeepCritical provides a comprehensive set of features for AI-assisted research:
+
+## Core Features
+
+### Multi-Source Search
+
+- **PubMed**: Search peer-reviewed biomedical literature via NCBI E-utilities
+- **ClinicalTrials.gov**: Search interventional clinical trials
+- **Europe PMC**: Search preprints and peer-reviewed articles (includes bioRxiv/medRxiv)
+- **RAG**: Semantic search within collected evidence using LlamaIndex
+
+### MCP Integration
+
+- **Model Context Protocol**: Expose search tools via MCP server
+- **Claude Desktop**: Use DeepCritical tools directly from Claude Desktop
+- **MCP Clients**: Compatible with any MCP-compatible client
+
+### Authentication
+
+- **HuggingFace OAuth**: Sign in with HuggingFace account for automatic API token usage
+- **Manual API Keys**: Support for OpenAI, Anthropic, and HuggingFace API keys
+- **Free Tier Support**: Automatic fallback to HuggingFace Inference API
+
+### Secure Code Execution
+
+- **Modal Sandbox**: Secure execution of AI-generated statistical code
+- **Isolated Environment**: Network isolation and package version pinning
+- **Safe Execution**: Prevents malicious code execution
+
+### Semantic Search & RAG
+
+- **LlamaIndex Integration**: Advanced RAG capabilities
+- **Vector Storage**: ChromaDB for embedding storage
+- **Semantic Deduplication**: Automatic detection of similar evidence
+- **Embedding Service**: Local sentence-transformers (no API key required)
+
+### Orchestration Patterns
+
+- **Graph-Based Execution**: Flexible graph orchestration with conditional routing
+- **Parallel Research Loops**: Run multiple research tasks concurrently
+- **Iterative Research**: Single-loop research with search-judge-synthesize cycles
+- **Deep Research**: Multi-section parallel research with planning and synthesis
+- **Magentic Orchestration**: Multi-agent coordination using Microsoft Agent Framework
+
+### Real-Time Streaming
+
+- **Event Streaming**: Real-time updates via `AsyncGenerator[AgentEvent]`
+- **Progress Tracking**: Monitor research progress with detailed event metadata
+- **UI Integration**: Seamless integration with Gradio chat interface
+
+### Budget Management
+
+- **Token Budget**: Track and limit LLM token usage
+- **Time Budget**: Enforce time limits per research loop
+- **Iteration Budget**: Limit maximum iterations
+- **Per-Loop Budgets**: Independent budgets for parallel research loops
+
+### State Management
+
+- **Thread-Safe Isolation**: ContextVar-based state management
+- **Evidence Deduplication**: Automatic URL-based deduplication
+- **Conversation History**: Track iteration history and agent interactions
+- **State Synchronization**: Share evidence across parallel loops
+
+## Advanced Features
+
+### Agent System
+
+- **Pydantic AI Agents**: Type-safe agent implementation
+- **Structured Output**: Pydantic models for agent responses
+- **Agent Factory**: Centralized agent creation with fallback support
+- **Specialized Agents**: Knowledge gap, tool selector, writer, proofreader, and more
+
+### Search Tools
+
+- **Rate Limiting**: Built-in rate limiting for external APIs
+- **Retry Logic**: Automatic retry with exponential backoff
+- **Query Preprocessing**: Automatic query enhancement and synonym expansion
+- **Evidence Conversion**: Automatic conversion to structured Evidence objects
+
+### Error Handling
+
+- **Custom Exceptions**: Hierarchical exception system
+- **Error Chaining**: Preserve exception context
+- **Structured Logging**: Comprehensive logging with structlog
+- **Graceful Degradation**: Fallback handlers for missing dependencies
+
+### Configuration
+
+- **Pydantic Settings**: Type-safe configuration management
+- **Environment Variables**: Support for `.env` files
+- **Validation**: Automatic configuration validation
+- **Flexible Providers**: Support for multiple LLM and embedding providers
+
+### Testing
+
+- **Unit Tests**: Comprehensive unit test coverage
+- **Integration Tests**: Real API integration tests
+- **Mock Support**: Extensive mocking utilities
+- **Coverage Reports**: Code coverage tracking
+
+## UI Features
+
+### Gradio Interface
+
+- **Real-Time Chat**: Interactive chat interface
+- **Streaming Updates**: Live progress updates
+- **Accordion UI**: Organized display of pending/done operations
+- **OAuth Integration**: Seamless HuggingFace authentication
+
+### MCP Server
+
+- **RESTful API**: HTTP-based MCP server
+- **Tool Discovery**: Automatic tool registration
+- **Request Handling**: Async request processing
+- **Error Responses**: Structured error responses
+
+## Development Features
+
+### Code Quality
+
+- **Type Safety**: Full type hints with mypy strict mode
+- **Linting**: Ruff for code quality
+- **Formatting**: Automatic code formatting
+- **Pre-commit Hooks**: Automated quality checks
+
+### Documentation
+
+- **Comprehensive Docs**: Detailed documentation for all components
+- **Code Examples**: Extensive code examples
+- **Architecture Diagrams**: Visual architecture documentation
+- **API Reference**: Complete API documentation
+
+
+
diff --git a/docs/overview/quick-start.md b/docs/overview/quick-start.md
new file mode 100644
index 0000000000000000000000000000000000000000..b9b45df09e850296c0659435d2916f36187cd614
--- /dev/null
+++ b/docs/overview/quick-start.md
@@ -0,0 +1,82 @@
+# Quick Start
+
+Get started with DeepCritical in minutes.
+
+## Installation
+
+```bash
+# Install uv if you haven't already
+pip install uv
+
+# Sync dependencies
+uv sync
+```
+
+## Run the UI
+
+```bash
+# Start the Gradio app
+uv run gradio run src/app.py
+```
+
+Open your browser to `http://localhost:7860`.
+
+## Basic Usage
+
+### 1. Authentication (Optional)
+
+**HuggingFace OAuth Login**:
+- Click the "Sign in with HuggingFace" button at the top of the app
+- Your HuggingFace API token will be automatically used for AI inference
+- No need to manually enter API keys when logged in
+
+**Manual API Key (BYOK)**:
+- Provide your own API key in the Settings accordion
+- Supports HuggingFace, OpenAI, or Anthropic API keys
+- Manual keys take priority over OAuth tokens
+
+### 2. Start a Research Query
+
+1. Enter your research question in the chat interface
+2. Click "Submit" or press Enter
+3. Watch the real-time progress as the system:
+   - Generates observations
+   - Identifies knowledge gaps
+   - Searches multiple sources
+   - Evaluates evidence
+   - Synthesizes findings
+4. Review the final research report
+
+### 3. MCP Integration (Optional)
+
+Connect DeepCritical to Claude Desktop:
+
+1. Add to your `claude_desktop_config.json`:
+```json
+{
+  "mcpServers": {
+    "deepcritical": {
+      "url": "http://localhost:7860/gradio_api/mcp/"
+    }
+  }
+}
+```
+
+2. Restart Claude Desktop
+3. Use DeepCritical tools directly from Claude Desktop
+
+## Available Tools
+
+- `search_pubmed`: Search peer-reviewed biomedical literature
+- `search_clinical_trials`: Search ClinicalTrials.gov
+- `search_biorxiv`: Search bioRxiv/medRxiv preprints
+- `search_all`: Search all sources simultaneously
+- `analyze_hypothesis`: Secure statistical analysis using Modal sandboxes
+
+## Next Steps
+
+- Read the [Installation Guide](../getting-started/installation.md) for detailed setup
+- Learn about [Configuration](../configuration/index.md)
+- Explore the [Architecture](../architecture/graph-orchestration.md)
+- Check out [Examples](../getting-started/examples.md)
+
diff --git a/docs/team.md b/docs/team.md
new file mode 100644
index 0000000000000000000000000000000000000000..802e7eaed95c67acefb4f30fe9fa7388305e7fcb
--- /dev/null
+++ b/docs/team.md
@@ -0,0 +1,33 @@
+# Team
+
+DeepCritical is developed by a team of researchers and developers working on AI-assisted research.
+
+## Team Members
+
+### The-Obstacle-Is-The-Way
+
+- GitHub: [The-Obstacle-Is-The-Way](https://github.com/The-Obstacle-Is-The-Way)
+
+### MarioAderman
+
+- GitHub: [MarioAderman](https://github.com/MarioAderman)
+
+### Josephrp
+
+- GitHub: [Josephrp](https://github.com/Josephrp)
+
+## About
+
+The DeepCritical team met online in the Alzheimer's Critical Literature Review Group in the Hugging Science initiative. We're building the agent framework we want to use for AI-assisted research to turn the vast amounts of clinical data into cures.
+
+## Contributing
+
+We welcome contributions! See the [Contributing Guide](contributing/index.md) for details.
+
+## Links
+
+- [GitHub Repository](https://github.com/DeepCritical/GradioDemo)
+- [HuggingFace Space](https://huggingface.co/spaces/DataQuests/DeepCritical)
+
+
+
diff --git a/mkdocs.yml b/mkdocs.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f47cac29d5f98dae511c817420e7a6e3bb64d390
--- /dev/null
+++ b/mkdocs.yml
@@ -0,0 +1,118 @@
+site_name: DeepCritical
+site_description: AI-Native Drug Repurposing Research Agent
+site_author: DeepCritical Team
+site_url: https://deepcritical.github.io/GradioDemo/
+
+repo_name: DeepCritical/GradioDemo
+repo_url: https://github.com/DeepCritical/GradioDemo
+edit_uri: edit/main/docs/
+
+theme:
+  name: material
+  palette:
+    # Light mode
+    - scheme: default
+      primary: orange
+      accent: red
+      toggle:
+        icon: material/brightness-7
+        name: Switch to dark mode
+    # Dark mode
+    - scheme: slate
+      primary: orange
+      accent: red
+      toggle:
+        icon: material/brightness-4
+        name: Switch to light mode
+  features:
+    - navigation.tabs
+    - navigation.sections
+    - navigation.expand
+    - navigation.top
+    - search.suggest
+    - search.highlight
+    - content.code.annotate
+    - content.code.copy
+  icon:
+    repo: fontawesome/brands/github
+
+plugins:
+  - search
+  - mermaid2
+  - codeinclude
+  - minify:
+      minify_html: true
+      minify_js: true
+      minify_css: true
+
+markdown_extensions:
+  - pymdownx.highlight:
+      anchor_linenums: true
+  - pymdownx.inlinehilite
+  - pymdownx.superfences:
+      custom_fences:
+        - name: mermaid
+          class: mermaid
+          format: !!python/name:pymdownx.superfences.fence_code_format
+      preserve_tabs: true
+  - dev.docs_plugins:
+      base_path: .
+  - pymdownx.tabbed:
+      alternate_style: true
+  - pymdownx.tasklist:
+      custom_checkbox: true
+  - pymdownx.emoji:
+      emoji_index: !!python/name:material.extensions.emoji.twemoji
+      emoji_generator: !!python/name:material.extensions.emoji.to_svg
+  - admonition
+  - pymdownx.details
+  - pymdownx.superfences
+  - attr_list
+  - md_in_html
+  - tables
+  - toc:
+      permalink: true
+
+nav:
+  - Home: index.md
+  - Overview:
+    - overview/architecture.md
+    - overview/features.md
+    - overview/quick-start.md
+  - Getting Started:
+    - getting-started/installation.md
+    - getting-started/quick-start.md
+    - getting-started/mcp-integration.md
+    - getting-started/examples.md
+  - Configuration:
+    - configuration/index.md
+    - configuration/CONFIGURATION.md
+  - Architecture:
+    - architecture/graph-orchestration.md
+    - architecture/graph_orchestration.md
+    - architecture/workflows.md
+    - architecture/workflow-diagrams.md
+    - architecture/agents.md
+    - architecture/orchestrators.md
+    - architecture/tools.md
+    - architecture/middleware.md
+    - architecture/services.md
+  - API Reference:
+    - api/agents.md
+    - api/tools.md
+    - api/orchestrators.md
+    - api/services.md
+    - api/models.md
+  - Contributing: contributing.md
+  - License: license.md
+  - Team: team.md
+
+extra:
+  social:
+    - icon: fontawesome/brands/github
+      link: https://github.com/DeepCritical/GradioDemo
+    - icon: material/web
+      link: https://huggingface.co/spaces/DataQuests/DeepCritical
+
+copyright: Copyright &copy; 2024 DeepCritical Team
+
diff --git a/pyproject.toml b/pyproject.toml
index c055dd28068c414e8059ec55e7d4ea0fd3213d8b..f225baf560876c99e860d92f82c6ef9606295117 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -7,18 +7,18 @@ requires-python = ">=3.11"
 dependencies = [
     # Core
     "pydantic>=2.7",
-    "pydantic-settings>=2.2", # For BaseSettings (config)
-    "pydantic-ai>=0.0.16", # Agent framework
+    "pydantic-settings>=2.2",
+    "pydantic-ai>=0.0.16",
     # AI Providers
     "openai>=1.0.0",
     "anthropic>=0.18.0",
     # HTTP & Parsing
-    "httpx>=0.27", # Async HTTP client (PubMed)
-    "beautifulsoup4>=4.12", # HTML parsing
-    "xmltodict>=0.13", # PubMed XML -> dict
-    "huggingface-hub>=0.20.0", # Hugging Face Inference API
+    "httpx>=0.27", 
+    "beautifulsoup4>=4.12", 
+    "xmltodict>=0.13", 
+    "huggingface-hub>=0.20.0", 
     # UI
-    "gradio[mcp]>=6.0.0", # Chat interface with MCP server support (6.0 required for css in launch())
+    "gradio[mcp,oauth]>=6.0.0", 
     # Utils
     "python-dotenv>=1.0", # .env loading
     "tenacity>=8.2", # Retry logic
@@ -31,6 +31,15 @@ dependencies = [
     "llama-index-llms-huggingface-api>=0.6.1",
     "llama-index-vector-stores-chroma>=0.5.3",
     "llama-index>=0.14.8",
+    "tokenizers>=0.22.0,<=0.23.0",
+    "transformers>=4.57.2",
+    "chromadb>=0.4.0",
+    "sentence-transformers>=2.2.0",
+    "numpy<2.0",
+    "agent-framework-core>=1.0.0b251120,<2.0.0",
+    "modal>=0.63.0",
+    "llama-index-llms-openai>=0.6.9",
+    "llama-index-embeddings-openai>=0.5.1",
 ]
 
 [project.optional-dependencies]
@@ -41,31 +50,18 @@ dev = [
     "pytest-sugar>=1.0",
     "pytest-cov>=5.0",
     "pytest-mock>=3.12",
-    "respx>=0.21",                   # Mock httpx requests
-    "typer>=0.9.0",                  # Gradio CLI dependency for smoke tests
+    "respx>=0.21",                  
+    "typer>=0.9.0",                 
 
     # Quality
     "ruff>=0.4.0",
     "mypy>=1.10",
     "pre-commit>=3.7",
-]
-magentic = [
-    "agent-framework-core>=1.0.0b251120,<2.0.0",  # Microsoft Agent Framework (PyPI)
-]
-embeddings = [
-    "chromadb>=0.4.0",
-    "sentence-transformers>=2.2.0",
-    "numpy<2.0",  # chromadb compatibility: uses np.float_ removed in NumPy 2.0
-]
-modal = [
-    # Mario's Modal code execution + LlamaIndex RAG
-    "modal>=0.63.0",
-    "llama-index>=0.11.0",
-    "llama-index-llms-openai",
-    "llama-index-embeddings-openai",
-    "llama-index-vector-stores-chroma",
-    "chromadb>=0.4.0",
-    "numpy<2.0",  # chromadb compatibility: uses np.float_ removed in NumPy 2.0
+    # Documentation
+    "mkdocs>=1.5.0",
+    "mkdocs-material>=9.0.0",
+    "mkdocs-mermaid2-plugin>=1.1.0",
+    "mkdocs-minify-plugin>=0.7.0",
 ]
 
 [build-system]
@@ -164,9 +160,14 @@ exclude_lines = [
 
 [dependency-groups]
 dev = [
+    "mkdocs-codeinclude-plugin>=0.2.1",
+    "mkdocs-macros-plugin>=1.5.0",
+    "pytest>=9.0.1",
+    "pytest-asyncio>=1.3.0",
+    "pytest-cov>=7.0.0",
+    "pytest-mock>=3.15.1",
+    "pytest-sugar>=1.1.1",
+    "respx>=0.22.0",
     "structlog>=25.5.0",
     "ty>=0.0.1a28",
 ]
-
-# Note: agent-framework-core is optional for magentic mode (multi-agent orchestration)
-# Version pinned to 1.0.0b* to avoid breaking changes. CI skips tests via pytest.importorskip
diff --git a/requirements.txt b/requirements.txt
index a9b57bb8ae90cc64a2ab0294b9848354d44f3b87..b182988edd4cfe2e992b3e87628a3ecb3180bca5 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,12 +1,18 @@
-# Core dependencies for HuggingFace Spaces
+########################## 
+# DO NOT USE THIS FILE
+# FOR GRADIO DEMO ONLY
+##########################
+
+
+#Core dependencies for HuggingFace Spaces
 pydantic>=2.7
 pydantic-settings>=2.2
 pydantic-ai>=0.0.16
 
 
-# AI Providers
+# OPTIONAL AI Providers
 openai>=1.0.0
-anthropic>=0.18.0
+# anthropic>=0.18.0
 
 # Multi-agent orchestration (Advanced mode)
 agent-framework-core>=1.0.0b251120
@@ -20,14 +26,14 @@ beautifulsoup4>=4.12
 xmltodict>=0.13
 
 # UI (Gradio with MCP server support)
-gradio[mcp]>=6.0.0
+# gradio[mcp]>=6.0.0
 
 # Utils
 python-dotenv>=1.0
 tenacity>=8.2
 structlog>=24.1
 requests>=2.32.5
-limits>=3.0  # Rate limiting (Phase 17)
+limits>=3.0  # Rate limiting 
 
 # Optional: Modal for code execution
 modal>=0.63.0
@@ -35,7 +41,7 @@ modal>=0.63.0
 # Optional: LlamaIndex RAG
 llama-index>=0.11.0
 llama-index-llms-openai
-llama-index-llms-huggingface  # Optional: For HuggingFace LLM support in RAG
+llama-index-llms-huggingface 
 llama-index-embeddings-openai
 llama-index-vector-stores-chroma
 chromadb>=0.4.0
diff --git a/src/app.py b/src/app.py
index 39b33550aa8f258cfb35b903d9479b2d1be8daaa..67d0cfbda503c5ce4c09271551769f9b68e06034 100644
--- a/src/app.py
+++ b/src/app.py
@@ -5,12 +5,8 @@ from collections.abc import AsyncGenerator
 from typing import Any
 
 import gradio as gr
-from pydantic_ai.models.anthropic import AnthropicModel
 from pydantic_ai.models.huggingface import HuggingFaceModel
-from pydantic_ai.models.openai import OpenAIChatModel as OpenAIModel
-from pydantic_ai.providers.anthropic import AnthropicProvider
 from pydantic_ai.providers.huggingface import HuggingFaceProvider
-from pydantic_ai.providers.openai import OpenAIProvider
 
 from src.agent_factory.judges import HFInferenceJudgeHandler, JudgeHandler, MockJudgeHandler
 from src.orchestrator_factory import create_orchestrator
@@ -19,14 +15,13 @@ from src.tools.europepmc import EuropePMCTool
 from src.tools.pubmed import PubMedTool
 from src.tools.search_handler import SearchHandler
 from src.utils.config import settings
-from src.utils.models import OrchestratorConfig
+from src.utils.models import AgentEvent, OrchestratorConfig
 
 
 def configure_orchestrator(
     use_mock: bool = False,
     mode: str = "simple",
-    user_api_key: str | None = None,
-    api_provider: str = "huggingface",
+    oauth_token: str | None = None,
 ) -> tuple[Any, str]:
     """
     Create an orchestrator instance.
@@ -34,8 +29,7 @@ def configure_orchestrator(
     Args:
         use_mock: If True, use MockJudgeHandler (no API key needed)
         mode: Orchestrator mode ("simple" or "advanced")
-        user_api_key: Optional user-provided API key (BYOK)
-        api_provider: API provider ("huggingface", "openai", or "anthropic")
+        oauth_token: Optional OAuth token from HuggingFace login
 
     Returns:
         Tuple of (Orchestrator instance, backend_name)
@@ -61,37 +55,16 @@ def configure_orchestrator(
         judge_handler = MockJudgeHandler()
         backend_info = "Mock (Testing)"
 
-    # 2. API Key (User provided or Env) - HuggingFace, OpenAI, or Anthropic
-    elif (
-        user_api_key
-        or (
-            api_provider == "huggingface"
-            and (os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_API_KEY"))
-        )
-        or (api_provider == "openai" and os.getenv("OPENAI_API_KEY"))
-        or (api_provider == "anthropic" and os.getenv("ANTHROPIC_API_KEY"))
-    ):
-        model: AnthropicModel | HuggingFaceModel | OpenAIModel | None = None
-        if user_api_key:
-            # Validate key/provider match to prevent silent auth failures
-            if api_provider == "openai" and user_api_key.startswith("sk-ant-"):
-                raise ValueError("Anthropic key provided but OpenAI provider selected")
-            is_openai_key = user_api_key.startswith("sk-") and not user_api_key.startswith(
-                "sk-ant-"
-            )
-            if api_provider == "anthropic" and is_openai_key:
-                raise ValueError("OpenAI key provided but Anthropic provider selected")
-            if api_provider == "huggingface":
-                model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
-                hf_provider = HuggingFaceProvider(api_key=user_api_key)
-                model = HuggingFaceModel(model_name, provider=hf_provider)
-            elif api_provider == "anthropic":
-                anthropic_provider = AnthropicProvider(api_key=user_api_key)
-                model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
-            elif api_provider == "openai":
-                openai_provider = OpenAIProvider(api_key=user_api_key)
-                model = OpenAIModel(settings.openai_model, provider=openai_provider)
-            backend_info = f"API ({api_provider.upper()})"
+    # 2. API Key (OAuth or Env) - HuggingFace only (OAuth provides HF token)
+    # Priority: oauth_token > env vars
+    effective_api_key = oauth_token
+    if effective_api_key or (os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_API_KEY")):
+        model: HuggingFaceModel | None = None
+        if effective_api_key:
+            model_name = settings.huggingface_model or "meta-llama/Llama-3.1-8B-Instruct"
+            hf_provider = HuggingFaceProvider(api_key=effective_api_key)
+            model = HuggingFaceModel(model_name, provider=hf_provider)
+            backend_info = "API (HuggingFace OAuth)"
         else:
             backend_info = "API (Env Config)"
 
@@ -112,13 +85,255 @@ def configure_orchestrator(
     return orchestrator, backend_info
 
 
+def event_to_chat_message(event: AgentEvent) -> gr.ChatMessage:
+    """
+    Convert AgentEvent to gr.ChatMessage with metadata for accordion display.
+
+    Args:
+        event: The AgentEvent to convert
+
+    Returns:
+        ChatMessage with metadata for collapsible accordion
+    """
+    # Map event types to accordion titles and determine if pending
+    event_configs: dict[str, dict[str, Any]] = {
+        "started": {"title": "🚀 Starting Research", "status": "done", "icon": "🚀"},
+        "searching": {"title": "🔍 Searching Literature", "status": "pending", "icon": "🔍"},
+        "search_complete": {"title": "📚 Search Results", "status": "done", "icon": "📚"},
+        "judging": {"title": "🧠 Evaluating Evidence", "status": "pending", "icon": "🧠"},
+        "judge_complete": {"title": "✅ Evidence Assessment", "status": "done", "icon": "✅"},
+        "looping": {"title": "🔄 Research Iteration", "status": "pending", "icon": "🔄"},
+        "synthesizing": {"title": "📝 Synthesizing Report", "status": "pending", "icon": "📝"},
+        "hypothesizing": {"title": "🔬 Generating Hypothesis", "status": "pending", "icon": "🔬"},
+        "analyzing": {"title": "📊 Statistical Analysis", "status": "pending", "icon": "📊"},
+        "analysis_complete": {"title": "📈 Analysis Results", "status": "done", "icon": "📈"},
+        "streaming": {"title": "📡 Processing", "status": "pending", "icon": "📡"},
+        "complete": {"title": None, "status": "done", "icon": "🎉"},  # Main response, no accordion
+        "error": {"title": "❌ Error", "status": "done", "icon": "❌"},
+    }
+
+    config = event_configs.get(
+        event.type, {"title": f"• {event.type}", "status": "done", "icon": "•"}
+    )
+
+    # For complete events, return main response without accordion
+    if event.type == "complete":
+        return gr.ChatMessage(
+            role="assistant",
+            content=event.message,
+        )
+
+    # Build metadata for accordion
+    metadata: dict[str, Any] = {}
+    if config["title"]:
+        metadata["title"] = config["title"]
+
+    # Set status (pending shows spinner, done is collapsed)
+    if config["status"] == "pending":
+        metadata["status"] = "pending"
+
+    # Add duration if available in data
+    if event.data and isinstance(event.data, dict) and "duration" in event.data:
+        metadata["duration"] = event.data["duration"]
+
+    # Add log info (iteration number, etc.)
+    log_parts: list[str] = []
+    if event.iteration > 0:
+        log_parts.append(f"Iteration {event.iteration}")
+    if event.data and isinstance(event.data, dict):
+        if "tool" in event.data:
+            log_parts.append(f"Tool: {event.data['tool']}")
+        if "results_count" in event.data:
+            log_parts.append(f"Results: {event.data['results_count']}")
+    if log_parts:
+        metadata["log"] = " | ".join(log_parts)
+
+    return gr.ChatMessage(
+        role="assistant",
+        content=event.message,
+        metadata=metadata if metadata else None,
+    )
+
+
+def extract_oauth_info(request: gr.Request | None) -> tuple[str | None, str | None]:
+    """
+    Extract OAuth token and username from Gradio request.
+
+    Args:
+        request: Gradio request object containing OAuth information
+
+    Returns:
+        Tuple of (oauth_token, oauth_username)
+    """
+    oauth_token: str | None = None
+    oauth_username: str | None = None
+
+    if request is None:
+        return oauth_token, oauth_username
+
+    # Try multiple ways to access OAuth token (Gradio API may vary)
+    # Pattern 1: request.oauth_token.token
+    if hasattr(request, "oauth_token") and request.oauth_token is not None:
+        if hasattr(request.oauth_token, "token"):
+            oauth_token = request.oauth_token.token
+        elif isinstance(request.oauth_token, str):
+            oauth_token = request.oauth_token
+    # Pattern 2: request.headers (fallback)
+    elif hasattr(request, "headers"):
+        # OAuth token might be in headers
+        auth_header = request.headers.get("authorization") or request.headers.get("Authorization")
+        if auth_header and auth_header.startswith("Bearer "):
+            oauth_token = auth_header.replace("Bearer ", "")
+
+    # Access username from request
+    if hasattr(request, "username") and request.username:
+        oauth_username = request.username
+    # Also try accessing via oauth_profile if available
+    elif hasattr(request, "oauth_profile") and request.oauth_profile is not None:
+        if hasattr(request.oauth_profile, "username"):
+            oauth_username = request.oauth_profile.username
+        elif hasattr(request.oauth_profile, "name"):
+            oauth_username = request.oauth_profile.name
+
+    return oauth_token, oauth_username
+
+
+async def yield_auth_messages(
+    oauth_username: str | None,
+    oauth_token: str | None,
+    has_huggingface: bool,
+    mode: str,
+) -> AsyncGenerator[gr.ChatMessage, None]:
+    """
+    Yield authentication and mode status messages.
+
+    Args:
+        oauth_username: OAuth username if available
+        oauth_token: OAuth token if available
+        has_huggingface: Whether HuggingFace credentials are available
+        mode: Orchestrator mode
+
+    Yields:
+        ChatMessage objects with authentication status
+    """
+    # Show user greeting if logged in via OAuth
+    if oauth_username:
+        yield gr.ChatMessage(
+            role="assistant",
+            content=f"👋 **Welcome, {oauth_username}!** Using your HuggingFace account.\n\n",
+        )
+
+    # Advanced mode is not supported without OpenAI (which requires manual setup)
+    # For now, we only support simple mode with HuggingFace
+    if mode == "advanced":
+        yield gr.ChatMessage(
+            role="assistant",
+            content=(
+                "⚠️ **Warning**: Advanced mode requires OpenAI API key configuration. "
+                "Falling back to simple mode.\n\n"
+            ),
+        )
+
+    # Inform user about authentication status
+    if oauth_token:
+        yield gr.ChatMessage(
+            role="assistant",
+            content=(
+                "🔐 **Using HuggingFace OAuth token** - "
+                "Authenticated via your HuggingFace account.\n\n"
+            ),
+        )
+    elif not has_huggingface:
+        # No keys at all - will use FREE HuggingFace Inference (public models)
+        yield gr.ChatMessage(
+            role="assistant",
+            content=(
+                "🤗 **Free Tier**: Using HuggingFace Inference (Llama 3.1 / Mistral) for AI analysis.\n"
+                "For premium models or higher rate limits, sign in with HuggingFace above.\n\n"
+            ),
+        )
+
+
+async def handle_orchestrator_events(
+    orchestrator: Any,
+    message: str,
+) -> AsyncGenerator[gr.ChatMessage, None]:
+    """
+    Handle orchestrator events and yield ChatMessages.
+
+    Args:
+        orchestrator: The orchestrator instance
+        message: The research question
+
+    Yields:
+        ChatMessage objects from orchestrator events
+    """
+    # Track pending accordions for real-time updates
+    pending_accordions: dict[str, str] = {}  # title -> accumulated content
+
+    async for event in orchestrator.run(message):
+        # Convert event to ChatMessage with metadata
+        chat_msg = event_to_chat_message(event)
+
+        # Handle complete events (main response)
+        if event.type == "complete":
+            # Close any pending accordions first
+            if pending_accordions:
+                for title, content in pending_accordions.items():
+                    yield gr.ChatMessage(
+                        role="assistant",
+                        content=content.strip(),
+                        metadata={"title": title, "status": "done"},
+                    )
+                pending_accordions.clear()
+
+            # Yield final response (no accordion for main response)
+            yield chat_msg
+            continue
+
+        # Handle events with metadata (accordions)
+        if chat_msg.metadata:
+            title = chat_msg.metadata.get("title")
+            status = chat_msg.metadata.get("status")
+
+            if title:
+                # For pending operations, accumulate content and show spinner
+                if status == "pending":
+                    if title not in pending_accordions:
+                        pending_accordions[title] = ""
+                    pending_accordions[title] += chat_msg.content + "\n"
+                    # Yield updated accordion with accumulated content
+                    yield gr.ChatMessage(
+                        role="assistant",
+                        content=pending_accordions[title].strip(),
+                        metadata=chat_msg.metadata,
+                    )
+                elif title in pending_accordions:
+                    # Combine pending content with final content
+                    final_content = pending_accordions[title] + chat_msg.content
+                    del pending_accordions[title]
+                    yield gr.ChatMessage(
+                        role="assistant",
+                        content=final_content.strip(),
+                        metadata={"title": title, "status": "done"},
+                    )
+                else:
+                    # New done accordion (no pending state)
+                    yield chat_msg
+            else:
+                # No title, yield as-is
+                yield chat_msg
+        else:
+            # No metadata, yield as plain message
+            yield chat_msg
+
+
 async def research_agent(
     message: str,
     history: list[dict[str, Any]],
     mode: str = "simple",
-    api_key: str = "",
-    api_provider: str = "huggingface",
-) -> AsyncGenerator[str, None]:
+    request: gr.Request | None = None,
+) -> AsyncGenerator[gr.ChatMessage | list[gr.ChatMessage], None]:
     """
     Gradio chat function that runs the research agent.
 
@@ -126,140 +341,101 @@ async def research_agent(
         message: User's research question
         history: Chat history (Gradio format)
         mode: Orchestrator mode ("simple" or "advanced")
-        api_key: Optional user-provided API key (BYOK - Bring Your Own Key)
-        api_provider: API provider ("huggingface", "openai", or "anthropic")
+        request: Gradio request object containing OAuth information
 
     Yields:
-        Markdown-formatted responses for streaming
+        ChatMessage objects with metadata for accordion display
     """
     if not message.strip():
-        yield "Please enter a research question."
+        yield gr.ChatMessage(
+            role="assistant",
+            content="Please enter a research question.",
+        )
         return
 
-    # Clean user-provided API key
-    user_api_key = api_key.strip() if api_key else None
+    # Extract OAuth token from request if available
+    oauth_token, oauth_username = extract_oauth_info(request)
 
     # Check available keys
-    has_huggingface = bool(os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_API_KEY"))
-    has_openai = bool(os.getenv("OPENAI_API_KEY"))
-    has_anthropic = bool(os.getenv("ANTHROPIC_API_KEY"))
-    has_user_key = bool(user_api_key)
-    has_paid_key = has_openai or has_anthropic or has_user_key
-
-    # Advanced mode requires OpenAI specifically (due to agent-framework binding)
-    if mode == "advanced" and not (has_openai or (has_user_key and api_provider == "openai")):
-        yield (
-            "⚠️ **Warning**: Advanced mode currently requires OpenAI API key. "
-            "Falling back to simple mode.\n\n"
-        )
-        mode = "simple"
+    has_huggingface = bool(os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_API_KEY") or oauth_token)
 
-    # Inform user about their key being used
-    if has_user_key:
-        yield (
-            f"🔑 **Using your {api_provider.upper()} API key** - "
-            "Your key is used only for this session and is never stored.\n\n"
-        )
-    elif not has_paid_key and not has_huggingface:
-        # No keys at all - will use FREE HuggingFace Inference (public models)
-        yield (
-            "🤗 **Free Tier**: Using HuggingFace Inference (Llama 3.1 / Mistral) for AI analysis.\n"
-            "For premium models or higher rate limits, enter a HuggingFace, OpenAI, or Anthropic API key below.\n\n"
-        )
+    # Adjust mode if needed
+    effective_mode = mode
+    if mode == "advanced":
+        effective_mode = "simple"
 
-    # Run the agent and stream events
-    response_parts: list[str] = []
+    # Yield authentication and mode status messages
+    async for msg in yield_auth_messages(oauth_username, oauth_token, has_huggingface, mode):
+        yield msg
 
+    # Run the agent and stream events
     try:
         # use_mock=False - let configure_orchestrator decide based on available keys
-        # It will use: Paid API > HF Inference (free tier)
+        # It will use: OAuth token > Env vars > HF Inference (free tier)
         orchestrator, backend_name = configure_orchestrator(
             use_mock=False,  # Never use mock in production - HF Inference is the free fallback
-            mode=mode,
-            user_api_key=user_api_key,
-            api_provider=api_provider,
+            mode=effective_mode,
+            oauth_token=oauth_token,
         )
 
-        yield f"🧠 **Backend**: {backend_name}\n\n"
-
-        async for event in orchestrator.run(message):
-            # Format event as markdown
-            event_md = event.to_markdown()
-            response_parts.append(event_md)
+        yield gr.ChatMessage(
+            role="assistant",
+            content=f"🧠 **Backend**: {backend_name}\n\n",
+        )
 
-            # If complete, show full response
-            if event.type == "complete":
-                yield event.message
-            else:
-                # Show progress
-                yield "\n\n".join(response_parts)
+        # Handle orchestrator events
+        async for msg in handle_orchestrator_events(orchestrator, message):
+            yield msg
 
     except Exception as e:
-        yield f"❌ **Error**: {e!s}"
+        yield gr.ChatMessage(
+            role="assistant",
+            content=f"❌ **Error**: {e!s}",
+            metadata={"title": "❌ Error", "status": "done"},
+        )
 
 
-def create_demo() -> gr.ChatInterface:
+def create_demo() -> gr.Blocks:
     """
-    Create the Gradio demo interface with MCP support.
+    Create the Gradio demo interface with MCP support and OAuth login.
 
     Returns:
-        Configured Gradio Blocks interface with MCP server enabled
+        Configured Gradio Blocks interface with MCP server and OAuth enabled
     """
-    # 1. Unwrapped ChatInterface (Fixes Accordion Bug)
-    demo = gr.ChatInterface(
-        fn=research_agent,
-        title="🧬 DeepCritical",
-        description=(
-            "*AI-Powered Drug Repurposing Agent — searches PubMed, "
-            "ClinicalTrials.gov & Europe PMC*\n\n"
-            "---\n"
-            "*Research tool only — not for medical advice.*  \n"
-            "**MCP Server Active**: Connect Claude Desktop to `/gradio_api/mcp/`"
-        ),
-        examples=[
-            [
-                "What drugs could be repurposed for Alzheimer's disease?",
-                "simple",
-                "",
-                "openai",
-            ],
-            [
-                "Is metformin effective for treating cancer?",
-                "simple",
-                "",
-                "openai",
-            ],
-            [
-                "What medications show promise for Long COVID treatment?",
-                "simple",
-                "",
-                "openai",
+    with gr.Blocks(title="🧬 DeepCritical") as demo:
+        # Add login button at the top
+        with gr.Row():
+            gr.LoginButton()
+
+        # Chat interface
+        gr.ChatInterface(
+            fn=research_agent,
+            title="🧬 DeepCritical",
+            description=(
+                "*AI-Powered Drug Repurposing Agent — searches PubMed, "
+                "ClinicalTrials.gov & Europe PMC*\n\n"
+                "---\n"
+                "*Research tool only — not for medical advice.*  \n"
+                "**MCP Server Active**: Connect Claude Desktop to `/gradio_api/mcp/`\n\n"
+                "**Sign in with HuggingFace** above to use your account's API token automatically."
+            ),
+            examples=[
+                ["What drugs could be repurposed for Alzheimer's disease?", "simple"],
+                ["Is metformin effective for treating cancer?", "simple"],
+                ["What medications show promise for Long COVID treatment?", "simple"],
             ],
-        ],
-        additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
-        additional_inputs=[
-            gr.Radio(
-                choices=["simple", "advanced"],
-                value="simple",
-                label="Orchestrator Mode",
-                info=(
-                    "Simple: Linear (Free Tier Friendly) | Advanced: Multi-Agent (Requires OpenAI)"
+            additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
+            additional_inputs=[
+                gr.Radio(
+                    choices=["simple", "advanced"],
+                    value="simple",
+                    label="Orchestrator Mode",
+                    info=(
+                        "Simple: Linear (Free Tier Friendly) | Advanced: Multi-Agent (Requires OpenAI - not available without manual config)"
+                    ),
                 ),
-            ),
-            gr.Textbox(
-                label="🔑 API Key (Optional - BYOK)",
-                placeholder="sk-... or sk-ant-...",
-                type="password",
-                info="Enter your own API key. Never stored.",
-            ),
-            gr.Radio(
-                choices=["huggingface", "openai", "anthropic"],
-                value="huggingface",
-                label="API Provider",
-                info="Select the provider for your API key (HuggingFace is default and free)",
-            ),
-        ],
-    )
+            ],
+        )
 
     return demo
 
diff --git a/tests/unit/middleware/__init__.py b/tests/unit/middleware/__init__.py
index 9a6293b6a156552a4775c4a9078865571fc02e64..8ce16913b7aa4c259711e35f3e2788007d202677 100644
--- a/tests/unit/middleware/__init__.py
+++ b/tests/unit/middleware/__init__.py
@@ -1 +1,4 @@
 """Unit tests for middleware components."""
+
+
+
diff --git a/tests/unit/middleware/test_budget_tracker_phase7.py b/tests/unit/middleware/test_budget_tracker_phase7.py
index f466e7abde38a6ca329a8f0ea451b03f7bf73905..903addc1e7d14866eabd709099748524cf4919d5 100644
--- a/tests/unit/middleware/test_budget_tracker_phase7.py
+++ b/tests/unit/middleware/test_budget_tracker_phase7.py
@@ -157,3 +157,6 @@ class TestIterationTokenTracking:
         assert budget2 is not None
         assert budget1.iteration_tokens[1] == 100
         assert budget2.iteration_tokens[1] == 200
+
+
+
diff --git a/tests/unit/middleware/test_state_machine.py b/tests/unit/middleware/test_state_machine.py
index ce86a63c4ce6e4ca3192ffa8ad70668c0b4c4710..d03722dce65c5710935238d192e4127ab7dc78e9 100644
--- a/tests/unit/middleware/test_state_machine.py
+++ b/tests/unit/middleware/test_state_machine.py
@@ -354,3 +354,6 @@ class TestContextVarIsolation:
         assert len(state2.evidence) == 1
         assert state1.evidence[0].citation.url == "https://example.com/1"
         assert state2.evidence[0].citation.url == "https://example.com/2"
+
+
+
diff --git a/tests/unit/middleware/test_workflow_manager.py b/tests/unit/middleware/test_workflow_manager.py
index 9bcff9b6fa8b96114e937e100523086e9040148a..af28c58c60d3e137243b2684543e4ce9b4289ce1 100644
--- a/tests/unit/middleware/test_workflow_manager.py
+++ b/tests/unit/middleware/test_workflow_manager.py
@@ -284,3 +284,6 @@ class TestWorkflowManager:
 
         assert len(shared) == 1
         assert shared[0].content == "Shared"
+
+
+
diff --git a/tests/unit/orchestrator/__init__.py b/tests/unit/orchestrator/__init__.py
index c36d5661288f15ea08112f6ad1b1b122a94658a9..8040ddaee36e0b8fe4e114e815e0a3409f9f26ff 100644
--- a/tests/unit/orchestrator/__init__.py
+++ b/tests/unit/orchestrator/__init__.py
@@ -1 +1,4 @@
 """Unit tests for orchestrator module."""
+
+
+
diff --git a/tests/unit/orchestrator/test_research_flow.py b/tests/unit/orchestrator/test_research_flow.py
index 2691ec15f9cfb7f465be33781b4d0fd009cbf0c0..c9a8f407ec2027feadddc0df68d615197001b3b4 100644
--- a/tests/unit/orchestrator/test_research_flow.py
+++ b/tests/unit/orchestrator/test_research_flow.py
@@ -37,6 +37,7 @@ class TestIterativeResearchFlow:
             patch("src.orchestrator.research_flow.create_thinking_agent") as mock_thinking,
             patch("src.orchestrator.research_flow.create_writer_agent") as mock_writer,
             patch("src.orchestrator.research_flow.execute_tool_tasks") as mock_execute,
+            patch("src.orchestrator.research_flow.get_rag_service") as mock_rag,
         ):
             mock_kg.return_value = mock_agents["knowledge_gap"]
             mock_ts.return_value = mock_agents["tool_selector"]
@@ -45,6 +46,8 @@ class TestIterativeResearchFlow:
             mock_execute.return_value = {
                 "task_1": ToolAgentOutput(output="Finding 1", sources=["url1"]),
             }
+            # Mock RAG service to return None to avoid ChromaDB initialization
+            mock_rag.return_value = None
 
             yield IterativeResearchFlow(max_iterations=2, max_time_minutes=5)
 
diff --git a/tests/unit/orchestrator/test_research_flow_phase7.py b/tests/unit/orchestrator/test_research_flow_phase7.py
index 56a28b54ec118e37904ddedb115f88f870df67bb..1ca57b06fd3c9c7a9e16ac80c1396a5f8587bd2e 100644
--- a/tests/unit/orchestrator/test_research_flow_phase7.py
+++ b/tests/unit/orchestrator/test_research_flow_phase7.py
@@ -68,6 +68,7 @@ def flow_with_judge(mock_agents, mock_judge_handler):
         patch("src.orchestrator.research_flow.create_judge_handler") as mock_judge_factory,
         patch("src.orchestrator.research_flow.execute_tool_tasks") as mock_execute,
         patch("src.orchestrator.research_flow.get_workflow_state") as mock_state,
+        patch("src.orchestrator.research_flow.get_rag_service") as mock_rag,
     ):
         mock_kg.return_value = mock_agents["knowledge_gap"]
         mock_ts.return_value = mock_agents["tool_selector"]
@@ -77,6 +78,8 @@ def flow_with_judge(mock_agents, mock_judge_handler):
         mock_execute.return_value = {
             "task_1": ToolAgentOutput(output="Finding 1", sources=["url1"]),
         }
+        # Mock RAG service to return None to avoid ChromaDB initialization
+        mock_rag.return_value = None
 
         # Mock workflow state
         mock_state_obj = MagicMock()
@@ -84,7 +87,7 @@ def flow_with_judge(mock_agents, mock_judge_handler):
         mock_state_obj.add_evidence = MagicMock(return_value=1)
         mock_state.return_value = mock_state_obj
 
-        return IterativeResearchFlow(max_iterations=2, max_time_minutes=5)
+        yield IterativeResearchFlow(max_iterations=2, max_time_minutes=5)
 
 
 @pytest.mark.unit
diff --git a/tests/unit/test_app_smoke.py b/tests/unit/test_app_smoke.py
index 3fb347f9d01e294eb0d36c9b374290ba50b6a0bb..74e88245814f12c1d80af1975ddf25b5b0dd634f 100644
--- a/tests/unit/test_app_smoke.py
+++ b/tests/unit/test_app_smoke.py
@@ -26,8 +26,15 @@ class TestAppSmoke:
 
         from src.app import create_demo
 
-        demo = create_demo()
-        assert demo is not None
+        # OAuth dependencies may not be available in test environment
+        # This is acceptable - OAuth is optional functionality
+        try:
+            demo = create_demo()
+            assert demo is not None
+        except ImportError as e:
+            if "oauth" in str(e).lower() or "itsdangerous" in str(e).lower():
+                pytest.skip(f"OAuth dependencies not available: {e}")
+            raise
 
     def test_mcp_tools_importable(self) -> None:
         """MCP tool functions should be importable.
diff --git a/tests/unit/tools/test_pubmed.py b/tests/unit/tools/test_pubmed.py
index f0e1717374c7c497300ffa29abc1653e839ca110..e4d93883bf058459039c51f51c1647d8a5a73ab4 100644
--- a/tests/unit/tools/test_pubmed.py
+++ b/tests/unit/tools/test_pubmed.py
@@ -1,6 +1,6 @@
 """Unit tests for PubMed tool."""
 
-from unittest.mock import AsyncMock, MagicMock
+from unittest.mock import AsyncMock, MagicMock, patch
 
 import pytest
 
@@ -42,7 +42,7 @@ class TestPubMedTool:
     """Tests for PubMedTool."""
 
     @pytest.mark.asyncio
-    async def test_search_returns_evidence(self, mocker):
+    async def test_search_returns_evidence(self):
         """PubMedTool should return Evidence objects from search."""
         # Mock the HTTP responses
         mock_search_response = MagicMock()
@@ -58,20 +58,20 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
+        with patch("httpx.AsyncClient", return_value=mock_client):
 
-        # Act
-        tool = PubMedTool()
-        results = await tool.search("metformin alzheimer")
+            # Act
+            tool = PubMedTool()
+            results = await tool.search("metformin alzheimer")
 
-        # Assert
-        assert len(results) == 1
-        assert results[0].citation.source == "pubmed"
-        assert "Metformin" in results[0].citation.title
-        assert "12345678" in results[0].citation.url
+            # Assert
+            assert len(results) == 1
+            assert results[0].citation.source == "pubmed"
+            assert "Metformin" in results[0].citation.title
+            assert "12345678" in results[0].citation.url
 
     @pytest.mark.asyncio
-    async def test_search_empty_results(self, mocker):
+    async def test_search_empty_results(self):
         """PubMedTool should return empty list when no results."""
         mock_response = MagicMock()
         mock_response.json.return_value = {"esearchresult": {"idlist": []}}
@@ -82,12 +82,11 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        tool = PubMedTool()
-        results = await tool.search("xyznonexistentquery123")
+        with patch("httpx.AsyncClient", return_value=mock_client):
+            tool = PubMedTool()
+            results = await tool.search("xyznonexistentquery123")
 
-        assert results == []
+            assert results == []
 
     def test_parse_pubmed_xml(self):
         """PubMedTool should correctly parse XML."""
@@ -99,7 +98,7 @@ class TestPubMedTool:
         assert "Smith John" in results[0].citation.authors
 
     @pytest.mark.asyncio
-    async def test_search_preprocesses_query(self, mocker):
+    async def test_search_preprocesses_query(self):
         """Test that queries are preprocessed before search."""
         mock_search_response = MagicMock()
         mock_search_response.json.return_value = {"esearchresult": {"idlist": []}}
@@ -110,27 +109,24 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        tool = PubMedTool()
-        await tool.search("What drugs help with Long COVID?")
+        with patch("httpx.AsyncClient", return_value=mock_client):
+            tool = PubMedTool()
+            await tool.search("What drugs help with Long COVID?")
 
-        # Verify call args
-        call_args = mock_client.get.call_args
-        params = call_args[1]["params"]
-        term = params["term"]
+            # Verify call args
+            call_args = mock_client.get.call_args
+            params = call_args[1]["params"]
+            term = params["term"]
 
-        # "what" and "help" should be stripped
-        assert "what" not in term.lower()
-        assert "help" not in term.lower()
-        # "long covid" should be expanded
-        assert "PASC" in term or "post-COVID" in term
+            # "what" and "help" should be stripped
+            assert "what" not in term.lower()
+            assert "help" not in term.lower()
+            # "long covid" should be expanded
+            assert "PASC" in term or "post-COVID" in term
 
     @pytest.mark.asyncio
-    async def test_rate_limiting_enforced(self, mocker):
+    async def test_rate_limiting_enforced(self):
         """PubMedTool should enforce rate limiting between requests."""
-        from unittest.mock import patch
-
         mock_search_response = MagicMock()
         mock_search_response.json.return_value = {"esearchresult": {"idlist": []}}
         mock_search_response.raise_for_status = MagicMock()
@@ -140,45 +136,35 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        from src.tools.rate_limiter import reset_pubmed_limiter
-
-        # Reset the rate limiter to ensure clean state
-        reset_pubmed_limiter()
-
-        mock_search_response = MagicMock()
-        mock_search_response.json.return_value = {"esearchresult": {"idlist": []}}
-        mock_search_response.raise_for_status = MagicMock()
-        mock_client = AsyncMock()
-        mock_client.get = AsyncMock(return_value=mock_search_response)
-        mock_client.__aenter__ = AsyncMock(return_value=mock_client)
-        mock_client.__aexit__ = AsyncMock(return_value=None)
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        tool = PubMedTool()
-        tool._limiter.reset()  # Reset storage to start fresh
-
-        # For 3 requests/second rate limit, we need to make 4 requests quickly to trigger the limit
-        # Make first 3 requests - should all succeed without sleep (within rate limit)
-        with patch("asyncio.sleep") as mock_sleep_first:
-            for i in range(3):
-                await tool.search(f"test query {i + 1}")
-            # First 3 requests should not sleep (within 3/second limit)
-            assert mock_sleep_first.call_count == 0
-
-        # Make 4th request immediately - should trigger rate limit
-        # For 3 requests/second, the 4th request should wait
-        with patch("asyncio.sleep") as mock_sleep:
-            await tool.search("test query 4")
-            # Rate limiter uses polling with 0.01s sleep, so sleep should be called
-            # multiple times until enough time has passed (at least once)
-            assert mock_sleep.call_count > 0, (
-                f"Rate limiter should call sleep when rate limit is hit. Call count: {mock_sleep.call_count}"
-            )
+        with patch("httpx.AsyncClient", return_value=mock_client):
+            from src.tools.rate_limiter import reset_pubmed_limiter
+
+            # Reset the rate limiter to ensure clean state
+            reset_pubmed_limiter()
+
+            tool = PubMedTool()
+            tool._limiter.reset()  # Reset storage to start fresh
+
+            # For 3 requests/second rate limit, we need to make 4 requests quickly to trigger the limit
+            # Make first 3 requests - should all succeed without sleep (within rate limit)
+            with patch("asyncio.sleep") as mock_sleep_first:
+                for i in range(3):
+                    await tool.search(f"test query {i + 1}")
+                # First 3 requests should not sleep (within 3/second limit)
+                assert mock_sleep_first.call_count == 0
+
+            # Make 4th request immediately - should trigger rate limit
+            # For 3 requests/second, the 4th request should wait
+            with patch("asyncio.sleep") as mock_sleep:
+                await tool.search("test query 4")
+                # Rate limiter uses polling with 0.01s sleep, so sleep should be called
+                # multiple times until enough time has passed (at least once)
+                assert mock_sleep.call_count > 0, (
+                    f"Rate limiter should call sleep when rate limit is hit. Call count: {mock_sleep.call_count}"
+                )
 
     @pytest.mark.asyncio
-    async def test_api_key_included_in_params(self, mocker):
+    async def test_api_key_included_in_params(self):
         """PubMedTool should include API key in params when provided."""
         mock_search_response = MagicMock()
         mock_search_response.json.return_value = {"esearchresult": {"idlist": []}}
@@ -189,29 +175,28 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        # Test with API key
-        tool = PubMedTool(api_key="test-api-key-123")
-        await tool.search("test query")
+        with patch("httpx.AsyncClient", return_value=mock_client):
+            # Test with API key
+            tool = PubMedTool(api_key="test-api-key-123")
+            await tool.search("test query")
 
-        # Verify API key was included in params
-        call_args = mock_client.get.call_args
-        params = call_args[1]["params"]
-        assert "api_key" in params
-        assert params["api_key"] == "test-api-key-123"
+            # Verify API key was included in params
+            call_args = mock_client.get.call_args
+            params = call_args[1]["params"]
+            assert "api_key" in params
+            assert params["api_key"] == "test-api-key-123"
 
-        # Test without API key
-        tool_no_key = PubMedTool(api_key=None)
-        mock_client.get.reset_mock()
-        await tool_no_key.search("test query")
+            # Test without API key
+            tool_no_key = PubMedTool(api_key=None)
+            mock_client.get.reset_mock()
+            await tool_no_key.search("test query")
 
-        call_args = mock_client.get.call_args
-        params = call_args[1]["params"]
-        assert "api_key" not in params
+            call_args = mock_client.get.call_args
+            params = call_args[1]["params"]
+            assert "api_key" not in params
 
     @pytest.mark.asyncio
-    async def test_handles_429_rate_limit(self, mocker):
+    async def test_handles_429_rate_limit(self):
         """PubMedTool should raise RateLimitError on 429 response."""
         import httpx
 
@@ -228,14 +213,13 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        tool = PubMedTool()
-        with pytest.raises(RateLimitError, match="rate limit exceeded"):
-            await tool.search("test query")
+        with patch("httpx.AsyncClient", return_value=mock_client):
+            tool = PubMedTool()
+            with pytest.raises(RateLimitError, match="rate limit exceeded"):
+                await tool.search("test query")
 
     @pytest.mark.asyncio
-    async def test_handles_500_server_error(self, mocker):
+    async def test_handles_500_server_error(self):
         """PubMedTool should raise SearchError on 500 response."""
         import httpx
 
@@ -252,14 +236,13 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        tool = PubMedTool()
-        with pytest.raises(SearchError, match="PubMed search failed"):
-            await tool.search("test query")
+        with patch("httpx.AsyncClient", return_value=mock_client):
+            tool = PubMedTool()
+            with pytest.raises(SearchError, match="PubMed search failed"):
+                await tool.search("test query")
 
     @pytest.mark.asyncio
-    async def test_handles_network_timeout(self, mocker):
+    async def test_handles_network_timeout(self):
         """PubMedTool should handle network timeout errors."""
         import httpx
 
@@ -270,12 +253,11 @@ class TestPubMedTool:
         mock_client.__aenter__ = AsyncMock(return_value=mock_client)
         mock_client.__aexit__ = AsyncMock(return_value=None)
 
-        mocker.patch("httpx.AsyncClient", return_value=mock_client)
-
-        tool = PubMedTool()
-        # Should be retried by tenacity, but eventually raise SearchError
-        with pytest.raises(SearchError):
-            await tool.search("test query")
+        with patch("httpx.AsyncClient", return_value=mock_client):
+            tool = PubMedTool()
+            # Should be retried by tenacity, but eventually raise SearchError
+            with pytest.raises(SearchError):
+                await tool.search("test query")
 
     def test_parse_empty_xml(self):
         """PubMedTool should handle empty XML gracefully."""
diff --git a/uv.lock b/uv.lock
index df0cbcd203a9184ee315e90eacd5b6d07841345a..802fdea427b37d5723b8ff81e356f700d5e70dc3 100644
--- a/uv.lock
+++ b/uv.lock
@@ -370,6 +370,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/83/7b/5652771e24fff12da9dde4c20ecf4682e606b104f26419d139758cc935a6/azure_identity-1.25.1-py3-none-any.whl", hash = "sha256:e9edd720af03dff020223cd269fa3a61e8f345ea75443858273bcb44844ab651", size = 191317, upload-time = "2025-10-06T20:30:04.251Z" },
 ]
 
+[[package]]
+name = "babel"
+version = "2.17.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/7d/6b/d52e42361e1aa00709585ecc30b3f9684b3ab62530771402248b1b1d6240/babel-2.17.0.tar.gz", hash = "sha256:0c54cffb19f690cdcc52a3b50bcbf71e07a808d1c80d549f2459b9d2cf0afb9d", size = 9951852, upload-time = "2025-02-01T15:17:41.026Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b7/b8/3fe70c75fe32afc4bb507f75563d39bc5642255d1d94f1f23604725780bf/babel-2.17.0-py3-none-any.whl", hash = "sha256:4d0b53093fdfb4b21c92b5213dba5a1b23885afa8383709427046b21c366e5f2", size = 10182537, upload-time = "2025-02-01T15:17:37.39Z" },
+]
+
 [[package]]
 name = "backoff"
 version = "2.2.1"
@@ -388,6 +397,20 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/b9/fa/123043af240e49752f1c4bd24da5053b6bd00cad78c2be53c0d1e8b975bc/backports.tarfile-1.2.0-py3-none-any.whl", hash = "sha256:77e284d754527b01fb1e6fa8a1afe577858ebe4e9dad8919e34c862cb399bc34", size = 30181, upload-time = "2024-05-28T17:01:53.112Z" },
 ]
 
+[[package]]
+name = "backrefs"
+version = "6.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/86/e3/bb3a439d5cb255c4774724810ad8073830fac9c9dee123555820c1bcc806/backrefs-6.1.tar.gz", hash = "sha256:3bba1749aafe1db9b915f00e0dd166cba613b6f788ffd63060ac3485dc9be231", size = 7011962, upload-time = "2025-11-15T14:52:08.323Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/3b/ee/c216d52f58ea75b5e1841022bbae24438b19834a29b163cb32aa3a2a7c6e/backrefs-6.1-py310-none-any.whl", hash = "sha256:2a2ccb96302337ce61ee4717ceacfbf26ba4efb1d55af86564b8bbaeda39cac1", size = 381059, upload-time = "2025-11-15T14:51:59.758Z" },
+    { url = "https://files.pythonhosted.org/packages/e6/9a/8da246d988ded941da96c7ed945d63e94a445637eaad985a0ed88787cb89/backrefs-6.1-py311-none-any.whl", hash = "sha256:e82bba3875ee4430f4de4b6db19429a27275d95a5f3773c57e9e18abc23fd2b7", size = 392854, upload-time = "2025-11-15T14:52:01.194Z" },
+    { url = "https://files.pythonhosted.org/packages/37/c9/fd117a6f9300c62bbc33bc337fd2b3c6bfe28b6e9701de336b52d7a797ad/backrefs-6.1-py312-none-any.whl", hash = "sha256:c64698c8d2269343d88947c0735cb4b78745bd3ba590e10313fbf3f78c34da5a", size = 398770, upload-time = "2025-11-15T14:52:02.584Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/95/7118e935b0b0bd3f94dfec2d852fd4e4f4f9757bdb49850519acd245cd3a/backrefs-6.1-py313-none-any.whl", hash = "sha256:4c9d3dc1e2e558965202c012304f33d4e0e477e1c103663fd2c3cc9bb18b0d05", size = 400726, upload-time = "2025-11-15T14:52:04.093Z" },
+    { url = "https://files.pythonhosted.org/packages/1d/72/6296bad135bfafd3254ae3648cd152980a424bd6fed64a101af00cc7ba31/backrefs-6.1-py314-none-any.whl", hash = "sha256:13eafbc9ccd5222e9c1f0bec563e6d2a6d21514962f11e7fc79872fd56cbc853", size = 412584, upload-time = "2025-11-15T14:52:05.233Z" },
+    { url = "https://files.pythonhosted.org/packages/02/e3/a4fa1946722c4c7b063cc25043a12d9ce9b4323777f89643be74cef2993c/backrefs-6.1-py39-none-any.whl", hash = "sha256:a9e99b8a4867852cad177a6430e31b0f6e495d65f8c6c134b68c14c3c95bf4b0", size = 381058, upload-time = "2025-11-15T14:52:06.698Z" },
+]
+
 [[package]]
 name = "banks"
 version = "2.2.0"
@@ -1046,6 +1069,12 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/0d/c3/e90f4a4feae6410f914f8ebac129b9ae7a8c92eb60a638012dde42030a9d/cryptography-46.0.3-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:6b5063083824e5509fdba180721d55909ffacccc8adbec85268b48439423d78c", size = 3438528, upload-time = "2025-10-15T23:18:26.227Z" },
 ]
 
+[[package]]
+name = "csscompressor"
+version = "0.9.5"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/f1/2a/8c3ac3d8bc94e6de8d7ae270bb5bc437b210bb9d6d9e46630c98f4abd20c/csscompressor-0.9.5.tar.gz", hash = "sha256:afa22badbcf3120a4f392e4d22f9fff485c044a1feda4a950ecc5eba9dd31a05", size = 237808, upload-time = "2017-11-26T21:13:08.238Z" }
+
 [[package]]
 name = "cyclopts"
 version = "4.2.5"
@@ -1079,17 +1108,23 @@ name = "deepcritical"
 version = "0.1.0"
 source = { editable = "." }
 dependencies = [
+    { name = "agent-framework-core" },
     { name = "anthropic" },
     { name = "beautifulsoup4" },
+    { name = "chromadb" },
     { name = "duckduckgo-search" },
-    { name = "gradio", extra = ["mcp"] },
+    { name = "gradio", extra = ["mcp", "oauth"] },
     { name = "httpx" },
     { name = "huggingface-hub" },
     { name = "limits" },
     { name = "llama-index" },
+    { name = "llama-index-embeddings-openai" },
     { name = "llama-index-llms-huggingface" },
     { name = "llama-index-llms-huggingface-api" },
+    { name = "llama-index-llms-openai" },
     { name = "llama-index-vector-stores-chroma" },
+    { name = "modal" },
+    { name = "numpy" },
     { name = "openai" },
     { name = "pydantic" },
     { name = "pydantic-ai" },
@@ -1097,13 +1132,20 @@ dependencies = [
     { name = "pydantic-settings" },
     { name = "python-dotenv" },
     { name = "requests" },
+    { name = "sentence-transformers" },
     { name = "structlog" },
     { name = "tenacity" },
+    { name = "tokenizers" },
+    { name = "transformers" },
     { name = "xmltodict" },
 ]
 
 [package.optional-dependencies]
 dev = [
+    { name = "mkdocs" },
+    { name = "mkdocs-material" },
+    { name = "mkdocs-mermaid2-plugin" },
+    { name = "mkdocs-minify-plugin" },
     { name = "mypy" },
     { name = "pre-commit" },
     { name = "pytest" },
@@ -1115,54 +1157,45 @@ dev = [
     { name = "ruff" },
     { name = "typer" },
 ]
-embeddings = [
-    { name = "chromadb" },
-    { name = "numpy" },
-    { name = "sentence-transformers" },
-]
-magentic = [
-    { name = "agent-framework-core" },
-]
-modal = [
-    { name = "chromadb" },
-    { name = "llama-index" },
-    { name = "llama-index-embeddings-openai" },
-    { name = "llama-index-llms-openai" },
-    { name = "llama-index-vector-stores-chroma" },
-    { name = "modal" },
-    { name = "numpy" },
-]
 
 [package.dev-dependencies]
 dev = [
+    { name = "mkdocs-codeinclude-plugin" },
+    { name = "mkdocs-macros-plugin" },
+    { name = "pytest" },
+    { name = "pytest-asyncio" },
+    { name = "pytest-cov" },
+    { name = "pytest-mock" },
+    { name = "pytest-sugar" },
+    { name = "respx" },
     { name = "structlog" },
     { name = "ty" },
 ]
 
 [package.metadata]
 requires-dist = [
-    { name = "agent-framework-core", marker = "extra == 'magentic'", specifier = ">=1.0.0b251120,<2.0.0" },
+    { name = "agent-framework-core", specifier = ">=1.0.0b251120,<2.0.0" },
     { name = "anthropic", specifier = ">=0.18.0" },
     { name = "beautifulsoup4", specifier = ">=4.12" },
-    { name = "chromadb", marker = "extra == 'embeddings'", specifier = ">=0.4.0" },
-    { name = "chromadb", marker = "extra == 'modal'", specifier = ">=0.4.0" },
+    { name = "chromadb", specifier = ">=0.4.0" },
     { name = "duckduckgo-search", specifier = ">=5.0" },
-    { name = "gradio", extras = ["mcp"], specifier = ">=6.0.0" },
+    { name = "gradio", extras = ["mcp", "oauth"], specifier = ">=6.0.0" },
     { name = "httpx", specifier = ">=0.27" },
     { name = "huggingface-hub", specifier = ">=0.20.0" },
     { name = "limits", specifier = ">=3.0" },
     { name = "llama-index", specifier = ">=0.14.8" },
-    { name = "llama-index", marker = "extra == 'modal'", specifier = ">=0.11.0" },
-    { name = "llama-index-embeddings-openai", marker = "extra == 'modal'" },
+    { name = "llama-index-embeddings-openai", specifier = ">=0.5.1" },
     { name = "llama-index-llms-huggingface", specifier = ">=0.6.1" },
     { name = "llama-index-llms-huggingface-api", specifier = ">=0.6.1" },
-    { name = "llama-index-llms-openai", marker = "extra == 'modal'" },
+    { name = "llama-index-llms-openai", specifier = ">=0.6.9" },
     { name = "llama-index-vector-stores-chroma", specifier = ">=0.5.3" },
-    { name = "llama-index-vector-stores-chroma", marker = "extra == 'modal'" },
-    { name = "modal", marker = "extra == 'modal'", specifier = ">=0.63.0" },
+    { name = "mkdocs", marker = "extra == 'dev'", specifier = ">=1.5.0" },
+    { name = "mkdocs-material", marker = "extra == 'dev'", specifier = ">=9.0.0" },
+    { name = "mkdocs-mermaid2-plugin", marker = "extra == 'dev'", specifier = ">=1.1.0" },
+    { name = "mkdocs-minify-plugin", marker = "extra == 'dev'", specifier = ">=0.7.0" },
+    { name = "modal", specifier = ">=0.63.0" },
     { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.10" },
-    { name = "numpy", marker = "extra == 'embeddings'", specifier = "<2.0" },
-    { name = "numpy", marker = "extra == 'modal'", specifier = "<2.0" },
+    { name = "numpy", specifier = "<2.0" },
     { name = "openai", specifier = ">=1.0.0" },
     { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.7" },
     { name = "pydantic", specifier = ">=2.7" },
@@ -1178,16 +1211,26 @@ requires-dist = [
     { name = "requests", specifier = ">=2.32.5" },
     { name = "respx", marker = "extra == 'dev'", specifier = ">=0.21" },
     { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.4.0" },
-    { name = "sentence-transformers", marker = "extra == 'embeddings'", specifier = ">=2.2.0" },
+    { name = "sentence-transformers", specifier = ">=2.2.0" },
     { name = "structlog", specifier = ">=24.1" },
     { name = "tenacity", specifier = ">=8.2" },
+    { name = "tokenizers", specifier = ">=0.22.0,<=0.23.0" },
+    { name = "transformers", specifier = ">=4.57.2" },
     { name = "typer", marker = "extra == 'dev'", specifier = ">=0.9.0" },
     { name = "xmltodict", specifier = ">=0.13" },
 ]
-provides-extras = ["dev", "magentic", "embeddings", "modal"]
+provides-extras = ["dev"]
 
 [package.metadata.requires-dev]
 dev = [
+    { name = "mkdocs-codeinclude-plugin", specifier = ">=0.2.1" },
+    { name = "mkdocs-macros-plugin", specifier = ">=1.5.0" },
+    { name = "pytest", specifier = ">=9.0.1" },
+    { name = "pytest-asyncio", specifier = ">=1.3.0" },
+    { name = "pytest-cov", specifier = ">=7.0.0" },
+    { name = "pytest-mock", specifier = ">=3.15.1" },
+    { name = "pytest-sugar", specifier = ">=1.1.1" },
+    { name = "respx", specifier = ">=0.22.0" },
     { name = "structlog", specifier = ">=25.5.0" },
     { name = "ty", specifier = ">=0.0.1a28" },
 ]
@@ -1299,6 +1342,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/b0/0d/9feae160378a3553fa9a339b0e9c1a048e147a4127210e286ef18b730f03/durationpy-0.10-py3-none-any.whl", hash = "sha256:3b41e1b601234296b4fb368338fdcd3e13e0b4fb5b67345948f4f2bf9868b286", size = 3922, upload-time = "2025-05-17T13:52:36.463Z" },
 ]
 
+[[package]]
+name = "editorconfig"
+version = "0.17.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/88/3a/a61d9a1f319a186b05d14df17daea42fcddea63c213bcd61a929fb3a6796/editorconfig-0.17.1.tar.gz", hash = "sha256:23c08b00e8e08cc3adcddb825251c497478df1dada6aefeb01e626ad37303745", size = 14695, upload-time = "2025-06-09T08:21:37.097Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/96/fd/a40c621ff207f3ce8e484aa0fc8ba4eb6e3ecf52e15b42ba764b457a9550/editorconfig-0.17.1-py3-none-any.whl", hash = "sha256:1eda9c2c0db8c16dbd50111b710572a5e6de934e39772de1959d41f64fc17c82", size = 16360, upload-time = "2025-06-09T08:21:35.654Z" },
+]
+
 [[package]]
 name = "email-validator"
 version = "2.3.0"
@@ -1587,6 +1639,18 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/ef/cf/9a31e8df3116cd8684baf950d1b05ec4dfb08d719a075fe7fe7bd78b453a/genai_prices-0.0.44-py3-none-any.whl", hash = "sha256:668debbd3d670f0e46af4f5bd0ce815a74847ee8d62d292ee33319ada3733009", size = 52425, upload-time = "2025-11-21T17:58:23.641Z" },
 ]
 
+[[package]]
+name = "ghp-import"
+version = "2.1.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "python-dateutil" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/d9/29/d40217cbe2f6b1359e00c6c307bb3fc876ba74068cbab3dde77f03ca0dc4/ghp-import-2.1.0.tar.gz", hash = "sha256:9c535c4c61193c2df8871222567d7fd7e5014d835f97dc7b7439069e2413d343", size = 10943, upload-time = "2022-05-02T15:47:16.11Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f7/ec/67fbef5d497f86283db54c22eec6f6140243aae73265799baaaa19cd17fb/ghp_import-2.1.0-py3-none-any.whl", hash = "sha256:8337dd7b50877f163d4c0289bc1f1c7f127550241988d568c1db512c4324a619", size = 11034, upload-time = "2022-05-02T15:47:14.552Z" },
+]
+
 [[package]]
 name = "google-auth"
 version = "2.43.0"
@@ -1676,6 +1740,10 @@ mcp = [
     { name = "mcp" },
     { name = "pydantic" },
 ]
+oauth = [
+    { name = "authlib" },
+    { name = "itsdangerous" },
+]
 
 [[package]]
 name = "gradio-client"
@@ -1896,6 +1964,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/cb/44/870d44b30e1dcfb6a65932e3e1506c103a8a5aea9103c337e7a53180322c/hf_xet-1.2.0-cp37-abi3-win_amd64.whl", hash = "sha256:e6584a52253f72c9f52f9e549d5895ca7a471608495c4ecaa6cc73dba2b24d69", size = 2905735, upload-time = "2025-10-24T19:04:35.928Z" },
 ]
 
+[[package]]
+name = "hjson"
+version = "3.1.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/82/e5/0b56d723a76ca67abadbf7fb71609fb0ea7e6926e94fcca6c65a85b36a0e/hjson-3.1.0.tar.gz", hash = "sha256:55af475a27cf83a7969c808399d7bccdec8fb836a07ddbd574587593b9cdcf75", size = 40541, upload-time = "2022-08-13T02:53:01.919Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1f/7f/13cd798d180af4bf4c0ceddeefba2b864a63c71645abc0308b768d67bb81/hjson-3.1.0-py3-none-any.whl", hash = "sha256:65713cdcf13214fb554eb8b4ef803419733f4f5e551047c9b711098ab7186b89", size = 54018, upload-time = "2022-08-13T02:52:59.899Z" },
+]
+
 [[package]]
 name = "hpack"
 version = "4.1.0"
@@ -1905,6 +1982,14 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/07/c6/80c95b1b2b94682a72cbdbfb85b81ae2daffa4291fbfa1b1464502ede10d/hpack-4.1.0-py3-none-any.whl", hash = "sha256:157ac792668d995c657d93111f46b4535ed114f0c9c8d672271bbec7eae1b496", size = 34357, upload-time = "2025-01-22T21:44:56.92Z" },
 ]
 
+[[package]]
+name = "htmlmin2"
+version = "0.1.13"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/be/31/a76f4bfa885f93b8167cb4c85cf32b54d1f64384d0b897d45bc6d19b7b45/htmlmin2-0.1.13-py3-none-any.whl", hash = "sha256:75609f2a42e64f7ce57dbff28a39890363bde9e7e5885db633317efbdf8c79a2", size = 34486, upload-time = "2023-03-14T21:28:30.388Z" },
+]
+
 [[package]]
 name = "httpcore"
 version = "1.0.9"
@@ -2080,6 +2165,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/32/4b/b99e37f88336009971405cbb7630610322ed6fbfa31e1d7ab3fbf3049a2d/invoke-2.2.1-py3-none-any.whl", hash = "sha256:2413bc441b376e5cd3f55bb5d364f973ad8bdd7bf87e53c79de3c11bf3feecc8", size = 160287, upload-time = "2025-10-11T00:36:33.703Z" },
 ]
 
+[[package]]
+name = "itsdangerous"
+version = "2.2.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/9c/cb/8ac0172223afbccb63986cc25049b154ecfb5e85932587206f42317be31d/itsdangerous-2.2.0.tar.gz", hash = "sha256:e0050c0b7da1eea53ffaf149c0cfbb5c6e2e2b69c4bef22c81fa6eb73e5f6173", size = 54410, upload-time = "2024-04-16T21:28:15.614Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/04/96/92447566d16df59b2a776c0fb82dbc4d9e07cd95062562af01e408583fc4/itsdangerous-2.2.0-py3-none-any.whl", hash = "sha256:c6242fc49e35958c8b15141343aa660db5fc54d4f13a1db01a3f5891b98700ef", size = 16234, upload-time = "2024-04-16T21:28:14.499Z" },
+]
+
 [[package]]
 name = "jaraco-classes"
 version = "3.4.0"
@@ -2240,6 +2334,25 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/1e/e8/685f47e0d754320684db4425a0967f7d3fa70126bffd76110b7009a0090f/joblib-1.5.2-py3-none-any.whl", hash = "sha256:4e1f0bdbb987e6d843c70cf43714cb276623def372df3c22fe5266b2670bc241", size = 308396, upload-time = "2025-08-27T12:15:45.188Z" },
 ]
 
+[[package]]
+name = "jsbeautifier"
+version = "1.15.4"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "editorconfig" },
+    { name = "six" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/ea/98/d6cadf4d5a1c03b2136837a435682418c29fdeb66be137128544cecc5b7a/jsbeautifier-1.15.4.tar.gz", hash = "sha256:5bb18d9efb9331d825735fbc5360ee8f1aac5e52780042803943aa7f854f7592", size = 75257, upload-time = "2025-02-27T17:53:53.252Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2d/14/1c65fccf8413d5f5c6e8425f84675169654395098000d8bddc4e9d3390e1/jsbeautifier-1.15.4-py3-none-any.whl", hash = "sha256:72f65de312a3f10900d7685557f84cb61a9733c50dcc27271a39f5b0051bf528", size = 94707, upload-time = "2025-02-27T17:53:46.152Z" },
+]
+
+[[package]]
+name = "jsmin"
+version = "3.0.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/5e/73/e01e4c5e11ad0494f4407a3f623ad4d87714909f50b17a06ed121034ff6e/jsmin-3.0.1.tar.gz", hash = "sha256:c0959a121ef94542e807a674142606f7e90214a2b3d1eb17300244bbb5cc2bfc", size = 13925, upload-time = "2022-01-16T20:35:59.13Z" }
+
 [[package]]
 name = "jsonschema"
 version = "4.25.1"
@@ -2721,6 +2834,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/6c/77/d7f491cbc05303ac6801651aabeb262d43f319288c1ea96c66b1d2692ff3/lxml-6.0.2-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:27220da5be049e936c3aca06f174e8827ca6445a4353a1995584311487fc4e3e", size = 3518768, upload-time = "2025-09-22T04:04:57.097Z" },
 ]
 
+[[package]]
+name = "markdown"
+version = "3.10"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/7d/ab/7dd27d9d863b3376fcf23a5a13cb5d024aed1db46f963f1b5735ae43b3be/markdown-3.10.tar.gz", hash = "sha256:37062d4f2aa4b2b6b32aefb80faa300f82cc790cb949a35b8caede34f2b68c0e", size = 364931, upload-time = "2025-11-03T19:51:15.007Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/70/81/54e3ce63502cd085a0c556652a4e1b919c45a446bd1e5300e10c44c8c521/markdown-3.10-py3-none-any.whl", hash = "sha256:b5b99d6951e2e4948d939255596523444c0e677c669700b1d17aa4a8a464cb7c", size = 107678, upload-time = "2025-11-03T19:51:13.887Z" },
+]
+
 [[package]]
 name = "markdown-it-py"
 version = "4.0.0"
@@ -2858,6 +2980,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/b3/38/89ba8ad64ae25be8de66a6d463314cf1eb366222074cfda9ee839c56a4b4/mdurl-0.1.2-py3-none-any.whl", hash = "sha256:84008a41e51615a49fc9966191ff91509e3c40b939176e643fd50a5c2196b8f8", size = 9979, upload-time = "2022-08-14T12:40:09.779Z" },
 ]
 
+[[package]]
+name = "mergedeep"
+version = "1.3.4"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/3a/41/580bb4006e3ed0361b8151a01d324fb03f420815446c7def45d02f74c270/mergedeep-1.3.4.tar.gz", hash = "sha256:0096d52e9dad9939c3d975a774666af186eda617e6ca84df4c94dec30004f2a8", size = 4661, upload-time = "2021-02-05T18:55:30.623Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2c/19/04f9b178c2d8a15b076c8b5140708fa6ffc5601fb6f1e975537072df5b2a/mergedeep-1.3.4-py3-none-any.whl", hash = "sha256:70775750742b25c0d8f36c55aed03d24c3384d17c951b3175d898bd778ef0307", size = 6354, upload-time = "2021-02-05T18:55:29.583Z" },
+]
+
 [[package]]
 name = "mistralai"
 version = "1.9.11"
@@ -2876,6 +3007,141 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/fe/76/4ce12563aea5a76016f8643eff30ab731e6656c845e9e4d090ef10c7b925/mistralai-1.9.11-py3-none-any.whl", hash = "sha256:7a3dc2b8ef3fceaa3582220234261b5c4e3e03a972563b07afa150e44a25a6d3", size = 442796, upload-time = "2025-10-02T15:53:39.134Z" },
 ]
 
+[[package]]
+name = "mkdocs"
+version = "1.6.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "click" },
+    { name = "colorama", marker = "sys_platform == 'win32'" },
+    { name = "ghp-import" },
+    { name = "jinja2" },
+    { name = "markdown" },
+    { name = "markupsafe" },
+    { name = "mergedeep" },
+    { name = "mkdocs-get-deps" },
+    { name = "packaging" },
+    { name = "pathspec" },
+    { name = "pyyaml" },
+    { name = "pyyaml-env-tag" },
+    { name = "watchdog" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/bc/c6/bbd4f061bd16b378247f12953ffcb04786a618ce5e904b8c5a01a0309061/mkdocs-1.6.1.tar.gz", hash = "sha256:7b432f01d928c084353ab39c57282f29f92136665bdd6abf7c1ec8d822ef86f2", size = 3889159, upload-time = "2024-08-30T12:24:06.899Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/22/5b/dbc6a8cddc9cfa9c4971d59fb12bb8d42e161b7e7f8cc89e49137c5b279c/mkdocs-1.6.1-py3-none-any.whl", hash = "sha256:db91759624d1647f3f34aa0c3f327dd2601beae39a366d6e064c03468d35c20e", size = 3864451, upload-time = "2024-08-30T12:24:05.054Z" },
+]
+
+[[package]]
+name = "mkdocs-codeinclude-plugin"
+version = "0.2.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "mkdocs" },
+    { name = "pygments" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/1b/b5/f72df157abc7f85e33ffa417464e9dd535ef5fda7654eda41190047a53b6/mkdocs-codeinclude-plugin-0.2.1.tar.gz", hash = "sha256:305387f67a885f0e36ec1cf977324fe1fe50d31301147194b63631d0864601b1", size = 8140, upload-time = "2023-03-01T19:57:06.724Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/4d/7b/60573ebf2a22b144eeaf3b29db9a6d4d342d68273f716ea2723d1ad723ba/mkdocs_codeinclude_plugin-0.2.1-py3-none-any.whl", hash = "sha256:172a917c9b257fa62850b669336151f85d3cd40312b2b52520cbcceab557ea6c", size = 8093, upload-time = "2023-03-01T19:57:05.207Z" },
+]
+
+[[package]]
+name = "mkdocs-get-deps"
+version = "0.2.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "mergedeep" },
+    { name = "platformdirs" },
+    { name = "pyyaml" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/98/f5/ed29cd50067784976f25ed0ed6fcd3c2ce9eb90650aa3b2796ddf7b6870b/mkdocs_get_deps-0.2.0.tar.gz", hash = "sha256:162b3d129c7fad9b19abfdcb9c1458a651628e4b1dea628ac68790fb3061c60c", size = 10239, upload-time = "2023-11-20T17:51:09.981Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9f/d4/029f984e8d3f3b6b726bd33cafc473b75e9e44c0f7e80a5b29abc466bdea/mkdocs_get_deps-0.2.0-py3-none-any.whl", hash = "sha256:2bf11d0b133e77a0dd036abeeb06dec8775e46efa526dc70667d8863eefc6134", size = 9521, upload-time = "2023-11-20T17:51:08.587Z" },
+]
+
+[[package]]
+name = "mkdocs-macros-plugin"
+version = "1.5.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "hjson" },
+    { name = "jinja2" },
+    { name = "mkdocs" },
+    { name = "packaging" },
+    { name = "pathspec" },
+    { name = "python-dateutil" },
+    { name = "pyyaml" },
+    { name = "requests" },
+    { name = "super-collections" },
+    { name = "termcolor" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/92/15/e6a44839841ebc9c5872fa0e6fad1c3757424e4fe026093b68e9f386d136/mkdocs_macros_plugin-1.5.0.tar.gz", hash = "sha256:12aa45ce7ecb7a445c66b9f649f3dd05e9b92e8af6bc65e4acd91d26f878c01f", size = 37730, upload-time = "2025-11-13T08:08:55.545Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/51/62/9fffba5bb9ed3d31a932ad35038ba9483d59850256ee0fea7f1187173983/mkdocs_macros_plugin-1.5.0-py3-none-any.whl", hash = "sha256:c10fabd812bf50f9170609d0ed518e54f1f0e12c334ac29141723a83c881dd6f", size = 44626, upload-time = "2025-11-13T08:08:53.878Z" },
+]
+
+[[package]]
+name = "mkdocs-material"
+version = "9.7.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "babel" },
+    { name = "backrefs" },
+    { name = "colorama" },
+    { name = "jinja2" },
+    { name = "markdown" },
+    { name = "mkdocs" },
+    { name = "mkdocs-material-extensions" },
+    { name = "paginate" },
+    { name = "pygments" },
+    { name = "pymdown-extensions" },
+    { name = "requests" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/9c/3b/111b84cd6ff28d9e955b5f799ef217a17bc1684ac346af333e6100e413cb/mkdocs_material-9.7.0.tar.gz", hash = "sha256:602b359844e906ee402b7ed9640340cf8a474420d02d8891451733b6b02314ec", size = 4094546, upload-time = "2025-11-11T08:49:09.73Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/04/87/eefe8d5e764f4cf50ed91b943f8e8f96b5efd65489d8303b7a36e2e79834/mkdocs_material-9.7.0-py3-none-any.whl", hash = "sha256:da2866ea53601125ff5baa8aa06404c6e07af3c5ce3d5de95e3b52b80b442887", size = 9283770, upload-time = "2025-11-11T08:49:06.26Z" },
+]
+
+[[package]]
+name = "mkdocs-material-extensions"
+version = "1.3.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/79/9b/9b4c96d6593b2a541e1cb8b34899a6d021d208bb357042823d4d2cabdbe7/mkdocs_material_extensions-1.3.1.tar.gz", hash = "sha256:10c9511cea88f568257f960358a467d12b970e1f7b2c0e5fb2bb48cab1928443", size = 11847, upload-time = "2023-11-22T19:09:45.208Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/5b/54/662a4743aa81d9582ee9339d4ffa3c8fd40a4965e033d77b9da9774d3960/mkdocs_material_extensions-1.3.1-py3-none-any.whl", hash = "sha256:adff8b62700b25cb77b53358dad940f3ef973dd6db797907c49e3c2ef3ab4e31", size = 8728, upload-time = "2023-11-22T19:09:43.465Z" },
+]
+
+[[package]]
+name = "mkdocs-mermaid2-plugin"
+version = "1.2.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "beautifulsoup4" },
+    { name = "jsbeautifier" },
+    { name = "mkdocs" },
+    { name = "pymdown-extensions" },
+    { name = "requests" },
+    { name = "setuptools" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/2a/6d/308f443a558b6a97ce55782658174c0d07c414405cfc0a44d36ad37e36f9/mkdocs_mermaid2_plugin-1.2.3.tar.gz", hash = "sha256:fb6f901d53e5191e93db78f93f219cad926ccc4d51e176271ca5161b6cc5368c", size = 16220, upload-time = "2025-10-17T19:38:53.047Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1a/4b/6fd6dd632019b7f522f1b1f794ab6115cd79890330986614be56fd18f0eb/mkdocs_mermaid2_plugin-1.2.3-py3-none-any.whl", hash = "sha256:33f60c582be623ed53829a96e19284fc7f1b74a1dbae78d4d2e47fe00c3e190d", size = 17299, upload-time = "2025-10-17T19:38:51.874Z" },
+]
+
+[[package]]
+name = "mkdocs-minify-plugin"
+version = "0.8.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "csscompressor" },
+    { name = "htmlmin2" },
+    { name = "jsmin" },
+    { name = "mkdocs" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/52/67/fe4b77e7a8ae7628392e28b14122588beaf6078b53eb91c7ed000fd158ac/mkdocs-minify-plugin-0.8.0.tar.gz", hash = "sha256:bc11b78b8120d79e817308e2b11539d790d21445eb63df831e393f76e52e753d", size = 8366, upload-time = "2024-01-29T16:11:32.982Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1b/cd/2e8d0d92421916e2ea4ff97f10a544a9bd5588eb747556701c983581df13/mkdocs_minify_plugin-0.8.0-py3-none-any.whl", hash = "sha256:5fba1a3f7bd9a2142c9954a6559a57e946587b21f133165ece30ea145c66aee6", size = 6723, upload-time = "2024-01-29T16:11:31.851Z" },
+]
+
 [[package]]
 name = "mmh3"
 version = "5.2.0"
@@ -3724,6 +3990,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484", size = 66469, upload-time = "2025-04-19T11:48:57.875Z" },
 ]
 
+[[package]]
+name = "paginate"
+version = "0.5.7"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ec/46/68dde5b6bc00c1296ec6466ab27dddede6aec9af1b99090e1107091b3b84/paginate-0.5.7.tar.gz", hash = "sha256:22bd083ab41e1a8b4f3690544afb2c60c25e5c9a63a30fa2f483f6c60c8e5945", size = 19252, upload-time = "2024-08-25T14:17:24.139Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/90/96/04b8e52da071d28f5e21a805b19cb9390aa17a47462ac87f5e2696b9566d/paginate-0.5.7-py2.py3-none-any.whl", hash = "sha256:b885e2af73abcf01d9559fd5216b57ef722f8c42affbb63942377668e35c7591", size = 13746, upload-time = "2024-08-25T14:17:22.55Z" },
+]
+
 [[package]]
 name = "pandas"
 version = "2.2.3"
@@ -4585,6 +4860,19 @@ crypto = [
     { name = "cryptography" },
 ]
 
+[[package]]
+name = "pymdown-extensions"
+version = "10.17.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "markdown" },
+    { name = "pyyaml" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/25/6d/af5378dbdb379fddd9a277f8b9888c027db480cde70028669ebd009d642a/pymdown_extensions-10.17.2.tar.gz", hash = "sha256:26bb3d7688e651606260c90fb46409fbda70bf9fdc3623c7868643a1aeee4713", size = 847344, upload-time = "2025-11-26T15:43:57.004Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/93/78/b93cb80bd673bdc9f6ede63d8eb5b4646366953df15667eb3603be57a2b1/pymdown_extensions-10.17.2-py3-none-any.whl", hash = "sha256:bffae79a2e8b9e44aef0d813583a8fea63457b7a23643a43988055b7b79b4992", size = 266556, upload-time = "2025-11-26T15:43:55.162Z" },
+]
+
 [[package]]
 name = "pypdf"
 version = "6.4.0"
@@ -4817,6 +5105,18 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/f1/12/de94a39c2ef588c7e6455cfbe7343d3b2dc9d6b6b2f40c4c6565744c873d/pyyaml-6.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:ebc55a14a21cb14062aa4162f906cd962b28e2e9ea38f9b4391244cd8de4ae0b", size = 149341, upload-time = "2025-09-25T21:32:56.828Z" },
 ]
 
+[[package]]
+name = "pyyaml-env-tag"
+version = "1.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "pyyaml" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/eb/2e/79c822141bfd05a853236b504869ebc6b70159afc570e1d5a20641782eaa/pyyaml_env_tag-1.1.tar.gz", hash = "sha256:2eb38b75a2d21ee0475d6d97ec19c63287a7e140231e4214969d0eac923cd7ff", size = 5737, upload-time = "2025-05-13T15:24:01.64Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/04/11/432f32f8097b03e3cd5fe57e88efb685d964e2e5178a48ed61e841f7fdce/pyyaml_env_tag-1.1-py3-none-any.whl", hash = "sha256:17109e1a528561e32f026364712fee1264bc2ea6715120891174ed1b980d2e04", size = 4722, upload-time = "2025-05-13T15:23:59.629Z" },
+]
+
 [[package]]
 name = "referencing"
 version = "0.36.2"
@@ -5462,6 +5762,18 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/a8/45/a132b9074aa18e799b891b91ad72133c98d8042c70f6240e4c5f9dabee2f/structlog-25.5.0-py3-none-any.whl", hash = "sha256:a8453e9b9e636ec59bd9e79bbd4a72f025981b3ba0f5837aebf48f02f37a7f9f", size = 72510, upload-time = "2025-10-27T08:28:21.535Z" },
 ]
 
+[[package]]
+name = "super-collections"
+version = "0.6.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "hjson" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/e0/de/a0c3d1244912c260638f0f925e190e493ccea37ecaea9bbad7c14413b803/super_collections-0.6.2.tar.gz", hash = "sha256:0c8d8abacd9fad2c7c1c715f036c29f5db213f8cac65f24d45ecba12b4da187a", size = 31315, upload-time = "2025-09-30T00:37:08.067Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/17/43/47c7cf84b3bd74a8631b02d47db356656bb8dff6f2e61a4c749963814d0d/super_collections-0.6.2-py3-none-any.whl", hash = "sha256:291b74d26299e9051d69ad9d89e61b07b6646f86a57a2f5ab3063d206eee9c56", size = 16173, upload-time = "2025-09-30T00:37:07.104Z" },
+]
+
 [[package]]
 name = "sympy"
 version = "1.14.0"
@@ -5992,6 +6304,33 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/79/0c/c05523fa3181fdf0c9c52a6ba91a23fbf3246cc095f26f6516f9c60e6771/virtualenv-20.35.4-py3-none-any.whl", hash = "sha256:c21c9cede36c9753eeade68ba7d523529f228a403463376cf821eaae2b650f1b", size = 6005095, upload-time = "2025-10-29T06:57:37.598Z" },
 ]
 
+[[package]]
+name = "watchdog"
+version = "6.0.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/db/7d/7f3d619e951c88ed75c6037b246ddcf2d322812ee8ea189be89511721d54/watchdog-6.0.0.tar.gz", hash = "sha256:9ddf7c82fda3ae8e24decda1338ede66e1c99883db93711d8fb941eaa2d8c282", size = 131220, upload-time = "2024-11-01T14:07:13.037Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e0/24/d9be5cd6642a6aa68352ded4b4b10fb0d7889cb7f45814fb92cecd35f101/watchdog-6.0.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:6eb11feb5a0d452ee41f824e271ca311a09e250441c262ca2fd7ebcf2461a06c", size = 96393, upload-time = "2024-11-01T14:06:31.756Z" },
+    { url = "https://files.pythonhosted.org/packages/63/7a/6013b0d8dbc56adca7fdd4f0beed381c59f6752341b12fa0886fa7afc78b/watchdog-6.0.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:ef810fbf7b781a5a593894e4f439773830bdecb885e6880d957d5b9382a960d2", size = 88392, upload-time = "2024-11-01T14:06:32.99Z" },
+    { url = "https://files.pythonhosted.org/packages/d1/40/b75381494851556de56281e053700e46bff5b37bf4c7267e858640af5a7f/watchdog-6.0.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:afd0fe1b2270917c5e23c2a65ce50c2a4abb63daafb0d419fde368e272a76b7c", size = 89019, upload-time = "2024-11-01T14:06:34.963Z" },
+    { url = "https://files.pythonhosted.org/packages/39/ea/3930d07dafc9e286ed356a679aa02d777c06e9bfd1164fa7c19c288a5483/watchdog-6.0.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:bdd4e6f14b8b18c334febb9c4425a878a2ac20efd1e0b231978e7b150f92a948", size = 96471, upload-time = "2024-11-01T14:06:37.745Z" },
+    { url = "https://files.pythonhosted.org/packages/12/87/48361531f70b1f87928b045df868a9fd4e253d9ae087fa4cf3f7113be363/watchdog-6.0.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:c7c15dda13c4eb00d6fb6fc508b3c0ed88b9d5d374056b239c4ad1611125c860", size = 88449, upload-time = "2024-11-01T14:06:39.748Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/7e/8f322f5e600812e6f9a31b75d242631068ca8f4ef0582dd3ae6e72daecc8/watchdog-6.0.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:6f10cb2d5902447c7d0da897e2c6768bca89174d0c6e1e30abec5421af97a5b0", size = 89054, upload-time = "2024-11-01T14:06:41.009Z" },
+    { url = "https://files.pythonhosted.org/packages/68/98/b0345cabdce2041a01293ba483333582891a3bd5769b08eceb0d406056ef/watchdog-6.0.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:490ab2ef84f11129844c23fb14ecf30ef3d8a6abafd3754a6f75ca1e6654136c", size = 96480, upload-time = "2024-11-01T14:06:42.952Z" },
+    { url = "https://files.pythonhosted.org/packages/85/83/cdf13902c626b28eedef7ec4f10745c52aad8a8fe7eb04ed7b1f111ca20e/watchdog-6.0.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:76aae96b00ae814b181bb25b1b98076d5fc84e8a53cd8885a318b42b6d3a5134", size = 88451, upload-time = "2024-11-01T14:06:45.084Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/c4/225c87bae08c8b9ec99030cd48ae9c4eca050a59bf5c2255853e18c87b50/watchdog-6.0.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:a175f755fc2279e0b7312c0035d52e27211a5bc39719dd529625b1930917345b", size = 89057, upload-time = "2024-11-01T14:06:47.324Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/c7/ca4bf3e518cb57a686b2feb4f55a1892fd9a3dd13f470fca14e00f80ea36/watchdog-6.0.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:7607498efa04a3542ae3e05e64da8202e58159aa1fa4acddf7678d34a35d4f13", size = 79079, upload-time = "2024-11-01T14:06:59.472Z" },
+    { url = "https://files.pythonhosted.org/packages/5c/51/d46dc9332f9a647593c947b4b88e2381c8dfc0942d15b8edc0310fa4abb1/watchdog-6.0.0-py3-none-manylinux2014_armv7l.whl", hash = "sha256:9041567ee8953024c83343288ccc458fd0a2d811d6a0fd68c4c22609e3490379", size = 79078, upload-time = "2024-11-01T14:07:01.431Z" },
+    { url = "https://files.pythonhosted.org/packages/d4/57/04edbf5e169cd318d5f07b4766fee38e825d64b6913ca157ca32d1a42267/watchdog-6.0.0-py3-none-manylinux2014_i686.whl", hash = "sha256:82dc3e3143c7e38ec49d61af98d6558288c415eac98486a5c581726e0737c00e", size = 79076, upload-time = "2024-11-01T14:07:02.568Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/cc/da8422b300e13cb187d2203f20b9253e91058aaf7db65b74142013478e66/watchdog-6.0.0-py3-none-manylinux2014_ppc64.whl", hash = "sha256:212ac9b8bf1161dc91bd09c048048a95ca3a4c4f5e5d4a7d1b1a7d5752a7f96f", size = 79077, upload-time = "2024-11-01T14:07:03.893Z" },
+    { url = "https://files.pythonhosted.org/packages/2c/3b/b8964e04ae1a025c44ba8e4291f86e97fac443bca31de8bd98d3263d2fcf/watchdog-6.0.0-py3-none-manylinux2014_ppc64le.whl", hash = "sha256:e3df4cbb9a450c6d49318f6d14f4bbc80d763fa587ba46ec86f99f9e6876bb26", size = 79078, upload-time = "2024-11-01T14:07:05.189Z" },
+    { url = "https://files.pythonhosted.org/packages/62/ae/a696eb424bedff7407801c257d4b1afda455fe40821a2be430e173660e81/watchdog-6.0.0-py3-none-manylinux2014_s390x.whl", hash = "sha256:2cce7cfc2008eb51feb6aab51251fd79b85d9894e98ba847408f662b3395ca3c", size = 79077, upload-time = "2024-11-01T14:07:06.376Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/e8/dbf020b4d98251a9860752a094d09a65e1b436ad181faf929983f697048f/watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl", hash = "sha256:20ffe5b202af80ab4266dcd3e91aae72bf2da48c0d33bdb15c66658e685e94e2", size = 79078, upload-time = "2024-11-01T14:07:07.547Z" },
+    { url = "https://files.pythonhosted.org/packages/07/f6/d0e5b343768e8bcb4cda79f0f2f55051bf26177ecd5651f84c07567461cf/watchdog-6.0.0-py3-none-win32.whl", hash = "sha256:07df1fdd701c5d4c8e55ef6cf55b8f0120fe1aef7ef39a1c6fc6bc2e606d517a", size = 79065, upload-time = "2024-11-01T14:07:09.525Z" },
+    { url = "https://files.pythonhosted.org/packages/db/d9/c495884c6e548fce18a8f40568ff120bc3a4b7b99813081c8ac0c936fa64/watchdog-6.0.0-py3-none-win_amd64.whl", hash = "sha256:cbafb470cf848d93b5d013e2ecb245d4aa1c8fd0504e863ccefa32445359d680", size = 79070, upload-time = "2024-11-01T14:07:10.686Z" },
+    { url = "https://files.pythonhosted.org/packages/33/e8/e40370e6d74ddba47f002a32919d91310d6074130fe4e17dabcafc15cbf1/watchdog-6.0.0-py3-none-win_ia64.whl", hash = "sha256:a1914259fa9e1454315171103c6a30961236f508b9b623eae470268bbcc6a22f", size = 79067, upload-time = "2024-11-01T14:07:11.845Z" },
+]
+
 [[package]]
 name = "watchfiles"
 version = "1.1.1"