Spaces:

MCP-1st-Birthday
/

DETERMINATOR

Running

App Files Files Community

Joseph Pollack commited on 10 days ago

Commit

4a653e3

unverified ·

0 Parent(s):

Initial commit - Independent repository - Breaking fork relationship

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.cursorrules +240 -0
.env.example +48 -0
.gitattributes +35 -0
.github/README.md +203 -0
.github/workflows/ci.yml +67 -0
.gitignore +77 -0
.pre-commit-config.yaml +64 -0
.pre-commit-hooks/run_pytest.ps1 +14 -0
.pre-commit-hooks/run_pytest.sh +15 -0
.python-version +1 -0
AGENTS.txt +236 -0
CONTRIBUTING.md +1 -0
Dockerfile +52 -0
Makefile +42 -0
README.md +196 -0
docs/CONFIGURATION.md +301 -0
docs/architecture/design-patterns.md +1509 -0
docs/architecture/graph_orchestration.md +151 -0
docs/architecture/overview.md +474 -0
docs/brainstorming/00_ROADMAP_SUMMARY.md +194 -0
docs/brainstorming/01_PUBMED_IMPROVEMENTS.md +125 -0
docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md +193 -0
docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md +211 -0
docs/brainstorming/04_OPENALEX_INTEGRATION.md +303 -0
docs/brainstorming/implementation/15_PHASE_OPENALEX.md +603 -0
docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md +586 -0
docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md +540 -0
docs/brainstorming/implementation/README.md +143 -0
docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md +189 -0
docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md +289 -0
docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md +112 -0
docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md +112 -0
docs/brainstorming/magentic-pydantic/04_FOLLOWUP_REVIEW_REQUEST.md +158 -0
docs/brainstorming/magentic-pydantic/REVIEW_PROMPT_FOR_SENIOR_AGENT.md +113 -0
docs/bugs/FIX_PLAN_MAGENTIC_MODE.md +227 -0
docs/bugs/P0_MAGENTIC_MODE_BROKEN.md +116 -0
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md +81 -0
docs/development/testing.md +139 -0
docs/examples/writer_agents_usage.md +425 -0
docs/guides/deployment.md +142 -0
docs/implementation/01_phase_foundation.md +587 -0
docs/implementation/02_phase_search.md +822 -0
docs/implementation/03_phase_judge.md +1052 -0
docs/implementation/04_phase_ui.md +1104 -0
docs/implementation/05_phase_magentic.md +1091 -0
docs/implementation/06_phase_embeddings.md +409 -0
docs/implementation/07_phase_hypothesis.md +630 -0
docs/implementation/08_phase_report.md +854 -0
docs/implementation/09_phase_source_cleanup.md +257 -0
docs/implementation/10_phase_clinicaltrials.md +437 -0

.cursorrules ADDED Viewed

	@@ -0,0 +1,240 @@

+# DeepCritical Project - Cursor Rules
+## Project-Wide Rules
+**Architecture**: Multi-agent research system using Pydantic AI for agent orchestration, supporting iterative and deep research patterns. Uses middleware for state management, budget tracking, and workflow coordination.
+**Type Safety**: ALWAYS use complete type hints. All functions must have parameter and return type annotations. Use `mypy --strict` compliance. Use `TYPE_CHECKING` imports for circular dependencies: `from typing import TYPE_CHECKING; if TYPE_CHECKING: from src.services.embeddings import EmbeddingService`
+**Async Patterns**: ALL I/O operations must be async (`async def`, `await`). Use `asyncio.gather()` for parallel operations. CPU-bound work must use `run_in_executor()`: `loop = asyncio.get_running_loop(); result = await loop.run_in_executor(None, cpu_bound_function, args)`. Never block the event loop.
+**Error Handling**: Use custom exceptions from `src/utils/exceptions.py`: `DeepCriticalError`, `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions: `raise SearchError(...) from e`. Log with structlog: `logger.error("Operation failed", error=str(e), context=value)`.
+**Logging**: Use `structlog` for ALL logging (NOT `print` or `logging`). Import: `import structlog; logger = structlog.get_logger()`. Log with structured data: `logger.info("event", key=value)`. Use appropriate levels: DEBUG, INFO, WARNING, ERROR.
+**Pydantic Models**: All data exchange uses Pydantic models from `src/utils/models.py`. Models are frozen (`model_config = {"frozen": True}`) for immutability. Use `Field()` with descriptions. Validate with `ge=`, `le=`, `min_length=`, `max_length=` constraints.
+**Code Style**: Ruff with 100-char line length. Ignore rules: `PLR0913` (too many arguments), `PLR0912` (too many branches), `PLR0911` (too many returns), `PLR2004` (magic values), `PLW0603` (global statement), `PLC0415` (lazy imports).
+**Docstrings**: Google-style docstrings for all public functions. Include Args, Returns, Raises sections. Use type hints in docstrings only if needed for clarity.
+**Testing**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`). Use `respx` for httpx mocking, `pytest-mock` for general mocking.
+**State Management**: Use `ContextVar` in middleware for thread-safe isolation. Never use global mutable state (except singletons via `@lru_cache`). Use `WorkflowState` from `src/middleware/state_machine.py` for workflow state.
+**Citation Validation**: ALWAYS validate references before returning reports. Use `validate_references()` from `src/utils/citation_validator.py`. Remove hallucinated citations. Log warnings for removed citations.
+---
+## src/agents/ - Agent Implementation Rules
+**Pattern**: All agents use Pydantic AI `Agent` class. Agents have structured output types (Pydantic models) or return strings. Use factory functions in `src/agent_factory/agents.py` for creation.
+**Agent Structure**:
+- System prompt as module-level constant (with date injection: `datetime.now().strftime("%Y-%m-%d")`)
+- Agent class with `__init__(model: Any | None = None)`
+- Main method (e.g., `async def evaluate()`, `async def write_report()`)
+- Factory function: `def create_agent_name(model: Any | None = None) -> AgentName`
+**Model Initialization**: Use `get_model()` from `src/agent_factory/judges.py` if no model provided. Support OpenAI/Anthropic/HF Inference via settings.
+**Error Handling**: Return fallback values (e.g., `KnowledgeGapOutput(research_complete=False, outstanding_gaps=[...])`) on failure. Log errors with context. Use retry logic (3 retries) in Pydantic AI Agent initialization.
+**Input Validation**: Validate query/inputs are not empty. Truncate very long inputs with warnings. Handle None values gracefully.
+**Output Types**: Use structured output types from `src/utils/models.py` (e.g., `KnowledgeGapOutput`, `AgentSelectionPlan`, `ReportDraft`). For text output (writer agents), return `str` directly.
+**Agent-Specific Rules**:
+- `knowledge_gap.py`: Outputs `KnowledgeGapOutput`. Evaluates research completeness.
+- `tool_selector.py`: Outputs `AgentSelectionPlan`. Selects tools (RAG/web/database).
+- `writer.py`: Returns markdown string. Includes citations in numbered format.
+- `long_writer.py`: Uses `ReportDraft` input/output. Handles section-by-section writing.
+- `proofreader.py`: Takes `ReportDraft`, returns polished markdown.
+- `thinking.py`: Returns observation string from conversation history.
+- `input_parser.py`: Outputs `ParsedQuery` with research mode detection.
+---
+## src/tools/ - Search Tool Rules
+**Protocol**: All tools implement `SearchTool` protocol from `src/tools/base.py`: `name` property and `async def search(query, max_results) -> list[Evidence]`.
+**Rate Limiting**: Use `@retry` decorator from tenacity: `@retry(stop=stop_after_attempt(3), wait=wait_exponential(...))`. Implement `_rate_limit()` method for APIs with limits. Use shared rate limiters from `src/tools/rate_limiter.py`.
+**Error Handling**: Raise `SearchError` or `RateLimitError` on failures. Handle HTTP errors (429, 500, timeout). Return empty list on non-critical errors (log warning).
+**Query Preprocessing**: Use `preprocess_query()` from `src/tools/query_utils.py` to remove noise and expand synonyms.
+**Evidence Conversion**: Convert API responses to `Evidence` objects with `Citation`. Extract metadata (title, url, date, authors). Set relevance scores (0.0-1.0). Handle missing fields gracefully.
+**Tool-Specific Rules**:
+- `pubmed.py`: Use NCBI E-utilities (ESearch → EFetch). Rate limit: 0.34s between requests. Parse XML with `xmltodict`. Handle single vs. multiple articles.
+- `clinicaltrials.py`: Use `requests` library (NOT httpx - WAF blocks httpx). Run in thread pool: `await asyncio.to_thread(requests.get, ...)`. Filter: Only interventional studies, active/completed.
+- `europepmc.py`: Handle preprint markers: `[PREPRINT - Not peer-reviewed]`. Build URLs from DOI or PMID.
+- `rag_tool.py`: Wraps `LlamaIndexRAGService`. Returns Evidence from RAG results. Handles ingestion.
+- `search_handler.py`: Orchestrates parallel searches across multiple tools. Uses `asyncio.gather()` with `return_exceptions=True`. Aggregates results into `SearchResult`.
+---
+## src/middleware/ - Middleware Rules
+**State Management**: Use `ContextVar` for thread-safe isolation. `WorkflowState` uses `ContextVar[WorkflowState | None]`. Initialize with `init_workflow_state(embedding_service)`. Access with `get_workflow_state()` (auto-initializes if missing).
+**WorkflowState**: Tracks `evidence: list[Evidence]`, `conversation: Conversation`, `embedding_service: Any`. Methods: `add_evidence()` (deduplicates by URL), `async search_related()` (semantic search).
+**WorkflowManager**: Manages parallel research loops. Methods: `add_loop()`, `run_loops_parallel()`, `update_loop_status()`, `sync_loop_evidence_to_state()`. Uses `asyncio.gather()` for parallel execution. Handles errors per loop (don't fail all if one fails).
+**BudgetTracker**: Tracks tokens, time, iterations per loop and globally. Methods: `create_budget()`, `add_tokens()`, `start_timer()`, `update_timer()`, `increment_iteration()`, `check_budget()`, `can_continue()`. Token estimation: `estimate_tokens(text)` (~4 chars per token), `estimate_llm_call_tokens(prompt, response)`.
+**Models**: All middleware models in `src/utils/models.py`. `IterationData`, `Conversation`, `ResearchLoop`, `BudgetStatus` are used by middleware.
+---
+## src/orchestrator/ - Orchestration Rules
+**Research Flows**: Two patterns: `IterativeResearchFlow` (single loop) and `DeepResearchFlow` (plan → parallel loops → synthesis). Both support agent chains (`use_graph=False`) and graph execution (`use_graph=True`).
+**IterativeResearchFlow**: Pattern: Generate observations → Evaluate gaps → Select tools → Execute → Judge → Continue/Complete. Uses `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`, `WriterAgent`, `JudgeHandler`. Tracks iterations, time, budget.
+**DeepResearchFlow**: Pattern: Planner → Parallel iterative loops per section → Synthesizer. Uses `PlannerAgent`, `IterativeResearchFlow` (per section), `LongWriterAgent` or `ProofreaderAgent`. Uses `WorkflowManager` for parallel execution.
+**Graph Orchestrator**: Uses Pydantic AI Graphs (when available) or agent chains (fallback). Routes based on research mode (iterative/deep/auto). Streams `AgentEvent` objects for UI.
+**State Initialization**: Always call `init_workflow_state()` before running flows. Initialize `BudgetTracker` per loop. Use `WorkflowManager` for parallel coordination.
+**Event Streaming**: Yield `AgentEvent` objects during execution. Event types: "started", "search_complete", "judge_complete", "hypothesizing", "synthesizing", "complete", "error". Include iteration numbers and data payloads.
+---
+## src/services/ - Service Rules
+**EmbeddingService**: Local sentence-transformers (NO API key required). All operations async-safe via `run_in_executor()`. ChromaDB for vector storage. Deduplication threshold: 0.85 (85% similarity = duplicate).
+**LlamaIndexRAGService**: Uses OpenAI embeddings (requires `OPENAI_API_KEY`). Methods: `ingest_evidence()`, `retrieve()`, `query()`. Returns documents with metadata (source, title, url, date, authors). Lazy initialization with graceful fallback.
+**StatisticalAnalyzer**: Generates Python code via LLM. Executes in Modal sandbox (secure, isolated). Library versions pinned in `SANDBOX_LIBRARIES` dict. Returns `AnalysisResult` with verdict (SUPPORTED/REFUTED/INCONCLUSIVE).
+**Singleton Pattern**: Use `@lru_cache(maxsize=1)` for singletons: `@lru_cache(maxsize=1); def get_service() -> Service: return Service()`. Lazy initialization to avoid requiring dependencies at import time.
+---
+## src/utils/ - Utility Rules
+**Models**: All Pydantic models in `src/utils/models.py`. Use frozen models (`model_config = {"frozen": True}`) except where mutation needed. Use `Field()` with descriptions. Validate with constraints.
+**Config**: Settings via Pydantic Settings (`src/utils/config.py`). Load from `.env` automatically. Use `settings` singleton: `from src.utils.config import settings`. Validate API keys with properties: `has_openai_key`, `has_anthropic_key`.
+**Exceptions**: Custom exception hierarchy in `src/utils/exceptions.py`. Base: `DeepCriticalError`. Specific: `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions.
+**LLM Factory**: Centralized LLM model creation in `src/utils/llm_factory.py`. Supports OpenAI, Anthropic, HF Inference. Use `get_model()` or factory functions. Check requirements before initialization.
+**Citation Validator**: Use `validate_references()` from `src/utils/citation_validator.py`. Removes hallucinated citations (URLs not in evidence). Logs warnings. Returns validated report string.
+---
+## src/orchestrator_factory.py Rules
+**Purpose**: Factory for creating orchestrators. Supports "simple" (legacy) and "advanced" (magentic) modes. Auto-detects mode based on API key availability.
+**Pattern**: Lazy import for optional dependencies (`_get_magentic_orchestrator_class()`). Handles `ImportError` gracefully with clear error messages.
+**Mode Detection**: `_determine_mode()` checks explicit mode or auto-detects: "advanced" if `settings.has_openai_key`, else "simple". Maps "magentic" → "advanced".
+**Function Signature**: `create_orchestrator(search_handler, judge_handler, config, mode) -> Any`. Simple mode requires handlers. Advanced mode uses MagenticOrchestrator.
+**Error Handling**: Raise `ValueError` with clear messages if requirements not met. Log mode selection with structlog.
+---
+## src/orchestrator_hierarchical.py Rules
+**Purpose**: Hierarchical orchestrator using middleware and sub-teams. Adapts Magentic ChatAgent to SubIterationTeam protocol.
+**Pattern**: Uses `SubIterationMiddleware` with `ResearchTeam` and `LLMSubIterationJudge`. Event-driven via callback queue.
+**State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated, but kept for compatibility).
+**Event Streaming**: Uses `asyncio.Queue` for event coordination. Yields `AgentEvent` objects. Handles event callback pattern with `asyncio.wait()`.
+**Error Handling**: Log errors with context. Yield error events. Process remaining events after task completion.
+---
+## src/orchestrator_magentic.py Rules
+**Purpose**: Magentic-based orchestrator using ChatAgent pattern. Each agent has internal LLM. Manager orchestrates agents.
+**Pattern**: Uses `MagenticBuilder` with participants (searcher, hypothesizer, judge, reporter). Manager uses `OpenAIChatClient`. Workflow built in `_build_workflow()`.
+**Event Processing**: `_process_event()` converts Magentic events to `AgentEvent`. Handles: `MagenticOrchestratorMessageEvent`, `MagenticAgentMessageEvent`, `MagenticFinalResultEvent`, `MagenticAgentDeltaEvent`, `WorkflowOutputEvent`.
+**Text Extraction**: `_extract_text()` defensively extracts text from messages. Priority: `.content` → `.text` → `str(message)`. Handles buggy message objects.
+**State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated).
+**Requirements**: Must call `check_magentic_requirements()` in `__init__`. Requires `agent-framework-core` and OpenAI API key.
+**Event Types**: Maps agent names to event types: "search" → "search_complete", "judge" → "judge_complete", "hypothes" → "hypothesizing", "report" → "synthesizing".
+---
+## src/agent_factory/ - Factory Rules
+**Pattern**: Factory functions for creating agents and handlers. Lazy initialization for optional dependencies. Support OpenAI/Anthropic/HF Inference.
+**Judges**: `create_judge_handler()` creates `JudgeHandler` with structured output (`JudgeAssessment`). Supports `MockJudgeHandler`, `HFInferenceJudgeHandler` as fallbacks.
+**Agents**: Factory functions in `agents.py` for all Pydantic AI agents. Pattern: `create_agent_name(model: Any | None = None) -> AgentName`. Use `get_model()` if model not provided.
+**Graph Builder**: `graph_builder.py` contains utilities for building research graphs. Supports iterative and deep research graph construction.
+**Error Handling**: Raise `ConfigurationError` if required API keys missing. Log agent creation. Handle import errors gracefully.
+---
+## src/prompts/ - Prompt Rules
+**Pattern**: System prompts stored as module-level constants. Include date injection: `datetime.now().strftime("%Y-%m-%d")`. Format evidence with truncation (1500 chars per item).
+**Judge Prompts**: In `judge.py`. Handle empty evidence case separately. Always request structured JSON output.
+**Hypothesis Prompts**: In `hypothesis.py`. Use diverse evidence selection (MMR algorithm). Sentence-aware truncation.
+**Report Prompts**: In `report.py`. Include full citation details. Use diverse evidence selection (n=20). Emphasize citation validation rules.
+---
+## Testing Rules
+**Structure**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`).
+**Mocking**: Use `respx` for httpx mocking. Use `pytest-mock` for general mocking. Mock LLM calls in unit tests (use `MockJudgeHandler`).
+**Fixtures**: Common fixtures in `tests/conftest.py`: `mock_httpx_client`, `mock_llm_response`.
+**Coverage**: Aim for >80% coverage. Test error handling, edge cases, and integration paths.
+---
+## File-Specific Agent Rules
+**knowledge_gap.py**: Outputs `KnowledgeGapOutput`. System prompt evaluates research completeness. Handles conversation history. Returns fallback on error.
+**writer.py**: Returns markdown string. System prompt includes citation format examples. Validates inputs. Truncates long findings. Retry logic for transient failures.
+**long_writer.py**: Uses `ReportDraft` input/output. Writes sections iteratively. Reformats references (deduplicates, renumbers). Reformats section headings.
+**proofreader.py**: Takes `ReportDraft`, returns polished markdown. Removes duplicates. Adds summary. Preserves references.
+**tool_selector.py**: Outputs `AgentSelectionPlan`. System prompt lists available agents (WebSearchAgent, SiteCrawlerAgent, RAGAgent). Guidelines for when to use each.
+**thinking.py**: Returns observation string. Generates observations from conversation history. Uses query and background context.
+**input_parser.py**: Outputs `ParsedQuery`. Detects research mode (iterative/deep). Extracts entities and research questions. Improves/refines query.

.env.example ADDED Viewed

	@@ -0,0 +1,48 @@

+# ============== LLM CONFIGURATION ==============
+# Provider: "openai" or "anthropic"
+LLM_PROVIDER=openai
+# API Keys (at least one required for full LLM analysis)
+OPENAI_API_KEY=sk-your-key-here
+ANTHROPIC_API_KEY=sk-ant-your-key-here
+# Model names (optional - sensible defaults set in config.py)
+# ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
+# OPENAI_MODEL=gpt-5.1
+# ============== EMBEDDINGS ==============
+# OpenAI Embedding Model (used if LLM_PROVIDER is openai and performing RAG/Embeddings)
+OPENAI_EMBEDDING_MODEL=text-embedding-3-small
+# Local Embedding Model (used for local/offline embeddings)
+LOCAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
+# ============== HUGGINGFACE (FREE TIER) ==============
+# HuggingFace Token - enables Llama 3.1 (best quality free model)
+# Get yours at: https://huggingface.co/settings/tokens
+#
+# WITHOUT HF_TOKEN: Falls back to ungated models (zephyr-7b-beta)
+# WITH HF_TOKEN: Uses Llama 3.1 8B Instruct (requires accepting license)
+#
+# For HuggingFace Spaces deployment:
+#   Set this as a "Secret" in Space Settings -> Variables and secrets
+#   Users/judges don't need their own token - the Space secret is used
+#
+HF_TOKEN=hf_your-token-here
+# ============== AGENT CONFIGURATION ==============
+MAX_ITERATIONS=10
+SEARCH_TIMEOUT=30
+LOG_LEVEL=INFO
+# ============== EXTERNAL SERVICES ==============
+# PubMed (optional - higher rate limits)
+NCBI_API_KEY=your-ncbi-key-here
+# Vector Database (optional - for LlamaIndex RAG)
+CHROMA_DB_PATH=./chroma_db

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

.github/README.md ADDED Viewed

	@@ -0,0 +1,203 @@

+---
+title: DeepCritical
+emoji: 🧬
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: "6.0.1"
+python_version: "3.11"
+app_file: src/app.py
+pinned: false
+license: mit
+tags:
+  - mcp-in-action-track-enterprise
+  - mcp-hackathon
+  - drug-repurposing
+  - biomedical-ai
+  - pydantic-ai
+  - llamaindex
+  - modal
+---
+# DeepCritical
+## Intro
+## Features
+- **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
+- **MCP Integration**: Use our tools from Claude Desktop or any MCP client
+- **Modal Sandbox**: Secure execution of AI-generated statistical code
+- **LlamaIndex RAG**: Semantic search and evidence synthesis
+- **HuggingfaceInference**:
+- **HuggingfaceMCP Custom Config To Use Community Tools**:
+- **Strongly Typed Composable Graphs**:
+- **Specialized Research Teams of Agents**:
+## Quick Start
+### 1. Environment Setup
+```bash
+# Install uv if you haven't already
+pip install uv
+# Sync dependencies
+uv sync
+```
+### 2. Run the UI
+```bash
+# Start the Gradio app
+uv run gradio run src/app.py
+```
+Open your browser to `http://localhost:7860`.
+### 3. Connect via MCP
+This application exposes a Model Context Protocol (MCP) server, allowing you to use its search tools directly from Claude Desktop or other MCP clients.
+**MCP Server URL**: `http://localhost:7860/gradio_api/mcp/`
+**Claude Desktop Configuration**:
+Add this to your `claude_desktop_config.json`:
+```json
+{
+  "mcpServers": {
+    "deepcritical": {
+      "url": "http://localhost:7860/gradio_api/mcp/"
+    }
+  }
+}
+```
+**Available Tools**:
+- `search_pubmed`: Search peer-reviewed biomedical literature.
+- `search_clinical_trials`: Search ClinicalTrials.gov.
+- `search_biorxiv`: Search bioRxiv/medRxiv preprints.
+- `search_all`: Search all sources simultaneously.
+- `analyze_hypothesis`: Secure statistical analysis using Modal sandboxes.
+## Deep Research Flows
+- iterativeResearch
+- deepResearch
+- researchTeam
+### Iterative Research
+sequenceDiagram
+    participant IterativeFlow
+    participant ThinkingAgent
+    participant KnowledgeGapAgent
+    participant ToolSelector
+    participant ToolExecutor
+    participant JudgeHandler
+    participant WriterAgent
+    IterativeFlow->>IterativeFlow: run(query)
+    loop Until complete or max_iterations
+        IterativeFlow->>ThinkingAgent: generate_observations()
+        ThinkingAgent-->>IterativeFlow: observations
+        IterativeFlow->>KnowledgeGapAgent: evaluate_gaps()
+        KnowledgeGapAgent-->>IterativeFlow: KnowledgeGapOutput
+        alt Research complete
+            IterativeFlow->>WriterAgent: create_final_report()
+            WriterAgent-->>IterativeFlow: final_report
+        else Gaps remain
+            IterativeFlow->>ToolSelector: select_agents(gap)
+            ToolSelector-->>IterativeFlow: AgentSelectionPlan
+            IterativeFlow->>ToolExecutor: execute_tool_tasks()
+            ToolExecutor-->>IterativeFlow: ToolAgentOutput[]
+            IterativeFlow->>JudgeHandler: assess_evidence()
+            JudgeHandler-->>IterativeFlow: should_continue
+        end
+    end
+### Deep Research
+sequenceDiagram
+    actor User
+    participant GraphOrchestrator
+    participant InputParser
+    participant GraphBuilder
+    participant GraphExecutor
+    participant Agent
+    participant BudgetTracker
+    participant WorkflowState
+    User->>GraphOrchestrator: run(query)
+    GraphOrchestrator->>InputParser: detect_research_mode(query)
+    InputParser-->>GraphOrchestrator: mode (iterative/deep)
+    GraphOrchestrator->>GraphBuilder: build_graph(mode)
+    GraphBuilder-->>GraphOrchestrator: ResearchGraph
+    GraphOrchestrator->>WorkflowState: init_workflow_state()
+    GraphOrchestrator->>BudgetTracker: create_budget()
+    GraphOrchestrator->>GraphExecutor: _execute_graph(graph)
+    loop For each node in graph
+        GraphExecutor->>Agent: execute_node(agent_node)
+        Agent->>Agent: process_input
+        Agent-->>GraphExecutor: result
+        GraphExecutor->>WorkflowState: update_state(result)
+        GraphExecutor->>BudgetTracker: add_tokens(used)
+        GraphExecutor->>BudgetTracker: check_budget()
+        alt Budget exceeded
+            GraphExecutor->>GraphOrchestrator: emit(error_event)
+        else Continue
+            GraphExecutor->>GraphOrchestrator: emit(progress_event)
+        end
+    end
+    GraphOrchestrator->>User: AsyncGenerator[AgentEvent]
+### Research Team
+Critical Deep Research Agent
+## Development
+### Run Tests
+```bash
+uv run pytest
+```
+### Run Checks
+```bash
+make check
+```
+## Architecture
+DeepCritical uses a Vertical Slice Architecture:
+1.  **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and bioRxiv.
+2.  **Judge Slice**: Evaluating evidence quality using LLMs.
+3.  **Orchestrator Slice**: Managing the research loop and UI.
+Built with:
+- **PydanticAI**: For robust agent interactions.
+- **Gradio**: For the streaming user interface.
+- **PubMed, ClinicalTrials.gov, bioRxiv**: For biomedical data.
+- **MCP**: For universal tool access.
+- **Modal**: For secure code execution.
+## Team
+- The-Obstacle-Is-The-Way
+- MarioAderman
+- Josephrp
+## Links
+- [GitHub Repository](https://github.com/The-Obstacle-Is-The-Way/DeepCritical-1)

.github/workflows/ci.yml ADDED Viewed

	@@ -0,0 +1,67 @@

+name: CI
+on:
+  push:
+    branches: [main, develop]
+  pull_request:
+    branches: [main, develop]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.11"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+      - name: Lint with ruff
+        run: |
+          ruff check . --exclude tests
+          ruff format --check . --exclude tests
+      - name: Type check with mypy
+        run: |
+          mypy src
+      - name: Install embedding dependencies
+        run: |
+          pip install -e ".[embeddings]"
+      - name: Run unit tests (excluding OpenAI and embedding providers)
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: |
+          pytest tests/unit/ -v -m "not openai and not embedding_provider" --tb=short -p no:logfire
+      - name: Run local embeddings tests
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: |
+          pytest tests/ -v -m "local_embeddings" --tb=short -p no:logfire || true
+        continue-on-error: true  # Allow failures if dependencies not available
+      - name: Run HuggingFace integration tests
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: |
+          pytest tests/integration/ -v -m "huggingface and not embedding_provider" --tb=short -p no:logfire || true
+        continue-on-error: true  # Allow failures if HF_TOKEN not set
+      - name: Run non-OpenAI integration tests (excluding embedding providers)
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: |
+          pytest tests/integration/ -v -m "integration and not openai and not embedding_provider" --tb=short -p no:logfire || true
+        continue-on-error: true  # Allow failures if dependencies not available

.gitignore ADDED Viewed

	@@ -0,0 +1,77 @@

+folder/
+.cursor/
+.ruff_cache/
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+.venv/
+venv/
+ENV/
+env/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+# Environment
+.env
+.env.local
+*.local
+# Claude
+.claude/
+# Burner docs (working drafts, not for commit)
+burner_docs/
+# Reference repos (clone locally, don't commit)
+reference_repos/autogen-microsoft/
+reference_repos/claude-agent-sdk/
+reference_repos/pydanticai-research-agent/
+reference_repos/pubmed-mcp-server/
+reference_repos/DeepCritical/
+# Keep the README in reference_repos
+!reference_repos/README.md
+# OS
+.DS_Store
+Thumbs.db
+# Logs
+*.log
+logs/
+# Testing
+.pytest_cache/
+.mypy_cache/
+.coverage
+htmlcov/
+# Database files
+chroma_db/
+*.sqlite3
+# Trigger rebuild Wed Nov 26 17:51:41 EST 2025

.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,64 @@

+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.4.4
+    hooks:
+      - id: ruff
+        args: [--fix, --exclude, tests]
+        exclude: ^reference_repos/
+      - id: ruff-format
+        args: [--exclude, tests]
+        exclude: ^reference_repos/
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.10.0
+    hooks:
+      - id: mypy
+        files: ^src/
+        exclude: ^folder
+        additional_dependencies:
+          - pydantic>=2.7
+          - pydantic-settings>=2.2
+          - tenacity>=8.2
+          - pydantic-ai>=0.0.16
+        args: [--ignore-missing-imports]
+  - repo: local
+    hooks:
+      - id: pytest-unit
+        name: pytest unit tests (no OpenAI)
+        entry: uv
+        language: system
+        types: [python]
+        args: [
+          "run",
+          "pytest",
+          "tests/unit/",
+          "-v",
+          "-m",
+          "not openai and not embedding_provider",
+          "--tb=short",
+          "-p",
+          "no:logfire",
+        ]
+        pass_filenames: false
+        always_run: true
+        require_serial: false
+      - id: pytest-local-embeddings
+        name: pytest local embeddings tests
+        entry: uv
+        language: system
+        types: [python]
+        args: [
+          "run",
+          "pytest",
+          "tests/",
+          "-v",
+          "-m",
+          "local_embeddings",
+          "--tb=short",
+          "-p",
+          "no:logfire",
+        ]
+        pass_filenames: false
+        always_run: true
+        require_serial: false

.pre-commit-hooks/run_pytest.ps1 ADDED Viewed

	@@ -0,0 +1,14 @@

+# PowerShell pytest runner for pre-commit (Windows)
+# Uses uv if available, otherwise falls back to pytest
+if (Get-Command uv -ErrorAction SilentlyContinue) {
+    uv run pytest $args
+} else {
+    Write-Warning "uv not found, using system pytest (may have missing dependencies)"
+    pytest $args
+}

.pre-commit-hooks/run_pytest.sh ADDED Viewed

	@@ -0,0 +1,15 @@

+#!/bin/bash
+# Cross-platform pytest runner for pre-commit
+# Uses uv if available, otherwise falls back to pytest
+if command -v uv >/dev/null 2>&1; then
+    uv run pytest "$@"
+else
+    echo "Warning: uv not found, using system pytest (may have missing dependencies)"
+    pytest "$@"
+fi

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.11

AGENTS.txt ADDED Viewed

	@@ -0,0 +1,236 @@

+# DeepCritical Project - Rules
+## Project-Wide Rules
+**Architecture**: Multi-agent research system using Pydantic AI for agent orchestration, supporting iterative and deep research patterns. Uses middleware for state management, budget tracking, and workflow coordination.
+**Type Safety**: ALWAYS use complete type hints. All functions must have parameter and return type annotations. Use `mypy --strict` compliance. Use `TYPE_CHECKING` imports for circular dependencies: `from typing import TYPE_CHECKING; if TYPE_CHECKING: from src.services.embeddings import EmbeddingService`
+**Async Patterns**: ALL I/O operations must be async (`async def`, `await`). Use `asyncio.gather()` for parallel operations. CPU-bound work must use `run_in_executor()`: `loop = asyncio.get_running_loop(); result = await loop.run_in_executor(None, cpu_bound_function, args)`. Never block the event loop.
+**Error Handling**: Use custom exceptions from `src/utils/exceptions.py`: `DeepCriticalError`, `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions: `raise SearchError(...) from e`. Log with structlog: `logger.error("Operation failed", error=str(e), context=value)`.
+**Logging**: Use `structlog` for ALL logging (NOT `print` or `logging`). Import: `import structlog; logger = structlog.get_logger()`. Log with structured data: `logger.info("event", key=value)`. Use appropriate levels: DEBUG, INFO, WARNING, ERROR.
+**Pydantic Models**: All data exchange uses Pydantic models from `src/utils/models.py`. Models are frozen (`model_config = {"frozen": True}`) for immutability. Use `Field()` with descriptions. Validate with `ge=`, `le=`, `min_length=`, `max_length=` constraints.
+**Code Style**: Ruff with 100-char line length. Ignore rules: `PLR0913` (too many arguments), `PLR0912` (too many branches), `PLR0911` (too many returns), `PLR2004` (magic values), `PLW0603` (global statement), `PLC0415` (lazy imports).
+**Docstrings**: Google-style docstrings for all public functions. Include Args, Returns, Raises sections. Use type hints in docstrings only if needed for clarity.
+**Testing**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`). Use `respx` for httpx mocking, `pytest-mock` for general mocking.
+**State Management**: Use `ContextVar` in middleware for thread-safe isolation. Never use global mutable state (except singletons via `@lru_cache`). Use `WorkflowState` from `src/middleware/state_machine.py` for workflow state.
+**Citation Validation**: ALWAYS validate references before returning reports. Use `validate_references()` from `src/utils/citation_validator.py`. Remove hallucinated citations. Log warnings for removed citations.
+---
+## src/agents/ - Agent Implementation Rules
+**Pattern**: All agents use Pydantic AI `Agent` class. Agents have structured output types (Pydantic models) or return strings. Use factory functions in `src/agent_factory/agents.py` for creation.
+**Agent Structure**:
+- System prompt as module-level constant (with date injection: `datetime.now().strftime("%Y-%m-%d")`)
+- Agent class with `__init__(model: Any | None = None)`
+- Main method (e.g., `async def evaluate()`, `async def write_report()`)
+- Factory function: `def create_agent_name(model: Any | None = None) -> AgentName`
+**Model Initialization**: Use `get_model()` from `src/agent_factory/judges.py` if no model provided. Support OpenAI/Anthropic/HF Inference via settings.
+**Error Handling**: Return fallback values (e.g., `KnowledgeGapOutput(research_complete=False, outstanding_gaps=[...])`) on failure. Log errors with context. Use retry logic (3 retries) in Pydantic AI Agent initialization.
+**Input Validation**: Validate query/inputs are not empty. Truncate very long inputs with warnings. Handle None values gracefully.
+**Output Types**: Use structured output types from `src/utils/models.py` (e.g., `KnowledgeGapOutput`, `AgentSelectionPlan`, `ReportDraft`). For text output (writer agents), return `str` directly.
+**Agent-Specific Rules**:
+- `knowledge_gap.py`: Outputs `KnowledgeGapOutput`. Evaluates research completeness.
+- `tool_selector.py`: Outputs `AgentSelectionPlan`. Selects tools (RAG/web/database).
+- `writer.py`: Returns markdown string. Includes citations in numbered format.
+- `long_writer.py`: Uses `ReportDraft` input/output. Handles section-by-section writing.
+- `proofreader.py`: Takes `ReportDraft`, returns polished markdown.
+- `thinking.py`: Returns observation string from conversation history.
+- `input_parser.py`: Outputs `ParsedQuery` with research mode detection.
+---
+## src/tools/ - Search Tool Rules
+**Protocol**: All tools implement `SearchTool` protocol from `src/tools/base.py`: `name` property and `async def search(query, max_results) -> list[Evidence]`.
+**Rate Limiting**: Use `@retry` decorator from tenacity: `@retry(stop=stop_after_attempt(3), wait=wait_exponential(...))`. Implement `_rate_limit()` method for APIs with limits. Use shared rate limiters from `src/tools/rate_limiter.py`.
+**Error Handling**: Raise `SearchError` or `RateLimitError` on failures. Handle HTTP errors (429, 500, timeout). Return empty list on non-critical errors (log warning).
+**Query Preprocessing**: Use `preprocess_query()` from `src/tools/query_utils.py` to remove noise and expand synonyms.
+**Evidence Conversion**: Convert API responses to `Evidence` objects with `Citation`. Extract metadata (title, url, date, authors). Set relevance scores (0.0-1.0). Handle missing fields gracefully.
+**Tool-Specific Rules**:
+- `pubmed.py`: Use NCBI E-utilities (ESearch → EFetch). Rate limit: 0.34s between requests. Parse XML with `xmltodict`. Handle single vs. multiple articles.
+- `clinicaltrials.py`: Use `requests` library (NOT httpx - WAF blocks httpx). Run in thread pool: `await asyncio.to_thread(requests.get, ...)`. Filter: Only interventional studies, active/completed.
+- `europepmc.py`: Handle preprint markers: `[PREPRINT - Not peer-reviewed]`. Build URLs from DOI or PMID.
+- `rag_tool.py`: Wraps `LlamaIndexRAGService`. Returns Evidence from RAG results. Handles ingestion.
+- `search_handler.py`: Orchestrates parallel searches across multiple tools. Uses `asyncio.gather()` with `return_exceptions=True`. Aggregates results into `SearchResult`.
+---
+## src/middleware/ - Middleware Rules
+**State Management**: Use `ContextVar` for thread-safe isolation. `WorkflowState` uses `ContextVar[WorkflowState | None]`. Initialize with `init_workflow_state(embedding_service)`. Access with `get_workflow_state()` (auto-initializes if missing).
+**WorkflowState**: Tracks `evidence: list[Evidence]`, `conversation: Conversation`, `embedding_service: Any`. Methods: `add_evidence()` (deduplicates by URL), `async search_related()` (semantic search).
+**WorkflowManager**: Manages parallel research loops. Methods: `add_loop()`, `run_loops_parallel()`, `update_loop_status()`, `sync_loop_evidence_to_state()`. Uses `asyncio.gather()` for parallel execution. Handles errors per loop (don't fail all if one fails).
+**BudgetTracker**: Tracks tokens, time, iterations per loop and globally. Methods: `create_budget()`, `add_tokens()`, `start_timer()`, `update_timer()`, `increment_iteration()`, `check_budget()`, `can_continue()`. Token estimation: `estimate_tokens(text)` (~4 chars per token), `estimate_llm_call_tokens(prompt, response)`.
+**Models**: All middleware models in `src/utils/models.py`. `IterationData`, `Conversation`, `ResearchLoop`, `BudgetStatus` are used by middleware.
+---
+## src/orchestrator/ - Orchestration Rules
+**Research Flows**: Two patterns: `IterativeResearchFlow` (single loop) and `DeepResearchFlow` (plan → parallel loops → synthesis). Both support agent chains (`use_graph=False`) and graph execution (`use_graph=True`).
+**IterativeResearchFlow**: Pattern: Generate observations → Evaluate gaps → Select tools → Execute → Judge → Continue/Complete. Uses `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`, `WriterAgent`, `JudgeHandler`. Tracks iterations, time, budget.
+**DeepResearchFlow**: Pattern: Planner → Parallel iterative loops per section → Synthesizer. Uses `PlannerAgent`, `IterativeResearchFlow` (per section), `LongWriterAgent` or `ProofreaderAgent`. Uses `WorkflowManager` for parallel execution.
+**Graph Orchestrator**: Uses Pydantic AI Graphs (when available) or agent chains (fallback). Routes based on research mode (iterative/deep/auto). Streams `AgentEvent` objects for UI.
+**State Initialization**: Always call `init_workflow_state()` before running flows. Initialize `BudgetTracker` per loop. Use `WorkflowManager` for parallel coordination.
+**Event Streaming**: Yield `AgentEvent` objects during execution. Event types: "started", "search_complete", "judge_complete", "hypothesizing", "synthesizing", "complete", "error". Include iteration numbers and data payloads.
+---
+## src/services/ - Service Rules
+**EmbeddingService**: Local sentence-transformers (NO API key required). All operations async-safe via `run_in_executor()`. ChromaDB for vector storage. Deduplication threshold: 0.85 (85% similarity = duplicate).
+**LlamaIndexRAGService**: Uses OpenAI embeddings (requires `OPENAI_API_KEY`). Methods: `ingest_evidence()`, `retrieve()`, `query()`. Returns documents with metadata (source, title, url, date, authors). Lazy initialization with graceful fallback.
+**StatisticalAnalyzer**: Generates Python code via LLM. Executes in Modal sandbox (secure, isolated). Library versions pinned in `SANDBOX_LIBRARIES` dict. Returns `AnalysisResult` with verdict (SUPPORTED/REFUTED/INCONCLUSIVE).
+**Singleton Pattern**: Use `@lru_cache(maxsize=1)` for singletons: `@lru_cache(maxsize=1); def get_service() -> Service: return Service()`. Lazy initialization to avoid requiring dependencies at import time.
+---
+## src/utils/ - Utility Rules
+**Models**: All Pydantic models in `src/utils/models.py`. Use frozen models (`model_config = {"frozen": True}`) except where mutation needed. Use `Field()` with descriptions. Validate with constraints.
+**Config**: Settings via Pydantic Settings (`src/utils/config.py`). Load from `.env` automatically. Use `settings` singleton: `from src.utils.config import settings`. Validate API keys with properties: `has_openai_key`, `has_anthropic_key`.
+**Exceptions**: Custom exception hierarchy in `src/utils/exceptions.py`. Base: `DeepCriticalError`. Specific: `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions.
+**LLM Factory**: Centralized LLM model creation in `src/utils/llm_factory.py`. Supports OpenAI, Anthropic, HF Inference. Use `get_model()` or factory functions. Check requirements before initialization.
+**Citation Validator**: Use `validate_references()` from `src/utils/citation_validator.py`. Removes hallucinated citations (URLs not in evidence). Logs warnings. Returns validated report string.
+---
+## src/orchestrator_factory.py Rules
+**Purpose**: Factory for creating orchestrators. Supports "simple" (legacy) and "advanced" (magentic) modes. Auto-detects mode based on API key availability.
+**Pattern**: Lazy import for optional dependencies (`_get_magentic_orchestrator_class()`). Handles `ImportError` gracefully with clear error messages.
+**Mode Detection**: `_determine_mode()` checks explicit mode or auto-detects: "advanced" if `settings.has_openai_key`, else "simple". Maps "magentic" → "advanced".
+**Function Signature**: `create_orchestrator(search_handler, judge_handler, config, mode) -> Any`. Simple mode requires handlers. Advanced mode uses MagenticOrchestrator.
+**Error Handling**: Raise `ValueError` with clear messages if requirements not met. Log mode selection with structlog.
+---
+## src/orchestrator_hierarchical.py Rules
+**Purpose**: Hierarchical orchestrator using middleware and sub-teams. Adapts Magentic ChatAgent to SubIterationTeam protocol.
+**Pattern**: Uses `SubIterationMiddleware` with `ResearchTeam` and `LLMSubIterationJudge`. Event-driven via callback queue.
+**State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated, but kept for compatibility).
+**Event Streaming**: Uses `asyncio.Queue` for event coordination. Yields `AgentEvent` objects. Handles event callback pattern with `asyncio.wait()`.
+**Error Handling**: Log errors with context. Yield error events. Process remaining events after task completion.
+---
+## src/orchestrator_magentic.py Rules
+**Purpose**: Magentic-based orchestrator using ChatAgent pattern. Each agent has internal LLM. Manager orchestrates agents.
+**Pattern**: Uses `MagenticBuilder` with participants (searcher, hypothesizer, judge, reporter). Manager uses `OpenAIChatClient`. Workflow built in `_build_workflow()`.
+**Event Processing**: `_process_event()` converts Magentic events to `AgentEvent`. Handles: `MagenticOrchestratorMessageEvent`, `MagenticAgentMessageEvent`, `MagenticFinalResultEvent`, `MagenticAgentDeltaEvent`, `WorkflowOutputEvent`.
+**Text Extraction**: `_extract_text()` defensively extracts text from messages. Priority: `.content` → `.text` → `str(message)`. Handles buggy message objects.
+**State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated).
+**Requirements**: Must call `check_magentic_requirements()` in `__init__`. Requires `agent-framework-core` and OpenAI API key.
+**Event Types**: Maps agent names to event types: "search" → "search_complete", "judge" → "judge_complete", "hypothes" → "hypothesizing", "report" → "synthesizing".
+---
+## src/agent_factory/ - Factory Rules
+**Pattern**: Factory functions for creating agents and handlers. Lazy initialization for optional dependencies. Support OpenAI/Anthropic/HF Inference.
+**Judges**: `create_judge_handler()` creates `JudgeHandler` with structured output (`JudgeAssessment`). Supports `MockJudgeHandler`, `HFInferenceJudgeHandler` as fallbacks.
+**Agents**: Factory functions in `agents.py` for all Pydantic AI agents. Pattern: `create_agent_name(model: Any | None = None) -> AgentName`. Use `get_model()` if model not provided.
+**Graph Builder**: `graph_builder.py` contains utilities for building research graphs. Supports iterative and deep research graph construction.
+**Error Handling**: Raise `ConfigurationError` if required API keys missing. Log agent creation. Handle import errors gracefully.
+---
+## src/prompts/ - Prompt Rules
+**Pattern**: System prompts stored as module-level constants. Include date injection: `datetime.now().strftime("%Y-%m-%d")`. Format evidence with truncation (1500 chars per item).
+**Judge Prompts**: In `judge.py`. Handle empty evidence case separately. Always request structured JSON output.
+**Hypothesis Prompts**: In `hypothesis.py`. Use diverse evidence selection (MMR algorithm). Sentence-aware truncation.
+**Report Prompts**: In `report.py`. Include full citation details. Use diverse evidence selection (n=20). Emphasize citation validation rules.
+---
+## Testing Rules
+**Structure**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`).
+**Mocking**: Use `respx` for httpx mocking. Use `pytest-mock` for general mocking. Mock LLM calls in unit tests (use `MockJudgeHandler`).
+**Fixtures**: Common fixtures in `tests/conftest.py`: `mock_httpx_client`, `mock_llm_response`.
+**Coverage**: Aim for >80% coverage. Test error handling, edge cases, and integration paths.
+---
+## File-Specific Agent Rules
+**knowledge_gap.py**: Outputs `KnowledgeGapOutput`. System prompt evaluates research completeness. Handles conversation history. Returns fallback on error.
+**writer.py**: Returns markdown string. System prompt includes citation format examples. Validates inputs. Truncates long findings. Retry logic for transient failures.
+**long_writer.py**: Uses `ReportDraft` input/output. Writes sections iteratively. Reformats references (deduplicates, renumbers). Reformats section headings.
+**proofreader.py**: Takes `ReportDraft`, returns polished markdown. Removes duplicates. Adds summary. Preserves references.
+**tool_selector.py**: Outputs `AgentSelectionPlan`. System prompt lists available agents (WebSearchAgent, SiteCrawlerAgent, RAGAgent). Guidelines for when to use each.
+**thinking.py**: Returns observation string. Generates observations from conversation history. Uses query and background context.
+**input_parser.py**: Outputs `ParsedQuery`. Detects research mode (iterative/deep). Extracts entities and research questions. Improves/refines query.

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ make sure you run the full pre-commit checks before opening a PR (not draft) otherwise Obstacle is the Way will loose his mind

Dockerfile ADDED Viewed

	@@ -0,0 +1,52 @@

+# Dockerfile for DeepCritical
+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies (curl needed for HEALTHCHECK)
+RUN apt-get update && apt-get install -y \
+    git \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv
+RUN pip install uv==0.5.4
+# Copy project files
+COPY pyproject.toml .
+COPY uv.lock .
+COPY src/ src/
+COPY README.md .
+# Install runtime dependencies only (no dev/test tools)
+RUN uv sync --frozen --no-dev --extra embeddings --extra magentic
+# Create non-root user BEFORE downloading models
+RUN useradd --create-home --shell /bin/bash appuser
+# Set cache directory for HuggingFace models (must be writable by appuser)
+ENV HF_HOME=/app/.cache
+ENV TRANSFORMERS_CACHE=/app/.cache
+# Create cache dir with correct ownership
+RUN mkdir -p /app/.cache && chown -R appuser:appuser /app/.cache
+# Pre-download the embedding model during build (as appuser to set correct ownership)
+USER appuser
+RUN uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
+# Expose port
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/ || exit 1
+# Set environment variables
+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV GRADIO_SERVER_PORT=7860
+ENV PYTHONPATH=/app
+# Run the app
+CMD ["uv", "run", "python", "-m", "src.app"]

Makefile ADDED Viewed

	@@ -0,0 +1,42 @@

+.PHONY: install test lint format typecheck check clean all cov cov-html
+# Default target
+all: check
+install:
+	uv sync --all-extras
+	uv run pre-commit install
+test:
+	uv run pytest tests/unit/ -v -m "not openai" -p no:logfire
+test-hf:
+	uv run pytest tests/ -v -m "huggingface" -p no:logfire
+test-all:
+	uv run pytest tests/ -v -p no:logfire
+# Coverage aliases
+cov: test-cov
+test-cov:
+	uv run pytest --cov=src --cov-report=term-missing -m "not openai" -p no:logfire
+cov-html:
+	uv run pytest --cov=src --cov-report=html -p no:logfire
+	@echo "Coverage report: open htmlcov/index.html"
+lint:
+	uv run ruff check src tests
+format:
+	uv run ruff format src tests
+typecheck:
+	uv run mypy src
+check: lint typecheck test-cov
+	@echo "All checks passed!"
+clean:
+	rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ .coverage htmlcov
+	find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true

README.md ADDED Viewed

	@@ -0,0 +1,196 @@

+---
+title: DeepCritical
+emoji: 🧬
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: "6.0.1"
+python_version: "3.11"
+app_file: src/app.py
+pinned: false
+license: mit
+tags:
+  - mcp-in-action-track-enterprise
+  - mcp-hackathon
+  - drug-repurposing
+  - biomedical-ai
+  - pydantic-ai
+  - llamaindex
+  - modal
+---
+# DeepCritical
+## Intro
+## Features
+- **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
+- **MCP Integration**: Use our tools from Claude Desktop or any MCP client
+- **Modal Sandbox**: Secure execution of AI-generated statistical code
+- **LlamaIndex RAG**: Semantic search and evidence synthesis
+- **HuggingfaceInference**:
+- **HuggingfaceMCP Custom Config To Use Community Tools**:
+- **Strongly Typed Composable Graphs**:
+- **Specialized Research Teams of Agents**:
+## Quick Start
+### 1. Environment Setup
+```bash
+# Install uv if you haven't already
+pip install uv
+# Sync dependencies
+uv sync
+```
+### 2. Run the UI
+```bash
+# Start the Gradio app
+uv run gradio run src/app.py
+```
+Open your browser to `http://localhost:7860`.
+### 3. Connect via MCP
+This application exposes a Model Context Protocol (MCP) server, allowing you to use its search tools directly from Claude Desktop or other MCP clients.
+**MCP Server URL**: `http://localhost:7860/gradio_api/mcp/`
+**Claude Desktop Configuration**:
+Add this to your `claude_desktop_config.json`:
+```json
+{
+  "mcpServers": {
+    "deepcritical": {
+      "url": "http://localhost:7860/gradio_api/mcp/"
+    }
+  }
+}
+```
+**Available Tools**:
+- `search_pubmed`: Search peer-reviewed biomedical literature.
+- `search_clinical_trials`: Search ClinicalTrials.gov.
+- `search_biorxiv`: Search bioRxiv/medRxiv preprints.
+- `search_all`: Search all sources simultaneously.
+- `analyze_hypothesis`: Secure statistical analysis using Modal sandboxes.
+## Architecture
+DeepCritical uses a Vertical Slice Architecture:
+1.  **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and bioRxiv.
+2.  **Judge Slice**: Evaluating evidence quality using LLMs.
+3.  **Orchestrator Slice**: Managing the research loop and UI.
+- iterativeResearch
+- deepResearch
+- researchTeam
+### Iterative Research
+sequenceDiagram
+    participant IterativeFlow
+    participant ThinkingAgent
+    participant KnowledgeGapAgent
+    participant ToolSelector
+    participant ToolExecutor
+    participant JudgeHandler
+    participant WriterAgent
+    IterativeFlow->>IterativeFlow: run(query)
+    loop Until complete or max_iterations
+        IterativeFlow->>ThinkingAgent: generate_observations()
+        ThinkingAgent-->>IterativeFlow: observations
+        IterativeFlow->>KnowledgeGapAgent: evaluate_gaps()
+        KnowledgeGapAgent-->>IterativeFlow: KnowledgeGapOutput
+        alt Research complete
+            IterativeFlow->>WriterAgent: create_final_report()
+            WriterAgent-->>IterativeFlow: final_report
+        else Gaps remain
+            IterativeFlow->>ToolSelector: select_agents(gap)
+            ToolSelector-->>IterativeFlow: AgentSelectionPlan
+            IterativeFlow->>ToolExecutor: execute_tool_tasks()
+            ToolExecutor-->>IterativeFlow: ToolAgentOutput[]
+            IterativeFlow->>JudgeHandler: assess_evidence()
+            JudgeHandler-->>IterativeFlow: should_continue
+        end
+    end
+### Deep Research
+sequenceDiagram
+    actor User
+    participant GraphOrchestrator
+    participant InputParser
+    participant GraphBuilder
+    participant GraphExecutor
+    participant Agent
+    participant BudgetTracker
+    participant WorkflowState
+    User->>GraphOrchestrator: run(query)
+    GraphOrchestrator->>InputParser: detect_research_mode(query)
+    InputParser-->>GraphOrchestrator: mode (iterative/deep)
+    GraphOrchestrator->>GraphBuilder: build_graph(mode)
+    GraphBuilder-->>GraphOrchestrator: ResearchGraph
+    GraphOrchestrator->>WorkflowState: init_workflow_state()
+    GraphOrchestrator->>BudgetTracker: create_budget()
+    GraphOrchestrator->>GraphExecutor: _execute_graph(graph)
+    loop For each node in graph
+        GraphExecutor->>Agent: execute_node(agent_node)
+        Agent->>Agent: process_input
+        Agent-->>GraphExecutor: result
+        GraphExecutor->>WorkflowState: update_state(result)
+        GraphExecutor->>BudgetTracker: add_tokens(used)
+        GraphExecutor->>BudgetTracker: check_budget()
+        alt Budget exceeded
+            GraphExecutor->>GraphOrchestrator: emit(error_event)
+        else Continue
+            GraphExecutor->>GraphOrchestrator: emit(progress_event)
+        end
+    end
+    GraphOrchestrator->>User: AsyncGenerator[AgentEvent]
+### Research Team
+Critical Deep Research Agent
+## Development
+### Run Tests
+```bash
+uv run pytest
+```
+### Run Checks
+```bash
+make check
+```
+## Join Us
+- The-Obstacle-Is-The-Way
+- MarioAderman
+- Josephrp
+## Links
+- [GitHub Repository](https://github.com/The-Obstacle-Is-The-Way/DeepCritical-1)

docs/CONFIGURATION.md ADDED Viewed

	@@ -0,0 +1,301 @@

+# Configuration Guide
+## Overview
+DeepCritical uses **Pydantic Settings** for centralized configuration management. All settings are defined in `src/utils/config.py` and can be configured via environment variables or a `.env` file.
+## Quick Start
+1. Copy the example environment file (if available) or create a `.env` file in the project root
+2. Set at least one LLM API key (`OPENAI_API_KEY` or `ANTHROPIC_API_KEY`)
+3. Optionally configure other services as needed
+## Configuration System
+### How It Works
+- **Settings Class**: `Settings` class in `src/utils/config.py` extends `BaseSettings` from `pydantic_settings`
+- **Environment File**: Automatically loads from `.env` file (if present)
+- **Environment Variables**: Reads from environment variables (case-insensitive)
+- **Type Safety**: Strongly-typed fields with validation
+- **Singleton Pattern**: Global `settings` instance for easy access
+### Usage
+```python
+from src.utils.config import settings
+# Check if API keys are available
+if settings.has_openai_key:
+    # Use OpenAI
+    pass
+# Access configuration values
+max_iterations = settings.max_iterations
+web_search_provider = settings.web_search_provider
+```
+## Required Configuration
+### At Least One LLM Provider
+You must configure at least one LLM provider:
+**OpenAI:**
+```bash
+LLM_PROVIDER=openai
+OPENAI_API_KEY=your_openai_api_key_here
+OPENAI_MODEL=gpt-5.1
+```
+**Anthropic:**
+```bash
+LLM_PROVIDER=anthropic
+ANTHROPIC_API_KEY=your_anthropic_api_key_here
+ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
+```
+## Optional Configuration
+### Embedding Configuration
+```bash
+# Embedding Provider: "openai", "local", or "huggingface"
+EMBEDDING_PROVIDER=local
+# OpenAI Embedding Model (used by LlamaIndex RAG)
+OPENAI_EMBEDDING_MODEL=text-embedding-3-small
+# Local Embedding Model (sentence-transformers)
+LOCAL_EMBEDDING_MODEL=all-MiniLM-L6-v2
+# HuggingFace Embedding Model
+HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
+```
+### HuggingFace Configuration
+```bash
+# HuggingFace API Token (for inference API)
+HUGGINGFACE_API_KEY=your_huggingface_api_key_here
+# Or use HF_TOKEN (alternative name)
+# Default HuggingFace Model ID
+HUGGINGFACE_MODEL=meta-llama/Llama-3.1-8B-Instruct
+```
+### Web Search Configuration
+```bash
+# Web Search Provider: "serper", "searchxng", "brave", "tavily", or "duckduckgo"
+# Default: "duckduckgo" (no API key required)
+WEB_SEARCH_PROVIDER=duckduckgo
+# Serper API Key (for Google search via Serper)
+SERPER_API_KEY=your_serper_api_key_here
+# SearchXNG Host URL
+SEARCHXNG_HOST=http://localhost:8080
+# Brave Search API Key
+BRAVE_API_KEY=your_brave_api_key_here
+# Tavily API Key
+TAVILY_API_KEY=your_tavily_api_key_here
+```
+### PubMed Configuration
+```bash
+# NCBI API Key (optional, for higher rate limits: 10 req/sec vs 3 req/sec)
+NCBI_API_KEY=your_ncbi_api_key_here
+```
+### Agent Configuration
+```bash
+# Maximum iterations per research loop
+MAX_ITERATIONS=10
+# Search timeout in seconds
+SEARCH_TIMEOUT=30
+# Use graph-based execution for research flows
+USE_GRAPH_EXECUTION=false
+```
+### Budget & Rate Limiting Configuration
+```bash
+# Default token budget per research loop
+DEFAULT_TOKEN_LIMIT=100000
+# Default time limit per research loop (minutes)
+DEFAULT_TIME_LIMIT_MINUTES=10
+# Default iterations limit per research loop
+DEFAULT_ITERATIONS_LIMIT=10
+```
+### RAG Service Configuration
+```bash
+# ChromaDB collection name for RAG
+RAG_COLLECTION_NAME=deepcritical_evidence
+# Number of top results to retrieve from RAG
+RAG_SIMILARITY_TOP_K=5
+# Automatically ingest evidence into RAG
+RAG_AUTO_INGEST=true
+```
+### ChromaDB Configuration
+```bash
+# ChromaDB storage path
+CHROMA_DB_PATH=./chroma_db
+# Whether to persist ChromaDB to disk
+CHROMA_DB_PERSIST=true
+# ChromaDB server host (for remote ChromaDB, optional)
+# CHROMA_DB_HOST=localhost
+# ChromaDB server port (for remote ChromaDB, optional)
+# CHROMA_DB_PORT=8000
+```
+### External Services
+```bash
+# Modal Token ID (for Modal sandbox execution)
+MODAL_TOKEN_ID=your_modal_token_id_here
+# Modal Token Secret
+MODAL_TOKEN_SECRET=your_modal_token_secret_here
+```
+### Logging Configuration
+```bash
+# Log Level: "DEBUG", "INFO", "WARNING", or "ERROR"
+LOG_LEVEL=INFO
+```
+## Configuration Properties
+The `Settings` class provides helpful properties for checking configuration:
+```python
+from src.utils.config import settings
+# Check API key availability
+settings.has_openai_key          # bool
+settings.has_anthropic_key       # bool
+settings.has_huggingface_key     # bool
+settings.has_any_llm_key         # bool
+# Check service availability
+settings.modal_available         # bool
+settings.web_search_available    # bool
+```
+## Environment Variables Reference
+### Required (at least one LLM)
+- `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` - At least one LLM provider key
+### Optional LLM Providers
+- `DEEPSEEK_API_KEY` (Phase 2)
+- `OPENROUTER_API_KEY` (Phase 2)
+- `GEMINI_API_KEY` (Phase 2)
+- `PERPLEXITY_API_KEY` (Phase 2)
+- `HUGGINGFACE_API_KEY` or `HF_TOKEN`
+- `AZURE_OPENAI_ENDPOINT` (Phase 2)
+- `AZURE_OPENAI_DEPLOYMENT` (Phase 2)
+- `AZURE_OPENAI_API_KEY` (Phase 2)
+- `AZURE_OPENAI_API_VERSION` (Phase 2)
+- `LOCAL_MODEL_URL` (Phase 2)
+### Web Search
+- `WEB_SEARCH_PROVIDER` (default: "duckduckgo")
+- `SERPER_API_KEY`
+- `SEARCHXNG_HOST`
+- `BRAVE_API_KEY`
+- `TAVILY_API_KEY`
+### Embeddings
+- `EMBEDDING_PROVIDER` (default: "local")
+- `HUGGINGFACE_EMBEDDING_MODEL` (optional)
+### RAG
+- `RAG_COLLECTION_NAME` (default: "deepcritical_evidence")
+- `RAG_SIMILARITY_TOP_K` (default: 5)
+- `RAG_AUTO_INGEST` (default: true)
+### ChromaDB
+- `CHROMA_DB_PATH` (default: "./chroma_db")
+- `CHROMA_DB_PERSIST` (default: true)
+- `CHROMA_DB_HOST` (optional)
+- `CHROMA_DB_PORT` (optional)
+### Budget
+- `DEFAULT_TOKEN_LIMIT` (default: 100000)
+- `DEFAULT_TIME_LIMIT_MINUTES` (default: 10)
+- `DEFAULT_ITERATIONS_LIMIT` (default: 10)
+### Other
+- `LLM_PROVIDER` (default: "openai")
+- `NCBI_API_KEY` (optional)
+- `MODAL_TOKEN_ID` (optional)
+- `MODAL_TOKEN_SECRET` (optional)
+- `MAX_ITERATIONS` (default: 10)
+- `LOG_LEVEL` (default: "INFO")
+- `USE_GRAPH_EXECUTION` (default: false)
+## Validation
+Settings are validated on load using Pydantic validation:
+- **Type checking**: All fields are strongly typed
+- **Range validation**: Numeric fields have min/max constraints
+- **Literal validation**: Enum fields only accept specific values
+- **Required fields**: API keys are checked when accessed via `get_api_key()`
+## Error Handling
+Configuration errors raise `ConfigurationError`:
+```python
+from src.utils.config import settings
+from src.utils.exceptions import ConfigurationError
+try:
+    api_key = settings.get_api_key()
+except ConfigurationError as e:
+    print(f"Configuration error: {e}")
+```
+## Future Enhancements (Phase 2)
+The following configurations are planned for Phase 2:
+1. **Additional LLM Providers**: DeepSeek, OpenRouter, Gemini, Perplexity, Azure OpenAI, Local models
+2. **Model Selection**: Reasoning/main/fast model configuration
+3. **Service Integration**: Migrate `folder/llm_config.py` to centralized config
+See `CONFIGURATION_ANALYSIS.md` for the complete implementation plan.

docs/architecture/design-patterns.md ADDED Viewed

	@@ -0,0 +1,1509 @@

+# Design Patterns & Technical Decisions
+## Explicit Answers to Architecture Questions
+---
+## Purpose of This Document
+This document explicitly answers all the "design pattern" questions raised in team discussions. It provides clear technical decisions with rationale.
+---
+## 1. Primary Architecture Pattern
+### Decision: Orchestrator with Search-Judge Loop
+**Pattern Name**: Iterative Research Orchestrator
+**Structure**:
+```
+┌─────────────────────────────────────┐
+│    Research Orchestrator            │
+│  ┌───────────────────────────────┐  │
+│  │  Search Strategy Planner      │  │
+│  └───────────────────────────────┘  │
+│              ↓                      │
+│  ┌───────────────────────────────┐  │
+│  │  Tool Coordinator             │  │
+│  │  - PubMed Search              │  │
+│  │  - Web Search                 │  │
+│  │  - Clinical Trials            │  │
+│  └───────────────────────────────┘  │
+│              ↓                      │
+│  ┌───────────────────────────────┐  │
+│  │  Evidence Aggregator          │  │
+│  └───────────────────────────────┘  │
+│              ↓                      │
+│  ┌───────────────────────────────┐  │
+│  │  Quality Judge                │  │
+│  │  (LLM-based assessment)       │  │
+│  └───────────────────────────────┘  │
+│              ↓                      │
+│       Loop or Synthesize?           │
+│              ↓                      │
+│  ┌───────────────────────────────┐  │
+│  │  Report Generator             │  │
+│  └───────────────────────────────┘  │
+└─────────────────────────────────────┘
+```
+**Why NOT single-agent?**
+- Need coordinated multi-tool queries
+- Need iterative refinement
+- Need quality assessment between searches
+**Why NOT pure ReAct?**
+- Medical research requires structured workflow
+- Need explicit quality gates
+- Want deterministic tool selection
+**Why THIS pattern?**
+- Clear separation of concerns
+- Testable components
+- Easy to debug
+- Proven in similar systems
+---
+## 2. Tool Selection & Orchestration Pattern
+### Decision: Static Tool Registry with Dynamic Selection
+**Pattern**:
+```python
+class ToolRegistry:
+    """Central registry of available research tools"""
+    tools = {
+        'pubmed': PubMedSearchTool(),
+        'web': WebSearchTool(),
+        'trials': ClinicalTrialsTool(),
+        'drugs': DrugInfoTool(),
+    }
+class Orchestrator:
+    def select_tools(self, question: str, iteration: int) -> List[Tool]:
+        """Dynamically choose tools based on context"""
+        if iteration == 0:
+            # First pass: broad search
+            return [tools['pubmed'], tools['web']]
+        else:
+            # Refinement: targeted search
+            return self.judge.recommend_tools(question, context)
+```
+**Why NOT on-the-fly agent factories?**
+- 6-day timeline (too complex)
+- Tools are known upfront
+- Simpler to test and debug
+**Why NOT single tool?**
+- Need multiple evidence sources
+- Different tools for different info types
+- Better coverage
+**Why THIS pattern?**
+- Balance flexibility vs simplicity
+- Tools can be added easily
+- Selection logic is transparent
+---
+## 3. Judge Pattern
+### Decision: Dual-Judge System (Quality + Budget)
+**Pattern**:
+```python
+class QualityJudge:
+    """LLM-based evidence quality assessment"""
+    def is_sufficient(self, question: str, evidence: List[Evidence]) -> bool:
+        """Main decision: do we have enough?"""
+        return (
+            self.has_mechanism_explanation(evidence) and
+            self.has_drug_candidates(evidence) and
+            self.has_clinical_evidence(evidence) and
+            self.confidence_score(evidence) > threshold
+        )
+    def identify_gaps(self, question: str, evidence: List[Evidence]) -> List[str]:
+        """What's missing?"""
+        gaps = []
+        if not self.has_mechanism_explanation(evidence):
+            gaps.append("disease mechanism")
+        if not self.has_drug_candidates(evidence):
+            gaps.append("potential drug candidates")
+        if not self.has_clinical_evidence(evidence):
+            gaps.append("clinical trial data")
+        return gaps
+class BudgetJudge:
+    """Resource constraint enforcement"""
+    def should_stop(self, state: ResearchState) -> bool:
+        """Hard limits"""
+        return (
+            state.tokens_used >= max_tokens or
+            state.iterations >= max_iterations or
+            state.time_elapsed >= max_time
+        )
+```
+**Why NOT just LLM judge?**
+- Cost control (prevent runaway queries)
+- Time bounds (hackathon demo needs to be fast)
+- Safety (prevent infinite loops)
+**Why NOT just token budget?**
+- Want early exit when answer is good
+- Quality matters, not just quantity
+- Better user experience
+**Why THIS pattern?**
+- Best of both worlds
+- Clear separation (quality vs resources)
+- Each judge has single responsibility
+---
+## 4. Break/Stopping Pattern
+### Decision: Three-Tier Break Conditions
+**Pattern**:
+```python
+def should_continue(state: ResearchState) -> bool:
+    """Multi-tier stopping logic"""
+    # Tier 1: Quality-based (ideal stop)
+    if quality_judge.is_sufficient(state.question, state.evidence):
+        state.stop_reason = "sufficient_evidence"
+        return False
+    # Tier 2: Budget-based (cost control)
+    if state.tokens_used >= config.max_tokens:
+        state.stop_reason = "token_budget_exceeded"
+        return False
+    # Tier 3: Iteration-based (safety)
+    if state.iterations >= config.max_iterations:
+        state.stop_reason = "max_iterations_reached"
+        return False
+    # Tier 4: Time-based (demo friendly)
+    if state.time_elapsed >= config.max_time:
+        state.stop_reason = "timeout"
+        return False
+    return True  # Continue researching
+```
+**Configuration**:
+```toml
+[research.limits]
+max_tokens = 50000      # ~$0.50 at Claude pricing
+max_iterations = 5      # Reasonable depth
+max_time_seconds = 120  # 2 minutes for demo
+judge_threshold = 0.8   # Quality confidence score
+```
+**Why multiple conditions?**
+- Defense in depth
+- Different failure modes
+- Graceful degradation
+**Why these specific limits?**
+- Tokens: Balances cost vs quality
+- Iterations: Enough for refinement, not too deep
+- Time: Fast enough for live demo
+- Judge: High bar for quality
+---
+## 5. State Management Pattern
+### Decision: Pydantic State Machine with Checkpoints
+**Pattern**:
+```python
+class ResearchState(BaseModel):
+    """Immutable state snapshots"""
+    query_id: str
+    question: str
+    iteration: int = 0
+    evidence: List[Evidence] = []
+    tokens_used: int = 0
+    search_history: List[SearchQuery] = []
+    stop_reason: Optional[str] = None
+    created_at: datetime
+    updated_at: datetime
+class StateManager:
+    def save_checkpoint(self, state: ResearchState) -> None:
+        """Save state to disk"""
+        path = f".deepresearch/checkpoints/{state.query_id}_iter{state.iteration}.json"
+        path.write_text(state.model_dump_json(indent=2))
+    def load_checkpoint(self, query_id: str, iteration: int) -> ResearchState:
+        """Resume from checkpoint"""
+        path = f".deepresearch/checkpoints/{query_id}_iter{iteration}.json"
+        return ResearchState.model_validate_json(path.read_text())
+```
+**Directory Structure**:
+```
+.deepresearch/
+├── state/
+│   └── current_123.json          # Active research state
+├── checkpoints/
+│   ├── query_123_iter0.json      # Checkpoint after iteration 0
+│   ├── query_123_iter1.json      # Checkpoint after iteration 1
+│   └── query_123_iter2.json      # Checkpoint after iteration 2
+└── workspace/
+    └── query_123/
+        ├── papers/                # Downloaded PDFs
+        ├── search_results/        # Raw search results
+        └── analysis/              # Intermediate analysis
+```
+**Why Pydantic?**
+- Type safety
+- Validation
+- Easy serialization
+- Integration with Pydantic AI
+**Why checkpoints?**
+- Resume interrupted research
+- Debugging (inspect state at each iteration)
+- Cost savings (don't re-query)
+- Demo resilience
+---
+## 6. Tool Interface Pattern
+### Decision: Async Unified Tool Protocol
+**Pattern**:
+```python
+from typing import Protocol, Optional, List, Dict
+import asyncio
+class ResearchTool(Protocol):
+    """Standard async interface all tools must implement"""
+    async def search(
+        self,
+        query: str,
+        max_results: int = 10,
+        filters: Optional[Dict] = None
+    ) -> List[Evidence]:
+        """Execute search and return structured evidence"""
+        ...
+    def get_metadata(self) -> ToolMetadata:
+        """Tool capabilities and requirements"""
+        ...
+class PubMedSearchTool:
+    """Concrete async implementation"""
+    def __init__(self):
+        self._rate_limiter = asyncio.Semaphore(3)  # 3 req/sec
+        self._cache: Dict[str, List[Evidence]] = {}
+    async def search(self, query: str, max_results: int = 10, **kwargs) -> List[Evidence]:
+        # Check cache first
+        cache_key = f"{query}:{max_results}"
+        if cache_key in self._cache:
+            return self._cache[cache_key]
+        async with self._rate_limiter:
+            # 1. Query PubMed E-utilities API (async httpx)
+            async with httpx.AsyncClient() as client:
+                response = await client.get(
+                    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
+                    params={"db": "pubmed", "term": query, "retmax": max_results}
+                )
+            # 2. Parse XML response
+            # 3. Extract: title, abstract, authors, citations
+            # 4. Convert to Evidence objects
+            evidence_list = self._parse_response(response.text)
+            # Cache results
+            self._cache[cache_key] = evidence_list
+            return evidence_list
+    def get_metadata(self) -> ToolMetadata:
+        return ToolMetadata(
+            name="PubMed",
+            description="Biomedical literature search",
+            rate_limit="3 requests/second",
+            requires_api_key=False
+        )
+```
+**Parallel Tool Execution**:
+```python
+async def search_all_tools(query: str, tools: List[ResearchTool]) -> List[Evidence]:
+    """Run all tool searches in parallel"""
+    tasks = [tool.search(query) for tool in tools]
+    results = await asyncio.gather(*tasks, return_exceptions=True)
+    # Flatten and filter errors
+    evidence = []
+    for result in results:
+        if isinstance(result, Exception):
+            logger.warning(f"Tool failed: {result}")
+        else:
+            evidence.extend(result)
+    return evidence
+```
+**Why Async?**
+- Tools are I/O bound (network calls)
+- Parallel execution = faster searches
+- Better UX (streaming progress)
+- Standard in 2025 Python
+**Why Protocol?**
+- Loose coupling
+- Easy to add new tools
+- Testable with mocks
+- Clear contract
+**Why NOT abstract base class?**
+- More Pythonic (PEP 544)
+- Duck typing friendly
+- Runtime checking with isinstance
+---
+## 7. Report Generation Pattern
+### Decision: Structured Output with Citations
+**Pattern**:
+```python
+class DrugCandidate(BaseModel):
+    name: str
+    mechanism: str
+    evidence_quality: Literal["strong", "moderate", "weak"]
+    clinical_status: str  # "FDA approved", "Phase 2", etc.
+    citations: List[Citation]
+class ResearchReport(BaseModel):
+    query: str
+    disease_mechanism: str
+    candidates: List[DrugCandidate]
+    methodology: str  # How we searched
+    confidence: float
+    sources_used: List[str]
+    generated_at: datetime
+    def to_markdown(self) -> str:
+        """Human-readable format"""
+        ...
+    def to_json(self) -> str:
+        """Machine-readable format"""
+        ...
+```
+**Output Example**:
+```markdown
+# Research Report: Long COVID Fatigue
+## Disease Mechanism
+Long COVID fatigue is associated with mitochondrial dysfunction
+and persistent inflammation [1, 2].
+## Drug Candidates
+### 1. Coenzyme Q10 (CoQ10) - STRONG EVIDENCE
+- **Mechanism**: Mitochondrial support, ATP production
+- **Status**: FDA approved (supplement)
+- **Evidence**: 2 randomized controlled trials showing fatigue reduction
+- **Citations**:
+  - Smith et al. (2023) - PubMed: 12345678
+  - Johnson et al. (2023) - PubMed: 87654321
+### 2. Low-dose Naltrexone (LDN) - MODERATE EVIDENCE
+- **Mechanism**: Anti-inflammatory, immune modulation
+- **Status**: FDA approved (different indication)
+- **Evidence**: 3 case studies, 1 ongoing Phase 2 trial
+- **Citations**: ...
+## Methodology
+- Searched PubMed: 45 papers reviewed
+- Searched Web: 12 sources
+- Clinical trials: 8 trials identified
+- Total iterations: 3
+- Tokens used: 12,450
+## Confidence: 85%
+## Sources
+- PubMed E-utilities
+- ClinicalTrials.gov
+- OpenFDA Database
+```
+**Why structured?**
+- Parseable by other systems
+- Consistent format
+- Easy to validate
+- Good for datasets
+**Why markdown?**
+- Human-readable
+- Renders nicely in Gradio
+- Easy to convert to PDF
+- Standard format
+---
+## 8. Error Handling Pattern
+### Decision: Graceful Degradation with Fallbacks
+**Pattern**:
+```python
+class ResearchAgent:
+    def research(self, question: str) -> ResearchReport:
+        try:
+            return self._research_with_retry(question)
+        except TokenBudgetExceeded:
+            # Return partial results
+            return self._synthesize_partial(state)
+        except ToolFailure as e:
+            # Try alternate tools
+            return self._research_with_fallback(question, failed_tool=e.tool)
+        except Exception as e:
+            # Log and return error report
+            logger.error(f"Research failed: {e}")
+            return self._error_report(question, error=e)
+```
+**Why NOT fail fast?**
+- Hackathon demo must be robust
+- Partial results better than nothing
+- Good user experience
+**Why NOT silent failures?**
+- Need visibility for debugging
+- User should know limitations
+- Honest about confidence
+---
+## 9. Configuration Pattern
+### Decision: Hydra-inspired but Simpler
+**Pattern**:
+```toml
+# config.toml
+[research]
+max_iterations = 5
+max_tokens = 50000
+max_time_seconds = 120
+judge_threshold = 0.85
+[tools]
+enabled = ["pubmed", "web", "trials"]
+[tools.pubmed]
+max_results = 20
+rate_limit = 3  # per second
+[tools.web]
+engine = "serpapi"
+max_results = 10
+[llm]
+provider = "anthropic"
+model = "claude-3-5-sonnet-20241022"
+temperature = 0.1
+[output]
+format = "markdown"
+include_citations = true
+include_methodology = true
+```
+**Loading**:
+```python
+from pathlib import Path
+import tomllib
+def load_config() -> dict:
+    config_path = Path("config.toml")
+    with open(config_path, "rb") as f:
+        return tomllib.load(f)
+```
+**Why NOT full Hydra?**
+- Simpler for hackathon
+- Easier to understand
+- Faster to modify
+- Can upgrade later
+**Why TOML?**
+- Human-readable
+- Standard (PEP 680)
+- Better than YAML edge cases
+- Native in Python 3.11+
+---
+## 10. Testing Pattern
+### Decision: Three-Level Testing Strategy
+**Pattern**:
+```python
+# Level 1: Unit tests (fast, isolated)
+def test_pubmed_tool():
+    tool = PubMedSearchTool()
+    results = tool.search("aspirin cardiovascular")
+    assert len(results) > 0
+    assert all(isinstance(r, Evidence) for r in results)
+# Level 2: Integration tests (tools + agent)
+def test_research_loop():
+    agent = ResearchAgent(config=test_config)
+    report = agent.research("aspirin repurposing")
+    assert report.candidates
+    assert report.confidence > 0
+# Level 3: End-to-end tests (full system)
+def test_full_workflow():
+    # Simulate user query through Gradio UI
+    response = gradio_app.predict("test query")
+    assert "Drug Candidates" in response
+```
+**Why three levels?**
+- Fast feedback (unit tests)
+- Confidence (integration tests)
+- Reality check (e2e tests)
+**Test Data**:
+```python
+# tests/fixtures/
+- mock_pubmed_response.xml
+- mock_web_results.json
+- sample_research_query.txt
+- expected_report.md
+```
+---
+## 11. Judge Prompt Templates
+### Decision: Structured JSON Output with Domain-Specific Criteria
+**Quality Judge System Prompt**:
+```python
+QUALITY_JUDGE_SYSTEM = """You are a medical research quality assessor specializing in drug repurposing.
+Your task is to evaluate if collected evidence is sufficient to answer a drug repurposing question.
+You assess evidence against four criteria specific to drug repurposing research:
+1. MECHANISM: Understanding of the disease's molecular/cellular mechanisms
+2. CANDIDATES: Identification of potential drug candidates with known mechanisms
+3. EVIDENCE: Clinical or preclinical evidence supporting repurposing
+4. SOURCES: Quality and credibility of sources (peer-reviewed > preprints > web)
+You MUST respond with valid JSON only. No other text."""
+```
+**Quality Judge User Prompt**:
+```python
+QUALITY_JUDGE_USER = """
+## Research Question
+{question}
+## Evidence Collected (Iteration {iteration} of {max_iterations})
+{evidence_summary}
+## Token Budget
+Used: {tokens_used} / {max_tokens}
+## Your Assessment
+Evaluate the evidence and respond with this exact JSON structure:
+```json
+{{
+  "assessment": {{
+    "mechanism_score": <0-10>,
+    "mechanism_reasoning": "<Step-by-step analysis of mechanism understanding>",
+    "candidates_score": <0-10>,
+    "candidates_found": ["<drug1>", "<drug2>", ...],
+    "evidence_score": <0-10>,
+    "evidence_reasoning": "<Critical evaluation of clinical/preclinical support>",
+    "sources_score": <0-10>,
+    "sources_breakdown": {{
+      "peer_reviewed": <count>,
+      "clinical_trials": <count>,
+      "preprints": <count>,
+      "other": <count>
+    }}
+  }},
+  "overall_confidence": <0.0-1.0>,
+  "sufficient": <true/false>,
+  "gaps": ["<missing info 1>", "<missing info 2>"],
+  "recommended_searches": ["<search query 1>", "<search query 2>"],
+  "recommendation": "<continue|synthesize>"
+}}
+```
+Decision rules:
+- sufficient=true if overall_confidence >= 0.8 AND mechanism_score >= 6 AND candidates_score >= 6
+- sufficient=true if remaining budget < 10% (must synthesize with what we have)
+- Otherwise, provide recommended_searches to fill gaps
+"""
+```
+**Report Synthesis Prompt**:
+```python
+SYNTHESIS_PROMPT = """You are a medical research synthesizer creating a drug repurposing report.
+## Research Question
+{question}
+## Collected Evidence
+{all_evidence}
+## Judge Assessment
+{final_assessment}
+## Your Task
+Create a comprehensive research report with this structure:
+1. **Executive Summary** (2-3 sentences)
+2. **Disease Mechanism** - What we understand about the condition
+3. **Drug Candidates** - For each candidate:
+   - Drug name and current FDA status
+   - Proposed mechanism for this condition
+   - Evidence quality (strong/moderate/weak)
+   - Key citations
+4. **Methodology** - How we searched (tools used, queries, iterations)
+5. **Limitations** - What we couldn't find or verify
+6. **Confidence Score** - Overall confidence in findings
+Format as Markdown. Include PubMed IDs as citations [PMID: 12345678].
+Be scientifically accurate. Do not hallucinate drug names or mechanisms.
+If evidence is weak, say so clearly."""
+```
+**Why Structured JSON?**
+- Parseable by code (not just LLM output)
+- Consistent format for logging/debugging
+- Can trigger specific actions (continue vs synthesize)
+- Testable with expected outputs
+**Why Domain-Specific Criteria?**
+- Generic "is this good?" prompts fail
+- Drug repurposing has specific requirements
+- Physician on team validated criteria
+- Maps to real research workflow
+---
+## 12. MCP Server Integration (Hackathon Track)
+### Decision: Tools as MCP Servers for Reusability
+**Why MCP?**
+- Hackathon has dedicated MCP track
+- Makes our tools reusable by others
+- Standard protocol (Model Context Protocol)
+- Future-proof (industry adoption growing)
+**Architecture**:
+```
+┌─────────────────────────────────────────────────┐
+│  DeepCritical Agent                             │
+│  (uses tools directly OR via MCP)               │
+└─────────────────────────────────────────────────┘
+                      │
+         ┌────────────┼────────────┐
+         ↓            ↓            ↓
+┌─────────────┐ ┌──────────┐ ┌───────────────┐
+│ PubMed MCP  │ │ Web MCP  │ │ Trials MCP    │
+│ Server      │ │ Server   │ │ Server        │
+└─────────────┘ └──────────┘ └───────────────┘
+         │            │            │
+         ↓            ↓            ↓
+    PubMed API   Brave/DDG   ClinicalTrials.gov
+```
+**PubMed MCP Server Implementation**:
+```python
+# src/mcp_servers/pubmed_server.py
+from fastmcp import FastMCP
+mcp = FastMCP("PubMed Research Tool")
+@mcp.tool()
+async def search_pubmed(
+    query: str,
+    max_results: int = 10,
+    date_range: str = "5y"
+) -> dict:
+    """
+    Search PubMed for biomedical literature.
+    Args:
+        query: Search terms (supports PubMed syntax like [MeSH])
+        max_results: Maximum papers to return (default 10, max 100)
+        date_range: Time filter - "1y", "5y", "10y", or "all"
+    Returns:
+        dict with papers list containing title, abstract, authors, pmid, date
+    """
+    tool = PubMedSearchTool()
+    results = await tool.search(query, max_results)
+    return {
+        "query": query,
+        "count": len(results),
+        "papers": [r.model_dump() for r in results]
+    }
+@mcp.tool()
+async def get_paper_details(pmid: str) -> dict:
+    """
+    Get full details for a specific PubMed paper.
+    Args:
+        pmid: PubMed ID (e.g., "12345678")
+    Returns:
+        Full paper metadata including abstract, MeSH terms, references
+    """
+    tool = PubMedSearchTool()
+    return await tool.get_details(pmid)
+if __name__ == "__main__":
+    mcp.run()
+```
+**Running the MCP Server**:
+```bash
+# Start the server
+python -m src.mcp_servers.pubmed_server
+# Or with uvx (recommended)
+uvx fastmcp run src/mcp_servers/pubmed_server.py
+# Note: fastmcp uses stdio transport by default, which is perfect
+# for local integration with Claude Desktop or the main agent.
+```
+**Claude Desktop Integration** (for demo):
+```json
+// ~/Library/Application Support/Claude/claude_desktop_config.json
+{
+  "mcpServers": {
+    "pubmed": {
+      "command": "python",
+      "args": ["-m", "src.mcp_servers.pubmed_server"],
+      "cwd": "/path/to/deepcritical"
+    }
+  }
+}
+```
+**Why FastMCP?**
+- Simple decorator syntax
+- Handles protocol complexity
+- Good docs and examples
+- Works with Claude Desktop and API
+**MCP Track Submission Requirements**:
+- [ ] At least one tool as MCP server
+- [ ] README with setup instructions
+- [ ] Demo showing MCP usage
+- [ ] Bonus: Multiple tools as MCP servers
+---
+## 13. Gradio UI Pattern (Hackathon Track)
+### Decision: Streaming Progress with Modern UI
+**Pattern**:
+```python
+import gradio as gr
+from typing import Generator
+def research_with_streaming(question: str) -> Generator[str, None, None]:
+    """Stream research progress to UI"""
+    yield "🔍 Starting research...\n\n"
+    agent = ResearchAgent()
+    async for event in agent.research_stream(question):
+        match event.type:
+            case "search_start":
+                yield f"📚 Searching {event.tool}...\n"
+            case "search_complete":
+                yield f"✅ Found {event.count} results from {event.tool}\n"
+            case "judge_thinking":
+                yield f"🤔 Evaluating evidence quality...\n"
+            case "judge_decision":
+                yield f"📊 Confidence: {event.confidence:.0%}\n"
+            case "iteration_complete":
+                yield f"🔄 Iteration {event.iteration} complete\n\n"
+            case "synthesis_start":
+                yield f"📝 Generating report...\n"
+            case "complete":
+                yield f"\n---\n\n{event.report}"
+# Gradio 5 UI
+with gr.Blocks(theme=gr.themes.Soft()) as demo:
+    gr.Markdown("# 🔬 DeepCritical: Drug Repurposing Research Agent")
+    gr.Markdown("Ask a question about potential drug repurposing opportunities.")
+    with gr.Row():
+        with gr.Column(scale=2):
+            question = gr.Textbox(
+                label="Research Question",
+                placeholder="What existing drugs might help treat long COVID fatigue?",
+                lines=2
+            )
+            examples = gr.Examples(
+                examples=[
+                    "What existing drugs might help treat long COVID fatigue?",
+                    "Find existing drugs that might slow Alzheimer's progression",
+                    "Which diabetes drugs show promise for cancer treatment?"
+                ],
+                inputs=question
+            )
+            submit = gr.Button("🚀 Start Research", variant="primary")
+        with gr.Column(scale=3):
+            output = gr.Markdown(label="Research Progress & Report")
+    submit.click(
+        fn=research_with_streaming,
+        inputs=question,
+        outputs=output,
+    )
+demo.launch()
+```
+**Why Streaming?**
+- User sees progress, not loading spinner
+- Builds trust (system is working)
+- Better UX for long operations
+- Gradio 5 native support
+**Why gr.Markdown Output?**
+- Research reports are markdown
+- Renders citations nicely
+- Code blocks for methodology
+- Tables for drug comparisons
+---
+## Summary: Design Decision Table
+| # | Question | Decision | Why |
+|---|----------|----------|-----|
+| 1 | **Architecture** | Orchestrator with search-judge loop | Clear, testable, proven |
+| 2 | **Tools** | Static registry, dynamic selection | Balance flexibility vs simplicity |
+| 3 | **Judge** | Dual (quality + budget) | Quality + cost control |
+| 4 | **Stopping** | Four-tier conditions | Defense in depth |
+| 5 | **State** | Pydantic + checkpoints | Type-safe, resumable |
+| 6 | **Tool Interface** | Async Protocol + parallel execution | Fast I/O, modern Python |
+| 7 | **Output** | Structured + Markdown | Human & machine readable |
+| 8 | **Errors** | Graceful degradation + fallbacks | Robust for demo |
+| 9 | **Config** | TOML (Hydra-inspired) | Simple, standard |
+| 10 | **Testing** | Three levels | Fast feedback + confidence |
+| 11 | **Judge Prompts** | Structured JSON + domain criteria | Parseable, medical-specific |
+| 12 | **MCP** | Tools as MCP servers | Hackathon track, reusability |
+| 13 | **UI** | Gradio 5 streaming | Progress visibility, modern UX |
+---
+## Answers to Specific Questions
+### "What's the orchestrator pattern?"
+**Answer**: See Section 1 - Iterative Research Orchestrator with search-judge loop
+### "LLM-as-judge or token budget?"
+**Answer**: Both - See Section 3 (Dual-Judge System) and Section 4 (Three-Tier Break Conditions)
+### "What's the break pattern?"
+**Answer**: See Section 4 - Three stopping conditions: quality threshold, token budget, max iterations
+### "Should we use agent factories?"
+**Answer**: No - See Section 2. Static tool registry is simpler for 6-day timeline
+### "How do we handle state?"
+**Answer**: See Section 5 - Pydantic state machine with checkpoints
+---
+## Appendix: Complete Data Models
+```python
+# src/deepresearch/models.py
+from pydantic import BaseModel, Field
+from typing import List, Optional, Literal
+from datetime import datetime
+class Citation(BaseModel):
+    """Reference to a source"""
+    source_type: Literal["pubmed", "web", "trial", "fda"]
+    identifier: str  # PMID, URL, NCT number, etc.
+    title: str
+    authors: Optional[List[str]] = None
+    date: Optional[str] = None
+    url: Optional[str] = None
+class Evidence(BaseModel):
+    """Single piece of evidence from search"""
+    content: str
+    source: Citation
+    relevance_score: float = Field(ge=0, le=1)
+    evidence_type: Literal["mechanism", "candidate", "clinical", "safety"]
+class DrugCandidate(BaseModel):
+    """Potential drug for repurposing"""
+    name: str
+    generic_name: Optional[str] = None
+    mechanism: str
+    current_indications: List[str]
+    proposed_mechanism: str
+    evidence_quality: Literal["strong", "moderate", "weak"]
+    fda_status: str
+    citations: List[Citation]
+class JudgeAssessment(BaseModel):
+    """Output from quality judge"""
+    mechanism_score: int = Field(ge=0, le=10)
+    candidates_score: int = Field(ge=0, le=10)
+    evidence_score: int = Field(ge=0, le=10)
+    sources_score: int = Field(ge=0, le=10)
+    overall_confidence: float = Field(ge=0, le=1)
+    sufficient: bool
+    gaps: List[str]
+    recommended_searches: List[str]
+    recommendation: Literal["continue", "synthesize"]
+class ResearchState(BaseModel):
+    """Complete state of a research session"""
+    query_id: str
+    question: str
+    iteration: int = 0
+    evidence: List[Evidence] = []
+    assessments: List[JudgeAssessment] = []
+    tokens_used: int = 0
+    search_history: List[str] = []
+    stop_reason: Optional[str] = None
+    created_at: datetime = Field(default_factory=datetime.utcnow)
+    updated_at: datetime = Field(default_factory=datetime.utcnow)
+class ResearchReport(BaseModel):
+    """Final output report"""
+    query: str
+    executive_summary: str
+    disease_mechanism: str
+    candidates: List[DrugCandidate]
+    methodology: str
+    limitations: str
+    confidence: float
+    sources_used: int
+    tokens_used: int
+    iterations: int
+    generated_at: datetime = Field(default_factory=datetime.utcnow)
+    def to_markdown(self) -> str:
+        """Render as markdown for Gradio"""
+        md = f"# Research Report: {self.query}\n\n"
+        md += f"## Executive Summary\n{self.executive_summary}\n\n"
+        md += f"## Disease Mechanism\n{self.disease_mechanism}\n\n"
+        md += "## Drug Candidates\n\n"
+        for i, drug in enumerate(self.candidates, 1):
+            md += f"### {i}. {drug.name} - {drug.evidence_quality.upper()} EVIDENCE\n"
+            md += f"- **Mechanism**: {drug.proposed_mechanism}\n"
+            md += f"- **FDA Status**: {drug.fda_status}\n"
+            md += f"- **Current Uses**: {', '.join(drug.current_indications)}\n"
+            md += f"- **Citations**: {len(drug.citations)} sources\n\n"
+        md += f"## Methodology\n{self.methodology}\n\n"
+        md += f"## Limitations\n{self.limitations}\n\n"
+        md += f"## Confidence: {self.confidence:.0%}\n"
+        return md
+```
+---
+## 14. Alternative Frameworks Considered
+We researched major agent frameworks before settling on our stack. Here's why we chose what we chose, and what we'd steal if we're shipping like animals and have time for Gucci upgrades.
+### Frameworks Evaluated
+| Framework | Repo | What It Does |
+|-----------|------|--------------|
+| **Microsoft AutoGen** | [github.com/microsoft/autogen](https://github.com/microsoft/autogen) | Multi-agent orchestration, complex workflows |
+| **Claude Agent SDK** | [github.com/anthropics/claude-agent-sdk-python](https://github.com/anthropics/claude-agent-sdk-python) | Anthropic's official agent framework |
+| **Pydantic AI** | [github.com/pydantic/pydantic-ai](https://github.com/pydantic/pydantic-ai) | Type-safe agents, structured outputs |
+### Why NOT AutoGen (Microsoft)?
+**Pros:**
+- Battle-tested multi-agent orchestration
+- `reflect_on_tool_use` - model reviews its own tool results
+- `max_tool_iterations` - built-in iteration limits
+- Concurrent tool execution
+- Rich ecosystem (AutoGen Studio, benchmarks)
+**Cons for MVP:**
+- Heavy dependency tree (50+ packages)
+- Complex configuration (YAML + Python)
+- Overkill for single-agent search-judge loop
+- Learning curve eats into 6-day timeline
+**Verdict:** Great for multi-agent systems. Overkill for our MVP.
+### Why NOT Claude Agent SDK (Anthropic)?
+**Pros:**
+- Official Anthropic framework
+- Clean `@tool` decorator pattern
+- In-process MCP servers (no subprocess)
+- Hooks for pre/post tool execution
+- Direct Claude Code integration
+**Cons for MVP:**
+- Requires Claude Code CLI bundled
+- Node.js dependency for some features
+- Designed for Claude Code ecosystem, not standalone agents
+- Less flexible for custom LLM providers
+**Verdict:** Would be great if we were building ON Claude Code. We're building a standalone agent.
+### Why Pydantic AI + FastMCP (Our Choice)
+**Pros:**
+- ✅ Simple, Pythonic API
+- ✅ Native async/await
+- ✅ Type-safe with Pydantic
+- ✅ Works with any LLM provider
+- ✅ FastMCP for clean MCP servers
+- ✅ Minimal dependencies
+- ✅ Can ship MVP in 6 days
+**Cons:**
+- Newer framework (less battle-tested)
+- Smaller ecosystem
+- May need to build more from scratch
+**Verdict:** Right tool for the job. Ship fast, iterate later.
+---
+## 15. Stretch Goals: Gucci Bangers (If We're Shipping Like Animals)
+If MVP ships early and we're crushing it, here's what we'd steal from other frameworks:
+### Tier 1: Quick Wins (2-4 hours each)
+#### From Claude Agent SDK: `@tool` Decorator Pattern
+Replace our Protocol-based tools with cleaner decorators:
+```python
+# CURRENT (Protocol-based)
+class PubMedSearchTool:
+    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
+        ...
+# UPGRADE (Decorator-based, stolen from Claude SDK)
+from claude_agent_sdk import tool
+@tool("search_pubmed", "Search PubMed for biomedical papers", {
+    "query": str,
+    "max_results": int
+})
+async def search_pubmed(args):
+    results = await _do_pubmed_search(args["query"], args["max_results"])
+    return {"content": [{"type": "text", "text": json.dumps(results)}]}
+```
+**Why it's Gucci:** Cleaner syntax, automatic schema generation, less boilerplate.
+#### From AutoGen: Reflect on Tool Use
+Add a reflection step where the model reviews its own tool results:
+```python
+# CURRENT: Judge evaluates evidence
+assessment = await judge.assess(question, evidence)
+# UPGRADE: Add reflection step (stolen from AutoGen)
+class ReflectiveJudge:
+    async def assess_with_reflection(self, question, evidence, tool_results):
+        # First pass: raw assessment
+        initial = await self._assess(question, evidence)
+        # Reflection: "Did I use the tools correctly?"
+        reflection = await self._reflect_on_tool_use(tool_results)
+        # Final: combine assessment + reflection
+        return self._combine(initial, reflection)
+```
+**Why it's Gucci:** Catches tool misuse, improves accuracy, more robust judge.
+### Tier 2: Medium Lifts (4-8 hours each)
+#### From AutoGen: Concurrent Tool Execution
+Run multiple tools in parallel with proper error handling:
+```python
+# CURRENT: Sequential with asyncio.gather
+results = await asyncio.gather(*[tool.search(query) for tool in tools])
+# UPGRADE: AutoGen-style with cancellation + timeout
+from autogen_core import CancellationToken
+async def execute_tools_concurrent(tools, query, timeout=30):
+    token = CancellationToken()
+    async def run_with_timeout(tool):
+        try:
+            return await asyncio.wait_for(
+                tool.search(query, cancellation_token=token),
+                timeout=timeout
+            )
+        except asyncio.TimeoutError:
+            token.cancel()  # Cancel other tools
+            return ToolError(f"{tool.name} timed out")
+    return await asyncio.gather(*[run_with_timeout(t) for t in tools])
+```
+**Why it's Gucci:** Proper timeout handling, cancellation propagation, production-ready.
+#### From Claude SDK: Hooks System
+Add pre/post hooks for logging, validation, cost tracking:
+```python
+# UPGRADE: Hook system (stolen from Claude SDK)
+class HookManager:
+    async def pre_tool_use(self, tool_name, args):
+        """Called before every tool execution"""
+        logger.info(f"Calling {tool_name} with {args}")
+        self.cost_tracker.start_timer()
+    async def post_tool_use(self, tool_name, result, duration):
+        """Called after every tool execution"""
+        self.cost_tracker.record(tool_name, duration)
+        if result.is_error:
+            self.error_tracker.record(tool_name, result.error)
+```
+**Why it's Gucci:** Observability, debugging, cost tracking, production-ready.
+### Tier 3: Big Lifts (Post-Hackathon)
+#### Full AutoGen Integration
+If we want multi-agent capabilities later:
+```python
+# POST-HACKATHON: Multi-agent drug repurposing
+from autogen_agentchat import AssistantAgent, GroupChat
+literature_agent = AssistantAgent(
+    name="LiteratureReviewer",
+    tools=[pubmed_search, web_search],
+    system_message="You search and summarize medical literature."
+)
+mechanism_agent = AssistantAgent(
+    name="MechanismAnalyzer",
+    tools=[pathway_db, protein_db],
+    system_message="You analyze disease mechanisms and drug targets."
+)
+synthesis_agent = AssistantAgent(
+    name="ReportSynthesizer",
+    system_message="You synthesize findings into actionable reports."
+)
+# Orchestrate multi-agent workflow
+group_chat = GroupChat(
+    agents=[literature_agent, mechanism_agent, synthesis_agent],
+    max_round=10
+)
+```
+**Why it's Gucci:** True multi-agent collaboration, specialized roles, scalable.
+---
+## Priority Order for Stretch Goals
+| Priority | Feature | Source | Effort | Impact |
+|----------|---------|--------|--------|--------|
+| 1 | `@tool` decorator | Claude SDK | 2 hrs | High - cleaner code |
+| 2 | Reflect on tool use | AutoGen | 3 hrs | High - better accuracy |
+| 3 | Hooks system | Claude SDK | 4 hrs | Medium - observability |
+| 4 | Concurrent + cancellation | AutoGen | 4 hrs | Medium - robustness |
+| 5 | Multi-agent | AutoGen | 8+ hrs | Post-hackathon |
+---
+## The Bottom Line
+```
+┌─────────────────────────────────────────────────────────────┐
+│  MVP (Days 1-4): Pydantic AI + FastMCP                      │
+│  - Ship working drug repurposing agent                      │
+│  - Search-judge loop with PubMed + Web                      │
+│  - Gradio UI with streaming                                 │
+│  - MCP server for hackathon track                           │
+├─────────────────────────────────────────────────────────────┤
+│  If Crushing It (Days 5-6): Steal the Gucci                 │
+│  - @tool decorators from Claude SDK                         │
+│  - Reflect on tool use from AutoGen                         │
+│  - Hooks for observability                                  │
+├─────────────────────────────────────────────────────────────┤
+│  Post-Hackathon: Full AutoGen Integration                   │
+│  - Multi-agent workflows                                    │
+│  - Specialized agent roles                                  │
+│  - Production-grade orchestration                           │
+└─────────────────────────────────────────────────────────────┘
+```
+**Ship MVP first. Steal bangers if time. Scale later.**
+---
+## 16. Reference Implementation Resources
+We've cloned production-ready repos into `reference_repos/` that we can vendor, copy from, or just USE directly. This section documents what's available and how to leverage it.
+### Cloned Repositories
+| Repository | Location | What It Provides |
+|------------|----------|------------------|
+| **pydanticai-research-agent** | `reference_repos/pydanticai-research-agent/` | Complete PydanticAI agent with Brave Search |
+| **pubmed-mcp-server** | `reference_repos/pubmed-mcp-server/` | Production-grade PubMed MCP server (TypeScript) |
+| **autogen-microsoft** | `reference_repos/autogen-microsoft/` | Microsoft's multi-agent framework |
+| **claude-agent-sdk** | `reference_repos/claude-agent-sdk/` | Anthropic's agent SDK with @tool decorator |
+### 🔥 CHEAT CODE: Production PubMed MCP Already Exists
+The `pubmed-mcp-server` is **production-grade** and has EVERYTHING we need:
+```bash
+# Already available tools in pubmed-mcp-server:
+pubmed_search_articles    # Search PubMed with filters, date ranges
+pubmed_fetch_contents     # Get full article details by PMID
+pubmed_article_connections # Find citations, related articles
+pubmed_research_agent     # Generate research plan outlines
+pubmed_generate_chart     # Create PNG charts from data
+```
+**Option 1: Use it directly via npx**
+```json
+{
+  "mcpServers": {
+    "pubmed": {
+      "command": "npx",
+      "args": ["@cyanheads/pubmed-mcp-server"],
+      "env": { "NCBI_API_KEY": "your_key" }
+    }
+  }
+}
+```
+**Option 2: Vendor the logic into Python**
+The TypeScript code in `reference_repos/pubmed-mcp-server/src/` shows exactly how to:
+- Construct PubMed E-utilities queries
+- Handle rate limiting (3/sec without key, 10/sec with key)
+- Parse XML responses
+- Extract article metadata
+### PydanticAI Research Agent Patterns
+The `pydanticai-research-agent` repo provides copy-paste patterns:
+**Agent Definition** (`agents/research_agent.py`):
+```python
+from pydantic_ai import Agent, RunContext
+from dataclasses import dataclass
+@dataclass
+class ResearchAgentDependencies:
+    brave_api_key: str
+    session_id: Optional[str] = None
+research_agent = Agent(
+    get_llm_model(),
+    deps_type=ResearchAgentDependencies,
+    system_prompt=SYSTEM_PROMPT
+)
+@research_agent.tool
+async def search_web(
+    ctx: RunContext[ResearchAgentDependencies],
+    query: str,
+    max_results: int = 10
+) -> List[Dict[str, Any]]:
+    """Search with context access via ctx.deps"""
+    results = await search_web_tool(ctx.deps.brave_api_key, query, max_results)
+    return results
+```
+**Brave Search Tool** (`tools/brave_search.py`):
+```python
+async def search_web_tool(api_key: str, query: str, count: int = 10) -> List[Dict]:
+    headers = {"X-Subscription-Token": api_key, "Accept": "application/json"}
+    async with httpx.AsyncClient() as client:
+        response = await client.get(
+            "https://api.search.brave.com/res/v1/web/search",
+            headers=headers,
+            params={"q": query, "count": count},
+            timeout=30.0
+        )
+    # Handle 429 rate limit, 401 auth errors
+    data = response.json()
+    return data.get("web", {}).get("results", [])
+```
+**Pydantic Models** (`models/research_models.py`):
+```python
+class BraveSearchResult(BaseModel):
+    title: str
+    url: str
+    description: str
+    score: float = Field(ge=0.0, le=1.0)
+```
+### Microsoft Agent Framework Orchestration Patterns
+From [deepwiki.com/microsoft/agent-framework](https://deepwiki.com/microsoft/agent-framework/3.4-workflows-and-orchestration):
+#### Sequential Orchestration
+```
+Agent A → Agent B → Agent C (each receives prior outputs)
+```
+**Use when:** Tasks have dependencies, results inform next steps.
+#### Concurrent (Fan-out/Fan-in)
+```
+           ┌→ Agent A ─┐
+Dispatcher ├→ Agent B ─┼→ Aggregator
+           └→ Agent C ─┘
+```
+**Use when:** Independent tasks can run in parallel, results need consolidation.
+**Our use:** Parallel PubMed + Web search.
+#### Handoff Orchestration
+```
+Coordinator → routes to → Specialist A, B, or C based on request
+```
+**Use when:** Router decides which search strategy based on query type.
+**Our use:** Route "mechanism" vs "clinical trial" vs "drug info" queries.
+#### HITL (Human-in-the-Loop)
+```
+Agent → RequestInfoEvent → Human validates → Agent continues
+```
+**Use when:** Critical judgment points need human validation.
+**Our use:** Optional "approve drug candidates before synthesis" step.
+### Recommended Hybrid Pattern for Our Agent
+Based on all the research, here's our recommended implementation:
+```
+┌─────────────────────────────────────────────────────────┐
+│  1. ROUTER (Handoff Pattern)                             │
+│     - Analyze query type                                 │
+│     - Choose search strategy                             │
+├─────────────────────────────────────────────────────────┤
+│  2. SEARCH (Concurrent Pattern)                          │
+│     - Fan-out to PubMed + Web in parallel                │
+│     - Timeout handling per AutoGen patterns              │
+│     - Aggregate results                                  │
+├─────────────────────────────────────────────────────────┤
+│  3. JUDGE (Sequential + Budget)                          │
+│     - Quality assessment                                 │
+│     - Token/iteration budget check                       │
+│     - Recommend: continue or synthesize                  │
+├─────────────────────────────────────────────────────────┤
+│  4. SYNTHESIZE (Final Agent)                             │
+│     - Generate research report                           │
+│     - Include citations                                  │
+│     - Stream to Gradio UI                                │
+└─────────────────────────────────────────────────────────┘
+```
+### Quick Start: Minimal Implementation Path
+**Day 1-2: Core Loop**
+1. Copy `search_web_tool` from `pydanticai-research-agent/tools/brave_search.py`
+2. Implement PubMed search (reference `pubmed-mcp-server/src/` for E-utilities patterns)
+3. Wire up basic search-judge loop
+**Day 3: Judge + State**
+1. Implement quality judge with JSON structured output
+2. Add budget judge
+3. Add Pydantic state management
+**Day 4: UI + MCP**
+1. Gradio streaming UI
+2. Wrap PubMed tool as FastMCP server
+**Day 5-6: Polish + Deploy**
+1. HuggingFace Spaces deployment
+2. Demo video
+3. Stretch goals if time
+---
+## 17. External Resources & MCP Servers
+### Available PubMed MCP Servers (Community)
+| Server | Author | Features | Link |
+|--------|--------|----------|------|
+| **pubmed-mcp-server** | cyanheads | Full E-utilities, research agent, charts | [GitHub](https://github.com/cyanheads/pubmed-mcp-server) |
+| **BioMCP** | GenomOncology | PubMed + ClinicalTrials + MyVariant | [GitHub](https://github.com/genomoncology/biomcp) |
+| **PubMed-MCP-Server** | JackKuo666 | Basic search, metadata access | [GitHub](https://github.com/JackKuo666/PubMed-MCP-Server) |
+### Web Search Options
+| Tool | Free Tier | API Key | Async Support |
+|------|-----------|---------|---------------|
+| **Brave Search** | 2000/month | Required | Yes (httpx) |
+| **DuckDuckGo** | Unlimited | No | Yes (duckduckgo-search) |
+| **SerpAPI** | None | Required | Yes |
+**Recommended:** Start with DuckDuckGo (free, no key), upgrade to Brave for production.
+```python
+# DuckDuckGo async search (no API key needed!)
+from duckduckgo_search import DDGS
+async def search_ddg(query: str, max_results: int = 10) -> List[Dict]:
+    with DDGS() as ddgs:
+        results = list(ddgs.text(query, max_results=max_results))
+    return [{"title": r["title"], "url": r["href"], "description": r["body"]} for r in results]
+```
+---
+**Document Status**: Official Architecture Spec
+**Review Score**: 100/100 (Ironclad Gucci Banger Edition)
+**Sections**: 17 design patterns + data models appendix + reference repos + stretch goals
+**Last Updated**: November 2025

docs/architecture/graph_orchestration.md ADDED Viewed

	@@ -0,0 +1,151 @@

+# Graph Orchestration Architecture
+## Overview
+Phase 4 implements a graph-based orchestration system for research workflows using Pydantic AI agents as nodes. This enables better parallel execution, conditional routing, and state management compared to simple agent chains.
+## Graph Structure
+### Nodes
+Graph nodes represent different stages in the research workflow:
+1. **Agent Nodes**: Execute Pydantic AI agents
+   - Input: Prompt/query
+   - Output: Structured or unstructured response
+   - Examples: `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`
+2. **State Nodes**: Update or read workflow state
+   - Input: Current state
+   - Output: Updated state
+   - Examples: Update evidence, update conversation history
+3. **Decision Nodes**: Make routing decisions based on conditions
+   - Input: Current state/results
+   - Output: Next node ID
+   - Examples: Continue research vs. complete research
+4. **Parallel Nodes**: Execute multiple nodes concurrently
+   - Input: List of node IDs
+   - Output: Aggregated results
+   - Examples: Parallel iterative research loops
+### Edges
+Edges define transitions between nodes:
+1. **Sequential Edges**: Always traversed (no condition)
+   - From: Source node
+   - To: Target node
+   - Condition: None (always True)
+2. **Conditional Edges**: Traversed based on condition
+   - From: Source node
+   - To: Target node
+   - Condition: Callable that returns bool
+   - Example: If research complete → go to writer, else → continue loop
+3. **Parallel Edges**: Used for parallel execution branches
+   - From: Parallel node
+   - To: Multiple target nodes
+   - Execution: All targets run concurrently
+## Graph Patterns
+### Iterative Research Graph
+```
+[Input] → [Thinking] → [Knowledge Gap] → [Decision: Complete?]
+                                              ↓ No          ↓ Yes
+                                    [Tool Selector]    [Writer]
+                                              ↓
+                                    [Execute Tools] → [Loop Back]
+```
+### Deep Research Graph
+```
+[Input] → [Planner] → [Parallel Iterative Loops] → [Synthesizer]
+                           ↓         ↓         ↓
+                        [Loop1]  [Loop2]  [Loop3]
+```
+## State Management
+State is managed via `WorkflowState` using `ContextVar` for thread-safe isolation:
+- **Evidence**: Collected evidence from searches
+- **Conversation**: Iteration history (gaps, tool calls, findings, thoughts)
+- **Embedding Service**: For semantic search
+State transitions occur at state nodes, which update the global workflow state.
+## Execution Flow
+1. **Graph Construction**: Build graph from nodes and edges
+2. **Graph Validation**: Ensure graph is valid (no cycles, all nodes reachable)
+3. **Graph Execution**: Traverse graph from entry node
+4. **Node Execution**: Execute each node based on type
+5. **Edge Evaluation**: Determine next node(s) based on edges
+6. **Parallel Execution**: Use `asyncio.gather()` for parallel nodes
+7. **State Updates**: Update state at state nodes
+8. **Event Streaming**: Yield events during execution for UI
+## Conditional Routing
+Decision nodes evaluate conditions and return next node IDs:
+- **Knowledge Gap Decision**: If `research_complete` → writer, else → tool selector
+- **Budget Decision**: If budget exceeded → exit, else → continue
+- **Iteration Decision**: If max iterations → exit, else → continue
+## Parallel Execution
+Parallel nodes execute multiple nodes concurrently:
+- Each parallel branch runs independently
+- Results are aggregated after all branches complete
+- State is synchronized after parallel execution
+- Errors in one branch don't stop other branches
+## Budget Enforcement
+Budget constraints are enforced at decision nodes:
+- **Token Budget**: Track LLM token usage
+- **Time Budget**: Track elapsed time
+- **Iteration Budget**: Track iteration count
+If any budget is exceeded, execution routes to exit node.
+## Error Handling
+Errors are handled at multiple levels:
+1. **Node Level**: Catch errors in individual node execution
+2. **Graph Level**: Handle errors during graph traversal
+3. **State Level**: Rollback state changes on error
+Errors are logged and yield error events for UI.
+## Backward Compatibility
+Graph execution is optional via feature flag:
+- `USE_GRAPH_EXECUTION=true`: Use graph-based execution
+- `USE_GRAPH_EXECUTION=false`: Use agent chain execution (existing)
+This allows gradual migration and fallback if needed.

docs/architecture/overview.md ADDED Viewed

	@@ -0,0 +1,474 @@

+# DeepCritical: Medical Drug Repurposing Research Agent
+## Project Overview
+---
+## Executive Summary
+**DeepCritical** is a deep research agent designed to accelerate medical drug repurposing research by autonomously searching, analyzing, and synthesizing evidence from multiple biomedical databases.
+### The Problem We Solve
+Drug repurposing - finding new therapeutic uses for existing FDA-approved drugs - can take years of manual literature review. Researchers must:
+- Search thousands of papers across multiple databases
+- Identify molecular mechanisms
+- Find relevant clinical trials
+- Assess safety profiles
+- Synthesize evidence into actionable insights
+**DeepCritical automates this process from hours to minutes.**
+### What Is Drug Repurposing?
+**Simple Explanation:**
+Using existing approved drugs to treat NEW diseases they weren't originally designed for.
+**Real Examples:**
+- **Viagra** (sildenafil): Originally for heart disease → Now treats erectile dysfunction
+- **Thalidomide**: Once banned → Now treats multiple myeloma
+- **Aspirin**: Pain reliever → Heart attack prevention
+- **Metformin**: Diabetes drug → Being tested for aging/longevity
+**Why It Matters:**
+- Faster than developing new drugs (years vs decades)
+- Cheaper (known safety profiles)
+- Lower risk (already FDA approved)
+- Immediate patient benefit potential
+---
+## Core Use Case
+### Primary Query Type
+> "What existing drugs might help treat [disease/condition]?"
+### Example Queries
+1. **Long COVID Fatigue**
+   - Query: "What existing drugs might help treat long COVID fatigue?"
+   - Agent searches: PubMed, clinical trials, drug databases
+   - Output: List of candidate drugs with mechanisms + evidence + citations
+2. **Alzheimer's Disease**
+   - Query: "Find existing drugs that target beta-amyloid pathways"
+   - Agent identifies: Disease mechanisms → Drug candidates → Clinical evidence
+   - Output: Comprehensive research report with drug candidates
+3. **Rare Disease Treatment**
+   - Query: "What drugs might help with fibrodysplasia ossificans progressiva?"
+   - Agent finds: Similar conditions → Shared pathways → Potential treatments
+   - Output: Evidence-based treatment suggestions
+---
+## System Architecture
+### High-Level Design (Phases 1-8)
+```text
+User Query
+    ↓
+Gradio UI (Phase 4)
+    ↓
+Magentic Manager (Phase 5) ← LLM-powered coordinator
+    ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6)
+    ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning
+    ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment
+    └── ReportAgent (Phase 8) ←→ Final Synthesis
+    ↓
+Structured Research Report
+```
+### Key Components
+1. **Magentic Manager (Orchestrator)**
+   - LLM-powered multi-agent coordinator
+   - Dynamic planning and agent selection
+   - Built-in stall detection and replanning
+   - Microsoft Agent Framework integration
+2. **SearchAgent (Phase 2+5+6)**
+   - PubMed E-utilities search
+   - DuckDuckGo web search
+   - Semantic search via ChromaDB (Phase 6)
+   - Evidence deduplication
+3. **HypothesisAgent (Phase 7)**
+   - Generates Drug → Target → Pathway → Effect hypotheses
+   - Guides targeted searches
+   - Scientific reasoning about mechanisms
+4. **JudgeAgent (Phase 3+5)**
+   - LLM-based evidence assessment
+   - Mechanism score + Clinical score
+   - Recommends continue/synthesize
+   - Generates refined search queries
+5. **ReportAgent (Phase 8)**
+   - Structured scientific reports
+   - Executive summary, methodology
+   - Hypotheses tested with evidence counts
+   - Proper citations and limitations
+6. **Gradio UI (Phase 4)**
+   - Chat interface for questions
+   - Real-time progress via events
+   - Mode toggle (Simple/Magentic)
+   - Formatted markdown output
+---
+## Design Patterns
+### 1. Search-and-Judge Loop (Primary Pattern)
+```python
+def research(question: str) -> Report:
+    context = []
+    for iteration in range(max_iterations):
+        # SEARCH: Query relevant tools
+        results = search_tools(question, context)
+        context.extend(results)
+        # JUDGE: Evaluate quality
+        if judge.is_sufficient(question, context):
+            break
+        # REFINE: Adjust search strategy
+        query = refine_query(question, context)
+    # SYNTHESIZE: Generate report
+    return synthesize_report(question, context)
+```
+**Why This Pattern:**
+- Simple to implement and debug
+- Clear loop termination conditions
+- Iterative improvement of search quality
+- Balances depth vs speed
+### 2. Multi-Tool Orchestration
+```
+Question → Agent decides which tools to use
+           ↓
+       ┌───┴────┬─────────┬──────────┐
+       ↓        ↓         ↓          ↓
+   PubMed  Web Search  Trials DB  Drug DB
+       ↓        ↓         ↓          ↓
+       └───┬────┴─────────┴──��───────┘
+           ↓
+    Aggregate Results → Judge
+```
+**Why This Pattern:**
+- Different sources provide different evidence types
+- Parallel tool execution (when possible)
+- Comprehensive coverage
+### 3. LLM-as-Judge with Token Budget
+**Dual Stopping Conditions:**
+- **Smart Stop**: LLM judge says "we have sufficient evidence"
+- **Hard Stop**: Token budget exhausted OR max iterations reached
+**Why Both:**
+- Judge enables early exit when answer is good
+- Budget prevents runaway costs
+- Iterations prevent infinite loops
+### 4. Stateful Checkpointing
+```
+.deepresearch/
+├── state/
+│   └── query_123.json    # Current research state
+├── checkpoints/
+│   └── query_123_iter3/  # Checkpoint at iteration 3
+└── workspace/
+    └── query_123/        # Downloaded papers, data
+```
+**Why This Pattern:**
+- Resume interrupted research
+- Debugging and analysis
+- Cost savings (don't re-search)
+---
+## Component Breakdown
+### Agent (Orchestrator)
+- **Responsibility**: Coordinate research process
+- **Size**: ~100 lines
+- **Key Methods**:
+  - `research(question)` - Main entry point
+  - `plan_search_strategy()` - Decide what to search
+  - `execute_search()` - Run tool queries
+  - `evaluate_progress()` - Call judge
+  - `synthesize_findings()` - Generate report
+### Tools
+- **Responsibility**: Interface with external data sources
+- **Size**: ~50 lines per tool
+- **Implementations**:
+  - `PubMedTool` - Search biomedical literature
+  - `WebSearchTool` - General medical information
+  - `ClinicalTrialsTool` - Trial data (optional)
+  - `DrugInfoTool` - FDA drug database (optional)
+### Judge
+- **Responsibility**: Evaluate evidence quality
+- **Size**: ~50 lines
+- **Key Methods**:
+  - `is_sufficient(question, evidence)` → bool
+  - `assess_quality(evidence)` → score
+  - `identify_gaps(question, evidence)` → missing_info
+### Gradio App
+- **Responsibility**: User interface
+- **Size**: ~50 lines
+- **Features**:
+  - Text input for questions
+  - Progress indicators
+  - Formatted output with citations
+  - Download research report
+---
+## Technical Stack
+### Core Dependencies
+```toml
+[dependencies]
+python = ">=3.10"
+pydantic = "^2.7"
+pydantic-ai = "^0.0.16"
+fastmcp = "^0.1.0"
+gradio = "^5.0"
+beautifulsoup4 = "^4.12"
+httpx = "^0.27"
+```
+### Optional Enhancements
+- `modal` - For GPU-accelerated local LLM
+- `fastmcp` - MCP server integration
+- `sentence-transformers` - Semantic search
+- `faiss-cpu` - Vector similarity
+### Tool APIs & Rate Limits
+| API | Cost | Rate Limit | API Key? | Notes |
+|-----|------|------------|----------|-------|
+| **PubMed E-utilities** | Free | 3/sec (no key), 10/sec (with key) | Optional | Register at NCBI for higher limits |
+| **Brave Search API** | Free tier | 2000/month free | Required | Primary web search |
+| **DuckDuckGo** | Free | Unofficial, ~1/sec | No | Fallback web search |
+| **ClinicalTrials.gov** | Free | 100/min | No | Stretch goal |
+| **OpenFDA** | Free | 240/min (no key), 120K/day (with key) | Optional | Drug info |
+**Web Search Strategy (Priority Order):**
+1. **Brave Search API** (free tier: 2000 queries/month) - Primary
+2. **DuckDuckGo** (unofficial, no API key) - Fallback
+3. **SerpAPI** ($50/month) - Only if free options fail
+**Why NOT SerpAPI first?**
+- Costs money (hackathon budget = $0)
+- Free alternatives work fine for demo
+- Can upgrade later if needed
+---
+## Success Criteria
+### Phase 1-5 (MVP) ✅ COMPLETE
+**Completed in ONE DAY:**
+- [x] User can ask drug repurposing question
+- [x] Agent searches PubMed (async)
+- [x] Agent searches web (DuckDuckGo)
+- [x] LLM judge evaluates evidence quality
+- [x] System respects token budget and iterations
+- [x] Output includes drug candidates + citations
+- [x] Works end-to-end for demo query
+- [x] Gradio UI with streaming progress
+- [x] Magentic multi-agent orchestration
+- [x] 38 unit tests passing
+- [x] CI/CD pipeline green
+### Hackathon Submission ✅ COMPLETE
+- [x] Gradio UI deployed on HuggingFace Spaces
+- [x] Example queries working and tested
+- [x] Architecture documentation
+- [x] README with setup instructions
+### Phase 6-8 (Enhanced)
+**Specs ready for implementation:**
+- [ ] Embeddings & Semantic Search (Phase 6)
+- [ ] Hypothesis Agent (Phase 7)
+- [ ] Report Agent (Phase 8)
+### What's EXPLICITLY Out of Scope
+**NOT building (to stay focused):**
+- ❌ User authentication
+- ❌ Database storage of queries
+- ❌ Multi-user support
+- ❌ Payment/billing
+- ❌ Production monitoring
+- ❌ Mobile UI
+---
+## Implementation Timeline
+### Day 1 (Today): Architecture & Setup
+- [x] Define use case (drug repurposing) ✅
+- [x] Write architecture docs ✅
+- [ ] Create project structure
+- [ ] First PR: Structure + Docs
+### Day 2: Core Agent Loop
+- [ ] Implement basic orchestrator
+- [ ] Add PubMed search tool
+- [ ] Simple judge (keyword-based)
+- [ ] Test with 1 query
+### Day 3: Intelligence Layer
+- [ ] Upgrade to LLM judge
+- [ ] Add web search tool
+- [ ] Token budget tracking
+- [ ] Test with multiple queries
+### Day 4: UI & Integration
+- [ ] Build Gradio interface
+- [ ] Wire up agent to UI
+- [ ] Add progress indicators
+- [ ] Format output nicely
+### Day 5: Polish & Extend
+- [ ] Add more tools (clinical trials)
+- [ ] Improve judge prompts
+- [ ] Checkpoint system
+- [ ] Error handling
+### Day 6: Deploy & Document
+- [ ] Deploy to HuggingFace Spaces
+- [ ] Record demo video
+- [ ] Write submission materials
+- [ ] Final testing
+---
+## Questions This Document Answers
+### For The Maintainer
+**Q: "What should our design pattern be?"**
+A: Search-and-judge loop with multi-tool orchestration (detailed in Design Patterns section)
+**Q: "Should we use LLM-as-judge or token budget?"**
+A: Both - judge for smart stopping, budget for cost control
+**Q: "What's the break pattern?"**
+A: Three conditions: judge approval, token limit, or max iterations (whichever comes first)
+**Q: "What components do we need?"**
+A: Agent orchestrator, tools (PubMed/web), judge, Gradio UI (see Component Breakdown)
+### For The Team
+**Q: "What are we actually building?"**
+A: Medical drug repurposing research agent (see Core Use Case)
+**Q: "How complex should it be?"**
+A: Simple but complete - ~300 lines of core code (see Component sizes)
+**Q: "What's the timeline?"**
+A: 6 days, MVP by Day 3, polish Days 4-6 (see Implementation Timeline)
+**Q: "What datasets/APIs do we use?"**
+A: PubMed (free), web search, clinical trials.gov (see Tool APIs)
+---
+## Next Steps
+1. **Review this document** - Team feedback on architecture
+2. **Finalize design** - Incorporate feedback
+3. **Create project structure** - Scaffold repository
+4. **Move to proper docs** - `docs/architecture/` folder
+5. **Open first PR** - Structure + Documentation
+6. **Start implementation** - Day 2 onward
+---
+## Notes & Decisions
+### Why Drug Repurposing?
+- Clear, impressive use case
+- Real-world medical impact
+- Good data availability (PubMed, trials)
+- Easy to explain (Viagra example!)
+- Physician on team ✅
+### Why Simple Architecture?
+- 6-day timeline
+- Need working end-to-end system
+- Hackathon judges value "works" over "complex"
+- Can extend later if successful
+### Why These Tools First?
+- PubMed: Best biomedical literature source
+- Web search: General medical knowledge
+- Clinical trials: Evidence of actual testing
+- Others: Nice-to-have, not critical for MVP
+---
+---
+## Appendix A: Demo Queries (Pre-tested)
+These queries will be used for demo and testing. They're chosen because:
+1. They have good PubMed coverage
+2. They're medically interesting
+3. They show the system's capabilities
+### Primary Demo Query
+```
+"What existing drugs might help treat long COVID fatigue?"
+```
+**Expected candidates**: CoQ10, Low-dose Naltrexone, Modafinil
+**Expected sources**: 20+ PubMed papers, 2-3 clinical trials
+### Secondary Demo Queries
+```
+"Find existing drugs that might slow Alzheimer's progression"
+"What approved medications could help with fibromyalgia pain?"
+"Which diabetes drugs show promise for cancer treatment?"
+```
+### Why These Queries?
+- Represent real clinical needs
+- Have substantial literature
+- Show diverse drug classes
+- Physician on team can validate results
+---
+## Appendix B: Risk Assessment
+| Risk | Likelihood | Impact | Mitigation |
+|------|------------|--------|------------|
+| PubMed rate limiting | Medium | High | Implement caching, respect 3/sec |
+| Web search API fails | Low | Medium | DuckDuckGo fallback |
+| LLM costs exceed budget | Medium | Medium | Hard token cap at 50K |
+| Judge quality poor | Medium | High | Pre-test prompts, iterate |
+| HuggingFace deploy issues | Low | High | Test deployment Day 4 |
+| Demo crashes live | Medium | High | Pre-recorded backup video |
+---
+---
+**Document Status**: Official Architecture Spec
+**Review Score**: 98/100
+**Last Updated**: November 2025

docs/brainstorming/00_ROADMAP_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,194 @@

+# DeepCritical Data Sources: Roadmap Summary
+**Created**: 2024-11-27
+**Purpose**: Future maintainability and hackathon continuation
+---
+## Current State
+### Working Tools
+| Tool | Status | Data Quality |
+|------|--------|--------------|
+| PubMed | ✅ Works | Good (abstracts only) |
+| ClinicalTrials.gov | ✅ Works | Good (filtered for interventional) |
+| Europe PMC | ✅ Works | Good (includes preprints) |
+### Removed Tools
+| Tool | Status | Reason |
+|------|--------|--------|
+| bioRxiv | ❌ Removed | No search API - only date/DOI lookup |
+---
+## Priority Improvements
+### P0: Critical (Do First)
+1. **Add Rate Limiting to PubMed**
+   - NCBI will block us without it
+   - Use `limits` library (see reference repo)
+   - 3/sec without key, 10/sec with key
+### P1: High Value, Medium Effort
+2. **Add OpenAlex as 4th Source**
+   - Citation network (huge for drug repurposing)
+   - Concept tagging (semantic discovery)
+   - Already implemented in reference repo
+   - Free, no API key
+3. **PubMed Full-Text via BioC**
+   - Get full paper text for PMC papers
+   - Already in reference repo
+### P2: Nice to Have
+4. **ClinicalTrials.gov Results**
+   - Get efficacy data from completed trials
+   - Requires more complex API calls
+5. **Europe PMC Annotations**
+   - Text-mined entities (genes, drugs, diseases)
+   - Automatic entity extraction
+---
+## Effort Estimates
+| Improvement | Effort | Impact | Priority |
+|-------------|--------|--------|----------|
+| PubMed rate limiting | 1 hour | Stability | P0 |
+| OpenAlex basic search | 2 hours | High | P1 |
+| OpenAlex citations | 2 hours | Very High | P1 |
+| PubMed full-text | 3 hours | Medium | P1 |
+| CT.gov results | 4 hours | Medium | P2 |
+| Europe PMC annotations | 3 hours | Medium | P2 |
+---
+## Architecture Decision
+### Option A: Keep Current + Add OpenAlex
+```
+                    User Query
+                        ↓
+    ┌───────────────────┼───────────────────┐
+    ↓                   ↓                   ↓
+ PubMed          ClinicalTrials        Europe PMC
+ (abstracts)     (trials only)         (preprints)
+    ↓                   ↓                   ↓
+    └───────────────────┼───────────────────┘
+                        ↓
+                   OpenAlex              ← NEW
+               (citations, concepts)
+                        ↓
+                  Orchestrator
+                        ↓
+                     Report
+```
+**Pros**: Low risk, additive
+**Cons**: More complexity, some overlap
+### Option B: OpenAlex as Primary
+```
+                    User Query
+                        ↓
+    ┌───────────────────┼───────────────────┐
+    ↓                   ↓                   ↓
+ OpenAlex          ClinicalTrials      Europe PMC
+ (primary          (trials only)       (full-text
+  search)                               fallback)
+    ↓                   ↓                   ↓
+    └───────────────────┼───────────────────┘
+                        ↓
+                  Orchestrator
+                        ↓
+                     Report
+```
+**Pros**: Simpler, citation network built-in
+**Cons**: Lose some PubMed-specific features
+### Recommendation: Option A
+Keep current architecture working, add OpenAlex incrementally.
+---
+## Quick Wins (Can Do Today)
+1. **Add `limits` to `pyproject.toml`**
+   ```toml
+   dependencies = [
+       "limits>=3.0",
+   ]
+   ```
+2. **Copy OpenAlex tool from reference repo**
+   - File: `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`
+   - Adapt to our `SearchTool` base class
+3. **Enable NCBI API Key**
+   - Add to `.env`: `NCBI_API_KEY=your_key`
+   - 10x rate limit improvement
+---
+## External Resources Worth Exploring
+### Python Libraries
+| Library | For | Notes |
+|---------|-----|-------|
+| `limits` | Rate limiting | Used by reference repo |
+| `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
+| `metapub` | PubMed | Full-featured |
+| `sentence-transformers` | Semantic search | For embeddings |
+### APIs Not Yet Used
+| API | Provides | Effort |
+|-----|----------|--------|
+| RxNorm | Drug name normalization | Low |
+| DrugBank | Drug targets/mechanisms | Medium (license) |
+| UniProt | Protein data | Medium |
+| ChEMBL | Bioactivity data | Medium |
+### RAG Tools (Future)
+| Tool | Purpose |
+|------|---------|
+| [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
+| [txtai](https://github.com/neuml/txtai) | Embeddings + search |
+| [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
+---
+## Files in This Directory
+| File | Contents |
+|------|----------|
+| `00_ROADMAP_SUMMARY.md` | This file |
+| `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
+| `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
+| `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
+| `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
+---
+## For Future Maintainers
+If you're picking this up after the hackathon:
+1. **Start with OpenAlex** - biggest bang for buck
+2. **Add rate limiting** - prevents API blocks
+3. **Don't bother with bioRxiv** - use Europe PMC instead
+4. **Reference repo is gold** - `reference_repos/DeepCritical/` has working implementations
+Good luck! 🚀

docs/brainstorming/01_PUBMED_IMPROVEMENTS.md ADDED Viewed

	@@ -0,0 +1,125 @@

+# PubMed Tool: Current State & Future Improvements
+**Status**: Currently Implemented
+**Priority**: High (Core Data Source)
+---
+## Current Implementation
+### What We Have (`src/tools/pubmed.py`)
+- Basic E-utilities search via `esearch.fcgi` and `efetch.fcgi`
+- Query preprocessing (strips question words, expands synonyms)
+- Returns: title, abstract, authors, journal, PMID
+- Rate limiting: None implemented (relying on NCBI defaults)
+### Current Limitations
+1. **No Full-Text Access**: Only retrieves abstracts, not full paper text
+2. **No Rate Limiting**: Risk of being blocked by NCBI
+3. **No BioC Format**: Missing structured full-text extraction
+4. **No Figure Retrieval**: No supplementary materials access
+5. **No PMC Integration**: Missing open-access full-text via PMC
+---
+## Reference Implementation (DeepCritical Reference Repo)
+The reference repo at `reference_repos/DeepCritical/DeepResearch/src/tools/bioinformatics_tools.py` has a more sophisticated implementation:
+### Features We're Missing
+```python
+# Rate limiting (lines 47-50)
+from limits import parse
+from limits.storage import MemoryStorage
+from limits.strategies import MovingWindowRateLimiter
+storage = MemoryStorage()
+limiter = MovingWindowRateLimiter(storage)
+rate_limit = parse("3/second")  # NCBI allows 3/sec without API key, 10/sec with
+# Full-text via BioC format (lines 108-120)
+def _get_fulltext(pmid: int) -> dict[str, Any] | None:
+    pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
+    # Returns structured JSON with full text for open-access papers
+# Figure retrieval via Europe PMC (lines 123-149)
+def _get_figures(pmcid: str) -> dict[str, str]:
+    suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
+    # Returns base64-encoded images from supplementary materials
+```
+---
+## Recommended Improvements
+### Phase 1: Rate Limiting (Critical)
+```python
+# Add to src/tools/pubmed.py
+from limits import parse
+from limits.storage import MemoryStorage
+from limits.strategies import MovingWindowRateLimiter
+storage = MemoryStorage()
+limiter = MovingWindowRateLimiter(storage)
+# With NCBI_API_KEY: 10/sec, without: 3/sec
+def get_rate_limit():
+    if settings.ncbi_api_key:
+        return parse("10/second")
+    return parse("3/second")
+```
+**Dependencies**: `pip install limits`
+### Phase 2: Full-Text Retrieval
+```python
+async def get_fulltext(pmid: str) -> str | None:
+    """Get full text for open-access papers via BioC API."""
+    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
+    # Only works for PMC papers (open access)
+```
+### Phase 3: PMC ID Resolution
+```python
+async def get_pmc_id(pmid: str) -> str | None:
+    """Convert PMID to PMCID for full-text access."""
+    url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
+```
+---
+## Python Libraries to Consider
+| Library | Purpose | Notes |
+|---------|---------|-------|
+| [Biopython](https://biopython.org/) | `Bio.Entrez` module | Official, well-maintained |
+| [PyMed](https://pypi.org/project/pymed/) | PubMed wrapper | Simpler API, less control |
+| [metapub](https://pypi.org/project/metapub/) | Full-featured | Tested on 1/3 of PubMed |
+| [limits](https://pypi.org/project/limits/) | Rate limiting | Used by reference repo |
+---
+## API Endpoints Reference
+| Endpoint | Purpose | Rate Limit |
+|----------|---------|------------|
+| `esearch.fcgi` | Search for PMIDs | 3/sec (10 with key) |
+| `efetch.fcgi` | Fetch metadata | 3/sec (10 with key) |
+| `esummary.fcgi` | Quick metadata | 3/sec (10 with key) |
+| `pmcoa.cgi/BioC_json` | Full text (PMC only) | Unknown |
+| `idconv/v1.0` | PMID ↔ PMCID | Unknown |
+---
+## Sources
+- [PubMed E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
+- [NCBI BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
+- [Searching PubMed with Python](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/)
+- [PyMed on PyPI](https://pypi.org/project/pymed/)

docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md ADDED Viewed

	@@ -0,0 +1,193 @@

+# ClinicalTrials.gov Tool: Current State & Future Improvements
+**Status**: Currently Implemented
+**Priority**: High (Core Data Source for Drug Repurposing)
+---
+## Current Implementation
+### What We Have (`src/tools/clinicaltrials.py`)
+- V2 API search via `clinicaltrials.gov/api/v2/studies`
+- Filters: `INTERVENTIONAL` study type, `RECRUITING` status
+- Returns: NCT ID, title, conditions, interventions, phase, status
+- Query preprocessing via shared `query_utils.py`
+### Current Strengths
+1. **Good Filtering**: Already filtering for interventional + recruiting
+2. **V2 API**: Using the modern API (v1 deprecated)
+3. **Phase Info**: Extracting trial phases for drug development context
+### Current Limitations
+1. **No Outcome Data**: Missing primary/secondary outcomes
+2. **No Eligibility Criteria**: Missing inclusion/exclusion details
+3. **No Sponsor Info**: Missing who's running the trial
+4. **No Result Data**: For completed trials, no efficacy data
+5. **Limited Drug Mapping**: No integration with drug databases
+---
+## API Capabilities We're Not Using
+### Fields We Could Request
+```python
+# Current fields
+fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
+# Additional valuable fields
+additional_fields = [
+    "PrimaryOutcomeMeasure",      # What are they measuring?
+    "SecondaryOutcomeMeasure",    # Secondary endpoints
+    "EligibilityCriteria",        # Who can participate?
+    "LeadSponsorName",            # Who's funding?
+    "ResultsFirstPostDate",       # Has results?
+    "StudyFirstPostDate",         # When started?
+    "CompletionDate",             # When finished?
+    "EnrollmentCount",            # Sample size
+    "InterventionDescription",    # Drug details
+    "ArmGroupLabel",              # Treatment arms
+    "InterventionOtherName",      # Drug aliases
+]
+```
+### Filter Enhancements
+```python
+# Current
+aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
+# Could add
+"status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED"  # Include completed for results
+"phase:PHASE2,PHASE3"  # Only later-stage trials
+"resultsFirstPostDateRange:2020-01-01_"  # Trials with posted results
+```
+---
+## Recommended Improvements
+### Phase 1: Richer Metadata
+```python
+EXTENDED_FIELDS = [
+    "NCTId",
+    "BriefTitle",
+    "OfficialTitle",
+    "Condition",
+    "InterventionName",
+    "InterventionDescription",
+    "InterventionOtherName",  # Drug synonyms!
+    "Phase",
+    "OverallStatus",
+    "PrimaryOutcomeMeasure",
+    "EnrollmentCount",
+    "LeadSponsorName",
+    "StudyFirstPostDate",
+]
+```
+### Phase 2: Results Retrieval
+For completed trials, we can get actual efficacy data:
+```python
+async def get_trial_results(nct_id: str) -> dict | None:
+    """Fetch results for completed trials."""
+    url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
+    params = {
+        "fields": "ResultsSection",
+    }
+    # Returns outcome measures and statistics
+```
+### Phase 3: Drug Name Normalization
+Map intervention names to standard identifiers:
+```python
+# Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
+# Solution: Use RxNorm or DrugBank for normalization
+async def normalize_drug_name(intervention: str) -> str:
+    """Normalize drug name via RxNorm API."""
+    url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
+    # Returns standardized RxCUI
+```
+---
+## Integration Opportunities
+### With PubMed
+Cross-reference trials with publications:
+```python
+# ClinicalTrials.gov provides PMID links
+# Can correlate trial results with published papers
+```
+### With DrugBank/ChEMBL
+Map interventions to:
+- Mechanism of action
+- Known targets
+- Adverse effects
+- Drug-drug interactions
+---
+## Python Libraries to Consider
+| Library | Purpose | Notes |
+|---------|---------|-------|
+| [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
+| [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
+| [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
+---
+## API Quirks & Gotchas
+1. **Rate Limiting**: Undocumented, be conservative
+2. **Pagination**: Max 1000 results per request
+3. **Field Names**: Case-sensitive, camelCase
+4. **Empty Results**: Some fields may be null even if requested
+5. **Status Changes**: Trials change status frequently
+---
+## Example Enhanced Query
+```python
+async def search_drug_repurposing_trials(
+    drug_name: str,
+    condition: str,
+    include_completed: bool = True,
+) -> list[Evidence]:
+    """Search for trials repurposing a drug for a new condition."""
+    statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
+    if include_completed:
+        statuses.append("COMPLETED")
+    params = {
+        "query.intr": drug_name,
+        "query.cond": condition,
+        "filter.overallStatus": ",".join(statuses),
+        "filter.studyType": "INTERVENTIONAL",
+        "fields": ",".join(EXTENDED_FIELDS),
+        "pageSize": 50,
+    }
+```
+---
+## Sources
+- [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
+- [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
+- [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)

docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md ADDED Viewed

	@@ -0,0 +1,211 @@

+# Europe PMC Tool: Current State & Future Improvements
+**Status**: Currently Implemented (Replaced bioRxiv)
+**Priority**: High (Preprint + Open Access Source)
+---
+## Why Europe PMC Over bioRxiv?
+### bioRxiv API Limitations (Why We Abandoned It)
+1. **No Search API**: Only returns papers by date range or DOI
+2. **No Query Capability**: Cannot search for "metformin cancer"
+3. **Workaround Required**: Would need to download ALL preprints and build local search
+4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
+### Europe PMC Advantages
+1. **Full Search API**: Boolean queries, filters, facets
+2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
+3. **Includes PubMed**: Also has MEDLINE content
+4. **34 Preprint Servers**: Not just bioRxiv
+5. **Open Access Focus**: Full-text when available
+---
+## Current Implementation
+### What We Have (`src/tools/europepmc.py`)
+- REST API search via `europepmc.org/webservices/rest/search`
+- Preprint flagging via `firstPublicationDate` heuristics
+- Returns: title, abstract, authors, DOI, source
+- Marks preprints for transparency
+### Current Limitations
+1. **No Full-Text Retrieval**: Only metadata/abstracts
+2. **No Citation Network**: Missing references/citations
+3. **No Supplementary Files**: Not fetching figures/data
+4. **Basic Preprint Detection**: Heuristic, not explicit flag
+---
+## Europe PMC API Capabilities
+### Endpoints We Could Use
+| Endpoint | Purpose | Currently Using |
+|----------|---------|-----------------|
+| `/search` | Query papers | Yes |
+| `/fulltext/{ID}` | Full text (XML/JSON) | No |
+| `/{PMCID}/supplementaryFiles` | Figures, data | No |
+| `/citations/{ID}` | Who cited this | No |
+| `/references/{ID}` | What this cites | No |
+| `/annotations` | Text-mined entities | No |
+### Rich Query Syntax
+```python
+# Current simple query
+query = "metformin cancer"
+# Could use advanced syntax
+query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
+query += " AND (SRC:PPR)"  # Only preprints
+query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])"  # Date range
+query += " AND (OPEN_ACCESS:y)"  # Only open access
+```
+### Source Filters
+```python
+# Filter by source
+"SRC:MED"     # MEDLINE
+"SRC:PMC"     # PubMed Central
+"SRC:PPR"     # Preprints (bioRxiv, medRxiv, etc.)
+"SRC:AGR"     # Agricola
+"SRC:CBA"     # Chinese Biological Abstracts
+```
+---
+## Recommended Improvements
+### Phase 1: Rich Metadata
+```python
+# Add to search results
+additional_fields = [
+    "citedByCount",           # Impact indicator
+    "source",                 # Explicit source (MED, PMC, PPR)
+    "isOpenAccess",           # Boolean flag
+    "fullTextUrlList",        # URLs for full text
+    "authorAffiliations",     # Institution info
+    "grantsList",             # Funding info
+]
+```
+### Phase 2: Full-Text Retrieval
+```python
+async def get_fulltext(pmcid: str) -> str | None:
+    """Get full text for open access papers."""
+    # XML format
+    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
+    # Or JSON
+    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
+```
+### Phase 3: Citation Network
+```python
+async def get_citations(pmcid: str) -> list[str]:
+    """Get papers that cite this one."""
+    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
+async def get_references(pmcid: str) -> list[str]:
+    """Get papers this one cites."""
+    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
+```
+### Phase 4: Text-Mined Annotations
+Europe PMC extracts entities automatically:
+```python
+async def get_annotations(pmcid: str) -> dict:
+    """Get text-mined entities (genes, diseases, drugs)."""
+    url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
+    params = {
+        "articleIds": f"PMC:{pmcid}",
+        "type": "Gene_Proteins,Diseases,Chemicals",
+        "format": "JSON",
+    }
+    # Returns structured entity mentions with positions
+```
+---
+## Supplementary File Retrieval
+From reference repo (`bioinformatics_tools.py` lines 123-149):
+```python
+def get_figures(pmcid: str) -> dict[str, str]:
+    """Download figures and supplementary files."""
+    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
+    # Returns ZIP with images, returns base64-encoded
+```
+---
+## Preprint-Specific Features
+### Identify Preprint Servers
+```python
+PREPRINT_SOURCES = {
+    "PPR": "General preprints",
+    "bioRxiv": "Biology preprints",
+    "medRxiv": "Medical preprints",
+    "chemRxiv": "Chemistry preprints",
+    "Research Square": "Multi-disciplinary",
+    "Preprints.org": "MDPI preprints",
+}
+# Check if published version exists
+async def check_published_version(preprint_doi: str) -> str | None:
+    """Check if preprint has been peer-reviewed and published."""
+    # Europe PMC links preprints to final versions
+```
+---
+## Rate Limiting
+Europe PMC is more generous than NCBI:
+```python
+# No documented hard limit, but be respectful
+# Recommend: 10-20 requests/second max
+# Use email in User-Agent for polite pool
+headers = {
+    "User-Agent": "DeepCritical/1.0 (mailto:your@email.com)"
+}
+```
+---
+## vs. The Lens & OpenAlex
+| Feature | Europe PMC | The Lens | OpenAlex |
+|---------|------------|----------|----------|
+| Biomedical Focus | Yes | Partial | Partial |
+| Preprints | Yes (34 servers) | Yes | Yes |
+| Full Text | PMC papers | Links | No |
+| Citations | Yes | Yes | Yes |
+| Annotations | Yes (text-mined) | No | No |
+| Rate Limits | Generous | Moderate | Very generous |
+| API Key | Optional | Required | Optional |
+---
+## Sources
+- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
+- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
+- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
+- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
+- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)

docs/brainstorming/04_OPENALEX_INTEGRATION.md ADDED Viewed

	@@ -0,0 +1,303 @@

+# OpenAlex Integration: The Missing Piece?
+**Status**: NOT Implemented (Candidate for Addition)
+**Priority**: HIGH - Could Replace Multiple Tools
+**Reference**: Already implemented in `reference_repos/DeepCritical`
+---
+## What is OpenAlex?
+OpenAlex is a **fully open** index of the global research system:
+- **209M+ works** (papers, books, datasets)
+- **2B+ author records** (disambiguated)
+- **124K+ venues** (journals, repositories)
+- **109K+ institutions**
+- **65K+ concepts** (hierarchical, linked to Wikidata)
+**Free. Open. No API key required.**
+---
+## Why OpenAlex for DeepCritical?
+### Current Architecture
+```
+User Query
+    ↓
+┌──────────────────────────────────────┐
+│  PubMed    ClinicalTrials  Europe PMC │  ← 3 separate APIs
+└──────────────────────────────────────┘
+    ↓
+Orchestrator (deduplicate, judge, synthesize)
+```
+### With OpenAlex
+```
+User Query
+    ↓
+┌──────────────────────────────────────┐
+│              OpenAlex                 │  ← Single API
+│  (includes PubMed + preprints +       │
+│   citations + concepts + authors)     │
+└──────────────────────────────────────┘
+    ↓
+Orchestrator (enrich with CT.gov for trials)
+```
+**OpenAlex already aggregates**:
+- PubMed/MEDLINE
+- Crossref
+- ORCID
+- Unpaywall (open access links)
+- Microsoft Academic Graph (legacy)
+- Preprint servers
+---
+## Reference Implementation
+From `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`:
+```python
+class OpenAlexFetchTool(ToolRunner):
+    def __init__(self):
+        super().__init__(
+            ToolSpec(
+                name="openalex_fetch",
+                description="Fetch OpenAlex work or author",
+                inputs={"entity": "TEXT", "identifier": "TEXT"},
+                outputs={"result": "JSON"},
+            )
+        )
+    def run(self, params: dict[str, Any]) -> ExecutionResult:
+        entity = params["entity"]      # "works", "authors", "venues"
+        identifier = params["identifier"]
+        base = "https://api.openalex.org"
+        url = f"{base}/{entity}/{identifier}"
+        resp = requests.get(url, timeout=30)
+        return ExecutionResult(success=True, data={"result": resp.json()})
+```
+---
+## OpenAlex API Features
+### Search Works (Papers)
+```python
+# Search for metformin + cancer papers
+url = "https://api.openalex.org/works"
+params = {
+    "search": "metformin cancer drug repurposing",
+    "filter": "publication_year:>2020,type:article",
+    "sort": "cited_by_count:desc",
+    "per_page": 50,
+}
+```
+### Rich Filtering
+```python
+# Filter examples
+"publication_year:2023"
+"type:article"                      # vs preprint, book, etc.
+"is_oa:true"                        # Open access only
+"concepts.id:C71924100"             # Papers about "Medicine"
+"authorships.institutions.id:I27837315"  # From Harvard
+"cited_by_count:>100"               # Highly cited
+"has_fulltext:true"                 # Full text available
+```
+### What You Get Back
+```json
+{
+    "id": "W2741809807",
+    "title": "Metformin: A candidate drug for...",
+    "publication_year": 2023,
+    "type": "article",
+    "cited_by_count": 45,
+    "is_oa": true,
+    "primary_location": {
+        "source": {"display_name": "Nature Medicine"},
+        "pdf_url": "https://...",
+        "landing_page_url": "https://..."
+    },
+    "concepts": [
+        {"id": "C71924100", "display_name": "Medicine", "score": 0.95},
+        {"id": "C54355233", "display_name": "Pharmacology", "score": 0.88}
+    ],
+    "authorships": [
+        {
+            "author": {"id": "A123", "display_name": "John Smith"},
+            "institutions": [{"display_name": "Harvard Medical School"}]
+        }
+    ],
+    "referenced_works": ["W123", "W456"],  # Citations
+    "related_works": ["W789", "W012"]       # Similar papers
+}
+```
+---
+## Key Advantages Over Current Tools
+### 1. Citation Network (We Don't Have This!)
+```python
+# Get papers that cite a work
+url = f"https://api.openalex.org/works?filter=cites:{work_id}"
+# Get papers cited by a work
+# Already in `referenced_works` field
+```
+### 2. Concept Tagging (We Don't Have This!)
+OpenAlex auto-tags papers with hierarchical concepts:
+- "Medicine" → "Pharmacology" → "Drug Repurposing"
+- Can search by concept, not just keywords
+### 3. Author Disambiguation (We Don't Have This!)
+```python
+# Find all works by an author
+url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}"
+```
+### 4. Institution Tracking
+```python
+# Find drug repurposing papers from top institutions
+url = "https://api.openalex.org/works"
+params = {
+    "search": "drug repurposing",
+    "filter": "authorships.institutions.id:I27837315",  # Harvard
+}
+```
+### 5. Related Works
+Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML.
+---
+## Proposed Implementation
+### New Tool: `src/tools/openalex.py`
+```python
+"""OpenAlex search tool for comprehensive scholarly data."""
+import httpx
+from src.tools.base import SearchTool
+from src.utils.models import Evidence
+class OpenAlexTool(SearchTool):
+    """Search OpenAlex for scholarly works with rich metadata."""
+    name = "openalex"
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        async with httpx.AsyncClient() as client:
+            resp = await client.get(
+                "https://api.openalex.org/works",
+                params={
+                    "search": query,
+                    "filter": "type:article,is_oa:true",
+                    "sort": "cited_by_count:desc",
+                    "per_page": max_results,
+                    "mailto": "deepcritical@example.com",  # Polite pool
+                },
+            )
+            data = resp.json()
+        return [
+            Evidence(
+                source="openalex",
+                title=work["title"],
+                abstract=work.get("abstract", ""),
+                url=work["primary_location"]["landing_page_url"],
+                metadata={
+                    "cited_by_count": work["cited_by_count"],
+                    "concepts": [c["display_name"] for c in work["concepts"][:5]],
+                    "is_open_access": work["is_oa"],
+                    "pdf_url": work["primary_location"].get("pdf_url"),
+                },
+            )
+            for work in data["results"]
+        ]
+```
+---
+## Rate Limits
+OpenAlex is **extremely generous**:
+- No hard rate limit documented
+- Recommended: <100,000 requests/day
+- **Polite pool**: Add `mailto=your@email.com` param for faster responses
+- No API key required (optional for priority support)
+---
+## Should We Add OpenAlex?
+### Arguments FOR
+1. **Already in reference repo** - proven pattern
+2. **Richer data** - citations, concepts, authors
+3. **Single source** - reduces API complexity
+4. **Free & open** - no keys, no limits
+5. **Institution adoption** - Leiden, Sorbonne switched to it
+### Arguments AGAINST
+1. **Adds complexity** - another data source
+2. **Overlap** - duplicates some PubMed data
+3. **Not biomedical-focused** - covers all disciplines
+4. **No full text** - still need PMC/Europe PMC for that
+### Recommendation
+**Add OpenAlex as a 4th source**, don't replace existing tools.
+Use it for:
+- Citation network analysis
+- Concept-based discovery
+- High-impact paper finding
+- Author/institution tracking
+Keep PubMed, ClinicalTrials, Europe PMC for:
+- Authoritative biomedical search
+- Clinical trial data
+- Full-text access
+- Preprint tracking
+---
+## Implementation Priority
+| Task | Effort | Value |
+|------|--------|-------|
+| Basic search | Low | High |
+| Citation network | Medium | Very High |
+| Concept filtering | Low | High |
+| Related works | Low | High |
+| Author tracking | Medium | Medium |
+---
+## Sources
+- [OpenAlex Documentation](https://docs.openalex.org)
+- [OpenAlex API Overview](https://docs.openalex.org/api)
+- [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex)
+- [Leiden University Announcement](https://www.leidenranking.com/information/openalex)
+- [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833)

docs/brainstorming/implementation/15_PHASE_OPENALEX.md ADDED Viewed

	@@ -0,0 +1,603 @@

+# Phase 15: OpenAlex Integration
+**Priority**: HIGH - Biggest bang for buck
+**Effort**: ~2-3 hours
+**Dependencies**: None (existing codebase patterns sufficient)
+---
+## Prerequisites (COMPLETED)
+The following model changes have been implemented to support this integration:
+1. **`SourceName` Literal Updated** (`src/utils/models.py:9`)
+   ```python
+   SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
+   ```
+   - Without this, `source="openalex"` would fail Pydantic validation
+2. **`Evidence.metadata` Field Added** (`src/utils/models.py:39-42`)
+   ```python
+   metadata: dict[str, Any] = Field(
+       default_factory=dict,
+       description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
+   )
+   ```
+   - Required for storing `cited_by_count`, `concepts`, etc.
+   - Model is still frozen - metadata must be passed at construction time
+3. **`__init__.py` Exports Updated** (`src/tools/__init__.py`)
+   - All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool`
+   - OpenAlexTool should be added here after implementation
+---
+## Overview
+Add OpenAlex as a 4th data source for comprehensive scholarly data including:
+- Citation networks (who cites whom)
+- Concept tagging (hierarchical topic classification)
+- Author disambiguation
+- 209M+ works indexed
+**Why OpenAlex?**
+- Free, no API key required
+- Already implemented in reference repo
+- Provides citation data we don't have
+- Aggregates PubMed + preprints + more
+---
+## TDD Implementation Plan
+### Step 1: Write the Tests First
+**File**: `tests/unit/tools/test_openalex.py`
+```python
+"""Tests for OpenAlex search tool."""
+import pytest
+import respx
+from httpx import Response
+from src.tools.openalex import OpenAlexTool
+from src.utils.models import Evidence
+class TestOpenAlexTool:
+    """Test suite for OpenAlex search functionality."""
+    @pytest.fixture
+    def tool(self) -> OpenAlexTool:
+        return OpenAlexTool()
+    def test_name_property(self, tool: OpenAlexTool) -> None:
+        """Tool should identify itself as 'openalex'."""
+        assert tool.name == "openalex"
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None:
+        """Search should return list of Evidence objects."""
+        mock_response = {
+            "results": [
+                {
+                    "id": "W2741809807",
+                    "title": "Metformin and cancer: A systematic review",
+                    "publication_year": 2023,
+                    "cited_by_count": 45,
+                    "type": "article",
+                    "is_oa": True,
+                    "primary_location": {
+                        "source": {"display_name": "Nature Medicine"},
+                        "landing_page_url": "https://doi.org/10.1038/example",
+                        "pdf_url": None,
+                    },
+                    "abstract_inverted_index": {
+                        "Metformin": [0],
+                        "shows": [1],
+                        "anticancer": [2],
+                        "effects": [3],
+                    },
+                    "concepts": [
+                        {"display_name": "Medicine", "score": 0.95},
+                        {"display_name": "Oncology", "score": 0.88},
+                    ],
+                    "authorships": [
+                        {
+                            "author": {"display_name": "John Smith"},
+                            "institutions": [{"display_name": "Harvard"}],
+                        }
+                    ],
+                }
+            ]
+        }
+        respx.get("https://api.openalex.org/works").mock(
+            return_value=Response(200, json=mock_response)
+        )
+        results = await tool.search("metformin cancer", max_results=10)
+        assert len(results) == 1
+        assert isinstance(results[0], Evidence)
+        assert "Metformin and cancer" in results[0].citation.title
+        assert results[0].citation.source == "openalex"
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_search_empty_results(self, tool: OpenAlexTool) -> None:
+        """Search with no results should return empty list."""
+        respx.get("https://api.openalex.org/works").mock(
+            return_value=Response(200, json={"results": []})
+        )
+        results = await tool.search("xyznonexistentquery123")
+        assert results == []
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None:
+        """Tool should handle papers without abstracts."""
+        mock_response = {
+            "results": [
+                {
+                    "id": "W123",
+                    "title": "Paper without abstract",
+                    "publication_year": 2023,
+                    "cited_by_count": 10,
+                    "type": "article",
+                    "is_oa": False,
+                    "primary_location": {
+                        "source": {"display_name": "Journal"},
+                        "landing_page_url": "https://example.com",
+                    },
+                    "abstract_inverted_index": None,
+                    "concepts": [],
+                    "authorships": [],
+                }
+            ]
+        }
+        respx.get("https://api.openalex.org/works").mock(
+            return_value=Response(200, json=mock_response)
+        )
+        results = await tool.search("test query")
+        assert len(results) == 1
+        assert results[0].content == ""  # No abstract
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None:
+        """Citation count should be in metadata."""
+        mock_response = {
+            "results": [
+                {
+                    "id": "W456",
+                    "title": "Highly cited paper",
+                    "publication_year": 2020,
+                    "cited_by_count": 500,
+                    "type": "article",
+                    "is_oa": True,
+                    "primary_location": {
+                        "source": {"display_name": "Science"},
+                        "landing_page_url": "https://example.com",
+                    },
+                    "abstract_inverted_index": {"Test": [0]},
+                    "concepts": [],
+                    "authorships": [],
+                }
+            ]
+        }
+        respx.get("https://api.openalex.org/works").mock(
+            return_value=Response(200, json=mock_response)
+        )
+        results = await tool.search("highly cited")
+        assert results[0].metadata["cited_by_count"] == 500
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None:
+        """Concepts should be extracted for semantic discovery."""
+        mock_response = {
+            "results": [
+                {
+                    "id": "W789",
+                    "title": "Drug repurposing study",
+                    "publication_year": 2023,
+                    "cited_by_count": 25,
+                    "type": "article",
+                    "is_oa": True,
+                    "primary_location": {
+                        "source": {"display_name": "PLOS ONE"},
+                        "landing_page_url": "https://example.com",
+                    },
+                    "abstract_inverted_index": {"Drug": [0], "repurposing": [1]},
+                    "concepts": [
+                        {"display_name": "Pharmacology", "score": 0.92},
+                        {"display_name": "Drug Discovery", "score": 0.85},
+                        {"display_name": "Medicine", "score": 0.80},
+                    ],
+                    "authorships": [],
+                }
+            ]
+        }
+        respx.get("https://api.openalex.org/works").mock(
+            return_value=Response(200, json=mock_response)
+        )
+        results = await tool.search("drug repurposing")
+        assert "Pharmacology" in results[0].metadata["concepts"]
+        assert "Drug Discovery" in results[0].metadata["concepts"]
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_search_api_error_raises_search_error(
+        self, tool: OpenAlexTool
+    ) -> None:
+        """API errors should raise SearchError."""
+        from src.utils.exceptions import SearchError
+        respx.get("https://api.openalex.org/works").mock(
+            return_value=Response(500, text="Internal Server Error")
+        )
+        with pytest.raises(SearchError):
+            await tool.search("test query")
+    def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
+        """Test abstract reconstruction from inverted index."""
+        inverted_index = {
+            "Metformin": [0, 5],
+            "is": [1],
+            "a": [2],
+            "diabetes": [3],
+            "drug": [4],
+            "effective": [6],
+        }
+        abstract = tool._reconstruct_abstract(inverted_index)
+        assert abstract == "Metformin is a diabetes drug Metformin effective"
+```
+---
+### Step 2: Create the Implementation
+**File**: `src/tools/openalex.py`
+```python
+"""OpenAlex search tool for comprehensive scholarly data."""
+from typing import Any
+import httpx
+from tenacity import retry, stop_after_attempt, wait_exponential
+from src.utils.exceptions import SearchError
+from src.utils.models import Citation, Evidence
+class OpenAlexTool:
+    """
+    Search OpenAlex for scholarly works with rich metadata.
+    OpenAlex provides:
+    - 209M+ scholarly works
+    - Citation counts and networks
+    - Concept tagging (hierarchical)
+    - Author disambiguation
+    - Open access links
+    API Docs: https://docs.openalex.org/
+    """
+    BASE_URL = "https://api.openalex.org/works"
+    def __init__(self, email: str | None = None) -> None:
+        """
+        Initialize OpenAlex tool.
+        Args:
+            email: Optional email for polite pool (faster responses)
+        """
+        self.email = email or "deepcritical@example.com"
+    @property
+    def name(self) -> str:
+        return "openalex"
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=1, max=10),
+        reraise=True,
+    )
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        """
+        Search OpenAlex for scholarly works.
+        Args:
+            query: Search terms
+            max_results: Maximum results to return (max 200 per request)
+        Returns:
+            List of Evidence objects with citation metadata
+        Raises:
+            SearchError: If API request fails
+        """
+        params = {
+            "search": query,
+            "filter": "type:article",  # Only peer-reviewed articles
+            "sort": "cited_by_count:desc",  # Most cited first
+            "per_page": min(max_results, 200),
+            "mailto": self.email,  # Polite pool for faster responses
+        }
+        async with httpx.AsyncClient(timeout=30.0) as client:
+            try:
+                response = await client.get(self.BASE_URL, params=params)
+                response.raise_for_status()
+                data = response.json()
+                results = data.get("results", [])
+                return [self._to_evidence(work) for work in results[:max_results]]
+            except httpx.HTTPStatusError as e:
+                raise SearchError(f"OpenAlex API error: {e}") from e
+            except httpx.RequestError as e:
+                raise SearchError(f"OpenAlex connection failed: {e}") from e
+    def _to_evidence(self, work: dict[str, Any]) -> Evidence:
+        """Convert OpenAlex work to Evidence object."""
+        title = work.get("title", "Untitled")
+        pub_year = work.get("publication_year", "Unknown")
+        cited_by = work.get("cited_by_count", 0)
+        is_oa = work.get("is_oa", False)
+        # Reconstruct abstract from inverted index
+        abstract_index = work.get("abstract_inverted_index")
+        abstract = self._reconstruct_abstract(abstract_index) if abstract_index else ""
+        # Extract concepts (top 5)
+        concepts = [
+            c.get("display_name", "")
+            for c in work.get("concepts", [])[:5]
+            if c.get("display_name")
+        ]
+        # Extract authors (top 5)
+        authorships = work.get("authorships", [])
+        authors = [
+            a.get("author", {}).get("display_name", "")
+            for a in authorships[:5]
+            if a.get("author", {}).get("display_name")
+        ]
+        # Get URL
+        primary_loc = work.get("primary_location") or {}
+        url = primary_loc.get("landing_page_url", "")
+        if not url:
+            # Fallback to OpenAlex page
+            work_id = work.get("id", "").replace("https://openalex.org/", "")
+            url = f"https://openalex.org/{work_id}"
+        return Evidence(
+            content=abstract[:2000],
+            citation=Citation(
+                source="openalex",
+                title=title[:500],
+                url=url,
+                date=str(pub_year),
+                authors=authors,
+            ),
+            relevance=min(0.9, 0.5 + (cited_by / 1000)),  # Boost by citations
+            metadata={
+                "cited_by_count": cited_by,
+                "is_open_access": is_oa,
+                "concepts": concepts,
+                "pdf_url": primary_loc.get("pdf_url"),
+            },
+        )
+    def _reconstruct_abstract(
+        self, inverted_index: dict[str, list[int]]
+    ) -> str:
+        """
+        Reconstruct abstract from OpenAlex inverted index format.
+        OpenAlex stores abstracts as {"word": [position1, position2, ...]}.
+        This rebuilds the original text.
+        """
+        if not inverted_index:
+            return ""
+        # Build position -> word mapping
+        position_word: dict[int, str] = {}
+        for word, positions in inverted_index.items():
+            for pos in positions:
+                position_word[pos] = word
+        # Reconstruct in order
+        if not position_word:
+            return ""
+        max_pos = max(position_word.keys())
+        words = [position_word.get(i, "") for i in range(max_pos + 1)]
+        return " ".join(w for w in words if w)
+```
+---
+### Step 3: Register in Search Handler
+**File**: `src/tools/search_handler.py` (add to imports and tool list)
+```python
+# Add import
+from src.tools.openalex import OpenAlexTool
+# Add to _create_tools method
+def _create_tools(self) -> list[SearchTool]:
+    return [
+        PubMedTool(),
+        ClinicalTrialsTool(),
+        EuropePMCTool(),
+        OpenAlexTool(),  # NEW
+    ]
+```
+---
+### Step 4: Update `__init__.py`
+**File**: `src/tools/__init__.py`
+```python
+from src.tools.openalex import OpenAlexTool
+__all__ = [
+    "PubMedTool",
+    "ClinicalTrialsTool",
+    "EuropePMCTool",
+    "OpenAlexTool",  # NEW
+    # ...
+]
+```
+---
+## Demo Script
+**File**: `examples/openalex_demo.py`
+```python
+#!/usr/bin/env python3
+"""Demo script to verify OpenAlex integration."""
+import asyncio
+from src.tools.openalex import OpenAlexTool
+async def main():
+    """Run OpenAlex search demo."""
+    tool = OpenAlexTool()
+    print("=" * 60)
+    print("OpenAlex Integration Demo")
+    print("=" * 60)
+    # Test 1: Basic drug repurposing search
+    print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...")
+    results = await tool.search("metformin cancer drug repurposing", max_results=5)
+    for i, evidence in enumerate(results, 1):
+        print(f"\n--- Result {i} ---")
+        print(f"Title: {evidence.citation.title}")
+        print(f"Year: {evidence.citation.date}")
+        print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}")
+        print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}")
+        print(f"Open Access: {evidence.metadata.get('is_open_access', False)}")
+        print(f"URL: {evidence.citation.url}")
+        if evidence.content:
+            print(f"Abstract: {evidence.content[:200]}...")
+    # Test 2: High-impact papers
+    print("\n" + "=" * 60)
+    print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...")
+    results = await tool.search("long COVID treatment", max_results=3)
+    for evidence in results:
+        print(f"\n- {evidence.citation.title}")
+        print(f"  Citations: {evidence.metadata.get('cited_by_count', 0)}")
+    print("\n" + "=" * 60)
+    print("Demo complete!")
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+---
+## Verification Checklist
+### Unit Tests
+```bash
+# Run just OpenAlex tests
+uv run pytest tests/unit/tools/test_openalex.py -v
+# Expected: All tests pass
+```
+### Integration Test (Manual)
+```bash
+# Run demo script with real API
+uv run python examples/openalex_demo.py
+# Expected: Real results from OpenAlex API
+```
+### Full Test Suite
+```bash
+# Ensure nothing broke
+make check
+# Expected: All 110+ tests pass, mypy clean
+```
+---
+## Success Criteria
+1. **Unit tests pass**: All mocked tests in `test_openalex.py` pass
+2. **Integration works**: Demo script returns real results
+3. **No regressions**: `make check` passes completely
+4. **SearchHandler integration**: OpenAlex appears in search results alongside other sources
+5. **Citation metadata**: Results include `cited_by_count`, `concepts`, `is_open_access`
+---
+## Future Enhancements (P2)
+Once basic integration works:
+1. **Citation Network Queries**
+   ```python
+   # Get papers citing a specific work
+   async def get_citing_works(self, work_id: str) -> list[Evidence]:
+       params = {"filter": f"cites:{work_id}"}
+       ...
+   ```
+2. **Concept-Based Search**
+   ```python
+   # Search by OpenAlex concept ID
+   async def search_by_concept(self, concept_id: str) -> list[Evidence]:
+       params = {"filter": f"concepts.id:{concept_id}"}
+       ...
+   ```
+3. **Author Tracking**
+   ```python
+   # Find all works by an author
+   async def search_by_author(self, author_id: str) -> list[Evidence]:
+       params = {"filter": f"authorships.author.id:{author_id}"}
+       ...
+   ```
+---
+## Notes
+- OpenAlex is **very generous** with rate limits (no documented hard limit)
+- Adding `mailto` parameter gives priority access (polite pool)
+- Abstract is stored as inverted index - must reconstruct
+- Citation count is a good proxy for paper quality/impact
+- Consider caching responses for repeated queries

docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md ADDED Viewed

	@@ -0,0 +1,586 @@

+# Phase 16: PubMed Full-Text Retrieval
+**Priority**: MEDIUM - Enhances evidence quality
+**Effort**: ~3 hours
+**Dependencies**: None (existing PubMed tool sufficient)
+---
+## Prerequisites (COMPLETED)
+The `Evidence.metadata` field has been added to `src/utils/models.py` to support:
+```python
+metadata={"has_fulltext": True}
+```
+---
+## Architecture Decision: Constructor Parameter vs Method Parameter
+**IMPORTANT**: The original spec proposed `include_fulltext` as a method parameter:
+```python
+# WRONG - SearchHandler won't pass this parameter
+async def search(self, query: str, max_results: int = 10, include_fulltext: bool = False):
+```
+**Problem**: `SearchHandler` calls `tool.search(query, max_results)` uniformly across all tools.
+It has no mechanism to pass tool-specific parameters like `include_fulltext`.
+**Solution**: Use constructor parameter instead:
+```python
+# CORRECT - Configured at instantiation time
+class PubMedTool:
+    def __init__(self, api_key: str | None = None, include_fulltext: bool = False):
+        self.include_fulltext = include_fulltext
+        ...
+```
+This way, you can create a full-text-enabled PubMed tool:
+```python
+# In orchestrator or wherever tools are created
+tools = [
+    PubMedTool(include_fulltext=True),  # Full-text enabled
+    ClinicalTrialsTool(),
+    EuropePMCTool(),
+]
+```
+---
+## Overview
+Add full-text retrieval for PubMed papers via the BioC API, enabling:
+- Complete paper text for open-access PMC papers
+- Structured sections (intro, methods, results, discussion)
+- Better evidence for LLM synthesis
+**Why Full-Text?**
+- Abstracts only give ~200-300 words
+- Full text provides detailed methods, results, figures
+- Reference repo already has this implemented
+- Makes LLM judgments more accurate
+---
+## TDD Implementation Plan
+### Step 1: Write the Tests First
+**File**: `tests/unit/tools/test_pubmed_fulltext.py`
+```python
+"""Tests for PubMed full-text retrieval."""
+import pytest
+import respx
+from httpx import Response
+from src.tools.pubmed import PubMedTool
+class TestPubMedFullText:
+    """Test suite for PubMed full-text functionality."""
+    @pytest.fixture
+    def tool(self) -> PubMedTool:
+        return PubMedTool()
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_get_pmc_id_success(self, tool: PubMedTool) -> None:
+        """Should convert PMID to PMCID for full-text access."""
+        mock_response = {
+            "records": [
+                {
+                    "pmid": "12345678",
+                    "pmcid": "PMC1234567",
+                }
+            ]
+        }
+        respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
+            return_value=Response(200, json=mock_response)
+        )
+        pmcid = await tool.get_pmc_id("12345678")
+        assert pmcid == "PMC1234567"
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_get_pmc_id_not_in_pmc(self, tool: PubMedTool) -> None:
+        """Should return None if paper not in PMC."""
+        mock_response = {
+            "records": [
+                {
+                    "pmid": "12345678",
+                    # No pmcid means not in PMC
+                }
+            ]
+        }
+        respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
+            return_value=Response(200, json=mock_response)
+        )
+        pmcid = await tool.get_pmc_id("12345678")
+        assert pmcid is None
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_get_fulltext_success(self, tool: PubMedTool) -> None:
+        """Should retrieve full text for PMC papers."""
+        # Mock BioC API response
+        mock_bioc = {
+            "documents": [
+                {
+                    "passages": [
+                        {
+                            "infons": {"section_type": "INTRO"},
+                            "text": "Introduction text here.",
+                        },
+                        {
+                            "infons": {"section_type": "METHODS"},
+                            "text": "Methods description here.",
+                        },
+                        {
+                            "infons": {"section_type": "RESULTS"},
+                            "text": "Results summary here.",
+                        },
+                        {
+                            "infons": {"section_type": "DISCUSS"},
+                            "text": "Discussion and conclusions.",
+                        },
+                    ]
+                }
+            ]
+        }
+        respx.get(
+            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
+        ).mock(return_value=Response(200, json=mock_bioc))
+        fulltext = await tool.get_fulltext("12345678")
+        assert fulltext is not None
+        assert "Introduction text here" in fulltext
+        assert "Methods description here" in fulltext
+        assert "Results summary here" in fulltext
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_get_fulltext_not_available(self, tool: PubMedTool) -> None:
+        """Should return None if full text not available."""
+        respx.get(
+            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/99999999/unicode"
+        ).mock(return_value=Response(404))
+        fulltext = await tool.get_fulltext("99999999")
+        assert fulltext is None
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_get_fulltext_structured(self, tool: PubMedTool) -> None:
+        """Should return structured sections dict."""
+        mock_bioc = {
+            "documents": [
+                {
+                    "passages": [
+                        {"infons": {"section_type": "INTRO"}, "text": "Intro..."},
+                        {"infons": {"section_type": "METHODS"}, "text": "Methods..."},
+                        {"infons": {"section_type": "RESULTS"}, "text": "Results..."},
+                        {"infons": {"section_type": "DISCUSS"}, "text": "Discussion..."},
+                    ]
+                }
+            ]
+        }
+        respx.get(
+            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
+        ).mock(return_value=Response(200, json=mock_bioc))
+        sections = await tool.get_fulltext_structured("12345678")
+        assert sections is not None
+        assert "introduction" in sections
+        assert "methods" in sections
+        assert "results" in sections
+        assert "discussion" in sections
+    @respx.mock
+    @pytest.mark.asyncio
+    async def test_search_with_fulltext_enabled(self) -> None:
+        """Search should include full text when tool is configured for it."""
+        # Create tool WITH full-text enabled via constructor
+        tool = PubMedTool(include_fulltext=True)
+        # Mock esearch
+        respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi").mock(
+            return_value=Response(
+                200, json={"esearchresult": {"idlist": ["12345678"]}}
+            )
+        )
+        # Mock efetch (abstract)
+        mock_xml = """
+        <PubmedArticleSet>
+          <PubmedArticle>
+            <MedlineCitation>
+              <PMID>12345678</PMID>
+              <Article>
+                <ArticleTitle>Test Paper</ArticleTitle>
+                <Abstract><AbstractText>Short abstract.</AbstractText></Abstract>
+                <AuthorList><Author><LastName>Smith</LastName></Author></AuthorList>
+              </Article>
+            </MedlineCitation>
+          </PubmedArticle>
+        </PubmedArticleSet>
+        """
+        respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi").mock(
+            return_value=Response(200, text=mock_xml)
+        )
+        # Mock ID converter
+        respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
+            return_value=Response(
+                200, json={"records": [{"pmid": "12345678", "pmcid": "PMC1234567"}]}
+            )
+        )
+        # Mock BioC full text
+        mock_bioc = {
+            "documents": [
+                {
+                    "passages": [
+                        {"infons": {"section_type": "INTRO"}, "text": "Full intro..."},
+                    ]
+                }
+            ]
+        }
+        respx.get(
+            "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
+        ).mock(return_value=Response(200, json=mock_bioc))
+        # NOTE: No include_fulltext param - it's set via constructor
+        results = await tool.search("test", max_results=1)
+        assert len(results) == 1
+        # Full text should be appended or replace abstract
+        assert "Full intro" in results[0].content or "Short abstract" in results[0].content
+```
+---
+### Step 2: Implement Full-Text Methods
+**File**: `src/tools/pubmed.py` (additions to existing class)
+```python
+# Add these methods to PubMedTool class
+async def get_pmc_id(self, pmid: str) -> str | None:
+    """
+    Convert PMID to PMCID for full-text access.
+    Args:
+        pmid: PubMed ID
+    Returns:
+        PMCID if paper is in PMC, None otherwise
+    """
+    url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
+    params = {"ids": pmid, "format": "json"}
+    async with httpx.AsyncClient(timeout=30.0) as client:
+        try:
+            response = await client.get(url, params=params)
+            response.raise_for_status()
+            data = response.json()
+            records = data.get("records", [])
+            if records and records[0].get("pmcid"):
+                return records[0]["pmcid"]
+            return None
+        except httpx.HTTPError:
+            return None
+async def get_fulltext(self, pmid: str) -> str | None:
+    """
+    Get full text for a PubMed paper via BioC API.
+    Only works for open-access papers in PubMed Central.
+    Args:
+        pmid: PubMed ID
+    Returns:
+        Full text as string, or None if not available
+    """
+    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
+    async with httpx.AsyncClient(timeout=60.0) as client:
+        try:
+            response = await client.get(url)
+            if response.status_code == 404:
+                return None
+            response.raise_for_status()
+            data = response.json()
+            # Extract text from all passages
+            documents = data.get("documents", [])
+            if not documents:
+                return None
+            passages = documents[0].get("passages", [])
+            text_parts = [p.get("text", "") for p in passages if p.get("text")]
+            return "\n\n".join(text_parts) if text_parts else None
+        except httpx.HTTPError:
+            return None
+async def get_fulltext_structured(self, pmid: str) -> dict[str, str] | None:
+    """
+    Get structured full text with sections.
+    Args:
+        pmid: PubMed ID
+    Returns:
+        Dict mapping section names to text, or None if not available
+    """
+    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
+    async with httpx.AsyncClient(timeout=60.0) as client:
+        try:
+            response = await client.get(url)
+            if response.status_code == 404:
+                return None
+            response.raise_for_status()
+            data = response.json()
+            documents = data.get("documents", [])
+            if not documents:
+                return None
+            # Map section types to readable names
+            section_map = {
+                "INTRO": "introduction",
+                "METHODS": "methods",
+                "RESULTS": "results",
+                "DISCUSS": "discussion",
+                "CONCL": "conclusion",
+                "ABSTRACT": "abstract",
+            }
+            sections: dict[str, list[str]] = {}
+            for passage in documents[0].get("passages", []):
+                section_type = passage.get("infons", {}).get("section_type", "other")
+                section_name = section_map.get(section_type, "other")
+                text = passage.get("text", "")
+                if text:
+                    if section_name not in sections:
+                        sections[section_name] = []
+                    sections[section_name].append(text)
+            # Join multiple passages per section
+            return {k: "\n\n".join(v) for k, v in sections.items()}
+        except httpx.HTTPError:
+            return None
+```
+---
+### Step 3: Update Constructor and Search Method
+Add full-text flag to constructor and update search to use it:
+```python
+class PubMedTool:
+    """Search tool for PubMed/NCBI."""
+    def __init__(
+        self,
+        api_key: str | None = None,
+        include_fulltext: bool = False,  # NEW CONSTRUCTOR PARAM
+    ) -> None:
+        self.api_key = api_key or settings.ncbi_api_key
+        if self.api_key == "your-ncbi-key-here":
+            self.api_key = None
+        self._last_request_time = 0.0
+        self.include_fulltext = include_fulltext  # Store for use in search()
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        """
+        Search PubMed and return evidence.
+        Note: Full-text enrichment is controlled by constructor parameter,
+        not method parameter, because SearchHandler doesn't pass extra args.
+        """
+        # ... existing search logic ...
+        evidence_list = self._parse_pubmed_xml(fetch_resp.text)
+        # Optionally enrich with full text (if configured at construction)
+        if self.include_fulltext:
+            evidence_list = await self._enrich_with_fulltext(evidence_list)
+        return evidence_list
+async def _enrich_with_fulltext(
+    self, evidence_list: list[Evidence]
+) -> list[Evidence]:
+    """Attempt to add full text to evidence items."""
+    enriched = []
+    for evidence in evidence_list:
+        # Extract PMID from URL
+        url = evidence.citation.url
+        pmid = url.rstrip("/").split("/")[-1] if url else None
+        if pmid:
+            fulltext = await self.get_fulltext(pmid)
+            if fulltext:
+                # Replace abstract with full text (truncated)
+                evidence = Evidence(
+                    content=fulltext[:8000],  # Larger limit for full text
+                    citation=evidence.citation,
+                    relevance=evidence.relevance,
+                    metadata={
+                        **evidence.metadata,
+                        "has_fulltext": True,
+                    },
+                )
+        enriched.append(evidence)
+    return enriched
+```
+---
+## Demo Script
+**File**: `examples/pubmed_fulltext_demo.py`
+```python
+#!/usr/bin/env python3
+"""Demo script to verify PubMed full-text retrieval."""
+import asyncio
+from src.tools.pubmed import PubMedTool
+async def main():
+    """Run PubMed full-text demo."""
+    tool = PubMedTool()
+    print("=" * 60)
+    print("PubMed Full-Text Demo")
+    print("=" * 60)
+    # Test 1: Convert PMID to PMCID
+    print("\n[Test 1] Converting PMID to PMCID...")
+    # Use a known open-access paper
+    test_pmid = "34450029"  # Example: COVID-related open-access paper
+    pmcid = await tool.get_pmc_id(test_pmid)
+    print(f"PMID {test_pmid} -> PMCID: {pmcid or 'Not in PMC'}")
+    # Test 2: Get full text
+    print("\n[Test 2] Fetching full text...")
+    if pmcid:
+        fulltext = await tool.get_fulltext(test_pmid)
+        if fulltext:
+            print(f"Full text length: {len(fulltext)} characters")
+            print(f"Preview: {fulltext[:500]}...")
+        else:
+            print("Full text not available")
+    # Test 3: Get structured sections
+    print("\n[Test 3] Fetching structured sections...")
+    if pmcid:
+        sections = await tool.get_fulltext_structured(test_pmid)
+        if sections:
+            print("Available sections:")
+            for section, text in sections.items():
+                print(f"  - {section}: {len(text)} chars")
+        else:
+            print("Structured text not available")
+    # Test 4: Search with full text
+    print("\n[Test 4] Search with full-text enrichment...")
+    results = await tool.search(
+        "metformin cancer open access",
+        max_results=3,
+        include_fulltext=True
+    )
+    for i, evidence in enumerate(results, 1):
+        has_ft = evidence.metadata.get("has_fulltext", False)
+        print(f"\n--- Result {i} ---")
+        print(f"Title: {evidence.citation.title}")
+        print(f"Has Full Text: {has_ft}")
+        print(f"Content Length: {len(evidence.content)} chars")
+    print("\n" + "=" * 60)
+    print("Demo complete!")
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+---
+## Verification Checklist
+### Unit Tests
+```bash
+# Run full-text tests
+uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
+# Run all PubMed tests
+uv run pytest tests/unit/tools/test_pubmed.py -v
+# Expected: All tests pass
+```
+### Integration Test (Manual)
+```bash
+# Run demo with real API
+uv run python examples/pubmed_fulltext_demo.py
+# Expected: Real full text from PMC papers
+```
+### Full Test Suite
+```bash
+make check
+# Expected: All tests pass, mypy clean
+```
+---
+## Success Criteria
+1. **ID Conversion works**: PMID -> PMCID conversion successful
+2. **Full text retrieval works**: BioC API returns paper text
+3. **Structured sections work**: Can get intro/methods/results/discussion separately
+4. **Search integration works**: `include_fulltext=True` enriches results
+5. **No regressions**: Existing tests still pass
+6. **Graceful degradation**: Non-PMC papers still return abstracts
+---
+## Notes
+- Only ~30% of PubMed papers have full text in PMC
+- BioC API has no documented rate limit, but be respectful
+- Full text can be very long - truncate appropriately
+- Consider caching full text responses (they don't change)
+- Timeout should be longer for full text (60s vs 30s)

docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md ADDED Viewed

	@@ -0,0 +1,540 @@

+# Phase 17: Rate Limiting with `limits` Library
+**Priority**: P0 CRITICAL - Prevents API blocks
+**Effort**: ~1 hour
+**Dependencies**: None
+---
+## CRITICAL: Async Safety Requirements
+**WARNING**: The rate limiter MUST be async-safe. Blocking the event loop will freeze:
+- The Gradio UI
+- All parallel searches
+- The orchestrator
+**Rules**:
+1. **NEVER use `time.sleep()`** - Always use `await asyncio.sleep()`
+2. **NEVER use blocking while loops** - Use async-aware polling
+3. **The `limits` library check is synchronous** - Wrap it carefully
+The implementation below uses a polling pattern that:
+- Checks the limit (synchronous, fast)
+- If exceeded, `await asyncio.sleep()` (non-blocking)
+- Retry the check
+**Alternative**: If `limits` proves problematic, use `aiolimiter` which is pure-async.
+---
+## Overview
+Replace naive `asyncio.sleep` rate limiting with proper rate limiter using the `limits` library, which provides:
+- Moving window rate limiting
+- Per-API configurable limits
+- Thread-safe storage
+- Already used in reference repo
+**Why This Matters?**
+- NCBI will block us without proper rate limiting (3/sec without key, 10/sec with)
+- Current implementation only has simple sleep delay
+- Need coordinated limits across all PubMed calls
+- Professional-grade rate limiting prevents production issues
+---
+## Current State
+### What We Have (`src/tools/pubmed.py:20-21, 34-41`)
+```python
+RATE_LIMIT_DELAY = 0.34  # ~3 requests/sec without API key
+async def _rate_limit(self) -> None:
+    """Enforce NCBI rate limiting."""
+    loop = asyncio.get_running_loop()
+    now = loop.time()
+    elapsed = now - self._last_request_time
+    if elapsed < self.RATE_LIMIT_DELAY:
+        await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
+    self._last_request_time = loop.time()
+```
+### Problems
+1. **Not shared across instances**: Each `PubMedTool()` has its own counter
+2. **Simple delay vs moving window**: Doesn't handle bursts properly
+3. **Hardcoded rate**: Doesn't adapt to API key presence
+4. **No backoff on 429**: Just retries blindly
+---
+## TDD Implementation Plan
+### Step 1: Add Dependency
+**File**: `pyproject.toml`
+```toml
+dependencies = [
+    # ... existing deps ...
+    "limits>=3.0",
+]
+```
+Then run:
+```bash
+uv sync
+```
+---
+### Step 2: Write the Tests First
+**File**: `tests/unit/tools/test_rate_limiting.py`
+```python
+"""Tests for rate limiting functionality."""
+import asyncio
+import time
+import pytest
+from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter
+class TestRateLimiter:
+    """Test suite for rate limiter."""
+    def test_create_limiter_without_api_key(self) -> None:
+        """Should create 3/sec limiter without API key."""
+        limiter = RateLimiter(rate="3/second")
+        assert limiter.rate == "3/second"
+    def test_create_limiter_with_api_key(self) -> None:
+        """Should create 10/sec limiter with API key."""
+        limiter = RateLimiter(rate="10/second")
+        assert limiter.rate == "10/second"
+    @pytest.mark.asyncio
+    async def test_limiter_allows_requests_under_limit(self) -> None:
+        """Should allow requests under the rate limit."""
+        limiter = RateLimiter(rate="10/second")
+        # 3 requests should all succeed immediately
+        for _ in range(3):
+            allowed = await limiter.acquire()
+            assert allowed is True
+    @pytest.mark.asyncio
+    async def test_limiter_blocks_when_exceeded(self) -> None:
+        """Should wait when rate limit exceeded."""
+        limiter = RateLimiter(rate="2/second")
+        # First 2 should be instant
+        await limiter.acquire()
+        await limiter.acquire()
+        # Third should block briefly
+        start = time.monotonic()
+        await limiter.acquire()
+        elapsed = time.monotonic() - start
+        # Should have waited ~0.5 seconds (half second window for 2/sec)
+        assert elapsed >= 0.3
+    @pytest.mark.asyncio
+    async def test_limiter_resets_after_window(self) -> None:
+        """Rate limit should reset after time window."""
+        limiter = RateLimiter(rate="5/second")
+        # Use up the limit
+        for _ in range(5):
+            await limiter.acquire()
+        # Wait for window to pass
+        await asyncio.sleep(1.1)
+        # Should be allowed again
+        start = time.monotonic()
+        await limiter.acquire()
+        elapsed = time.monotonic() - start
+        assert elapsed < 0.1  # Should be nearly instant
+class TestGetPubmedLimiter:
+    """Test PubMed-specific limiter factory."""
+    def test_limiter_without_api_key(self) -> None:
+        """Should return 3/sec limiter without key."""
+        limiter = get_pubmed_limiter(api_key=None)
+        assert "3" in limiter.rate
+    def test_limiter_with_api_key(self) -> None:
+        """Should return 10/sec limiter with key."""
+        limiter = get_pubmed_limiter(api_key="my-api-key")
+        assert "10" in limiter.rate
+    def test_limiter_is_singleton(self) -> None:
+        """Same API key should return same limiter instance."""
+        limiter1 = get_pubmed_limiter(api_key="key1")
+        limiter2 = get_pubmed_limiter(api_key="key1")
+        assert limiter1 is limiter2
+    def test_different_keys_different_limiters(self) -> None:
+        """Different API keys should return different limiters."""
+        limiter1 = get_pubmed_limiter(api_key="key1")
+        limiter2 = get_pubmed_limiter(api_key="key2")
+        # Clear cache for clean test
+        # Actually, different keys SHOULD share the same limiter
+        # since we're limiting against the same API
+        assert limiter1 is limiter2  # Shared NCBI rate limit
+```
+---
+### Step 3: Create Rate Limiter Module
+**File**: `src/tools/rate_limiter.py`
+```python
+"""Rate limiting utilities using the limits library."""
+import asyncio
+from typing import ClassVar
+from limits import RateLimitItem, parse
+from limits.storage import MemoryStorage
+from limits.strategies import MovingWindowRateLimiter
+class RateLimiter:
+    """
+    Async-compatible rate limiter using limits library.
+    Uses moving window algorithm for smooth rate limiting.
+    """
+    def __init__(self, rate: str) -> None:
+        """
+        Initialize rate limiter.
+        Args:
+            rate: Rate string like "3/second" or "10/second"
+        """
+        self.rate = rate
+        self._storage = MemoryStorage()
+        self._limiter = MovingWindowRateLimiter(self._storage)
+        self._rate_limit: RateLimitItem = parse(rate)
+        self._identity = "default"  # Single identity for shared limiting
+    async def acquire(self, wait: bool = True) -> bool:
+        """
+        Acquire permission to make a request.
+        ASYNC-SAFE: Uses asyncio.sleep(), never time.sleep().
+        The polling pattern allows other coroutines to run while waiting.
+        Args:
+            wait: If True, wait until allowed. If False, return immediately.
+        Returns:
+            True if allowed, False if not (only when wait=False)
+        """
+        while True:
+            # Check if we can proceed (synchronous, fast - ~microseconds)
+            if self._limiter.hit(self._rate_limit, self._identity):
+                return True
+            if not wait:
+                return False
+            # CRITICAL: Use asyncio.sleep(), NOT time.sleep()
+            # This yields control to the event loop, allowing other
+            # coroutines (UI, parallel searches) to run
+            await asyncio.sleep(0.1)
+    def reset(self) -> None:
+        """Reset the rate limiter (for testing)."""
+        self._storage.reset()
+# Singleton limiter for PubMed/NCBI
+_pubmed_limiter: RateLimiter | None = None
+def get_pubmed_limiter(api_key: str | None = None) -> RateLimiter:
+    """
+    Get the shared PubMed rate limiter.
+    Rate depends on whether API key is provided:
+    - Without key: 3 requests/second
+    - With key: 10 requests/second
+    Args:
+        api_key: NCBI API key (optional)
+    Returns:
+        Shared RateLimiter instance
+    """
+    global _pubmed_limiter
+    if _pubmed_limiter is None:
+        rate = "10/second" if api_key else "3/second"
+        _pubmed_limiter = RateLimiter(rate)
+    return _pubmed_limiter
+def reset_pubmed_limiter() -> None:
+    """Reset the PubMed limiter (for testing)."""
+    global _pubmed_limiter
+    _pubmed_limiter = None
+# Factory for other APIs
+class RateLimiterFactory:
+    """Factory for creating/getting rate limiters for different APIs."""
+    _limiters: ClassVar[dict[str, RateLimiter]] = {}
+    @classmethod
+    def get(cls, api_name: str, rate: str) -> RateLimiter:
+        """
+        Get or create a rate limiter for an API.
+        Args:
+            api_name: Unique identifier for the API
+            rate: Rate limit string (e.g., "10/second")
+        Returns:
+            RateLimiter instance (shared for same api_name)
+        """
+        if api_name not in cls._limiters:
+            cls._limiters[api_name] = RateLimiter(rate)
+        return cls._limiters[api_name]
+    @classmethod
+    def reset_all(cls) -> None:
+        """Reset all limiters (for testing)."""
+        cls._limiters.clear()
+```
+---
+### Step 4: Update PubMed Tool
+**File**: `src/tools/pubmed.py` (replace rate limiting code)
+```python
+# Replace imports and rate limiting
+from src.tools.rate_limiter import get_pubmed_limiter
+class PubMedTool:
+    """Search tool for PubMed/NCBI."""
+    BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
+    HTTP_TOO_MANY_REQUESTS = 429
+    def __init__(self, api_key: str | None = None) -> None:
+        self.api_key = api_key or settings.ncbi_api_key
+        if self.api_key == "your-ncbi-key-here":
+            self.api_key = None
+        # Use shared rate limiter
+        self._limiter = get_pubmed_limiter(self.api_key)
+    async def _rate_limit(self) -> None:
+        """Enforce NCBI rate limiting using shared limiter."""
+        await self._limiter.acquire()
+    # ... rest of class unchanged ...
+```
+---
+### Step 5: Add Rate Limiters for Other APIs
+**File**: `src/tools/clinicaltrials.py` (optional)
+```python
+from src.tools.rate_limiter import RateLimiterFactory
+class ClinicalTrialsTool:
+    def __init__(self) -> None:
+        # ClinicalTrials.gov doesn't document limits, but be conservative
+        self._limiter = RateLimiterFactory.get("clinicaltrials", "5/second")
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        await self._limiter.acquire()
+        # ... rest of method ...
+```
+**File**: `src/tools/europepmc.py` (optional)
+```python
+from src.tools.rate_limiter import RateLimiterFactory
+class EuropePMCTool:
+    def __init__(self) -> None:
+        # Europe PMC is generous, but still be respectful
+        self._limiter = RateLimiterFactory.get("europepmc", "10/second")
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        await self._limiter.acquire()
+        # ... rest of method ...
+```
+---
+## Demo Script
+**File**: `examples/rate_limiting_demo.py`
+```python
+#!/usr/bin/env python3
+"""Demo script to verify rate limiting works correctly."""
+import asyncio
+import time
+from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
+from src.tools.pubmed import PubMedTool
+async def test_basic_limiter():
+    """Test basic rate limiter behavior."""
+    print("=" * 60)
+    print("Rate Limiting Demo")
+    print("=" * 60)
+    # Test 1: Basic limiter
+    print("\n[Test 1] Testing 3/second limiter...")
+    limiter = RateLimiter("3/second")
+    start = time.monotonic()
+    for i in range(6):
+        await limiter.acquire()
+        elapsed = time.monotonic() - start
+        print(f"  Request {i+1} at {elapsed:.2f}s")
+    total = time.monotonic() - start
+    print(f"  Total time for 6 requests: {total:.2f}s (expected ~2s)")
+async def test_pubmed_limiter():
+    """Test PubMed-specific limiter."""
+    print("\n[Test 2] Testing PubMed limiter (shared)...")
+    reset_pubmed_limiter()  # Clean state
+    # Without API key: 3/sec
+    limiter = get_pubmed_limiter(api_key=None)
+    print(f"  Rate without key: {limiter.rate}")
+    # Multiple tools should share the same limiter
+    tool1 = PubMedTool()
+    tool2 = PubMedTool()
+    # Verify they share the limiter
+    print(f"  Tools share limiter: {tool1._limiter is tool2._limiter}")
+async def test_concurrent_requests():
+    """Test rate limiting under concurrent load."""
+    print("\n[Test 3] Testing concurrent request limiting...")
+    limiter = RateLimiter("5/second")
+    async def make_request(i: int):
+        await limiter.acquire()
+        return time.monotonic()
+    start = time.monotonic()
+    # Launch 10 concurrent requests
+    tasks = [make_request(i) for i in range(10)]
+    times = await asyncio.gather(*tasks)
+    # Calculate distribution
+    relative_times = [t - start for t in times]
+    print(f"  Request times: {[f'{t:.2f}s' for t in sorted(relative_times)]}")
+    total = max(relative_times)
+    print(f"  All 10 requests completed in {total:.2f}s (expected ~2s)")
+async def main():
+    await test_basic_limiter()
+    await test_pubmed_limiter()
+    await test_concurrent_requests()
+    print("\n" + "=" * 60)
+    print("Demo complete!")
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+---
+## Verification Checklist
+### Unit Tests
+```bash
+# Run rate limiting tests
+uv run pytest tests/unit/tools/test_rate_limiting.py -v
+# Expected: All tests pass
+```
+### Integration Test (Manual)
+```bash
+# Run demo
+uv run python examples/rate_limiting_demo.py
+# Expected: Requests properly spaced
+```
+### Full Test Suite
+```bash
+make check
+# Expected: All tests pass, mypy clean
+```
+---
+## Success Criteria
+1. **`limits` library installed**: Dependency added to pyproject.toml
+2. **RateLimiter class works**: Can create and use limiters
+3. **PubMed uses new limiter**: Shared limiter across instances
+4. **Rate adapts to API key**: 3/sec without, 10/sec with
+5. **Concurrent requests handled**: Multiple async requests properly queued
+6. **No regressions**: All existing tests pass
+---
+## API Rate Limit Reference
+| API | Without Key | With Key |
+|-----|-------------|----------|
+| PubMed/NCBI | 3/sec | 10/sec |
+| ClinicalTrials.gov | Undocumented (~5/sec safe) | N/A |
+| Europe PMC | ~10-20/sec (generous) | N/A |
+| OpenAlex | ~100k/day (no per-sec limit) | Faster with `mailto` |
+---
+## Notes
+- `limits` library uses moving window algorithm (fairer than fixed window)
+- Singleton pattern ensures all PubMed calls share the limit
+- The factory pattern allows easy extension to other APIs
+- Consider adding 429 response detection + exponential backoff
+- In production, consider Redis storage for distributed rate limiting

docs/brainstorming/implementation/README.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# Implementation Plans
+TDD implementation plans based on the brainstorming documents. Each phase is a self-contained vertical slice with tests, implementation, and demo scripts.
+---
+## Prerequisites (COMPLETED)
+The following foundational changes have been implemented to support all three phases:
+| Change | File | Status |
+|--------|------|--------|
+| Add `"openalex"` to `SourceName` | `src/utils/models.py:9` | ✅ Done |
+| Add `metadata` field to `Evidence` | `src/utils/models.py:39-42` | ✅ Done |
+| Export all tools from `__init__.py` | `src/tools/__init__.py` | ✅ Done |
+All 110 tests pass after these changes.
+---
+## Priority Order
+| Phase | Name | Priority | Effort | Value |
+|-------|------|----------|--------|-------|
+| **17** | Rate Limiting | P0 CRITICAL | 1 hour | Stability |
+| **15** | OpenAlex | HIGH | 2-3 hours | Very High |
+| **16** | PubMed Full-Text | MEDIUM | 3 hours | High |
+**Recommended implementation order**: 17 → 15 → 16
+---
+## Phase 15: OpenAlex Integration
+**File**: [15_PHASE_OPENALEX.md](./15_PHASE_OPENALEX.md)
+Add OpenAlex as 4th data source for:
+- Citation networks (who cites whom)
+- Concept tagging (semantic discovery)
+- 209M+ scholarly works
+- Free, no API key required
+**Quick Start**:
+```bash
+# Create the tool
+touch src/tools/openalex.py
+touch tests/unit/tools/test_openalex.py
+# Run tests first (TDD)
+uv run pytest tests/unit/tools/test_openalex.py -v
+# Demo
+uv run python examples/openalex_demo.py
+```
+---
+## Phase 16: PubMed Full-Text
+**File**: [16_PHASE_PUBMED_FULLTEXT.md](./16_PHASE_PUBMED_FULLTEXT.md)
+Add full-text retrieval via BioC API for:
+- Complete paper text (not just abstracts)
+- Structured sections (intro, methods, results)
+- Better evidence for LLM synthesis
+**Quick Start**:
+```bash
+# Add methods to existing pubmed.py
+# Tests in test_pubmed_fulltext.py
+# Run tests
+uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
+# Demo
+uv run python examples/pubmed_fulltext_demo.py
+```
+---
+## Phase 17: Rate Limiting
+**File**: [17_PHASE_RATE_LIMITING.md](./17_PHASE_RATE_LIMITING.md)
+Replace naive sleep-based rate limiting with `limits` library for:
+- Moving window algorithm
+- Shared limits across instances
+- Configurable per-API rates
+- Production-grade stability
+**Quick Start**:
+```bash
+# Add dependency
+uv add limits
+# Create module
+touch src/tools/rate_limiter.py
+touch tests/unit/tools/test_rate_limiting.py
+# Run tests
+uv run pytest tests/unit/tools/test_rate_limiting.py -v
+# Demo
+uv run python examples/rate_limiting_demo.py
+```
+---
+## TDD Workflow
+Each implementation doc follows this pattern:
+1. **Write tests first** - Define expected behavior
+2. **Run tests** - Verify they fail (red)
+3. **Implement** - Write minimal code to pass
+4. **Run tests** - Verify they pass (green)
+5. **Refactor** - Clean up if needed
+6. **Demo** - Verify end-to-end with real APIs
+7. **`make check`** - Ensure no regressions
+---
+## Related Brainstorming Docs
+These implementation plans are derived from:
+- [00_ROADMAP_SUMMARY.md](../00_ROADMAP_SUMMARY.md) - Priority overview
+- [01_PUBMED_IMPROVEMENTS.md](../01_PUBMED_IMPROVEMENTS.md) - PubMed details
+- [02_CLINICALTRIALS_IMPROVEMENTS.md](../02_CLINICALTRIALS_IMPROVEMENTS.md) - CT.gov details
+- [03_EUROPEPMC_IMPROVEMENTS.md](../03_EUROPEPMC_IMPROVEMENTS.md) - Europe PMC details
+- [04_OPENALEX_INTEGRATION.md](../04_OPENALEX_INTEGRATION.md) - OpenAlex integration
+---
+## Future Phases (Not Yet Documented)
+Based on brainstorming, these could be added later:
+- **Phase 18**: ClinicalTrials.gov Results Retrieval
+- **Phase 19**: Europe PMC Annotations API
+- **Phase 20**: Drug Name Normalization (RxNorm)
+- **Phase 21**: Citation Network Queries (OpenAlex)
+- **Phase 22**: Semantic Search with Embeddings

docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md ADDED Viewed

	@@ -0,0 +1,189 @@

+# Situation Analysis: Pydantic-AI + Microsoft Agent Framework Integration
+**Date:** November 27, 2025
+**Status:** ACTIVE DECISION REQUIRED
+**Risk Level:** HIGH - DO NOT MERGE PR #41 UNTIL RESOLVED
+---
+## 1. The Problem
+We almost merged a refactor that would have **deleted** multi-agent orchestration capability from the codebase, mistakenly believing pydantic-ai and Microsoft Agent Framework were mutually exclusive.
+**They are not.** They are complementary:
+- **pydantic-ai** (Library): Ensures LLM outputs match Pydantic schemas
+- **Microsoft Agent Framework** (Framework): Orchestrates multi-agent workflows
+---
+## 2. Current Branch State
+| Branch | Location | Has Agent Framework? | Has Pydantic-AI Improvements? | Status |
+|--------|----------|---------------------|------------------------------|--------|
+| `origin/dev` | GitHub | YES | NO | **SAFE - Source of Truth** |
+| `huggingface-upstream/dev` | HF Spaces | YES | NO | **SAFE - Same as GitHub** |
+| `origin/main` | GitHub | YES | NO | **SAFE** |
+| `feat/pubmed-fulltext` | GitHub | NO (deleted) | YES | **DANGER - Has destructive refactor** |
+| `refactor/pydantic-unification` | Local | NO (deleted) | YES | **DANGER - Redundant, delete** |
+| Local `dev` | Local only | NO (deleted) | YES | **DANGER - NOT PUSHED (thankfully)** |
+### Key Files at Risk
+**On `origin/dev` (PRESERVED):**
+```text
+src/agents/
+├── analysis_agent.py      # StatisticalAnalyzer wrapper
+├── hypothesis_agent.py    # Hypothesis generation
+├── judge_agent.py         # JudgeHandler wrapper
+├── magentic_agents.py     # Multi-agent definitions
+├── report_agent.py        # Report synthesis
+├── search_agent.py        # SearchHandler wrapper
+├── state.py               # Thread-safe state management
+└── tools.py               # @ai_function decorated tools
+src/orchestrator_magentic.py  # Multi-agent orchestrator
+src/utils/llm_factory.py      # Centralized LLM client factory
+```
+**Deleted in refactor branch (would be lost if merged):**
+- All of the above
+---
+## 3. Target Architecture
+```text
+┌─────────────────────────────────────────────────────────────────┐
+│  Microsoft Agent Framework (Orchestration Layer)                │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
+│  │ SearchAgent  │→ │ JudgeAgent   │→ │ ReportAgent  │          │
+│  │ (BaseAgent)  │  │ (BaseAgent)  │  │ (BaseAgent)  │          │
+│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘          │
+│         │                 │                 │                  │
+│         ▼                 ▼                 ▼                  │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
+│  │ pydantic-ai  │  │ pydantic-ai  │  │ pydantic-ai  │          │
+│  │ Agent()      │  │ Agent()      │  │ Agent()      │          │
+│  │ output_type= │  │ output_type= │  │ output_type= │          │
+│  │ SearchResult │  │ JudgeAssess  │  │ Report       │          │
+│  └──────────────┘  └──────────────┘  └──────────────┘          │
+└─────────────────────────────────────────────────────────────────┘
+```
+**Why this architecture:**
+1. **Agent Framework** handles: workflow coordination, state passing, middleware, observability
+2. **pydantic-ai** handles: type-safe LLM calls within each agent
+---
+## 4. CRITICAL: Naming Confusion Clarification
+> **Senior Agent Review Finding:** The codebase uses "magentic" in file names (e.g., `orchestrator_magentic.py`, `magentic_agents.py`) but this is **NOT** the `magentic` PyPI package by Jacky Liang. It's Microsoft Agent Framework (`agent-framework-core`).
+**The naming confusion:**
+- `magentic` (PyPI package): A different library for structured LLM outputs
+- "Magentic" (in our codebase): Our internal name for Microsoft Agent Framework integration
+- `agent-framework-core` (PyPI package): Microsoft's actual multi-agent orchestration framework
+**Recommended future action:** Rename `orchestrator_magentic.py` → `orchestrator_advanced.py` to eliminate confusion.
+---
+## 5. What the Refactor DID Get Right
+The refactor branch (`feat/pubmed-fulltext`) has some valuable improvements:
+1. **`judges.py` unified `get_model()`** - Supports OpenAI, Anthropic, AND HuggingFace via pydantic-ai
+2. **HuggingFace free tier support** - `HuggingFaceModel` integration
+3. **Test fix** - Properly mocks `HuggingFaceModel` class
+4. **Removed broken magentic optional dependency** from pyproject.toml (this was correct - the old `magentic` package is different from Microsoft Agent Framework)
+**What it got WRONG:**
+1. Deleted `src/agents/` entirely instead of refactoring them
+2. Deleted `src/orchestrator_magentic.py` instead of fixing it
+3. Conflated "magentic" (old package) with "Microsoft Agent Framework" (current framework)
+---
+## 6. Options for Path Forward
+### Option A: Abandon Refactor, Start Fresh
+- Close PR #41
+- Delete `feat/pubmed-fulltext` and `refactor/pydantic-unification` branches
+- Reset local `dev` to match `origin/dev`
+- Cherry-pick ONLY the good parts (judges.py improvements, HF support)
+- **Pros:** Clean, safe
+- **Cons:** Lose some work, need to redo carefully
+### Option B: Cherry-Pick Good Parts to origin/dev
+- Do NOT merge PR #41
+- Create new branch from `origin/dev`
+- Cherry-pick specific commits/changes that improve pydantic-ai usage
+- Keep agent framework code intact
+- **Pros:** Preserves both, surgical
+- **Cons:** Requires careful file-by-file review
+### Option C: Revert Deletions in Refactor Branch
+- On `feat/pubmed-fulltext`, restore deleted agent files from `origin/dev`
+- Keep the pydantic-ai improvements
+- Merge THAT to dev
+- **Pros:** Gets both
+- **Cons:** Complex git operations, risk of conflicts
+---
+## 7. Recommended Action: Option B (Cherry-Pick)
+**Step-by-step:**
+1. **Close PR #41** (do not merge)
+2. **Delete redundant branches:**
+   - `refactor/pydantic-unification` (local)
+   - Reset local `dev` to `origin/dev`
+3. **Create new branch from origin/dev:**
+   ```bash
+   git checkout -b feat/pydantic-ai-improvements origin/dev
+   ```
+4. **Cherry-pick or manually port these improvements:**
+   - `src/agent_factory/judges.py` - the unified `get_model()` function
+   - `examples/free_tier_demo.py` - HuggingFace demo
+   - Test improvements
+5. **Do NOT delete any agent framework files**
+6. **Create PR for review**
+---
+## 8. Files to Cherry-Pick (Safe Improvements)
+| File | What Changed | Safe to Port? |
+|------|-------------|---------------|
+| `src/agent_factory/judges.py` | Added `HuggingFaceModel` support in `get_model()` | YES |
+| `examples/free_tier_demo.py` | New demo for HF inference | YES |
+| `tests/unit/agent_factory/test_judges.py` | Fixed HF model mocking | YES |
+| `pyproject.toml` | Removed old `magentic` optional dep | MAYBE (review carefully) |
+---
+## 9. Questions to Answer Before Proceeding
+1. **For the hackathon**: Do we need full multi-agent orchestration, or is single-agent sufficient?
+2. **For DeepCritical mainline**: Is the plan to use Microsoft Agent Framework for orchestration?
+3. **Timeline**: How much time do we have to get this right?
+---
+## 10. Immediate Actions (DO NOW)
+- [ ] **DO NOT merge PR #41**
+- [ ] Close PR #41 with comment explaining the situation
+- [ ] Do not push local `dev` branch anywhere
+- [ ] Confirm HuggingFace Spaces is untouched (it is - verified)
+---
+## 11. Decision Log
+| Date | Decision | Rationale |
+|------|----------|-----------|
+| 2025-11-27 | Pause refactor merge | Discovered agent framework and pydantic-ai are complementary, not exclusive |
+| TBD | ? | Awaiting decision on path forward |

docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md ADDED Viewed

	@@ -0,0 +1,289 @@

+# Architecture Specification: Dual-Mode Agent System
+**Date:** November 27, 2025
+**Status:** SPECIFICATION
+**Goal:** Graceful degradation from full multi-agent orchestration to simple single-agent mode
+---
+## 1. Core Concept: Two Operating Modes
+```text
+┌─────────────────────────────────────────────────────────────────────┐
+│                        USER REQUEST                                 │
+│                            │                                        │
+│                            ▼                                        │
+│                   ┌─────────────────┐                               │
+│                   │  Mode Selection │                               │
+│                   │  (Auto-detect)  │                               │
+│                   └────────┬────────┘                               │
+│                            │                                        │
+│            ┌───────────────┴───────────────┐                        │
+│            │                               │                        │
+│            ▼                               ▼                        │
+│   ┌─────────────────┐             ┌─────────────────┐               │
+│   │   SIMPLE MODE   │             │  ADVANCED MODE  │               │
+│   │  (Free Tier)    │             │  (Paid Tier)    │               │
+│   │                 │             │                 │               │
+│   │  pydantic-ai    │             │  MS Agent Fwk   │               │
+│   │  single-agent   │             │  + pydantic-ai  │               │
+│   │  loop           │             │  multi-agent    │               │
+│   └─────────────────┘             └─────────────────┘               │
+│            │                               │                        │
+│            └───────────────┬───────────────┘                        │
+│                            ▼                                        │
+│                   ┌─────────────────┐                               │
+│                   │  Research Report │                              │
+│                   │  with Citations  │                              │
+│                   └─────────────────┘                               │
+└─────────────────────────────────────────────────────────────────────┘
+```
+---
+## 2. Mode Comparison
+| Aspect | Simple Mode | Advanced Mode |
+|--------|-------------|---------------|
+| **Trigger** | No API key OR `LLM_PROVIDER=huggingface` | OpenAI API key present (currently OpenAI only) |
+| **Framework** | pydantic-ai only | Microsoft Agent Framework + pydantic-ai |
+| **Architecture** | Single orchestrator loop | Multi-agent coordination |
+| **Agents** | One agent does Search→Judge→Report | SearchAgent, JudgeAgent, ReportAgent, AnalysisAgent |
+| **State Management** | Simple dict | Thread-safe `MagenticState` with context vars |
+| **Quality** | Good (functional) | Better (specialized agents, coordination) |
+| **Cost** | Free (HuggingFace Inference) | Paid (OpenAI/Anthropic) |
+| **Use Case** | Demos, hackathon, budget-constrained | Production, research quality |
+---
+## 3. Simple Mode Architecture (pydantic-ai Only)
+```text
+┌─────────────────────────────────────────────────────┐
+│                  Orchestrator                       │
+│                                                     │
+│   while not sufficient and iteration < max:        │
+│       1. SearchHandler.execute(query)              │
+│       2. JudgeHandler.assess(evidence)    ◄── pydantic-ai Agent  │
+│       3. if sufficient: break                      │
+│       4. query = judge.next_queries                │
+│                                                     │
+│   return ReportGenerator.generate(evidence)        │
+└─────────────────────────────────────────────────────┘
+```
+**Components:**
+- `src/orchestrator.py` - Simple loop orchestrator
+- `src/agent_factory/judges.py` - JudgeHandler with pydantic-ai
+- `src/tools/search_handler.py` - Scatter-gather search
+- `src/tools/pubmed.py`, `clinicaltrials.py`, `europepmc.py` - Search tools
+---
+## 4. Advanced Mode Architecture (MS Agent Framework + pydantic-ai)
+```text
+┌─────────────────────────────────────────────────────────────────────┐
+│              Microsoft Agent Framework Orchestrator                 │
+│                                                                     │
+│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐            │
+│   │ SearchAgent │───▶│ JudgeAgent  │───▶│ ReportAgent │            │
+│   │ (BaseAgent) │    │ (BaseAgent) │    │ (BaseAgent) │            │
+│   └──────┬──────┘    └──────┬──────┘    └──────┬──────┘            │
+│          │                  │                  │                    │
+│          ▼                  ▼                  ▼                    │
+│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐            │
+│   │ pydantic-ai │    │ pydantic-ai │    │ pydantic-ai │            │
+│   │ Agent()     │    │ Agent()     │    │ Agent()     │            │
+│   │ output_type=│    │ output_type=│    │ output_type=│            │
+│   │ SearchResult│    │ JudgeAssess │    │ Report      │            │
+│   └─────────────┘    └─────────────┘    └─────────────┘            │
+│                                                                     │
+│   Shared State: MagenticState (thread-safe via contextvars)        │
+│   - evidence: list[Evidence]                                       │
+│   - embedding_service: EmbeddingService                            │
+└─────────────────────────────────────────────────────────────────────┘
+```
+**Components:**
+- `src/orchestrator_magentic.py` - Multi-agent orchestrator
+- `src/agents/search_agent.py` - SearchAgent (BaseAgent)
+- `src/agents/judge_agent.py` - JudgeAgent (BaseAgent)
+- `src/agents/report_agent.py` - ReportAgent (BaseAgent)
+- `src/agents/analysis_agent.py` - AnalysisAgent (BaseAgent)
+- `src/agents/state.py` - Thread-safe state management
+- `src/agents/tools.py` - @ai_function decorated tools
+---
+## 5. Mode Selection Logic
+```python
+# src/orchestrator_factory.py (actual implementation)
+def create_orchestrator(
+    search_handler: SearchHandlerProtocol | None = None,
+    judge_handler: JudgeHandlerProtocol | None = None,
+    config: OrchestratorConfig | None = None,
+    mode: Literal["simple", "magentic", "advanced"] | None = None,
+) -> Any:
+    """
+    Auto-select orchestrator based on available credentials.
+    Priority:
+    1. If mode explicitly set, use that
+    2. If OpenAI key available -> Advanced Mode (currently OpenAI only)
+    3. Otherwise -> Simple Mode (HuggingFace free tier)
+    """
+    effective_mode = _determine_mode(mode)
+    if effective_mode == "advanced":
+        orchestrator_cls = _get_magentic_orchestrator_class()
+        return orchestrator_cls(max_rounds=config.max_iterations if config else 10)
+    # Simple mode requires handlers
+    if search_handler is None or judge_handler is None:
+        raise ValueError("Simple mode requires search_handler and judge_handler")
+    return Orchestrator(
+        search_handler=search_handler,
+        judge_handler=judge_handler,
+        config=config,
+    )
+```
+---
+## 6. Shared Components (Both Modes Use)
+These components work in both modes:
+| Component | Purpose |
+|-----------|---------|
+| `src/tools/pubmed.py` | PubMed search |
+| `src/tools/clinicaltrials.py` | ClinicalTrials.gov search |
+| `src/tools/europepmc.py` | Europe PMC search |
+| `src/tools/search_handler.py` | Scatter-gather orchestration |
+| `src/tools/rate_limiter.py` | Rate limiting |
+| `src/utils/models.py` | Evidence, Citation, JudgeAssessment |
+| `src/utils/config.py` | Settings |
+| `src/services/embeddings.py` | Vector search (optional) |
+---
+## 7. pydantic-ai Integration Points
+Both modes use pydantic-ai for structured LLM outputs:
+```python
+# In JudgeHandler (both modes)
+from pydantic_ai import Agent
+from pydantic_ai.models.huggingface import HuggingFaceModel
+from pydantic_ai.models.openai import OpenAIModel
+from pydantic_ai.models.anthropic import AnthropicModel
+class JudgeHandler:
+    def __init__(self, model: Any = None):
+        self.model = model or get_model()  # Auto-selects based on config
+        self.agent = Agent(
+            model=self.model,
+            output_type=JudgeAssessment,  # Structured output!
+            system_prompt=SYSTEM_PROMPT,
+        )
+    async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
+        result = await self.agent.run(format_prompt(question, evidence))
+        return result.output  # Guaranteed to be JudgeAssessment
+```
+---
+## 8. Microsoft Agent Framework Integration Points
+Advanced mode wraps pydantic-ai agents in BaseAgent:
+```python
+# In JudgeAgent (advanced mode only)
+from agent_framework import BaseAgent, AgentRunResponse, ChatMessage, Role
+class JudgeAgent(BaseAgent):
+    def __init__(self, judge_handler: JudgeHandlerProtocol):
+        super().__init__(
+            name="JudgeAgent",
+            description="Evaluates evidence quality",
+        )
+        self._handler = judge_handler  # Uses pydantic-ai internally
+    async def run(self, messages, **kwargs) -> AgentRunResponse:
+        question = extract_question(messages)
+        evidence = self._evidence_store.get("current", [])
+        # Delegate to pydantic-ai powered handler
+        assessment = await self._handler.assess(question, evidence)
+        return AgentRunResponse(
+            messages=[ChatMessage(role=Role.ASSISTANT, text=format_response(assessment))],
+            additional_properties={"assessment": assessment.model_dump()},
+        )
+```
+---
+## 9. Benefits of This Architecture
+1. **Graceful Degradation**: Works without API keys (free tier)
+2. **Progressive Enhancement**: Better with API keys (orchestration)
+3. **Code Reuse**: pydantic-ai handlers shared between modes
+4. **Hackathon Ready**: Demo works without requiring paid keys
+5. **Production Ready**: Full orchestration available when needed
+6. **Future Proof**: Can add more agents to advanced mode
+7. **Testable**: Simple mode is easier to unit test
+---
+## 10. Known Risks and Mitigations
+> **From Senior Agent Review**
+### 10.1 Bridge Complexity (MEDIUM)
+**Risk:** In Advanced Mode, agents (Agent Framework) wrap handlers (pydantic-ai). Both are async. Context variables (`MagenticState`) must propagate correctly through the pydantic-ai call stack.
+**Mitigation:**
+- pydantic-ai uses standard Python `contextvars`, which naturally propagate through `await` chains
+- Test context propagation explicitly in integration tests
+- If issues arise, pass state explicitly rather than via context vars
+### 10.2 Integration Drift (MEDIUM)
+**Risk:** Simple Mode and Advanced Mode might diverge in behavior over time (e.g., Simple Mode uses logic A, Advanced Mode uses logic B).
+**Mitigation:**
+- Both modes MUST call the exact same underlying Tools (`src/tools/*`) and Handlers (`src/agent_factory/*`)
+- Handlers are the single source of truth for business logic
+- Agents are thin wrappers that delegate to handlers
+### 10.3 Testing Burden (LOW-MEDIUM)
+**Risk:** Two distinct orchestrators (`src/orchestrator.py` and `src/orchestrator_magentic.py`) doubles integration testing surface area.
+**Mitigation:**
+- Unit test handlers independently (shared code)
+- Integration tests for each mode separately
+- End-to-end tests verify same output for same input (determinism permitting)
+### 10.4 Dependency Conflicts (LOW)
+**Risk:** `agent-framework-core` might conflict with `pydantic-ai`'s dependencies (e.g., different pydantic versions).
+**Status:** Both use `pydantic>=2.x`. Should be compatible.
+---
+## 11. Naming Clarification
+> See `00_SITUATION_AND_PLAN.md` Section 4 for full details.
+**Important:** The codebase uses "magentic" in file names (`orchestrator_magentic.py`, `magentic_agents.py`) but this refers to our internal naming for Microsoft Agent Framework integration, **NOT** the `magentic` PyPI package.
+**Future action:** Rename to `orchestrator_advanced.py` to eliminate confusion.

docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md ADDED Viewed

	@@ -0,0 +1,112 @@

+# Implementation Phases: Dual-Mode Agent System
+**Date:** November 27, 2025
+**Status:** IMPLEMENTATION PLAN (REVISED)
+**Strategy:** TDD (Test-Driven Development), SOLID Principles
+**Dependency Strategy:** PyPI (agent-framework-core)
+---
+## Phase 0: Environment Validation & Cleanup
+**Goal:** Ensure clean state and dependencies are correctly installed.
+### Step 0.1: Verify PyPI Package
+The `agent-framework-core` package is published on PyPI by Microsoft. Verify installation:
+```bash
+uv sync --all-extras
+python -c "from agent_framework import ChatAgent; print('OK')"
+```
+### Step 0.2: Branch State
+We are on `feat/dual-mode-architecture`. Ensure it is up to date with `origin/dev` before starting.
+**Note:** The `reference_repos/agent-framework` folder is kept for reference/documentation only.
+The production dependency uses the official PyPI release.
+---
+## Phase 1: Pydantic-AI Improvements (Simple Mode)
+**Goal:** Implement `HuggingFaceModel` support in `JudgeHandler` using strict TDD.
+### Step 1.1: Test First (Red)
+Create `tests/unit/agent_factory/test_judges_factory.py`:
+- Test `get_model()` returns `HuggingFaceModel` when `LLM_PROVIDER=huggingface`.
+- Test `get_model()` respects `HF_TOKEN`.
+- Test fallback to OpenAI.
+### Step 1.2: Implementation (Green)
+Update `src/utils/config.py`:
+- Add `huggingface_model` and `hf_token` fields.
+Update `src/agent_factory/judges.py`:
+- Implement `get_model` with the logic derived from the tests.
+- Use dependency injection for the model where possible.
+### Step 1.3: Refactor
+Ensure `JudgeHandler` is loosely coupled from the specific model provider.
+---
+## Phase 2: Orchestrator Factory (The Switch)
+**Goal:** Implement the factory pattern to switch between Simple and Advanced modes.
+### Step 2.1: Test First (Red)
+Create `tests/unit/test_orchestrator_factory.py`:
+- Test `create_orchestrator` returns `Orchestrator` (simple) when API keys are missing.
+- Test `create_orchestrator` returns `MagenticOrchestrator` (advanced) when OpenAI key exists.
+- Test explicit mode override.
+### Step 2.2: Implementation (Green)
+Update `src/orchestrator_factory.py` to implement the selection logic.
+---
+## Phase 3: Agent Framework Integration (Advanced Mode)
+**Goal:** Integrate Microsoft Agent Framework from PyPI.
+### Step 3.1: Dependency Management
+The `agent-framework-core` package is installed from PyPI:
+```toml
+[project.optional-dependencies]
+magentic = [
+    "agent-framework-core>=1.0.0b251120,<2.0.0",  # Microsoft Agent Framework (PyPI)
+]
+```
+Install with: `uv sync --all-extras`
+### Step 3.2: Verify Imports (Test First)
+Create `tests/unit/agents/test_agent_imports.py`:
+- Verify `from agent_framework import ChatAgent` works.
+- Verify instantiation of `ChatAgent` with a mock client.
+### Step 3.3: Update Agents
+Refactor `src/agents/*.py` to ensure they match the exact signature of the local `ChatAgent` class.
+- **SOLID:** Ensure agents have single responsibilities.
+- **DRY:** Share tool definitions between Pydantic-AI simple mode and Agent Framework advanced mode.
+---
+## Phase 4: UI & End-to-End Verification
+**Goal:** Update Gradio to reflect the active mode.
+### Step 4.1: UI Updates
+Update `src/app.py` to display "Simple Mode" vs "Advanced Mode".
+### Step 4.2: End-to-End Test
+Run the full loop:
+1. Simple Mode (No Keys) -> Search -> Judge (HF) -> Report.
+2. Advanced Mode (OpenAI Key) -> SearchAgent -> JudgeAgent -> ReportAgent.
+---
+## Phase 5: Cleanup & Documentation
+- Remove unused code.
+- Update main README.md.
+- Final `make check`.

docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md ADDED Viewed

	@@ -0,0 +1,112 @@

+# Immediate Actions Checklist
+**Date:** November 27, 2025
+**Priority:** Execute in order
+---
+## Before Starting Implementation
+### 1. Close PR #41 (CRITICAL)
+```bash
+gh pr close 41 --comment "Architecture decision changed. Cherry-picking improvements to preserve both pydantic-ai and Agent Framework capabilities."
+```
+### 2. Verify HuggingFace Spaces is Safe
+```bash
+# Should show agent framework files exist
+git ls-tree --name-only huggingface-upstream/dev -- src/agents/
+git ls-tree --name-only huggingface-upstream/dev -- src/orchestrator_magentic.py
+```
+Expected output: Files should exist (they do as of this writing).
+### 3. Clean Local Environment
+```bash
+# Switch to main first
+git checkout main
+# Delete problematic branches
+git branch -D refactor/pydantic-unification 2>/dev/null || true
+git branch -D feat/pubmed-fulltext 2>/dev/null || true
+# Reset local dev to origin/dev
+git branch -D dev 2>/dev/null || true
+git checkout -b dev origin/dev
+# Verify agent framework code exists
+ls src/agents/
+# Expected: __init__.py, analysis_agent.py, hypothesis_agent.py, judge_agent.py,
+#           magentic_agents.py, report_agent.py, search_agent.py, state.py, tools.py
+ls src/orchestrator_magentic.py
+# Expected: file exists
+```
+### 4. Create Fresh Feature Branch
+```bash
+git checkout -b feat/dual-mode-architecture origin/dev
+```
+---
+## Decision Points
+Before proceeding, confirm:
+1. **For hackathon**: Do we need advanced mode, or is simple mode sufficient?
+   - Simple mode = faster to implement, works today
+   - Advanced mode = better quality, more work
+2. **Timeline**: How much time do we have?
+   - If < 1 day: Focus on simple mode only
+   - If > 1 day: Implement dual-mode
+3. **Dependencies**: Is `agent-framework-core` available?
+   - Check: `pip index versions agent-framework-core`
+   - If not on PyPI, may need to install from GitHub
+---
+## Quick Start (Simple Mode Only)
+If time is limited, implement only simple mode improvements:
+```bash
+# On feat/dual-mode-architecture branch
+# 1. Update judges.py to add HuggingFace support
+# 2. Update config.py to add HF settings
+# 3. Create free_tier_demo.py
+# 4. Run make check
+# 5. Create PR to dev
+```
+This gives you free-tier capability without touching agent framework code.
+---
+## Quick Start (Full Dual-Mode)
+If time permits, implement full dual-mode:
+Follow phases 1-6 in `02_IMPLEMENTATION_PHASES.md`
+---
+## Emergency Rollback
+If anything goes wrong:
+```bash
+# Reset to safe state
+git checkout main
+git branch -D feat/dual-mode-architecture
+git checkout -b feat/dual-mode-architecture origin/dev
+```
+Origin/dev is the safe fallback - it has agent framework intact.

docs/brainstorming/magentic-pydantic/04_FOLLOWUP_REVIEW_REQUEST.md ADDED Viewed

	@@ -0,0 +1,158 @@

+# Follow-Up Review Request: Did We Implement Your Feedback?
+**Date:** November 27, 2025
+**Context:** You previously reviewed our dual-mode architecture plan and provided feedback. We have updated the documentation. Please verify we correctly implemented your recommendations.
+---
+## Your Original Feedback vs Our Changes
+### 1. Naming Confusion Clarification
+**Your feedback:** "You are using Microsoft Agent Framework, but you've named your integration 'Magentic'. This caused the confusion."
+**Our change:** Added Section 4 in `00_SITUATION_AND_PLAN.md`:
+```markdown
+## 4. CRITICAL: Naming Confusion Clarification
+> **Senior Agent Review Finding:** The codebase uses "magentic" in file names
+> (e.g., `orchestrator_magentic.py`, `magentic_agents.py`) but this is **NOT**
+> the `magentic` PyPI package by Jacky Liang. It's Microsoft Agent Framework.
+**The naming confusion:**
+- `magentic` (PyPI package): A different library for structured LLM outputs
+- "Magentic" (in our codebase): Our internal name for Microsoft Agent Framework integration
+- `agent-framework-core` (PyPI package): Microsoft's actual multi-agent orchestration framework
+**Recommended future action:** Rename `orchestrator_magentic.py` → `orchestrator_advanced.py`
+```
+**Status:** ✅ IMPLEMENTED
+---
+### 2. Bridge Complexity Warning
+**Your feedback:** "You must ensure MagenticState (context vars) propagates correctly through the pydantic-ai call stack."
+**Our change:** Added Section 10.1 in `01_ARCHITECTURE_SPEC.md`:
+```markdown
+### 10.1 Bridge Complexity (MEDIUM)
+**Risk:** In Advanced Mode, agents (Agent Framework) wrap handlers (pydantic-ai).
+Both are async. Context variables (`MagenticState`) must propagate correctly.
+**Mitigation:**
+- pydantic-ai uses standard Python `contextvars`, which naturally propagate through `await` chains
+- Test context propagation explicitly in integration tests
+- If issues arise, pass state explicitly rather than via context vars
+```
+**Status:** ✅ IMPLEMENTED
+---
+### 3. Integration Drift Warning
+**Your feedback:** "Simple Mode and Advanced Mode might diverge in behavior."
+**Our change:** Added Section 10.2 in `01_ARCHITECTURE_SPEC.md`:
+```markdown
+### 10.2 Integration Drift (MEDIUM)
+**Risk:** Simple Mode and Advanced Mode might diverge in behavior over time.
+**Mitigation:**
+- Both modes MUST call the exact same underlying Tools (`src/tools/*`) and Handlers (`src/agent_factory/*`)
+- Handlers are the single source of truth for business logic
+- Agents are thin wrappers that delegate to handlers
+```
+**Status:** ✅ IMPLEMENTED
+---
+### 4. Testing Burden Warning
+**Your feedback:** "You now have two distinct orchestrators to maintain. This doubles your integration testing surface area."
+**Our change:** Added Section 10.3 in `01_ARCHITECTURE_SPEC.md`:
+```markdown
+### 10.3 Testing Burden (LOW-MEDIUM)
+**Risk:** Two distinct orchestrators doubles integration testing surface area.
+**Mitigation:**
+- Unit test handlers independently (shared code)
+- Integration tests for each mode separately
+- End-to-end tests verify same output for same input
+```
+**Status:** ✅ IMPLEMENTED
+---
+### 5. Rename Recommendation
+**Your feedback:** "Rename `src/orchestrator_magentic.py` to `src/orchestrator_advanced.py`"
+**Our change:** Added Step 3.4 in `02_IMPLEMENTATION_PHASES.md`:
+```markdown
+### Step 3.4: (OPTIONAL) Rename "Magentic" to "Advanced"
+> **Senior Agent Recommendation:** Rename files to eliminate confusion.
+git mv src/orchestrator_magentic.py src/orchestrator_advanced.py
+git mv src/agents/magentic_agents.py src/agents/advanced_agents.py
+**Note:** This is optional for the hackathon. Can be done in a follow-up PR.
+```
+**Status:** ✅ DOCUMENTED (marked as optional for hackathon)
+---
+### 6. Standardize Wrapper Recommendation
+**Your feedback:** "Create a generic `PydanticAiAgentWrapper(BaseAgent)` class instead of manually wrapping each handler."
+**Our change:** NOT YET DOCUMENTED
+**Status:** ⚠️ NOT IMPLEMENTED - Should we add this?
+---
+## Questions for Your Review
+1. **Did we correctly implement your feedback?** Are there any misunderstandings in how we interpreted your recommendations?
+2. **Is the "Standardize Wrapper" recommendation critical?** Should we add it to the implementation phases, or is it a nice-to-have for later?
+3. **Dependency versioning:** You noted `agent-framework-core>=1.0.0b251120` might be ephemeral. Should we:
+   - Pin to a specific version?
+   - Use a version range?
+   - Install from GitHub source?
+4. **Anything else we missed?**
+---
+## Files to Re-Review
+1. `00_SITUATION_AND_PLAN.md` - Added Section 4 (Naming Clarification)
+2. `01_ARCHITECTURE_SPEC.md` - Added Sections 10-11 (Risks, Naming)
+3. `02_IMPLEMENTATION_PHASES.md` - Added Step 3.4 (Optional Rename)
+---
+## Current Branch State
+We are now on `feat/dual-mode-architecture` branched from `origin/dev`:
+- ✅ Agent framework code intact (`src/agents/`, `src/orchestrator_magentic.py`)
+- ✅ Documentation committed
+- ❌ PR #41 still open (need to close it)
+- ❌ Cherry-pick of pydantic-ai improvements not yet done
+---
+Please confirm: **GO / NO-GO** to proceed with Phase 1 (cherry-picking pydantic-ai improvements)?

docs/brainstorming/magentic-pydantic/REVIEW_PROMPT_FOR_SENIOR_AGENT.md ADDED Viewed

	@@ -0,0 +1,113 @@

+# Senior Agent Review Prompt
+Copy and paste everything below this line to a fresh Claude/AI session:
+---
+## Context
+I am a junior developer working on a HuggingFace hackathon project called DeepCritical. We made a significant architectural mistake and are now trying to course-correct. I need you to act as a **senior staff engineer** and critically review our proposed solution.
+## The Situation
+We almost merged a refactor that would have **deleted** our multi-agent orchestration capability, mistakenly believing that `pydantic-ai` (a library for structured LLM outputs) and Microsoft's `agent-framework` (a framework for multi-agent orchestration) were mutually exclusive alternatives.
+**They are not.** They are complementary:
+- `pydantic-ai` ensures LLM responses match Pydantic schemas (type-safe outputs)
+- `agent-framework` orchestrates multiple agents working together (coordination layer)
+We now want to implement a **dual-mode architecture** where:
+- **Simple Mode (No API key):** Uses only pydantic-ai with HuggingFace free tier
+- **Advanced Mode (With API key):** Uses Microsoft Agent Framework for orchestration, with pydantic-ai inside each agent for structured outputs
+## Your Task
+Please perform a **deep, critical review** of:
+1. **The architecture diagram** (image attached: `assets/magentic-pydantic.png`)
+2. **Our documentation** (4 files listed below)
+3. **The actual codebase** to verify our claims
+## Specific Questions to Answer
+### Architecture Validation
+1. Is our understanding correct that pydantic-ai and agent-framework are complementary, not competing?
+2. Does the dual-mode architecture diagram accurately represent how these should integrate?
+3. Are there any architectural flaws or anti-patterns in our proposed design?
+### Documentation Accuracy
+4. Are the branch states we documented accurate? (Check `git log`, `git ls-tree`)
+5. Is our understanding of what code exists where correct?
+6. Are the implementation phases realistic and in the correct order?
+7. Are there any missing steps or dependencies we overlooked?
+### Codebase Reality Check
+8. Does `origin/dev` actually have the agent framework code intact? Verify by checking:
+   - `git ls-tree origin/dev -- src/agents/`
+   - `git ls-tree origin/dev -- src/orchestrator_magentic.py`
+9. What does the current `src/agents/` code actually import? Does it use `agent_framework` or `agent-framework-core`?
+10. Is the `agent-framework-core` package actually available on PyPI, or do we need to install from source?
+### Implementation Feasibility
+11. Can the cherry-pick strategy we outlined actually work, or are there merge conflicts we're not seeing?
+12. Is the mode auto-detection logic sound?
+13. What are the risks we haven't identified?
+### Critical Errors Check
+14. Did we miss anything critical in our analysis?
+15. Are there any factual errors in our documentation?
+16. Would a Google/DeepMind senior engineer approve this plan, or would they flag issues?
+## Files to Review
+Please read these files in order:
+1. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md`
+2. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md`
+3. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md`
+4. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md`
+And the architecture diagram:
+5. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/assets/magentic-pydantic.png`
+## Reference Repositories to Consult
+We have local clones of the source-of-truth repositories:
+- **Original DeepCritical:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/DeepCritical/`
+- **Microsoft Agent Framework:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/agent-framework/`
+- **Microsoft AutoGen:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/autogen-microsoft/`
+Please cross-reference our hackathon fork against these to verify architectural alignment.
+## Codebase to Analyze
+Our hackathon fork is at:
+`/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/`
+Key files to examine:
+- `src/agents/` - Agent framework integration
+- `src/agent_factory/judges.py` - pydantic-ai integration
+- `src/orchestrator.py` - Simple mode orchestrator
+- `src/orchestrator_magentic.py` - Advanced mode orchestrator
+- `src/orchestrator_factory.py` - Mode selection
+- `pyproject.toml` - Dependencies
+## Expected Output
+Please provide:
+1. **Validation Summary:** Is our plan sound? (YES/NO with explanation)
+2. **Errors Found:** List any factual errors in our documentation
+3. **Missing Items:** What did we overlook?
+4. **Risk Assessment:** What could go wrong?
+5. **Recommended Changes:** Specific edits to our documentation or plan
+6. **Go/No-Go Recommendation:** Should we proceed with this plan?
+## Tone
+Be brutally honest. If our plan is flawed, say so directly. We would rather know now than after implementation. Don't soften criticism - we need accuracy.
+---
+END OF PROMPT

docs/bugs/FIX_PLAN_MAGENTIC_MODE.md ADDED Viewed

	@@ -0,0 +1,227 @@

+# Fix Plan: Magentic Mode Report Generation
+**Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
+**Approach**: Test-Driven Development (TDD)
+**Estimated Scope**: 4 tasks, ~2-3 hours
+---
+## Problem Summary
+Magentic mode runs but fails to produce readable reports due to:
+1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
+2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
+3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
+---
+## Fix Order (TDD)
+### Phase 1: Write Failing Tests
+**Task 1.1**: Create test for ChatMessage text extraction
+```python
+# tests/unit/test_orchestrator_magentic.py
+def test_process_event_extracts_text_from_chat_message():
+    """Final result event should extract text from ChatMessage object."""
+    # Arrange: Mock ChatMessage with .content attribute
+    # Act: Call _process_event with MagenticFinalResultEvent
+    # Assert: Returned AgentEvent.message is a string, not object repr
+```
+**Task 1.2**: Create test for max rounds configuration
+```python
+def test_orchestrator_uses_configured_max_rounds():
+    """MagenticOrchestrator should use max_rounds from constructor."""
+    # Arrange: Create orchestrator with max_rounds=10
+    # Act: Build workflow
+    # Assert: Workflow has max_round_count=10
+```
+**Task 1.3**: Create test for bioRxiv reference removal
+```python
+def test_task_prompt_references_europe_pmc():
+    """Task prompt should reference Europe PMC, not bioRxiv."""
+    # Arrange: Create orchestrator
+    # Act: Check task string in run()
+    # Assert: Contains "Europe PMC", not "bioRxiv"
+```
+---
+### Phase 2: Fix ChatMessage Text Extraction
+**File**: `src/orchestrator_magentic.py`
+**Lines**: 192-199
+**Current Code**:
+```python
+elif isinstance(event, MagenticFinalResultEvent):
+    text = event.message.text if event.message else "No result"
+```
+**Fixed Code**:
+```python
+elif isinstance(event, MagenticFinalResultEvent):
+    if event.message:
+        # ChatMessage may have .content or .text depending on version
+        if hasattr(event.message, 'content') and event.message.content:
+            text = str(event.message.content)
+        elif hasattr(event.message, 'text') and event.message.text:
+            text = str(event.message.text)
+        else:
+            # Fallback: convert entire message to string
+            text = str(event.message)
+    else:
+        text = "No result generated"
+```
+**Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
+---
+### Phase 3: Fix Max Rounds Configuration
+**File**: `src/orchestrator_magentic.py`
+**Lines**: 97-99
+**Current Code**:
+```python
+.with_standard_manager(
+    chat_client=manager_client,
+    max_round_count=self._max_rounds,  # Already uses config
+    max_stall_count=3,
+    max_reset_count=2,
+)
+```
+**Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
+**Fix**: Verify the value flows through correctly. Add logging.
+```python
+logger.info(
+    "Building Magentic workflow",
+    max_rounds=self._max_rounds,
+    max_stall=3,
+    max_reset=2,
+)
+```
+**Also check**: `src/orchestrator_factory.py` passes config correctly:
+```python
+return MagenticOrchestrator(
+    max_rounds=config.max_iterations if config else 10,
+)
+```
+---
+### Phase 4: Fix Stale bioRxiv References
+**Files to update**:
+| File | Line | Change |
+|------|------|--------|
+| `src/orchestrator_magentic.py` | 131 | "bioRxiv" → "Europe PMC" |
+| `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" → "Europe PMC" |
+| `src/app.py` | 202-203 | "bioRxiv" → "Europe PMC" |
+**Search command to verify**:
+```bash
+grep -rn "bioRxiv\|biorxiv" src/
+```
+---
+## Implementation Checklist
+```
+[ ] Phase 1: Write failing tests
+    [ ] 1.1 Test ChatMessage text extraction
+    [ ] 1.2 Test max rounds configuration
+    [ ] 1.3 Test Europe PMC references
+[ ] Phase 2: Fix ChatMessage extraction
+    [ ] Update _process_event() in orchestrator_magentic.py
+    [ ] Run test 1.1 - should pass
+[ ] Phase 3: Fix max rounds
+    [ ] Add logging to _build_workflow()
+    [ ] Verify factory passes config correctly
+    [ ] Run test 1.2 - should pass
+[ ] Phase 4: Fix bioRxiv references
+    [ ] Update orchestrator_magentic.py task prompt
+    [ ] Update magentic_agents.py descriptions
+    [ ] Update app.py UI text
+    [ ] Run test 1.3 - should pass
+    [ ] Run grep to verify no remaining refs
+[ ] Final Verification
+    [ ] make check passes
+    [ ] All tests pass (108+)
+    [ ] Manual test: run_magentic.py produces readable report
+```
+---
+## Test Commands
+```bash
+# Run specific test file
+uv run pytest tests/unit/test_orchestrator_magentic.py -v
+# Run all tests
+uv run pytest tests/unit/ -v
+# Full check
+make check
+# Manual integration test
+set -a && source .env && set +a
+uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
+```
+---
+## Success Criteria
+1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
+2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
+3. No "Max round count reached" error with default settings
+4. No "bioRxiv" references anywhere in codebase
+5. All 108+ tests pass
+6. `make check` passes
+---
+## Files Modified
+```
+src/
+├── orchestrator_magentic.py   # ChatMessage fix, logging
+├── agents/magentic_agents.py  # bioRxiv → Europe PMC
+└── app.py                     # bioRxiv → Europe PMC
+tests/unit/
+└── test_orchestrator_magentic.py  # NEW: 3 tests
+```
+---
+## Notes for AI Agent
+When implementing this fix plan:
+1. **DO NOT** create mock data or fake responses
+2. **DO** write real tests that verify actual behavior
+3. **DO** run `make check` after each phase
+4. **DO** test with real OpenAI API key via `.env`
+5. **DO** preserve existing functionality - simple mode must still work
+6. **DO NOT** over-engineer - minimal changes to fix the specific bugs

docs/bugs/P0_MAGENTIC_MODE_BROKEN.md ADDED Viewed

	@@ -0,0 +1,116 @@

+# P0 Bug: Magentic Mode Returns ChatMessage Object Instead of Report Text
+**Status**: OPEN
+**Priority**: P0 (Critical)
+**Date**: 2025-11-27
+---
+## Actual Bug Found (Not What We Thought)
+**The OpenAI key works fine.** The real bug is different:
+### The Problem
+When Magentic mode completes, the final report returns a `ChatMessage` object instead of the actual text:
+```
+FINAL REPORT:
+<agent_framework._types.ChatMessage object at 0x11db70310>
+```
+### Evidence
+Full test output shows:
+1. Magentic orchestrator starts correctly
+2. SearchAgent finds evidence
+3. HypothesisAgent generates hypotheses
+4. JudgeAgent evaluates
+5. **BUT**: Final output is `ChatMessage` object, not text
+### Root Cause
+In `src/orchestrator_magentic.py` line 193:
+```python
+elif isinstance(event, MagenticFinalResultEvent):
+    text = event.message.text if event.message else "No result"
+```
+The `event.message` is a `ChatMessage` object, and `.text` may not extract the content correctly, or the message structure changed in the agent-framework library.
+---
+## Secondary Issue: Max Rounds Reached
+The orchestrator hits max rounds before producing a report:
+```
+[ERROR] Magentic Orchestrator: Max round count reached
+```
+This means the workflow times out before the ReportAgent synthesizes the final output.
+---
+## What Works
+- OpenAI API key: **Works** (loaded from .env)
+- SearchAgent: **Works** (finds evidence from PubMed, ClinicalTrials, Europe PMC)
+- HypothesisAgent: **Works** (generates Drug -> Target -> Pathway chains)
+- JudgeAgent: **Partial** (evaluates but sometimes loses context)
+---
+## Files to Fix
+| File | Line | Issue |
+|------|------|-------|
+| `src/orchestrator_magentic.py` | 193 | `event.message.text` returns object, not string |
+| `src/orchestrator_magentic.py` | 97-99 | `max_round_count=3` too low for full pipeline |
+---
+## Suggested Fix
+```python
+# In _process_event, line 192-199
+elif isinstance(event, MagenticFinalResultEvent):
+    # Handle ChatMessage object properly
+    if event.message:
+        if hasattr(event.message, 'content'):
+            text = event.message.content
+        elif hasattr(event.message, 'text'):
+            text = event.message.text
+        else:
+            text = str(event.message)
+    else:
+        text = "No result"
+```
+And increase rounds:
+```python
+# In _build_workflow, line 97
+max_round_count=self._max_rounds,  # Use configured value, default 10
+```
+---
+## Test Command
+```bash
+set -a && source .env && set +a && uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
+```
+---
+## Simple Mode Works
+For reference, simple mode produces full reports:
+```bash
+uv run python examples/orchestrator_demo/run_agent.py "metformin alzheimer"
+```
+Output includes structured report with Drug Candidates, Key Findings, etc.

docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md ADDED Viewed

	@@ -0,0 +1,81 @@

+# P1 Bug: Gradio Settings Accordion Not Collapsing
+**Priority**: P1 (UX Bug)
+**Status**: OPEN
+**Date**: 2025-11-27
+**Target Component**: `src/app.py`
+---
+## 1. Problem Description
+The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
+### Symptoms
+- Accordion arrow toggles visually, but content remains visible.
+- Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
+---
+## 2. Root Cause Analysis
+**Definitive Cause**: Nested `Blocks` Context Bug.
+`gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
+**Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
+---
+## 3. Solution Strategy: "The Unwrap Fix"
+We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
+### Implementation Plan
+**Refactor `src/app.py` / `create_demo()`**:
+1.  **Remove** the `with gr.Blocks() as demo:` context manager.
+2.  **Instantiate** `gr.ChatInterface` directly as the `demo` object.
+3.  **Migrate UI Elements**:
+    *   **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
+    *   **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
+### Before (Buggy)
+```python
+def create_demo():
+    with gr.Blocks() as demo:  # <--- CAUSE OF BUG
+        gr.Markdown("# Title")
+        gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
+        gr.Markdown("Footer")
+    return demo
+```
+### After (Correct)
+```python
+def create_demo():
+    return gr.ChatInterface(   # <--- FIX: Top-level component
+        ...,
+        title="🧬 DeepCritical",
+        description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
+        additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
+    )
+```
+---
+## 4. Validation
+1.  **Run**: `uv run python src/app.py`
+2.  **Check**: Open `http://localhost:7860`
+3.  **Verify**:
+    *   Settings accordion starts **COLLAPSED**.
+    *   Header title ("DeepCritical") is visible.
+    *   Footer text ("MCP Server Active") is visible in the description area.
+    *   Chat functionality works (Magentic/Simple modes).
+---
+## 5. Constraints & Notes
+- **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
+- **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.

docs/development/testing.md ADDED Viewed

	@@ -0,0 +1,139 @@

+# Testing Strategy
+## ensuring DeepCritical is Ironclad
+---
+## Overview
+Our testing strategy follows a strict **Pyramid of Reliability**:
+1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
+2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
+3. **E2E / Regression Tests**: Full research workflows (10% of tests)
+**Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
+---
+## 1. Unit Tests (Fast & Cheap)
+**Location**: `tests/unit/`
+Focus on individual components without external network calls. Mock everything.
+### Key Test Cases
+#### Agent Logic
+- **Initialization**: Verify default config loads correctly.
+- **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
+- **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
+- **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).
+#### Tools (Mocked)
+- **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
+- **Validation**: Ensure tools reject invalid queries (empty strings, etc.).
+#### Judge Prompts
+- **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
+- **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.
+```python
+# Example: Testing State Logic
+def test_budget_stop():
+    state = ResearchState(tokens_used=50001, max_tokens=50000)
+    assert should_continue(state) is False
+```
+---
+## 2. Integration Tests (Realistic & Mocked I/O)
+**Location**: `tests/integration/`
+Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.
+### Key Test Cases
+#### Search Loop
+- **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
+- **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
+- **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.
+#### MCP Server Integration
+- **Server Startup**: Verify MCP server starts and exposes tools.
+- **Client Connection**: Verify agent can call tools via MCP protocol.
+```python
+# Example: Testing Search Loop with Mocked Tools
+async def test_search_loop_flow():
+    agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
+    report = await agent.run("test query")
+    assert agent.state.iterations > 0
+    assert len(report.sources) > 0
+```
+---
+## 3. End-to-End (E2E) Tests (The "Real Deal")
+**Location**: `tests/e2e/`
+Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.
+### Key Test Cases
+#### The "Golden Query"
+Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
+- **Success Criteria**:
+  - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
+  - Includes citations from PubMed.
+  - Completes within 3 iterations.
+  - JSON output matches schema.
+#### Deployment Smoke Test
+- **Gradio UI**: Verify UI launches and accepts input.
+- **Streaming**: Verify generator yields chunks (first chunk within 2s).
+---
+## 4. Tools & Config
+### Pytest Configuration
+```toml
+# pyproject.toml
+[tool.pytest.ini_options]
+markers = [
+    "unit: fast, isolated tests",
+    "integration: mocked network tests",
+    "e2e: real network tests (slow, expensive)"
+]
+asyncio_mode = "auto"
+```
+### CI/CD Pipeline (GitHub Actions)
+1. **Lint**: `ruff check .`
+2. **Type Check**: `mypy .`
+3. **Unit**: `pytest -m unit`
+4. **Integration**: `pytest -m integration`
+5. **E2E**: (Manual trigger only)
+---
+## 5. Anti-Hallucination Validation
+How do we test if the agent is lying?
+1. **Citation Check**:
+   - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
+   - Fail if a citation is "orphaned" (hallucinated ID).
+2. **Negative Constraints**:
+   - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
+---
+## Checklist for Implementation
+- [ ] Set up `tests/` directory structure
+- [ ] Configure `pytest` and `vcrpy`
+- [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
+- [ ] Write first unit test for `ResearchState`

docs/examples/writer_agents_usage.md ADDED Viewed

	@@ -0,0 +1,425 @@

+# Writer Agents Usage Examples
+This document provides examples of how to use the writer agents in DeepCritical for generating research reports.
+## Overview
+DeepCritical provides three writer agents for different report generation scenarios:
+1. **WriterAgent** - Basic writer for simple reports from findings
+2. **LongWriterAgent** - Iterative writer for long-form multi-section reports
+3. **ProofreaderAgent** - Finalizes and polishes report drafts
+## WriterAgent
+The `WriterAgent` generates final reports from research findings. It's used in iterative research flows.
+### Basic Usage
+```python
+from src.agent_factory.agents import create_writer_agent
+# Create writer agent
+writer = create_writer_agent()
+# Generate report
+query = "What is the capital of France?"
+findings = """
+Paris is the capital of France [1].
+It is located in the north-central part of the country [2].
+[1] https://example.com/france-info
+[2] https://example.com/paris-info
+"""
+report = await writer.write_report(
+    query=query,
+    findings=findings,
+)
+print(report)
+```
+### With Output Length Specification
+```python
+report = await writer.write_report(
+    query="Explain machine learning",
+    findings=findings,
+    output_length="500 words",
+)
+```
+### With Additional Instructions
+```python
+report = await writer.write_report(
+    query="Explain machine learning",
+    findings=findings,
+    output_length="A comprehensive overview",
+    output_instructions="Use formal academic language and include examples",
+)
+```
+### Integration with IterativeResearchFlow
+The `WriterAgent` is automatically used by `IterativeResearchFlow`:
+```python
+from src.agent_factory.agents import create_iterative_flow
+flow = create_iterative_flow(max_iterations=5, max_time_minutes=10)
+report = await flow.run(
+    query="What is quantum computing?",
+    output_length="A detailed explanation",
+    output_instructions="Include practical applications",
+)
+```
+## LongWriterAgent
+The `LongWriterAgent` iteratively writes report sections with proper citation management. It's used in deep research flows.
+### Basic Usage
+```python
+from src.agent_factory.agents import create_long_writer_agent
+from src.utils.models import ReportDraft, ReportDraftSection
+# Create long writer agent
+long_writer = create_long_writer_agent()
+# Create report draft with sections
+report_draft = ReportDraft(
+    sections=[
+        ReportDraftSection(
+            section_title="Introduction",
+            section_content="Draft content for introduction with [1].",
+        ),
+        ReportDraftSection(
+            section_title="Methods",
+            section_content="Draft content for methods with [2].",
+        ),
+        ReportDraftSection(
+            section_title="Results",
+            section_content="Draft content for results with [3].",
+        ),
+    ]
+)
+# Generate full report
+report = await long_writer.write_report(
+    original_query="What are the main features of Python?",
+    report_title="Python Programming Language Overview",
+    report_draft=report_draft,
+)
+print(report)
+```
+### Writing Individual Sections
+You can also write sections one at a time:
+```python
+# Write first section
+section_output = await long_writer.write_next_section(
+    original_query="What is Python?",
+    report_draft="",  # No existing draft
+    next_section_title="Introduction",
+    next_section_draft="Python is a programming language...",
+)
+print(section_output.next_section_markdown)
+print(section_output.references)
+# Write second section with existing draft
+section_output = await long_writer.write_next_section(
+    original_query="What is Python?",
+    report_draft="# Report\n\n## Introduction\n\nContent...",
+    next_section_title="Features",
+    next_section_draft="Python features include...",
+)
+```
+### Integration with DeepResearchFlow
+The `LongWriterAgent` is automatically used by `DeepResearchFlow`:
+```python
+from src.agent_factory.agents import create_deep_flow
+flow = create_deep_flow(
+    max_iterations=5,
+    max_time_minutes=10,
+    use_long_writer=True,  # Use long writer (default)
+)
+report = await flow.run("What are the main features of Python programming language?")
+```
+## ProofreaderAgent
+The `ProofreaderAgent` finalizes and polishes report drafts by removing duplicates, adding summaries, and refining wording.
+### Basic Usage
+```python
+from src.agent_factory.agents import create_proofreader_agent
+from src.utils.models import ReportDraft, ReportDraftSection
+# Create proofreader agent
+proofreader = create_proofreader_agent()
+# Create report draft
+report_draft = ReportDraft(
+    sections=[
+        ReportDraftSection(
+            section_title="Introduction",
+            section_content="Python is a programming language [1].",
+        ),
+        ReportDraftSection(
+            section_title="Features",
+            section_content="Python has many features [2].",
+        ),
+    ]
+)
+# Proofread and finalize
+final_report = await proofreader.proofread(
+    query="What is Python?",
+    report_draft=report_draft,
+)
+print(final_report)
+```
+### Integration with DeepResearchFlow
+Use `ProofreaderAgent` instead of `LongWriterAgent`:
+```python
+from src.agent_factory.agents import create_deep_flow
+flow = create_deep_flow(
+    max_iterations=5,
+    max_time_minutes=10,
+    use_long_writer=False,  # Use proofreader instead
+)
+report = await flow.run("What are the main features of Python?")
+```
+## Error Handling
+All writer agents include robust error handling:
+### Handling Empty Inputs
+```python
+# WriterAgent handles empty findings gracefully
+report = await writer.write_report(
+    query="Test query",
+    findings="",  # Empty findings
+)
+# Returns a fallback report
+# LongWriterAgent handles empty sections
+report = await long_writer.write_report(
+    original_query="Test",
+    report_title="Test Report",
+    report_draft=ReportDraft(sections=[]),  # Empty draft
+)
+# Returns minimal report
+# ProofreaderAgent handles empty drafts
+report = await proofreader.proofread(
+    query="Test",
+    report_draft=ReportDraft(sections=[]),
+)
+# Returns minimal report
+```
+### Retry Logic
+All agents automatically retry on transient errors (timeouts, connection errors):
+```python
+# Automatically retries up to 3 times on transient failures
+report = await writer.write_report(
+    query="Test query",
+    findings=findings,
+)
+```
+### Fallback Reports
+If all retries fail, agents return fallback reports:
+```python
+# Returns fallback report with query and findings
+report = await writer.write_report(
+    query="Test query",
+    findings=findings,
+)
+# Fallback includes: "# Research Report\n\n## Query\n...\n\n## Findings\n..."
+```
+## Citation Validation
+### For Markdown Reports
+Use the markdown citation validator:
+```python
+from src.utils.citation_validator import validate_markdown_citations
+from src.utils.models import Evidence, Citation
+# Collect evidence during research
+evidence = [
+    Evidence(
+        content="Paris is the capital of France",
+        citation=Citation(
+            source="web",
+            title="France Information",
+            url="https://example.com/france",
+            date="2024-01-01",
+        ),
+    ),
+]
+# Generate report
+report = await writer.write_report(query="What is the capital of France?", findings=findings)
+# Validate citations
+validated_report, removed_count = validate_markdown_citations(report, evidence)
+if removed_count > 0:
+    print(f"Removed {removed_count} invalid citations")
+```
+### For ResearchReport Objects
+Use the structured citation validator:
+```python
+from src.utils.citation_validator import validate_references
+# For ResearchReport objects (from ReportAgent)
+validated_report = validate_references(report, evidence)
+```
+## Custom Model Configuration
+All writer agents support custom model configuration:
+```python
+from pydantic_ai import Model
+# Create custom model
+custom_model = Model("openai", "gpt-4")
+# Use with writer agents
+writer = create_writer_agent(model=custom_model)
+long_writer = create_long_writer_agent(model=custom_model)
+proofreader = create_proofreader_agent(model=custom_model)
+```
+## Best Practices
+1. **Use WriterAgent for simple reports** - When you have findings as a string and need a quick report
+2. **Use LongWriterAgent for structured reports** - When you need multiple sections with proper citation management
+3. **Use ProofreaderAgent for final polish** - When you have draft sections and need a polished final report
+4. **Validate citations** - Always validate citations against collected evidence
+5. **Handle errors gracefully** - All agents return fallback reports on failure
+6. **Specify output length** - Use `output_length` parameter to control report size
+7. **Provide instructions** - Use `output_instructions` for specific formatting requirements
+## Integration Examples
+### Full Iterative Research Flow
+```python
+from src.agent_factory.agents import create_iterative_flow
+flow = create_iterative_flow(
+    max_iterations=5,
+    max_time_minutes=10,
+)
+report = await flow.run(
+    query="What is machine learning?",
+    output_length="A comprehensive 1000-word explanation",
+    output_instructions="Include practical examples and use cases",
+)
+```
+### Full Deep Research Flow with Long Writer
+```python
+from src.agent_factory.agents import create_deep_flow
+flow = create_deep_flow(
+    max_iterations=5,
+    max_time_minutes=10,
+    use_long_writer=True,
+)
+report = await flow.run("What are the main features of Python programming language?")
+```
+### Full Deep Research Flow with Proofreader
+```python
+from src.agent_factory.agents import create_deep_flow
+flow = create_deep_flow(
+    max_iterations=5,
+    max_time_minutes=10,
+    use_long_writer=False,  # Use proofreader
+)
+report = await flow.run("Explain quantum computing basics")
+```
+## Troubleshooting
+### Empty Reports
+If you get empty reports, check:
+- Input validation logs (agents log warnings for empty inputs)
+- LLM API key configuration
+- Network connectivity
+### Citation Issues
+If citations are missing or invalid:
+- Use `validate_markdown_citations()` to check citations
+- Ensure Evidence objects are properly collected during research
+- Check that URLs in findings match Evidence URLs
+### Performance Issues
+For large reports:
+- Use `LongWriterAgent` for better section management
+- Consider truncating very long findings (agents do this automatically)
+- Use appropriate `max_time_minutes` settings
+## See Also
+- [Research Flows Documentation](../orchestrator/research_flows.md)
+- [Citation Validation](../utils/citation_validation.md)
+- [Agent Factory](../agent_factory/agents.md)

docs/guides/deployment.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# Deployment Guide
+## Launching DeepCritical: Gradio, MCP, & Modal
+---
+## Overview
+DeepCritical is designed for a multi-platform deployment strategy to maximize hackathon impact:
+1. **HuggingFace Spaces**: Host the Gradio UI (User Interface).
+2. **MCP Server**: Expose research tools to Claude Desktop/Agents.
+3. **Modal (Optional)**: Run heavy inference or local LLMs if API costs are prohibitive.
+---
+## 1. HuggingFace Spaces (Gradio UI)
+**Goal**: A public URL where judges/users can try the research agent.
+### Prerequisites
+- HuggingFace Account
+- `gradio` installed (`uv add gradio`)
+### Steps
+1. **Create Space**:
+   - Go to HF Spaces -> Create New Space.
+   - SDK: **Gradio**.
+   - Hardware: **CPU Basic** (Free) is sufficient (since we use APIs).
+2. **Prepare Files**:
+   - Ensure `app.py` contains the Gradio interface construction.
+   - Ensure `requirements.txt` or `pyproject.toml` lists all dependencies.
+3. **Secrets**:
+   - Go to Space Settings -> **Repository secrets**.
+   - Add `ANTHROPIC_API_KEY` (or your chosen LLM provider key).
+   - Add `BRAVE_API_KEY` (for web search).
+4. **Deploy**:
+   - Push code to the Space's git repo.
+   - Watch "Build" logs.
+### Streaming Optimization
+Ensure `app.py` uses generator functions for the chat interface to prevent timeouts:
+```python
+# app.py
+def predict(message, history):
+    agent = ResearchAgent()
+    for update in agent.research_stream(message):
+        yield update
+```
+---
+## 2. MCP Server Deployment
+**Goal**: Allow other agents (like Claude Desktop) to use our PubMed/Research tools directly.
+### Local Usage (Claude Desktop)
+1. **Install**:
+   ```bash
+   uv sync
+   ```
+2. **Configure Claude Desktop**:
+   Edit `~/Library/Application Support/Claude/claude_desktop_config.json`:
+   ```json
+   {
+     "mcpServers": {
+       "deepcritical": {
+         "command": "uv",
+         "args": ["run", "fastmcp", "run", "src/mcp_servers/pubmed_server.py"],
+         "cwd": "/absolute/path/to/DeepCritical"
+       }
+     }
+   }
+   ```
+3. **Restart Claude**: You should see a 🔌 icon indicating connected tools.
+### Remote Deployment (Smithery/Glama)
+*Target for "MCP Track" bonus points.*
+1. **Dockerize**: Create a `Dockerfile` for the MCP server.
+   ```dockerfile
+   FROM python:3.11-slim
+   COPY . /app
+   RUN pip install fastmcp httpx
+   CMD ["fastmcp", "run", "src/mcp_servers/pubmed_server.py", "--transport", "sse"]
+   ```
+   *Note: Use SSE transport for remote/HTTP servers.*
+2. **Deploy**: Host on Fly.io or Railway.
+---
+## 3. Modal (GPU/Heavy Compute)
+**Goal**: Run a local LLM (e.g., Llama-3-70B) or handle massive parallel searches if APIs are too slow/expensive.
+### Setup
+1. **Install**: `uv add modal`
+2. **Auth**: `modal token new`
+### Logic
+Instead of calling Anthropic API, we call a Modal function:
+```python
+# src/llm/modal_client.py
+import modal
+stub = modal.Stub("deepcritical-inference")
+@stub.function(gpu="A100")
+def generate_text(prompt: str):
+    # Load vLLM or similar
+    ...
+```
+### When to use?
+- **Hackathon Demo**: Stick to Anthropic/OpenAI APIs for speed/reliability.
+- **Production/Stretch**: Use Modal if you hit rate limits or want to show off "Open Source Models" capability.
+---
+## Deployment Checklist
+### Pre-Flight
+- [ ] Run `pytest -m unit` to ensure logic is sound.
+- [ ] Run `pytest -m e2e` (one pass) to verify APIs connect.
+- [ ] Check `requirements.txt` matches `pyproject.toml`.
+### Secrets Management
+- [ ] **NEVER** commit `.env` files.
+- [ ] Verify keys are added to HF Space settings.
+### Post-Launch
+- [ ] Test the live URL.
+- [ ] Verify "Stop" button in Gradio works (interrupts the agent).
+- [ ] Record a walkthrough video (crucial for hackathon submission).

docs/implementation/01_phase_foundation.md ADDED Viewed

	@@ -0,0 +1,587 @@

+# Phase 1 Implementation Spec: Foundation & Tooling
+**Goal**: Establish a "Gucci Banger" development environment using 2025 best practices.
+**Philosophy**: "If the build isn't solid, the agent won't be."
+---
+## 1. Prerequisites
+Before starting, ensure these are installed:
+```bash
+# Install uv (Rust-based package manager)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Verify
+uv --version  # Should be >= 0.4.0
+```
+---
+## 2. Project Initialization
+```bash
+# From project root
+uv init --name deepcritical
+uv python install 3.11  # Pin Python version
+```
+---
+## 3. The Tooling Stack (Exact Dependencies)
+### `pyproject.toml` (Complete, Copy-Paste Ready)
+```toml
+[project]
+name = "deepcritical"
+version = "0.1.0"
+description = "AI-Native Drug Repurposing Research Agent"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    # Core
+    "pydantic>=2.7",
+    "pydantic-settings>=2.2",      # For BaseSettings (config)
+    "pydantic-ai>=0.0.16",          # Agent framework
+    # HTTP & Parsing
+    "httpx>=0.27",                   # Async HTTP client
+    "beautifulsoup4>=4.12",          # HTML parsing
+    "xmltodict>=0.13",               # PubMed XML -> dict
+    # Search
+    "duckduckgo-search>=6.0",        # Free web search
+    # UI
+    "gradio>=5.0",                   # Chat interface
+    # Utils
+    "python-dotenv>=1.0",            # .env loading
+    "tenacity>=8.2",                 # Retry logic
+    "structlog>=24.1",               # Structured logging
+]
+[project.optional-dependencies]
+dev = [
+    # Testing
+    "pytest>=8.0",
+    "pytest-asyncio>=0.23",
+    "pytest-sugar>=1.0",
+    "pytest-cov>=5.0",
+    "pytest-mock>=3.12",
+    "respx>=0.21",                   # Mock httpx requests
+    # Quality
+    "ruff>=0.4.0",
+    "mypy>=1.10",
+    "pre-commit>=3.7",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["src"]
+# ============== RUFF CONFIG ==============
+[tool.ruff]
+line-length = 100
+target-version = "py311"
+src = ["src", "tests"]
+[tool.ruff.lint]
+select = [
+    "E",    # pycodestyle errors
+    "F",    # pyflakes
+    "B",    # flake8-bugbear
+    "I",    # isort
+    "N",    # pep8-naming
+    "UP",   # pyupgrade
+    "PL",   # pylint
+    "RUF",  # ruff-specific
+]
+ignore = [
+    "PLR0913",  # Too many arguments (agents need many params)
+]
+[tool.ruff.lint.isort]
+known-first-party = ["src"]
+# ============== MYPY CONFIG ==============
+[tool.mypy]
+python_version = "3.11"
+strict = true
+ignore_missing_imports = true
+disallow_untyped_defs = true
+warn_return_any = true
+warn_unused_ignores = true
+# ============== PYTEST CONFIG ==============
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+asyncio_mode = "auto"
+addopts = [
+    "-v",
+    "--tb=short",
+    "--strict-markers",
+]
+markers = [
+    "unit: Unit tests (mocked)",
+    "integration: Integration tests (real APIs)",
+    "slow: Slow tests",
+]
+# ============== COVERAGE CONFIG ==============
+[tool.coverage.run]
+source = ["src"]
+omit = ["*/__init__.py"]
+[tool.coverage.report]
+exclude_lines = [
+    "pragma: no cover",
+    "if TYPE_CHECKING:",
+    "raise NotImplementedError",
+]
+```
+---
+## 4. Directory Structure (Maintainer's Structure)
+```bash
+# Execute these commands to create the directory structure
+mkdir -p src/utils
+mkdir -p src/tools
+mkdir -p src/prompts
+mkdir -p src/agent_factory
+mkdir -p src/middleware
+mkdir -p src/database_services
+mkdir -p src/retrieval_factory
+mkdir -p tests/unit/tools
+mkdir -p tests/unit/agent_factory
+mkdir -p tests/unit/utils
+mkdir -p tests/integration
+# Create __init__.py files (required for imports)
+touch src/__init__.py
+touch src/utils/__init__.py
+touch src/tools/__init__.py
+touch src/prompts/__init__.py
+touch src/agent_factory/__init__.py
+touch tests/__init__.py
+touch tests/unit/__init__.py
+touch tests/unit/tools/__init__.py
+touch tests/unit/agent_factory/__init__.py
+touch tests/unit/utils/__init__.py
+touch tests/integration/__init__.py
+```
+### Final Structure:
+```
+src/
+├── __init__.py
+├── app.py                      # Entry point (Gradio UI)
+├── orchestrator.py             # Agent loop
+├── agent_factory/              # Agent creation and judges
+│   ├── __init__.py
+│   ├── agents.py
+│   └── judges.py
+├── tools/                      # Search tools
+│   ├── __init__.py
+│   ├── pubmed.py
+│   ├── websearch.py
+│   └── search_handler.py
+├── prompts/                    # Prompt templates
+│   ├── __init__.py
+│   └── judge.py
+├── utils/                      # Shared utilities
+│   ├── __init__.py
+│   ├── config.py
+│   ├── exceptions.py
+│   ├── models.py
+│   ├── dataloaders.py
+│   └── parsers.py
+├── middleware/                 # (Future)
+├── database_services/          # (Future)
+└── retrieval_factory/          # (Future)
+tests/
+├── __init__.py
+├── conftest.py
+├── unit/
+│   ├── __init__.py
+│   ├── tools/
+│   │   ├── __init__.py
+│   │   ├── test_pubmed.py
+│   │   ├── test_websearch.py
+│   │   └── test_search_handler.py
+│   ├── agent_factory/
+│   │   ├── __init__.py
+│   │   └── test_judges.py
+│   ├── utils/
+│   │   ├── __init__.py
+│   │   └── test_config.py
+│   └── test_orchestrator.py
+└── integration/
+    ├── __init__.py
+    └── test_pubmed_live.py
+```
+---
+## 5. Configuration Files
+### `.env.example` (Copy to `.env` and fill)
+```bash
+# LLM Provider (choose one)
+OPENAI_API_KEY=sk-your-key-here
+ANTHROPIC_API_KEY=sk-ant-your-key-here
+# Optional: PubMed API key (higher rate limits)
+NCBI_API_KEY=your-ncbi-key-here
+# Optional: For HuggingFace deployment
+HF_TOKEN=hf_your-token-here
+# Agent Config
+MAX_ITERATIONS=10
+LOG_LEVEL=INFO
+```
+### `.pre-commit-config.yaml`
+```yaml
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.4.4
+    hooks:
+      - id: ruff
+        args: [--fix]
+      - id: ruff-format
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.10.0
+    hooks:
+      - id: mypy
+        additional_dependencies:
+          - pydantic>=2.7
+          - pydantic-settings>=2.2
+        args: [--ignore-missing-imports]
+```
+### `tests/conftest.py` (Pytest Fixtures)
+```python
+"""Shared pytest fixtures for all tests."""
+import pytest
+from unittest.mock import AsyncMock
+@pytest.fixture
+def mock_httpx_client(mocker):
+    """Mock httpx.AsyncClient for API tests."""
+    mock = mocker.patch("httpx.AsyncClient")
+    mock.return_value.__aenter__ = AsyncMock(return_value=mock.return_value)
+    mock.return_value.__aexit__ = AsyncMock(return_value=None)
+    return mock
+@pytest.fixture
+def mock_llm_response():
+    """Factory fixture for mocking LLM responses."""
+    def _mock(content: str):
+        return AsyncMock(return_value=content)
+    return _mock
+@pytest.fixture
+def sample_evidence():
+    """Sample Evidence objects for testing."""
+    from src.utils.models import Evidence, Citation
+    return [
+        Evidence(
+            content="Metformin shows promise in Alzheimer's...",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin and Alzheimer's Disease",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345678/",
+                date="2024-01-15"
+            ),
+            relevance=0.85
+        )
+    ]
+```
+---
+## 6. Core Utilities Implementation
+### `src/utils/config.py`
+```python
+"""Application configuration using Pydantic Settings."""
+from pydantic_settings import BaseSettings, SettingsConfigDict
+from pydantic import Field
+from typing import Literal
+import structlog
+class Settings(BaseSettings):
+    """Strongly-typed application settings."""
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        case_sensitive=False,
+        extra="ignore",
+    )
+    # LLM Configuration
+    openai_api_key: str | None = Field(default=None, description="OpenAI API key")
+    anthropic_api_key: str | None = Field(default=None, description="Anthropic API key")
+    llm_provider: Literal["openai", "anthropic"] = Field(
+        default="openai",
+        description="Which LLM provider to use"
+    )
+    openai_model: str = Field(default="gpt-4o", description="OpenAI model name")
+    anthropic_model: str = Field(default="claude-3-5-sonnet-20241022", description="Anthropic model")
+    # PubMed Configuration
+    ncbi_api_key: str | None = Field(default=None, description="NCBI API key for higher rate limits")
+    # Agent Configuration
+    max_iterations: int = Field(default=10, ge=1, le=50)
+    search_timeout: int = Field(default=30, description="Seconds to wait for search")
+    # Logging
+    log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = "INFO"
+    def get_api_key(self) -> str:
+        """Get the API key for the configured provider."""
+        if self.llm_provider == "openai":
+            if not self.openai_api_key:
+                raise ValueError("OPENAI_API_KEY not set")
+            return self.openai_api_key
+        else:
+            if not self.anthropic_api_key:
+                raise ValueError("ANTHROPIC_API_KEY not set")
+            return self.anthropic_api_key
+def get_settings() -> Settings:
+    """Factory function to get settings (allows mocking in tests)."""
+    return Settings()
+def configure_logging(settings: Settings) -> None:
+    """Configure structured logging."""
+    structlog.configure(
+        processors=[
+            structlog.stdlib.filter_by_level,
+            structlog.stdlib.add_logger_name,
+            structlog.stdlib.add_log_level,
+            structlog.processors.TimeStamper(fmt="iso"),
+            structlog.processors.JSONRenderer(),
+        ],
+        wrapper_class=structlog.stdlib.BoundLogger,
+        context_class=dict,
+        logger_factory=structlog.stdlib.LoggerFactory(),
+    )
+# Singleton for easy import
+settings = get_settings()
+```
+### `src/utils/exceptions.py`
+```python
+"""Custom exceptions for DeepCritical."""
+class DeepCriticalError(Exception):
+    """Base exception for all DeepCritical errors."""
+    pass
+class SearchError(DeepCriticalError):
+    """Raised when a search operation fails."""
+    pass
+class JudgeError(DeepCriticalError):
+    """Raised when the judge fails to assess evidence."""
+    pass
+class ConfigurationError(DeepCriticalError):
+    """Raised when configuration is invalid."""
+    pass
+class RateLimitError(SearchError):
+    """Raised when we hit API rate limits."""
+    pass
+```
+---
+## 7. TDD Workflow: First Test
+### `tests/unit/utils/test_config.py`
+```python
+"""Unit tests for configuration loading."""
+import pytest
+from unittest.mock import patch
+import os
+class TestSettings:
+    """Tests for Settings class."""
+    def test_default_max_iterations(self):
+        """Settings should have default max_iterations of 10."""
+        from src.utils.config import Settings
+        # Clear any env vars
+        with patch.dict(os.environ, {}, clear=True):
+            settings = Settings()
+            assert settings.max_iterations == 10
+    def test_max_iterations_from_env(self):
+        """Settings should read MAX_ITERATIONS from env."""
+        from src.utils.config import Settings
+        with patch.dict(os.environ, {"MAX_ITERATIONS": "25"}):
+            settings = Settings()
+            assert settings.max_iterations == 25
+    def test_invalid_max_iterations_raises(self):
+        """Settings should reject invalid max_iterations."""
+        from src.utils.config import Settings
+        from pydantic import ValidationError
+        with patch.dict(os.environ, {"MAX_ITERATIONS": "100"}):
+            with pytest.raises(ValidationError):
+                Settings()  # 100 > 50 (max)
+    def test_get_api_key_openai(self):
+        """get_api_key should return OpenAI key when provider is openai."""
+        from src.utils.config import Settings
+        with patch.dict(os.environ, {
+            "LLM_PROVIDER": "openai",
+            "OPENAI_API_KEY": "sk-test-key"
+        }):
+            settings = Settings()
+            assert settings.get_api_key() == "sk-test-key"
+    def test_get_api_key_missing_raises(self):
+        """get_api_key should raise when key is not set."""
+        from src.utils.config import Settings
+        with patch.dict(os.environ, {"LLM_PROVIDER": "openai"}, clear=True):
+            settings = Settings()
+            with pytest.raises(ValueError, match="OPENAI_API_KEY not set"):
+                settings.get_api_key()
+```
+---
+## 8. Makefile (Developer Experience)
+Create a `Makefile` for standard devex commands:
+```makefile
+.PHONY: install test lint format typecheck check clean
+install:
+	uv sync --all-extras
+	uv run pre-commit install
+test:
+	uv run pytest tests/unit/ -v
+test-cov:
+	uv run pytest --cov=src --cov-report=term-missing
+lint:
+	uv run ruff check src tests
+format:
+	uv run ruff format src tests
+typecheck:
+	uv run mypy src
+check: lint typecheck test
+	@echo "All checks passed!"
+clean:
+	rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ .coverage
+	find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
+```
+---
+## 9. Execution Commands
+```bash
+# Install all dependencies
+uv sync --all-extras
+# Run tests (should pass after implementing config.py)
+uv run pytest tests/unit/utils/test_config.py -v
+# Run full test suite with coverage
+uv run pytest --cov=src --cov-report=term-missing
+# Run linting
+uv run ruff check src tests
+uv run ruff format src tests
+# Run type checking
+uv run mypy src
+# Set up pre-commit hooks
+uv run pre-commit install
+```
+---
+## 10. Implementation Checklist
+- [ ] Install `uv` and verify version
+- [ ] Run `uv init --name deepcritical`
+- [ ] Create `pyproject.toml` (copy from above)
+- [ ] Create directory structure (run mkdir commands)
+- [ ] Create `.env.example` and `.env`
+- [ ] Create `.pre-commit-config.yaml`
+- [ ] Create `Makefile` (copy from above)
+- [ ] Create `tests/conftest.py`
+- [ ] Implement `src/utils/config.py`
+- [ ] Implement `src/utils/exceptions.py`
+- [ ] Write tests in `tests/unit/utils/test_config.py`
+- [ ] Run `make install`
+- [ ] Run `make check` — **ALL CHECKS MUST PASS**
+- [ ] Commit: `git commit -m "feat: phase 1 foundation complete"`
+---
+## 11. Definition of Done
+Phase 1 is **COMPLETE** when:
+1. `uv run pytest` passes with 100% of tests green
+2. `uv run ruff check src tests` has 0 errors
+3. `uv run mypy src` has 0 errors
+4. Pre-commit hooks are installed and working
+5. `from src.utils.config import settings` works in Python REPL
+**Proceed to Phase 2 ONLY after all checkboxes are complete.**

docs/implementation/02_phase_search.md ADDED Viewed

	@@ -0,0 +1,822 @@

+# Phase 2 Implementation Spec: Search Vertical Slice
+**Goal**: Implement the "Eyes and Ears" of the agent — retrieving real biomedical data.
+**Philosophy**: "Real data, mocked connections."
+**Prerequisite**: Phase 1 complete (all tests passing)
+> **⚠️ Implementation Note (2025-01-27)**: The DuckDuckGo WebTool specified in this phase was removed in favor of the Europe PMC tool (see Phase 11). Europe PMC provides better coverage for biomedical research by including preprints, peer-reviewed articles, and patents. The current implementation uses PubMed, ClinicalTrials.gov, and Europe PMC as search sources.
+---
+## 1. The Slice Definition
+This slice covers:
+1. **Input**: A string query (e.g., "metformin Alzheimer's disease").
+2. **Process**:
+   - Fetch from PubMed (E-utilities API).
+   - ~~Fetch from Web (DuckDuckGo).~~ **REMOVED** - Replaced by Europe PMC in Phase 11
+   - Normalize results into `Evidence` models.
+3. **Output**: A list of `Evidence` objects.
+**Files to Create**:
+- `src/utils/models.py` - Pydantic models (Evidence, Citation, SearchResult)
+- `src/tools/pubmed.py` - PubMed E-utilities tool
+- ~~`src/tools/websearch.py` - DuckDuckGo search tool~~ **REMOVED** - See Phase 11 for Europe PMC replacement
+- `src/tools/search_handler.py` - Orchestrates multiple tools
+- `src/tools/__init__.py` - Exports
+**Additional Files (Post-Phase 2 Enhancements)**:
+- `src/tools/query_utils.py` - Query preprocessing (removes question words, expands medical synonyms)
+---
+## 2. PubMed E-utilities API Reference
+**Base URL**: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/`
+### Key Endpoints
+| Endpoint | Purpose | Example |
+|----------|---------|---------|
+| `esearch.fcgi` | Search for article IDs | `?db=pubmed&term=metformin+alzheimer&retmax=10` |
+| `efetch.fcgi` | Fetch article details | `?db=pubmed&id=12345,67890&rettype=abstract&retmode=xml` |
+### Rate Limiting (CRITICAL!)
+NCBI **requires** rate limiting:
+- **Without API key**: 3 requests/second
+- **With API key**: 10 requests/second
+Get a free API key: https://www.ncbi.nlm.nih.gov/account/settings/
+```python
+# Add to .env
+NCBI_API_KEY=your-key-here  # Optional but recommended
+```
+### Example Search Flow
+```
+1. esearch: "metformin alzheimer" → [PMID: 12345, 67890, ...]
+2. efetch: PMIDs → Full abstracts/metadata
+3. Parse XML → Evidence objects
+```
+---
+## 3. Models (`src/utils/models.py`)
+```python
+"""Data models for the Search feature."""
+from pydantic import BaseModel, Field
+from typing import Literal
+class Citation(BaseModel):
+    """A citation to a source document."""
+    source: Literal["pubmed", "web"] = Field(description="Where this came from")
+    title: str = Field(min_length=1, max_length=500)
+    url: str = Field(description="URL to the source")
+    date: str = Field(description="Publication date (YYYY-MM-DD or 'Unknown')")
+    authors: list[str] = Field(default_factory=list)
+    @property
+    def formatted(self) -> str:
+        """Format as a citation string."""
+        author_str = ", ".join(self.authors[:3])
+        if len(self.authors) > 3:
+            author_str += " et al."
+        return f"{author_str} ({self.date}). {self.title}. {self.source.upper()}"
+class Evidence(BaseModel):
+    """A piece of evidence retrieved from search."""
+    content: str = Field(min_length=1, description="The actual text content")
+    citation: Citation
+    relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
+    class Config:
+        frozen = True  # Immutable after creation
+class SearchResult(BaseModel):
+    """Result of a search operation."""
+    query: str
+    evidence: list[Evidence]
+    sources_searched: list[Literal["pubmed", "web"]]
+    total_found: int
+    errors: list[str] = Field(default_factory=list)
+```
+---
+## 4. Tool Protocol (`src/tools/pubmed.py` and `src/tools/websearch.py`)
+### The Interface (Protocol) - Add to `src/tools/__init__.py`
+```python
+"""Search tools package."""
+from typing import Protocol, List
+# Import implementations
+from src.tools.pubmed import PubMedTool
+from src.tools.websearch import WebTool
+from src.tools.search_handler import SearchHandler
+# Re-export
+__all__ = ["SearchTool", "PubMedTool", "WebTool", "SearchHandler"]
+class SearchTool(Protocol):
+    """Protocol defining the interface for all search tools."""
+    @property
+    def name(self) -> str:
+        """Human-readable name of this tool."""
+        ...
+    async def search(self, query: str, max_results: int = 10) -> List["Evidence"]:
+        """
+        Execute a search and return evidence.
+        Args:
+            query: The search query string
+            max_results: Maximum number of results to return
+        Returns:
+            List of Evidence objects
+        Raises:
+            SearchError: If the search fails
+            RateLimitError: If we hit rate limits
+        """
+        ...
+```
+### PubMed Tool Implementation (`src/tools/pubmed.py`)
+```python
+"""PubMed search tool using NCBI E-utilities."""
+import asyncio
+import httpx
+import xmltodict
+from typing import List
+from tenacity import retry, stop_after_attempt, wait_exponential
+from src.utils.config import settings
+from src.utils.exceptions import SearchError, RateLimitError
+from src.utils.models import Evidence, Citation
+class PubMedTool:
+    """Search tool for PubMed/NCBI."""
+    BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
+    RATE_LIMIT_DELAY = 0.34  # ~3 requests/sec without API key
+    def __init__(self, api_key: str | None = None):
+        self.api_key = api_key or getattr(settings, "ncbi_api_key", None)
+        self._last_request_time = 0.0
+    @property
+    def name(self) -> str:
+        return "pubmed"
+    async def _rate_limit(self) -> None:
+        """Enforce NCBI rate limiting."""
+        now = asyncio.get_event_loop().time()
+        elapsed = now - self._last_request_time
+        if elapsed < self.RATE_LIMIT_DELAY:
+            await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
+        self._last_request_time = asyncio.get_event_loop().time()
+    def _build_params(self, **kwargs) -> dict:
+        """Build request params with optional API key."""
+        params = {**kwargs, "retmode": "json"}
+        if self.api_key:
+            params["api_key"] = self.api_key
+        return params
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=1, max=10),
+        reraise=True,
+    )
+    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
+        """
+        Search PubMed and return evidence.
+        1. ESearch: Get PMIDs matching query
+        2. EFetch: Get abstracts for those PMIDs
+        3. Parse and return Evidence objects
+        """
+        await self._rate_limit()
+        async with httpx.AsyncClient(timeout=30.0) as client:
+            # Step 1: Search for PMIDs
+            search_params = self._build_params(
+                db="pubmed",
+                term=query,
+                retmax=max_results,
+                sort="relevance",
+            )
+            try:
+                search_resp = await client.get(
+                    f"{self.BASE_URL}/esearch.fcgi",
+                    params=search_params,
+                )
+                search_resp.raise_for_status()
+            except httpx.HTTPStatusError as e:
+                if e.response.status_code == 429:
+                    raise RateLimitError("PubMed rate limit exceeded")
+                raise SearchError(f"PubMed search failed: {e}")
+            search_data = search_resp.json()
+            pmids = search_data.get("esearchresult", {}).get("idlist", [])
+            if not pmids:
+                return []
+            # Step 2: Fetch abstracts
+            await self._rate_limit()
+            fetch_params = self._build_params(
+                db="pubmed",
+                id=",".join(pmids),
+                rettype="abstract",
+            )
+            # Use XML for fetch (more reliable parsing)
+            fetch_params["retmode"] = "xml"
+            fetch_resp = await client.get(
+                f"{self.BASE_URL}/efetch.fcgi",
+                params=fetch_params,
+            )
+            fetch_resp.raise_for_status()
+            # Step 3: Parse XML to Evidence
+            return self._parse_pubmed_xml(fetch_resp.text)
+    def _parse_pubmed_xml(self, xml_text: str) -> List[Evidence]:
+        """Parse PubMed XML into Evidence objects."""
+        try:
+            data = xmltodict.parse(xml_text)
+        except Exception as e:
+            raise SearchError(f"Failed to parse PubMed XML: {e}")
+        articles = data.get("PubmedArticleSet", {}).get("PubmedArticle", [])
+        # Handle single article (xmltodict returns dict instead of list)
+        if isinstance(articles, dict):
+            articles = [articles]
+        evidence_list = []
+        for article in articles:
+            try:
+                evidence = self._article_to_evidence(article)
+                if evidence:
+                    evidence_list.append(evidence)
+            except Exception:
+                continue  # Skip malformed articles
+        return evidence_list
+    def _article_to_evidence(self, article: dict) -> Evidence | None:
+        """Convert a single PubMed article to Evidence."""
+        medline = article.get("MedlineCitation", {})
+        article_data = medline.get("Article", {})
+        # Extract PMID
+        pmid = medline.get("PMID", {})
+        if isinstance(pmid, dict):
+            pmid = pmid.get("#text", "")
+        # Extract title
+        title = article_data.get("ArticleTitle", "")
+        if isinstance(title, dict):
+            title = title.get("#text", str(title))
+        # Extract abstract
+        abstract_data = article_data.get("Abstract", {}).get("AbstractText", "")
+        if isinstance(abstract_data, list):
+            abstract = " ".join(
+                item.get("#text", str(item)) if isinstance(item, dict) else str(item)
+                for item in abstract_data
+            )
+        elif isinstance(abstract_data, dict):
+            abstract = abstract_data.get("#text", str(abstract_data))
+        else:
+            abstract = str(abstract_data)
+        if not abstract or not title:
+            return None
+        # Extract date
+        pub_date = article_data.get("Journal", {}).get("JournalIssue", {}).get("PubDate", {})
+        year = pub_date.get("Year", "Unknown")
+        month = pub_date.get("Month", "01")
+        day = pub_date.get("Day", "01")
+        date_str = f"{year}-{month}-{day}" if year != "Unknown" else "Unknown"
+        # Extract authors
+        author_list = article_data.get("AuthorList", {}).get("Author", [])
+        if isinstance(author_list, dict):
+            author_list = [author_list]
+        authors = []
+        for author in author_list[:5]:  # Limit to 5 authors
+            last = author.get("LastName", "")
+            first = author.get("ForeName", "")
+            if last:
+                authors.append(f"{last} {first}".strip())
+        return Evidence(
+            content=abstract[:2000],  # Truncate long abstracts
+            citation=Citation(
+                source="pubmed",
+                title=title[:500],
+                url=f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/",
+                date=date_str,
+                authors=authors,
+            ),
+        )
+```
+### DuckDuckGo Tool Implementation (`src/tools/websearch.py`)
+```python
+"""Web search tool using DuckDuckGo."""
+from typing import List
+from duckduckgo_search import DDGS
+from src.utils.exceptions import SearchError
+from src.utils.models import Evidence, Citation
+class WebTool:
+    """Search tool for general web search via DuckDuckGo."""
+    def __init__(self):
+        pass
+    @property
+    def name(self) -> str:
+        return "web"
+    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
+        """
+        Search DuckDuckGo and return evidence.
+        Note: duckduckgo-search is synchronous, so we run it in executor.
+        """
+        import asyncio
+        loop = asyncio.get_event_loop()
+        try:
+            results = await loop.run_in_executor(
+                None,
+                lambda: self._sync_search(query, max_results),
+            )
+            return results
+        except Exception as e:
+            raise SearchError(f"Web search failed: {e}")
+    def _sync_search(self, query: str, max_results: int) -> List[Evidence]:
+        """Synchronous search implementation."""
+        evidence_list = []
+        with DDGS() as ddgs:
+            results = list(ddgs.text(query, max_results=max_results))
+        for result in results:
+            evidence_list.append(
+                Evidence(
+                    content=result.get("body", "")[:1000],
+                    citation=Citation(
+                        source="web",
+                        title=result.get("title", "Unknown")[:500],
+                        url=result.get("href", ""),
+                        date="Unknown",
+                        authors=[],
+                    ),
+                )
+            )
+        return evidence_list
+```
+---
+## 5. Search Handler (`src/tools/search_handler.py`)
+The handler orchestrates multiple tools using the **Scatter-Gather** pattern.
+```python
+"""Search handler - orchestrates multiple search tools."""
+import asyncio
+from typing import List, Protocol
+import structlog
+from src.utils.exceptions import SearchError
+from src.utils.models import Evidence, SearchResult
+logger = structlog.get_logger()
+class SearchTool(Protocol):
+    """Protocol defining the interface for all search tools."""
+    @property
+    def name(self) -> str:
+        ...
+    async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
+        ...
+def flatten(nested: List[List[Evidence]]) -> List[Evidence]:
+    """Flatten a list of lists into a single list."""
+    return [item for sublist in nested for item in sublist]
+class SearchHandler:
+    """Orchestrates parallel searches across multiple tools."""
+    def __init__(self, tools: List[SearchTool], timeout: float = 30.0):
+        """
+        Initialize the search handler.
+        Args:
+            tools: List of search tools to use
+            timeout: Timeout for each search in seconds
+        """
+        self.tools = tools
+        self.timeout = timeout
+    async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
+        """
+        Execute search across all tools in parallel.
+        Args:
+            query: The search query
+            max_results_per_tool: Max results from each tool
+        Returns:
+            SearchResult containing all evidence and metadata
+        """
+        logger.info("Starting search", query=query, tools=[t.name for t in self.tools])
+        # Create tasks for parallel execution
+        tasks = [
+            self._search_with_timeout(tool, query, max_results_per_tool)
+            for tool in self.tools
+        ]
+        # Gather results (don't fail if one tool fails)
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        # Process results
+        all_evidence: List[Evidence] = []
+        sources_searched: List[str] = []
+        errors: List[str] = []
+        for tool, result in zip(self.tools, results):
+            if isinstance(result, Exception):
+                errors.append(f"{tool.name}: {str(result)}")
+                logger.warning("Search tool failed", tool=tool.name, error=str(result))
+            else:
+                all_evidence.extend(result)
+                sources_searched.append(tool.name)
+                logger.info("Search tool succeeded", tool=tool.name, count=len(result))
+        return SearchResult(
+            query=query,
+            evidence=all_evidence,
+            sources_searched=sources_searched,
+            total_found=len(all_evidence),
+            errors=errors,
+        )
+    async def _search_with_timeout(
+        self,
+        tool: SearchTool,
+        query: str,
+        max_results: int,
+    ) -> List[Evidence]:
+        """Execute a single tool search with timeout."""
+        try:
+            return await asyncio.wait_for(
+                tool.search(query, max_results),
+                timeout=self.timeout,
+            )
+        except asyncio.TimeoutError:
+            raise SearchError(f"{tool.name} search timed out after {self.timeout}s")
+```
+---
+## 6. TDD Workflow
+### Test File: `tests/unit/tools/test_pubmed.py`
+```python
+"""Unit tests for PubMed tool."""
+import pytest
+from unittest.mock import AsyncMock, MagicMock
+# Sample PubMed XML response for mocking
+SAMPLE_PUBMED_XML = """<?xml version="1.0" ?>
+<PubmedArticleSet>
+    <PubmedArticle>
+        <MedlineCitation>
+            <PMID>12345678</PMID>
+            <Article>
+                <ArticleTitle>Metformin in Alzheimer's Disease: A Systematic Review</ArticleTitle>
+                <Abstract>
+                    <AbstractText>Metformin shows neuroprotective properties...</AbstractText>
+                </Abstract>
+                <AuthorList>
+                    <Author>
+                        <LastName>Smith</LastName>
+                        <ForeName>John</ForeName>
+                    </Author>
+                </AuthorList>
+                <Journal>
+                    <JournalIssue>
+                        <PubDate>
+                            <Year>2024</Year>
+                            <Month>01</Month>
+                        </PubDate>
+                    </JournalIssue>
+                </Journal>
+            </Article>
+        </MedlineCitation>
+    </PubmedArticle>
+</PubmedArticleSet>
+"""
+class TestPubMedTool:
+    """Tests for PubMedTool."""
+    @pytest.mark.asyncio
+    async def test_search_returns_evidence(self, mocker):
+        """PubMedTool should return Evidence objects from search."""
+        from src.tools.pubmed import PubMedTool
+        # Mock the HTTP responses
+        mock_search_response = MagicMock()
+        mock_search_response.json.return_value = {
+            "esearchresult": {"idlist": ["12345678"]}
+        }
+        mock_search_response.raise_for_status = MagicMock()
+        mock_fetch_response = MagicMock()
+        mock_fetch_response.text = SAMPLE_PUBMED_XML
+        mock_fetch_response.raise_for_status = MagicMock()
+        mock_client = AsyncMock()
+        mock_client.get = AsyncMock(side_effect=[mock_search_response, mock_fetch_response])
+        mock_client.__aenter__ = AsyncMock(return_value=mock_client)
+        mock_client.__aexit__ = AsyncMock(return_value=None)
+        mocker.patch("httpx.AsyncClient", return_value=mock_client)
+        # Act
+        tool = PubMedTool()
+        results = await tool.search("metformin alzheimer")
+        # Assert
+        assert len(results) == 1
+        assert results[0].citation.source == "pubmed"
+        assert "Metformin" in results[0].citation.title
+        assert "12345678" in results[0].citation.url
+    @pytest.mark.asyncio
+    async def test_search_empty_results(self, mocker):
+        """PubMedTool should return empty list when no results."""
+        from src.tools.pubmed import PubMedTool
+        mock_response = MagicMock()
+        mock_response.json.return_value = {"esearchresult": {"idlist": []}}
+        mock_response.raise_for_status = MagicMock()
+        mock_client = AsyncMock()
+        mock_client.get = AsyncMock(return_value=mock_response)
+        mock_client.__aenter__ = AsyncMock(return_value=mock_client)
+        mock_client.__aexit__ = AsyncMock(return_value=None)
+        mocker.patch("httpx.AsyncClient", return_value=mock_client)
+        tool = PubMedTool()
+        results = await tool.search("xyznonexistentquery123")
+        assert results == []
+    def test_parse_pubmed_xml(self):
+        """PubMedTool should correctly parse XML."""
+        from src.tools.pubmed import PubMedTool
+        tool = PubMedTool()
+        results = tool._parse_pubmed_xml(SAMPLE_PUBMED_XML)
+        assert len(results) == 1
+        assert results[0].citation.source == "pubmed"
+        assert "Smith John" in results[0].citation.authors
+```
+### Test File: `tests/unit/tools/test_websearch.py`
+```python
+"""Unit tests for WebTool."""
+import pytest
+from unittest.mock import MagicMock
+class TestWebTool:
+    """Tests for WebTool."""
+    @pytest.mark.asyncio
+    async def test_search_returns_evidence(self, mocker):
+        """WebTool should return Evidence objects from search."""
+        from src.tools.websearch import WebTool
+        mock_results = [
+            {
+                "title": "Drug Repurposing Article",
+                "href": "https://example.com/article",
+                "body": "Some content about drug repurposing...",
+            }
+        ]
+        mock_ddgs = MagicMock()
+        mock_ddgs.__enter__ = MagicMock(return_value=mock_ddgs)
+        mock_ddgs.__exit__ = MagicMock(return_value=None)
+        mock_ddgs.text = MagicMock(return_value=mock_results)
+        mocker.patch("src.tools.websearch.DDGS", return_value=mock_ddgs)
+        tool = WebTool()
+        results = await tool.search("drug repurposing")
+        assert len(results) == 1
+        assert results[0].citation.source == "web"
+        assert "Drug Repurposing" in results[0].citation.title
+```
+### Test File: `tests/unit/tools/test_search_handler.py`
+```python
+"""Unit tests for SearchHandler."""
+import pytest
+from unittest.mock import AsyncMock
+from src.utils.models import Evidence, Citation
+from src.utils.exceptions import SearchError
+class TestSearchHandler:
+    """Tests for SearchHandler."""
+    @pytest.mark.asyncio
+    async def test_execute_aggregates_results(self):
+        """SearchHandler should aggregate results from all tools."""
+        from src.tools.search_handler import SearchHandler
+        # Create mock tools
+        mock_tool_1 = AsyncMock()
+        mock_tool_1.name = "mock1"
+        mock_tool_1.search = AsyncMock(return_value=[
+            Evidence(
+                content="Result 1",
+                citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
+            )
+        ])
+        mock_tool_2 = AsyncMock()
+        mock_tool_2.name = "mock2"
+        mock_tool_2.search = AsyncMock(return_value=[
+            Evidence(
+                content="Result 2",
+                citation=Citation(source="web", title="T2", url="u2", date="2024"),
+            )
+        ])
+        handler = SearchHandler(tools=[mock_tool_1, mock_tool_2])
+        result = await handler.execute("test query")
+        assert result.total_found == 2
+        assert "mock1" in result.sources_searched
+        assert "mock2" in result.sources_searched
+        assert len(result.errors) == 0
+    @pytest.mark.asyncio
+    async def test_execute_handles_tool_failure(self):
+        """SearchHandler should continue if one tool fails."""
+        from src.tools.search_handler import SearchHandler
+        mock_tool_ok = AsyncMock()
+        mock_tool_ok.name = "ok_tool"
+        mock_tool_ok.search = AsyncMock(return_value=[
+            Evidence(
+                content="Good result",
+                citation=Citation(source="pubmed", title="T", url="u", date="2024"),
+            )
+        ])
+        mock_tool_fail = AsyncMock()
+        mock_tool_fail.name = "fail_tool"
+        mock_tool_fail.search = AsyncMock(side_effect=SearchError("API down"))
+        handler = SearchHandler(tools=[mock_tool_ok, mock_tool_fail])
+        result = await handler.execute("test")
+        assert result.total_found == 1
+        assert "ok_tool" in result.sources_searched
+        assert len(result.errors) == 1
+        assert "fail_tool" in result.errors[0]
+```
+---
+## 7. Integration Test (Optional, Real API)
+```python
+# tests/integration/test_pubmed_live.py
+"""Integration tests that hit real APIs (run manually)."""
+import pytest
+@pytest.mark.integration
+@pytest.mark.slow
+@pytest.mark.asyncio
+async def test_pubmed_live_search():
+    """Test real PubMed search (requires network)."""
+    from src.tools.pubmed import PubMedTool
+    tool = PubMedTool()
+    results = await tool.search("metformin diabetes", max_results=3)
+    assert len(results) > 0
+    assert results[0].citation.source == "pubmed"
+    assert "pubmed.ncbi.nlm.nih.gov" in results[0].citation.url
+# Run with: uv run pytest tests/integration -m integration
+```
+---
+## 8. Implementation Checklist
+- [x] Create `src/utils/models.py` with all Pydantic models (Evidence, Citation, SearchResult) - **COMPLETE**
+- [x] Create `src/tools/__init__.py` with SearchTool Protocol and exports - **COMPLETE**
+- [x] Implement `src/tools/pubmed.py` with PubMedTool class - **COMPLETE**
+- [ ] ~~Implement `src/tools/websearch.py` with WebTool class~~ - **REMOVED** (replaced by Europe PMC in Phase 11)
+- [x] Create `src/tools/search_handler.py` with SearchHandler class - **COMPLETE**
+- [x] Write tests in `tests/unit/tools/test_pubmed.py` - **COMPLETE** (basic tests)
+- [ ] Write tests in `tests/unit/tools/test_websearch.py` - **N/A** (WebTool removed)
+- [x] Write tests in `tests/unit/tools/test_search_handler.py` - **COMPLETE** (basic tests)
+- [x] Run `uv run pytest tests/unit/tools/ -v` — **ALL TESTS MUST PASS** - **PASSING**
+- [ ] (Optional) Run integration test: `uv run pytest -m integration`
+- [ ] Add edge case tests (rate limiting, error handling, timeouts) - **PENDING**
+- [ ] Commit: `git commit -m "feat: phase 2 search slice complete"` - **DONE**
+**Post-Phase 2 Enhancements**:
+- [x] Query preprocessing (`src/tools/query_utils.py`) - **ADDED**
+- [x] Europe PMC tool (Phase 11) - **ADDED**
+- [x] ClinicalTrials tool (Phase 10) - **ADDED**
+---
+## 9. Definition of Done
+Phase 2 is **COMPLETE** when:
+1. ✅ All unit tests pass: `uv run pytest tests/unit/tools/ -v` - **PASSING**
+2. ✅ `SearchHandler` can execute with search tools - **WORKING**
+3. ✅ Graceful degradation: if one tool fails, other tools still return results - **IMPLEMENTED**
+4. ✅ Rate limiting is enforced (verify no 429 errors) - **IMPLEMENTED**
+5. ✅ Can run this in Python REPL:
+```python
+import asyncio
+from src.tools.pubmed import PubMedTool
+from src.tools.search_handler import SearchHandler
+async def test():
+    handler = SearchHandler([PubMedTool()])
+    result = await handler.execute("metformin alzheimer")
+    print(f"Found {result.total_found} results")
+    for e in result.evidence[:3]:
+        print(f"- {e.citation.title}")
+asyncio.run(test())
+```
+**Note**: WebTool was removed in favor of Europe PMC (Phase 11). The current implementation uses PubMed as the primary Phase 2 tool, with Europe PMC and ClinicalTrials added in later phases.
+**Proceed to Phase 3 ONLY after all checkboxes are complete.**

docs/implementation/03_phase_judge.md ADDED Viewed

	@@ -0,0 +1,1052 @@

+# Phase 3 Implementation Spec: Judge Vertical Slice
+**Goal**: Implement the "Brain" of the agent — evaluating evidence quality.
+**Philosophy**: "Structured Output or Bust."
+**Prerequisite**: Phase 2 complete (all search tests passing)
+---
+## 1. The Slice Definition
+This slice covers:
+1. **Input**: A user question + a list of `Evidence` (from Phase 2).
+2. **Process**:
+   - Construct a prompt with the evidence.
+   - Call LLM (PydanticAI / OpenAI / Anthropic).
+   - Force JSON structured output.
+3. **Output**: A `JudgeAssessment` object.
+**Files to Create**:
+- `src/utils/models.py` - Add JudgeAssessment models (extend from Phase 2)
+- `src/prompts/judge.py` - Judge prompt templates
+- `src/agent_factory/judges.py` - JudgeHandler with PydanticAI
+- `tests/unit/agent_factory/test_judges.py` - Unit tests
+---
+## 2. Models (Add to `src/utils/models.py`)
+The output schema must be strict for reliable structured output.
+```python
+"""Add these models to src/utils/models.py (after Evidence models from Phase 2)."""
+from pydantic import BaseModel, Field
+from typing import List, Literal
+class AssessmentDetails(BaseModel):
+    """Detailed assessment of evidence quality."""
+    mechanism_score: int = Field(
+        ...,
+        ge=0,
+        le=10,
+        description="How well does the evidence explain the mechanism? 0-10"
+    )
+    mechanism_reasoning: str = Field(
+        ...,
+        min_length=10,
+        description="Explanation of mechanism score"
+    )
+    clinical_evidence_score: int = Field(
+        ...,
+        ge=0,
+        le=10,
+        description="Strength of clinical/preclinical evidence. 0-10"
+    )
+    clinical_reasoning: str = Field(
+        ...,
+        min_length=10,
+        description="Explanation of clinical evidence score"
+    )
+    drug_candidates: List[str] = Field(
+        default_factory=list,
+        description="List of specific drug candidates mentioned"
+    )
+    key_findings: List[str] = Field(
+        default_factory=list,
+        description="Key findings from the evidence"
+    )
+class JudgeAssessment(BaseModel):
+    """Complete assessment from the Judge."""
+    details: AssessmentDetails
+    sufficient: bool = Field(
+        ...,
+        description="Is evidence sufficient to provide a recommendation?"
+    )
+    confidence: float = Field(
+        ...,
+        ge=0.0,
+        le=1.0,
+        description="Confidence in the assessment (0-1)"
+    )
+    recommendation: Literal["continue", "synthesize"] = Field(
+        ...,
+        description="continue = need more evidence, synthesize = ready to answer"
+    )
+    next_search_queries: List[str] = Field(
+        default_factory=list,
+        description="If continue, what queries to search next"
+    )
+    reasoning: str = Field(
+        ...,
+        min_length=20,
+        description="Overall reasoning for the recommendation"
+    )
+```
+---
+## 3. Prompt Engineering (`src/prompts/judge.py`)
+We treat prompts as code. They should be versioned and clean.
+```python
+"""Judge prompts for evidence assessment."""
+from typing import List
+from src.utils.models import Evidence
+SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
+Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition.
+## Evaluation Criteria
+1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
+   - 0-3: No clear mechanism, speculative
+   - 4-6: Some mechanistic insight, but gaps exist
+   - 7-10: Clear, well-supported mechanism of action
+2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
+   - 0-3: No clinical data, only theoretical
+   - 4-6: Preclinical or early clinical data
+   - 7-10: Strong clinical evidence (trials, meta-analyses)
+3. **Sufficiency**: Evidence is sufficient when:
+   - Combined scores >= 12 AND
+   - At least one specific drug candidate identified AND
+   - Clear mechanistic rationale exists
+## Output Rules
+- Always output valid JSON matching the schema
+- Be conservative: only recommend "synthesize" when truly confident
+- If continuing, suggest specific, actionable search queries
+- Never hallucinate drug names or findings not in the evidence
+"""
+def format_user_prompt(question: str, evidence: List[Evidence]) -> str:
+    """
+    Format the user prompt with question and evidence.
+    Args:
+        question: The user's research question
+        evidence: List of Evidence objects from search
+    Returns:
+        Formatted prompt string
+    """
+    evidence_text = "\n\n".join([
+        f"### Evidence {i+1}\n"
+        f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
+        f"**URL**: {e.citation.url}\n"
+        f"**Date**: {e.citation.date}\n"
+        f"**Content**:\n{e.content[:1500]}..."
+        if len(e.content) > 1500 else
+        f"### Evidence {i+1}\n"
+        f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
+        f"**URL**: {e.citation.url}\n"
+        f"**Date**: {e.citation.date}\n"
+        f"**Content**:\n{e.content}"
+        for i, e in enumerate(evidence)
+    ])
+    return f"""## Research Question
+{question}
+## Available Evidence ({len(evidence)} sources)
+{evidence_text}
+## Your Task
+Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
+Respond with a JSON object matching the JudgeAssessment schema.
+"""
+def format_empty_evidence_prompt(question: str) -> str:
+    """
+    Format prompt when no evidence was found.
+    Args:
+        question: The user's research question
+    Returns:
+        Formatted prompt string
+    """
+    return f"""## Research Question
+{question}
+## Available Evidence
+No evidence was found from the search.
+## Your Task
+Since no evidence was found, recommend search queries that might yield better results.
+Set sufficient=False and recommendation="continue".
+Suggest 3-5 specific search queries.
+"""
+```
+---
+## 4. JudgeHandler Implementation (`src/agent_factory/judges.py`)
+Using PydanticAI for structured output with retry logic.
+```python
+"""Judge handler for evidence assessment using PydanticAI."""
+import os
+from typing import List
+import structlog
+from pydantic_ai import Agent
+from pydantic_ai.models.openai import OpenAIModel
+from pydantic_ai.models.anthropic import AnthropicModel
+from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails
+from src.utils.config import settings
+from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt
+logger = structlog.get_logger()
+def get_model():
+    """Get the LLM model based on configuration."""
+    provider = getattr(settings, "llm_provider", "openai")
+    if provider == "anthropic":
+        return AnthropicModel(
+            model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"),
+            api_key=os.getenv("ANTHROPIC_API_KEY"),
+        )
+    else:
+        return OpenAIModel(
+            model_name=getattr(settings, "openai_model", "gpt-4o"),
+            api_key=os.getenv("OPENAI_API_KEY"),
+        )
+class JudgeHandler:
+    """
+    Handles evidence assessment using an LLM with structured output.
+    Uses PydanticAI to ensure responses match the JudgeAssessment schema.
+    """
+    def __init__(self, model=None):
+        """
+        Initialize the JudgeHandler.
+        Args:
+            model: Optional PydanticAI model. If None, uses config default.
+        """
+        self.model = model or get_model()
+        self.agent = Agent(
+            model=self.model,
+            result_type=JudgeAssessment,
+            system_prompt=SYSTEM_PROMPT,
+            retries=3,
+        )
+    async def assess(
+        self,
+        question: str,
+        evidence: List[Evidence],
+    ) -> JudgeAssessment:
+        """
+        Assess evidence and determine if it's sufficient.
+        Args:
+            question: The user's research question
+            evidence: List of Evidence objects from search
+        Returns:
+            JudgeAssessment with evaluation results
+        Raises:
+            JudgeError: If assessment fails after retries
+        """
+        logger.info(
+            "Starting evidence assessment",
+            question=question[:100],
+            evidence_count=len(evidence),
+        )
+        # Format the prompt based on whether we have evidence
+        if evidence:
+            user_prompt = format_user_prompt(question, evidence)
+        else:
+            user_prompt = format_empty_evidence_prompt(question)
+        try:
+            # Run the agent with structured output
+            result = await self.agent.run(user_prompt)
+            assessment = result.data
+            logger.info(
+                "Assessment complete",
+                sufficient=assessment.sufficient,
+                recommendation=assessment.recommendation,
+                confidence=assessment.confidence,
+            )
+            return assessment
+        except Exception as e:
+            logger.error("Assessment failed", error=str(e))
+            # Return a safe default assessment on failure
+            return self._create_fallback_assessment(question, str(e))
+    def _create_fallback_assessment(
+        self,
+        question: str,
+        error: str,
+    ) -> JudgeAssessment:
+        """
+        Create a fallback assessment when LLM fails.
+        Args:
+            question: The original question
+            error: The error message
+        Returns:
+            Safe fallback JudgeAssessment
+        """
+        return JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning="Assessment failed due to LLM error",
+                clinical_evidence_score=0,
+                clinical_reasoning="Assessment failed due to LLM error",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=[
+                f"{question} mechanism",
+                f"{question} clinical trials",
+                f"{question} drug candidates",
+            ],
+            reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
+        )
+class HFInferenceJudgeHandler:
+    """
+    JudgeHandler using HuggingFace Inference API for FREE LLM calls.
+    This is the DEFAULT for demo mode - provides real AI analysis without
+    requiring users to have OpenAI/Anthropic API keys.
+    Model Fallback Chain (handles gated models and rate limits):
+        1. meta-llama/Llama-3.1-8B-Instruct (best quality, requires HF_TOKEN)
+        2. mistralai/Mistral-7B-Instruct-v0.3 (good quality, may require token)
+        3. HuggingFaceH4/zephyr-7b-beta (ungated, always works)
+    Rate Limit Handling:
+        - Exponential backoff with 3 retries
+        - Falls back to next model on persistent 429/503 errors
+    """
+    # Model fallback chain: gated (best) → ungated (fallback)
+    FALLBACK_MODELS = [
+        "meta-llama/Llama-3.1-8B-Instruct",      # Best quality (gated)
+        "mistralai/Mistral-7B-Instruct-v0.3",    # Good quality
+        "HuggingFaceH4/zephyr-7b-beta",          # Ungated fallback
+    ]
+    def __init__(self, model_id: str | None = None) -> None:
+        """
+        Initialize with HF Inference client.
+        Args:
+            model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain.
+        """
+        self.model_id = model_id
+        # Will automatically use HF_TOKEN from env if available
+        self.client = InferenceClient()
+        self.call_count = 0
+        self.last_question: str | None = None
+        self.last_evidence: list[Evidence] | None = None
+    def _extract_json(self, text: str) -> dict[str, Any] | None:
+        """
+        Robust JSON extraction that handles markdown blocks and nested braces.
+        """
+        text = text.strip()
+        # Remove markdown code blocks if present (with bounds checking)
+        if "```json" in text:
+            parts = text.split("```json", 1)
+            if len(parts) > 1:
+                inner_parts = parts[1].split("```", 1)
+                text = inner_parts[0]
+        elif "```" in text:
+            parts = text.split("```", 1)
+            if len(parts) > 1:
+                inner_parts = parts[1].split("```", 1)
+                text = inner_parts[0]
+        text = text.strip()
+        # Find first '{'
+        start_idx = text.find("{")
+        if start_idx == -1:
+            return None
+        # Stack-based parsing ignoring chars in strings
+        count = 0
+        in_string = False
+        escape = False
+        for i, char in enumerate(text[start_idx:], start=start_idx):
+            if in_string:
+                if escape:
+                    escape = False
+                elif char == "\\":
+                    escape = True
+                elif char == '"':
+                    in_string = False
+            elif char == '"':
+                in_string = True
+            elif char == "{":
+                count += 1
+            elif char == "}":
+                count -= 1
+                if count == 0:
+                    try:
+                        result = json.loads(text[start_idx : i + 1])
+                        if isinstance(result, dict):
+                            return result
+                        return None
+                    except json.JSONDecodeError:
+                        return None
+        return None
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=1, max=4),
+        retry=retry_if_exception_type(Exception),
+        reraise=True,
+    )
+    async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment:
+        """Make API call with retry logic using chat_completion."""
+        loop = asyncio.get_running_loop()
+        # Build messages for chat_completion (model-agnostic)
+        messages = [
+            {
+                "role": "system",
+                "content": f"""{SYSTEM_PROMPT}
+IMPORTANT: Respond with ONLY valid JSON matching this schema:
+{{
+    "details": {{
+        "mechanism_score": <int 0-10>,
+        "mechanism_reasoning": "<string>",
+        "clinical_evidence_score": <int 0-10>,
+        "clinical_reasoning": "<string>",
+        "drug_candidates": ["<string>", ...],
+        "key_findings": ["<string>", ...]
+    }},
+    "sufficient": <bool>,
+    "confidence": <float 0-1>,
+    "recommendation": "continue" | "synthesize",
+    "next_search_queries": ["<string>", ...],
+    "reasoning": "<string>"
+}}""",
+            },
+            {"role": "user", "content": prompt},
+        ]
+        # Use chat_completion (conversational task - supported by all models)
+        response = await loop.run_in_executor(
+            None,
+            lambda: self.client.chat_completion(
+                messages=messages,
+                model=model,
+                max_tokens=1024,
+                temperature=0.1,
+            ),
+        )
+        # Extract content from response
+        content = response.choices[0].message.content
+        if not content:
+            raise ValueError("Empty response from model")
+        # Extract and parse JSON
+        json_data = self._extract_json(content)
+        if not json_data:
+            raise ValueError("No valid JSON found in response")
+        return JudgeAssessment(**json_data)
+    async def assess(
+        self,
+        question: str,
+        evidence: list[Evidence],
+    ) -> JudgeAssessment:
+        """
+        Assess evidence using HuggingFace Inference API.
+        Attempts models in order until one succeeds.
+        """
+        self.call_count += 1
+        self.last_question = question
+        self.last_evidence = evidence
+        # Format the user prompt
+        if evidence:
+            user_prompt = format_user_prompt(question, evidence)
+        else:
+            user_prompt = format_empty_evidence_prompt(question)
+        models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS
+        last_error: Exception | None = None
+        for model in models_to_try:
+            try:
+                return await self._call_with_retry(model, user_prompt, question)
+            except Exception as e:
+                logger.warning("Model failed", model=model, error=str(e))
+                last_error = e
+                continue
+        # All models failed
+        logger.error("All HF models failed", error=str(last_error))
+        return self._create_fallback_assessment(question, str(last_error))
+    def _create_fallback_assessment(
+        self,
+        question: str,
+        error: str,
+    ) -> JudgeAssessment:
+        """Create a fallback assessment when inference fails."""
+        return JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning=f"Assessment failed: {error}",
+                clinical_evidence_score=0,
+                clinical_reasoning=f"Assessment failed: {error}",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=[
+                f"{question} mechanism",
+                f"{question} clinical trials",
+                f"{question} drug candidates",
+            ],
+            reasoning=f"HF Inference failed: {error}. Recommend retrying.",
+        )
+class MockJudgeHandler:
+    """
+    Mock JudgeHandler for UNIT TESTING ONLY.
+    NOT for production use. Use HFInferenceJudgeHandler for demo mode.
+    """
+    def __init__(self, mock_response: JudgeAssessment | None = None):
+        """Initialize with optional mock response for testing."""
+        self.mock_response = mock_response
+        self.call_count = 0
+        self.last_question = None
+        self.last_evidence = None
+    async def assess(
+        self,
+        question: str,
+        evidence: List[Evidence],
+    ) -> JudgeAssessment:
+        """Return the mock response (for testing only)."""
+        self.call_count += 1
+        self.last_question = question
+        self.last_evidence = evidence
+        if self.mock_response:
+            return self.mock_response
+        # Default mock response for tests
+        return JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=7,
+                mechanism_reasoning="Mock assessment for testing",
+                clinical_evidence_score=6,
+                clinical_reasoning="Mock assessment for testing",
+                drug_candidates=["TestDrug"],
+                key_findings=["Test finding"],
+            ),
+            sufficient=len(evidence) >= 3,
+            confidence=0.75,
+            recommendation="synthesize" if len(evidence) >= 3 else "continue",
+            next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
+            reasoning="Mock assessment for unit testing only",
+        )
+```
+---
+## 5. TDD Workflow
+### Test File: `tests/unit/agent_factory/test_judges.py`
+```python
+"""Unit tests for JudgeHandler."""
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+from src.utils.models import (
+    Evidence,
+    Citation,
+    JudgeAssessment,
+    AssessmentDetails,
+)
+class TestJudgeHandler:
+    """Tests for JudgeHandler."""
+    @pytest.mark.asyncio
+    async def test_assess_returns_assessment(self):
+        """JudgeHandler should return JudgeAssessment from LLM."""
+        from src.agent_factory.judges import JudgeHandler
+        # Create mock assessment
+        mock_assessment = JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=8,
+                mechanism_reasoning="Strong mechanistic evidence",
+                clinical_evidence_score=7,
+                clinical_reasoning="Good clinical support",
+                drug_candidates=["Metformin"],
+                key_findings=["Neuroprotective effects"],
+            ),
+            sufficient=True,
+            confidence=0.85,
+            recommendation="synthesize",
+            next_search_queries=[],
+            reasoning="Evidence is sufficient for synthesis",
+        )
+        # Mock the PydanticAI agent
+        mock_result = MagicMock()
+        mock_result.data = mock_assessment
+        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
+            mock_agent = AsyncMock()
+            mock_agent.run = AsyncMock(return_value=mock_result)
+            mock_agent_class.return_value = mock_agent
+            handler = JudgeHandler()
+            # Replace the agent with our mock
+            handler.agent = mock_agent
+            evidence = [
+                Evidence(
+                    content="Metformin shows neuroprotective properties...",
+                    citation=Citation(
+                        source="pubmed",
+                        title="Metformin in AD",
+                        url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                        date="2024-01-01",
+                    ),
+                )
+            ]
+            result = await handler.assess("metformin alzheimer", evidence)
+            assert result.sufficient is True
+            assert result.recommendation == "synthesize"
+            assert result.confidence == 0.85
+            assert "Metformin" in result.details.drug_candidates
+    @pytest.mark.asyncio
+    async def test_assess_empty_evidence(self):
+        """JudgeHandler should handle empty evidence gracefully."""
+        from src.agent_factory.judges import JudgeHandler
+        mock_assessment = JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning="No evidence to assess",
+                clinical_evidence_score=0,
+                clinical_reasoning="No evidence to assess",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=["metformin alzheimer mechanism"],
+            reasoning="No evidence found, need to search more",
+        )
+        mock_result = MagicMock()
+        mock_result.data = mock_assessment
+        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
+            mock_agent = AsyncMock()
+            mock_agent.run = AsyncMock(return_value=mock_result)
+            mock_agent_class.return_value = mock_agent
+            handler = JudgeHandler()
+            handler.agent = mock_agent
+            result = await handler.assess("metformin alzheimer", [])
+            assert result.sufficient is False
+            assert result.recommendation == "continue"
+            assert len(result.next_search_queries) > 0
+    @pytest.mark.asyncio
+    async def test_assess_handles_llm_failure(self):
+        """JudgeHandler should return fallback on LLM failure."""
+        from src.agent_factory.judges import JudgeHandler
+        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
+            mock_agent = AsyncMock()
+            mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
+            mock_agent_class.return_value = mock_agent
+            handler = JudgeHandler()
+            handler.agent = mock_agent
+            evidence = [
+                Evidence(
+                    content="Some content",
+                    citation=Citation(
+                        source="pubmed",
+                        title="Title",
+                        url="url",
+                        date="2024",
+                    ),
+                )
+            ]
+            result = await handler.assess("test question", evidence)
+            # Should return fallback, not raise
+            assert result.sufficient is False
+            assert result.recommendation == "continue"
+            assert "failed" in result.reasoning.lower()
+class TestHFInferenceJudgeHandler:
+    """Tests for HFInferenceJudgeHandler."""
+    @pytest.mark.asyncio
+    async def test_extract_json_raw(self):
+        """Should extract raw JSON."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        # Bypass __init__ for unit testing extraction
+        result = handler._extract_json('{"key": "value"}')
+        assert result == {"key": "value"}
+    @pytest.mark.asyncio
+    async def test_extract_json_markdown_block(self):
+        """Should extract JSON from markdown code block."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        response = '''Here is the assessment:
+```json
+{"key": "value", "nested": {"inner": 1}}
+```
+'''
+        result = handler._extract_json(response)
+        assert result == {"key": "value", "nested": {"inner": 1}}
+    @pytest.mark.asyncio
+    async def test_extract_json_with_preamble(self):
+        """Should extract JSON with preamble text."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        response = 'Here is your JSON response:\n{"sufficient": true, "confidence": 0.85}'
+        result = handler._extract_json(response)
+        assert result == {"sufficient": True, "confidence": 0.85}
+    @pytest.mark.asyncio
+    async def test_extract_json_nested_braces(self):
+        """Should handle nested braces correctly."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
+        response = '{"details": {"mechanism_score": 8}, "reasoning": "test"}'
+        result = handler._extract_json(response)
+        assert result["details"]["mechanism_score"] == 8
+    @pytest.mark.asyncio
+    async def test_hf_handler_uses_fallback_models(self):
+        """HFInferenceJudgeHandler should have fallback model chain."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        # Check class has fallback models defined
+        assert len(HFInferenceJudgeHandler.FALLBACK_MODELS) >= 3
+        assert "zephyr-7b-beta" in HFInferenceJudgeHandler.FALLBACK_MODELS[-1]
+    @pytest.mark.asyncio
+    async def test_hf_handler_fallback_on_auth_error(self):
+        """Should fall back to ungated model on auth error."""
+        from src.agent_factory.judges import HFInferenceJudgeHandler
+        from unittest.mock import MagicMock, patch
+        with patch("src.agent_factory.judges.InferenceClient") as mock_client_class:
+            # First call raises 403, second succeeds
+            mock_client = MagicMock()
+            mock_client.chat_completion.side_effect = [
+                Exception("403 Forbidden: gated model"),
+                MagicMock(choices=[MagicMock(message=MagicMock(content='{"sufficient": false}'))])
+            ]
+            mock_client_class.return_value = mock_client
+            handler = HFInferenceJudgeHandler()
+            # Manually trigger fallback test
+            assert handler._try_fallback_model() is True
+            assert handler.model_id != "meta-llama/Llama-3.1-8B-Instruct"
+class TestMockJudgeHandler:
+    """Tests for MockJudgeHandler (UNIT TESTING ONLY)."""
+    @pytest.mark.asyncio
+    async def test_mock_handler_returns_default(self):
+        """MockJudgeHandler should return default assessment."""
+        from src.agent_factory.judges import MockJudgeHandler
+        handler = MockJudgeHandler()
+        evidence = [
+            Evidence(
+                content="Content 1",
+                citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
+            ),
+            Evidence(
+                content="Content 2",
+                citation=Citation(source="web", title="T2", url="u2", date="2024"),
+            ),
+        ]
+        result = await handler.assess("test", evidence)
+        assert handler.call_count == 1
+        assert handler.last_question == "test"
+        assert len(handler.last_evidence) == 2
+        assert result.details.mechanism_score == 7
+    @pytest.mark.asyncio
+    async def test_mock_handler_custom_response(self):
+        """MockJudgeHandler should return custom response when provided."""
+        from src.agent_factory.judges import MockJudgeHandler
+        custom_assessment = JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=10,
+                mechanism_reasoning="Custom reasoning",
+                clinical_evidence_score=10,
+                clinical_reasoning="Custom clinical",
+                drug_candidates=["CustomDrug"],
+                key_findings=["Custom finding"],
+            ),
+            sufficient=True,
+            confidence=1.0,
+            recommendation="synthesize",
+            next_search_queries=[],
+            reasoning="Custom assessment",
+        )
+        handler = MockJudgeHandler(mock_response=custom_assessment)
+        result = await handler.assess("test", [])
+        assert result.details.mechanism_score == 10
+        assert result.details.drug_candidates == ["CustomDrug"]
+    @pytest.mark.asyncio
+    async def test_mock_handler_insufficient_with_few_evidence(self):
+        """MockJudgeHandler should recommend continue with < 3 evidence."""
+        from src.agent_factory.judges import MockJudgeHandler
+        handler = MockJudgeHandler()
+        # Only 2 pieces of evidence
+        evidence = [
+            Evidence(
+                content="Content",
+                citation=Citation(source="pubmed", title="T", url="u", date="2024"),
+            ),
+            Evidence(
+                content="Content 2",
+                citation=Citation(source="web", title="T2", url="u2", date="2024"),
+            ),
+        ]
+        result = await handler.assess("test", evidence)
+        assert result.sufficient is False
+        assert result.recommendation == "continue"
+        assert len(result.next_search_queries) > 0
+```
+---
+## 6. Dependencies
+Add to `pyproject.toml`:
+```toml
+[project]
+dependencies = [
+    # ... existing deps ...
+    "pydantic-ai>=0.0.16",
+    "openai>=1.0.0",
+    "anthropic>=0.18.0",
+    "huggingface-hub>=0.20.0",  # For HFInferenceJudgeHandler (FREE LLM)
+]
+```
+**Note**: `huggingface-hub` is required for the free tier to work. It:
+- Provides `InferenceClient` for API calls
+- Auto-reads `HF_TOKEN` from environment (optional, for gated models)
+- Works without any token for ungated models like `zephyr-7b-beta`
+---
+## 7. Configuration (`src/utils/config.py`)
+Add LLM configuration:
+```python
+"""Add to src/utils/config.py."""
+from pydantic_settings import BaseSettings
+from typing import Literal
+class Settings(BaseSettings):
+    """Application settings."""
+    # LLM Configuration
+    llm_provider: Literal["openai", "anthropic"] = "openai"
+    openai_model: str = "gpt-4o"
+    anthropic_model: str = "claude-3-5-sonnet-20241022"
+    # API Keys (loaded from environment)
+    openai_api_key: str | None = None
+    anthropic_api_key: str | None = None
+    ncbi_api_key: str | None = None
+    class Config:
+        env_file = ".env"
+        env_file_encoding = "utf-8"
+settings = Settings()
+```
+---
+## 8. Implementation Checklist
+- [ ] Add `AssessmentDetails` and `JudgeAssessment` models to `src/utils/models.py`
+- [ ] Create `src/prompts/__init__.py` (empty, for package)
+- [ ] Create `src/prompts/judge.py` with prompt templates
+- [ ] Create `src/agent_factory/__init__.py` with exports
+- [ ] Implement `src/agent_factory/judges.py` with JudgeHandler
+- [ ] Update `src/utils/config.py` with LLM settings
+- [ ] Create `tests/unit/agent_factory/__init__.py`
+- [ ] Write tests in `tests/unit/agent_factory/test_judges.py`
+- [ ] Run `uv run pytest tests/unit/agent_factory/ -v` — **ALL TESTS MUST PASS**
+- [ ] Commit: `git commit -m "feat: phase 3 judge slice complete"`
+---
+## 9. Definition of Done
+Phase 3 is **COMPLETE** when:
+1. All unit tests pass: `uv run pytest tests/unit/agent_factory/ -v`
+2. `JudgeHandler` can assess evidence and return structured output
+3. Graceful degradation: if LLM fails, returns safe fallback
+4. MockJudgeHandler works for testing without API calls
+5. Can run this in Python REPL:
+```python
+import asyncio
+import os
+from src.utils.models import Evidence, Citation
+from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
+# Test with mock (no API key needed)
+async def test_mock():
+    handler = MockJudgeHandler()
+    evidence = [
+        Evidence(
+            content="Metformin shows neuroprotective effects in AD models",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin and Alzheimer's",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                date="2024-01-01",
+            ),
+        ),
+    ]
+    result = await handler.assess("metformin alzheimer", evidence)
+    print(f"Sufficient: {result.sufficient}")
+    print(f"Recommendation: {result.recommendation}")
+    print(f"Drug candidates: {result.details.drug_candidates}")
+asyncio.run(test_mock())
+# Test with real LLM (requires API key)
+async def test_real():
+    os.environ["OPENAI_API_KEY"] = "your-key-here"  # Or set in .env
+    handler = JudgeHandler()
+    evidence = [
+        Evidence(
+            content="Metformin shows neuroprotective effects in AD models...",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin and Alzheimer's",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                date="2024-01-01",
+            ),
+        ),
+    ]
+    result = await handler.assess("metformin alzheimer", evidence)
+    print(f"Sufficient: {result.sufficient}")
+    print(f"Confidence: {result.confidence}")
+    print(f"Reasoning: {result.reasoning}")
+# asyncio.run(test_real())  # Uncomment with valid API key
+```
+**Proceed to Phase 4 ONLY after all checkboxes are complete.**

docs/implementation/04_phase_ui.md ADDED Viewed

	@@ -0,0 +1,1104 @@

+# Phase 4 Implementation Spec: Orchestrator & UI
+**Goal**: Connect the Brain and the Body, then give it a Face.
+**Philosophy**: "Streaming is Trust."
+**Prerequisite**: Phase 3 complete (all judge tests passing)
+---
+## 1. The Slice Definition
+This slice connects:
+1. **Orchestrator**: The state machine (While loop) calling Search -> Judge.
+2. **UI**: Gradio interface that visualizes the loop.
+**Files to Create/Modify**:
+- `src/orchestrator.py` - Agent loop logic
+- `src/app.py` - Gradio UI
+- `tests/unit/test_orchestrator.py` - Unit tests
+- `Dockerfile` - Container for deployment
+- `README.md` - Usage instructions (update)
+---
+## 2. Agent Events (`src/utils/models.py`)
+Add event types for streaming UI updates:
+```python
+"""Add to src/utils/models.py (after JudgeAssessment models)."""
+from pydantic import BaseModel, Field
+from typing import Literal, Any
+from datetime import datetime
+class AgentEvent(BaseModel):
+    """Event emitted by the orchestrator for UI streaming."""
+    type: Literal[
+        "started",
+        "searching",
+        "search_complete",
+        "judging",
+        "judge_complete",
+        "looping",
+        "synthesizing",
+        "complete",
+        "error",
+    ]
+    message: str
+    data: Any = None
+    timestamp: datetime = Field(default_factory=datetime.now)
+    iteration: int = 0
+    def to_markdown(self) -> str:
+        """Format event as markdown for chat display."""
+        icons = {
+            "started": "🚀",
+            "searching": "🔍",
+            "search_complete": "📚",
+            "judging": "🧠",
+            "judge_complete": "✅",
+            "looping": "🔄",
+            "synthesizing": "📝",
+            "complete": "🎉",
+            "error": "❌",
+        }
+        icon = icons.get(self.type, "•")
+        return f"{icon} **{self.type.upper()}**: {self.message}"
+class OrchestratorConfig(BaseModel):
+    """Configuration for the orchestrator."""
+    max_iterations: int = Field(default=5, ge=1, le=10)
+    max_results_per_tool: int = Field(default=10, ge=1, le=50)
+    search_timeout: float = Field(default=30.0, ge=5.0, le=120.0)
+```
+---
+## 3. The Orchestrator (`src/orchestrator.py`)
+This is the "Agent" logic — the while loop that drives search and judgment.
+```python
+"""Orchestrator - the agent loop connecting Search and Judge."""
+import asyncio
+from typing import AsyncGenerator, List, Protocol
+import structlog
+from src.utils.models import (
+    Evidence,
+    SearchResult,
+    JudgeAssessment,
+    AgentEvent,
+    OrchestratorConfig,
+)
+logger = structlog.get_logger()
+class SearchHandlerProtocol(Protocol):
+    """Protocol for search handler."""
+    async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
+        ...
+class JudgeHandlerProtocol(Protocol):
+    """Protocol for judge handler."""
+    async def assess(self, question: str, evidence: List[Evidence]) -> JudgeAssessment:
+        ...
+class Orchestrator:
+    """
+    The agent orchestrator - runs the Search -> Judge -> Loop cycle.
+    This is a generator-based design that yields events for real-time UI updates.
+    """
+    def __init__(
+        self,
+        search_handler: SearchHandlerProtocol,
+        judge_handler: JudgeHandlerProtocol,
+        config: OrchestratorConfig | None = None,
+    ):
+        """
+        Initialize the orchestrator.
+        Args:
+            search_handler: Handler for executing searches
+            judge_handler: Handler for assessing evidence
+            config: Optional configuration (uses defaults if not provided)
+        """
+        self.search = search_handler
+        self.judge = judge_handler
+        self.config = config or OrchestratorConfig()
+        self.history: List[dict] = []
+    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
+        """
+        Run the agent loop for a query.
+        Yields AgentEvent objects for each step, allowing real-time UI updates.
+        Args:
+            query: The user's research question
+        Yields:
+            AgentEvent objects for each step of the process
+        """
+        logger.info("Starting orchestrator", query=query)
+        yield AgentEvent(
+            type="started",
+            message=f"Starting research for: {query}",
+            iteration=0,
+        )
+        all_evidence: List[Evidence] = []
+        current_queries = [query]
+        iteration = 0
+        while iteration < self.config.max_iterations:
+            iteration += 1
+            logger.info("Iteration", iteration=iteration, queries=current_queries)
+            # === SEARCH PHASE ===
+            yield AgentEvent(
+                type="searching",
+                message=f"Searching for: {', '.join(current_queries[:3])}...",
+                iteration=iteration,
+            )
+            try:
+                # Execute searches for all current queries
+                search_tasks = [
+                    self.search.execute(q, self.config.max_results_per_tool)
+                    for q in current_queries[:3]  # Limit to 3 queries per iteration
+                ]
+                search_results = await asyncio.gather(*search_tasks, return_exceptions=True)
+                # Collect evidence from successful searches
+                new_evidence: List[Evidence] = []
+                errors: List[str] = []
+                for q, result in zip(current_queries[:3], search_results):
+                    if isinstance(result, Exception):
+                        errors.append(f"Search for '{q}' failed: {str(result)}")
+                    else:
+                        new_evidence.extend(result.evidence)
+                        errors.extend(result.errors)
+                # Deduplicate evidence by URL
+                seen_urls = {e.citation.url for e in all_evidence}
+                unique_new = [e for e in new_evidence if e.citation.url not in seen_urls]
+                all_evidence.extend(unique_new)
+                yield AgentEvent(
+                    type="search_complete",
+                    message=f"Found {len(unique_new)} new sources ({len(all_evidence)} total)",
+                    data={"new_count": len(unique_new), "total_count": len(all_evidence)},
+                    iteration=iteration,
+                )
+                if errors:
+                    logger.warning("Search errors", errors=errors)
+            except Exception as e:
+                logger.error("Search phase failed", error=str(e))
+                yield AgentEvent(
+                    type="error",
+                    message=f"Search failed: {str(e)}",
+                    iteration=iteration,
+                )
+                continue
+            # === JUDGE PHASE ===
+            yield AgentEvent(
+                type="judging",
+                message=f"Evaluating {len(all_evidence)} sources...",
+                iteration=iteration,
+            )
+            try:
+                assessment = await self.judge.assess(query, all_evidence)
+                yield AgentEvent(
+                    type="judge_complete",
+                    message=f"Assessment: {assessment.recommendation} (confidence: {assessment.confidence:.0%})",
+                    data={
+                        "sufficient": assessment.sufficient,
+                        "confidence": assessment.confidence,
+                        "mechanism_score": assessment.details.mechanism_score,
+                        "clinical_score": assessment.details.clinical_evidence_score,
+                    },
+                    iteration=iteration,
+                )
+                # Record this iteration in history
+                self.history.append({
+                    "iteration": iteration,
+                    "queries": current_queries,
+                    "evidence_count": len(all_evidence),
+                    "assessment": assessment.model_dump(),
+                })
+                # === DECISION PHASE ===
+                if assessment.sufficient and assessment.recommendation == "synthesize":
+                    yield AgentEvent(
+                        type="synthesizing",
+                        message="Evidence sufficient! Preparing synthesis...",
+                        iteration=iteration,
+                    )
+                    # Generate final response
+                    final_response = self._generate_synthesis(query, all_evidence, assessment)
+                    yield AgentEvent(
+                        type="complete",
+                        message=final_response,
+                        data={
+                            "evidence_count": len(all_evidence),
+                            "iterations": iteration,
+                            "drug_candidates": assessment.details.drug_candidates,
+                            "key_findings": assessment.details.key_findings,
+                        },
+                        iteration=iteration,
+                    )
+                    return
+                else:
+                    # Need more evidence - prepare next queries
+                    current_queries = assessment.next_search_queries or [
+                        f"{query} mechanism of action",
+                        f"{query} clinical evidence",
+                    ]
+                    yield AgentEvent(
+                        type="looping",
+                        message=f"Need more evidence. Next searches: {', '.join(current_queries[:2])}...",
+                        data={"next_queries": current_queries},
+                        iteration=iteration,
+                    )
+            except Exception as e:
+                logger.error("Judge phase failed", error=str(e))
+                yield AgentEvent(
+                    type="error",
+                    message=f"Assessment failed: {str(e)}",
+                    iteration=iteration,
+                )
+                continue
+        # Max iterations reached
+        yield AgentEvent(
+            type="complete",
+            message=self._generate_partial_synthesis(query, all_evidence),
+            data={
+                "evidence_count": len(all_evidence),
+                "iterations": iteration,
+                "max_reached": True,
+            },
+            iteration=iteration,
+        )
+    def _generate_synthesis(
+        self,
+        query: str,
+        evidence: List[Evidence],
+        assessment: JudgeAssessment,
+    ) -> str:
+        """
+        Generate the final synthesis response.
+        Args:
+            query: The original question
+            evidence: All collected evidence
+            assessment: The final assessment
+        Returns:
+            Formatted synthesis as markdown
+        """
+        drug_list = "\n".join([f"- **{d}**" for d in assessment.details.drug_candidates]) or "- No specific candidates identified"
+        findings_list = "\n".join([f"- {f}" for f in assessment.details.key_findings]) or "- See evidence below"
+        citations = "\n".join([
+            f"{i+1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()}, {e.citation.date})"
+            for i, e in enumerate(evidence[:10])  # Limit to 10 citations
+        ])
+        return f"""## Drug Repurposing Analysis
+### Question
+{query}
+### Drug Candidates
+{drug_list}
+### Key Findings
+{findings_list}
+### Assessment
+- **Mechanism Score**: {assessment.details.mechanism_score}/10
+- **Clinical Evidence Score**: {assessment.details.clinical_evidence_score}/10
+- **Confidence**: {assessment.confidence:.0%}
+### Reasoning
+{assessment.reasoning}
+### Citations ({len(evidence)} sources)
+{citations}
+---
+*Analysis based on {len(evidence)} sources across {len(self.history)} iterations.*
+"""
+    def _generate_partial_synthesis(
+        self,
+        query: str,
+        evidence: List[Evidence],
+    ) -> str:
+        """
+        Generate a partial synthesis when max iterations reached.
+        Args:
+            query: The original question
+            evidence: All collected evidence
+        Returns:
+            Formatted partial synthesis as markdown
+        """
+        citations = "\n".join([
+            f"{i+1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
+            for i, e in enumerate(evidence[:10])
+        ])
+        return f"""## Partial Analysis (Max Iterations Reached)
+### Question
+{query}
+### Status
+Maximum search iterations reached. The evidence gathered may be incomplete.
+### Evidence Collected
+Found {len(evidence)} sources. Consider refining your query for more specific results.
+### Citations
+{citations}
+---
+*Consider searching with more specific terms or drug names.*
+"""
+```
+---
+## 4. The Gradio UI (`src/app.py`)
+Using Gradio 5 generator pattern for real-time streaming.
+```python
+"""Gradio UI for DeepCritical agent."""
+import asyncio
+import gradio as gr
+from typing import AsyncGenerator
+from src.orchestrator import Orchestrator
+from src.tools.pubmed import PubMedTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.biorxiv import BioRxivTool
+from src.tools.search_handler import SearchHandler
+from src.agent_factory.judges import JudgeHandler, HFInferenceJudgeHandler
+from src.utils.models import OrchestratorConfig, AgentEvent
+def create_orchestrator(
+    user_api_key: str | None = None,
+    api_provider: str = "openai",
+) -> tuple[Orchestrator, str]:
+    """
+    Create an orchestrator instance.
+    Args:
+        user_api_key: Optional user-provided API key (BYOK)
+        api_provider: API provider ("openai" or "anthropic")
+    Returns:
+        Tuple of (Configured Orchestrator instance, backend_name)
+    Priority:
+        1. User-provided API key → JudgeHandler (OpenAI/Anthropic)
+        2. Environment API key → JudgeHandler (OpenAI/Anthropic)
+        3. No key → HFInferenceJudgeHandler (FREE, automatic fallback chain)
+    HF Inference Fallback Chain:
+        1. Llama 3.1 8B (requires HF_TOKEN for gated model)
+        2. Mistral 7B (may require token)
+        3. Zephyr 7B (ungated, always works)
+    """
+    import os
+    # Create search tools
+    search_handler = SearchHandler(
+        tools=[PubMedTool(), ClinicalTrialsTool(), BioRxivTool()],
+        timeout=30.0,
+    )
+    # Determine which judge to use
+    has_env_key = bool(os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY"))
+    has_user_key = bool(user_api_key)
+    has_hf_token = bool(os.getenv("HF_TOKEN"))
+    if has_user_key:
+        # User provided their own key
+        judge_handler = JudgeHandler(model=None)
+        backend_name = f"your {api_provider.upper()} API key"
+    elif has_env_key:
+        # Environment has API key configured
+        judge_handler = JudgeHandler(model=None)
+        backend_name = "configured API key"
+    else:
+        # Use FREE HuggingFace Inference with automatic fallback
+        judge_handler = HFInferenceJudgeHandler()
+        if has_hf_token:
+            backend_name = "HuggingFace Inference (Llama 3.1)"
+        else:
+            backend_name = "HuggingFace Inference (free tier)"
+    # Create orchestrator
+    config = OrchestratorConfig(
+        max_iterations=5,
+        max_results_per_tool=10,
+    )
+    return Orchestrator(
+        search_handler=search_handler,
+        judge_handler=judge_handler,
+        config=config,
+    ), backend_name
+async def research_agent(
+    message: str,
+    history: list[dict],
+    api_key: str = "",
+    api_provider: str = "openai",
+) -> AsyncGenerator[str, None]:
+    """
+    Gradio chat function that runs the research agent.
+    Args:
+        message: User's research question
+        history: Chat history (Gradio format)
+        api_key: Optional user-provided API key (BYOK)
+        api_provider: API provider ("openai" or "anthropic")
+    Yields:
+        Markdown-formatted responses for streaming
+    """
+    if not message.strip():
+        yield "Please enter a research question."
+        return
+    import os
+    # Clean user-provided API key
+    user_api_key = api_key.strip() if api_key else None
+    # Create orchestrator with appropriate judge
+    orchestrator, backend_name = create_orchestrator(
+        user_api_key=user_api_key,
+        api_provider=api_provider,
+    )
+    # Determine icon based on backend
+    has_hf_token = bool(os.getenv("HF_TOKEN"))
+    if "HuggingFace" in backend_name:
+        icon = "🤗"
+        extra_note = (
+            "\n*For premium analysis, enter an OpenAI or Anthropic API key.*"
+            if not has_hf_token else ""
+        )
+    else:
+        icon = "🔑"
+        extra_note = ""
+    # Inform user which backend is being used
+    yield f"{icon} **Using {backend_name}**{extra_note}\n\n"
+    # Run the agent and stream events
+    response_parts = []
+    try:
+        async for event in orchestrator.run(message):
+            # Format event as markdown
+            event_md = event.to_markdown()
+            response_parts.append(event_md)
+            # If complete, show full response
+            if event.type == "complete":
+                yield event.message
+            else:
+                # Show progress
+                yield "\n\n".join(response_parts)
+    except Exception as e:
+        yield f"❌ **Error**: {str(e)}"
+def create_demo() -> gr.Blocks:
+    """
+    Create the Gradio demo interface.
+    Returns:
+        Configured Gradio Blocks interface
+    """
+    with gr.Blocks(
+        title="DeepCritical - Drug Repurposing Research Agent",
+        theme=gr.themes.Soft(),
+    ) as demo:
+        gr.Markdown("""
+        # 🧬 DeepCritical
+        ## AI-Powered Drug Repurposing Research Agent
+        Ask questions about potential drug repurposing opportunities.
+        The agent will search PubMed and the web, evaluate evidence, and provide recommendations.
+        **Example questions:**
+        - "What drugs could be repurposed for Alzheimer's disease?"
+        - "Is metformin effective for cancer treatment?"
+        - "What existing medications show promise for Long COVID?"
+        """)
+        # Note: additional_inputs render in an accordion below the chat input
+        gr.ChatInterface(
+            fn=research_agent,
+            examples=[
+                [
+                    "What drugs could be repurposed for Alzheimer's disease?",
+                    "simple",
+                    "",
+                    "openai",
+                ],
+                [
+                    "Is metformin effective for treating cancer?",
+                    "simple",
+                    "",
+                    "openai",
+                ],
+            ],
+            additional_inputs=[
+                gr.Radio(
+                    choices=["simple", "magentic"],
+                    value="simple",
+                    label="Orchestrator Mode",
+                    info="Simple: Linear | Magentic: Multi-Agent (OpenAI)",
+                ),
+                gr.Textbox(
+                    label="API Key (Optional - Bring Your Own Key)",
+                    placeholder="sk-... or sk-ant-...",
+                    type="password",
+                    info="Enter your own API key for full AI analysis. Never stored.",
+                ),
+                gr.Radio(
+                    choices=["openai", "anthropic"],
+                    value="openai",
+                    label="API Provider",
+                    info="Select the provider for your API key",
+                ),
+            ],
+        )
+        gr.Markdown("""
+        ---
+        **Note**: This is a research tool and should not be used for medical decisions.
+        Always consult healthcare professionals for medical advice.
+        Built with 🤖 PydanticAI + 🔬 PubMed + 🦆 DuckDuckGo
+        """)
+    return demo
+def main():
+    """Run the Gradio app."""
+    demo = create_demo()
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,
+    )
+if __name__ == "__main__":
+    main()
+```
+---
+## 5. TDD Workflow
+### Test File: `tests/unit/test_orchestrator.py`
+```python
+"""Unit tests for Orchestrator."""
+import pytest
+from unittest.mock import AsyncMock, MagicMock
+from src.utils.models import (
+    Evidence,
+    Citation,
+    SearchResult,
+    JudgeAssessment,
+    AssessmentDetails,
+    OrchestratorConfig,
+)
+class TestOrchestrator:
+    """Tests for Orchestrator."""
+    @pytest.fixture
+    def mock_search_handler(self):
+        """Create a mock search handler."""
+        handler = AsyncMock()
+        handler.execute = AsyncMock(return_value=SearchResult(
+            query="test",
+            evidence=[
+                Evidence(
+                    content="Test content",
+                    citation=Citation(
+                        source="pubmed",
+                        title="Test Title",
+                        url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                        date="2024-01-01",
+                    ),
+                ),
+            ],
+            sources_searched=["pubmed"],
+            total_found=1,
+            errors=[],
+        ))
+        return handler
+    @pytest.fixture
+    def mock_judge_sufficient(self):
+        """Create a mock judge that returns sufficient."""
+        handler = AsyncMock()
+        handler.assess = AsyncMock(return_value=JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=8,
+                mechanism_reasoning="Good mechanism",
+                clinical_evidence_score=7,
+                clinical_reasoning="Good clinical",
+                drug_candidates=["Drug A"],
+                key_findings=["Finding 1"],
+            ),
+            sufficient=True,
+            confidence=0.85,
+            recommendation="synthesize",
+            next_search_queries=[],
+            reasoning="Evidence is sufficient",
+        ))
+        return handler
+    @pytest.fixture
+    def mock_judge_insufficient(self):
+        """Create a mock judge that returns insufficient."""
+        handler = AsyncMock()
+        handler.assess = AsyncMock(return_value=JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=4,
+                mechanism_reasoning="Weak mechanism",
+                clinical_evidence_score=3,
+                clinical_reasoning="Weak clinical",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.3,
+            recommendation="continue",
+            next_search_queries=["more specific query"],
+            reasoning="Need more evidence",
+        ))
+        return handler
+    @pytest.mark.asyncio
+    async def test_orchestrator_completes_with_sufficient_evidence(
+        self,
+        mock_search_handler,
+        mock_judge_sufficient,
+    ):
+        """Orchestrator should complete when evidence is sufficient."""
+        from src.orchestrator import Orchestrator
+        config = OrchestratorConfig(max_iterations=5)
+        orchestrator = Orchestrator(
+            search_handler=mock_search_handler,
+            judge_handler=mock_judge_sufficient,
+            config=config,
+        )
+        events = []
+        async for event in orchestrator.run("test query"):
+            events.append(event)
+        # Should have started, searched, judged, and completed
+        event_types = [e.type for e in events]
+        assert "started" in event_types
+        assert "searching" in event_types
+        assert "search_complete" in event_types
+        assert "judging" in event_types
+        assert "judge_complete" in event_types
+        assert "complete" in event_types
+        # Should only have 1 iteration
+        complete_event = [e for e in events if e.type == "complete"][0]
+        assert complete_event.iteration == 1
+    @pytest.mark.asyncio
+    async def test_orchestrator_loops_when_insufficient(
+        self,
+        mock_search_handler,
+        mock_judge_insufficient,
+    ):
+        """Orchestrator should loop when evidence is insufficient."""
+        from src.orchestrator import Orchestrator
+        config = OrchestratorConfig(max_iterations=3)
+        orchestrator = Orchestrator(
+            search_handler=mock_search_handler,
+            judge_handler=mock_judge_insufficient,
+            config=config,
+        )
+        events = []
+        async for event in orchestrator.run("test query"):
+            events.append(event)
+        # Should have looping events
+        event_types = [e.type for e in events]
+        assert event_types.count("looping") >= 2  # At least 2 loop events
+        # Should hit max iterations
+        complete_event = [e for e in events if e.type == "complete"][0]
+        assert complete_event.data.get("max_reached") is True
+    @pytest.mark.asyncio
+    async def test_orchestrator_respects_max_iterations(
+        self,
+        mock_search_handler,
+        mock_judge_insufficient,
+    ):
+        """Orchestrator should stop at max_iterations."""
+        from src.orchestrator import Orchestrator
+        config = OrchestratorConfig(max_iterations=2)
+        orchestrator = Orchestrator(
+            search_handler=mock_search_handler,
+            judge_handler=mock_judge_insufficient,
+            config=config,
+        )
+        events = []
+        async for event in orchestrator.run("test query"):
+            events.append(event)
+        # Should have exactly 2 iterations
+        max_iteration = max(e.iteration for e in events)
+        assert max_iteration == 2
+    @pytest.mark.asyncio
+    async def test_orchestrator_handles_search_error(self):
+        """Orchestrator should handle search errors gracefully."""
+        from src.orchestrator import Orchestrator
+        mock_search = AsyncMock()
+        mock_search.execute = AsyncMock(side_effect=Exception("Search failed"))
+        mock_judge = AsyncMock()
+        mock_judge.assess = AsyncMock(return_value=JudgeAssessment(
+            details=AssessmentDetails(
+                mechanism_score=0,
+                mechanism_reasoning="N/A",
+                clinical_evidence_score=0,
+                clinical_reasoning="N/A",
+                drug_candidates=[],
+                key_findings=[],
+            ),
+            sufficient=False,
+            confidence=0.0,
+            recommendation="continue",
+            next_search_queries=["retry query"],
+            reasoning="Search failed",
+        ))
+        config = OrchestratorConfig(max_iterations=2)
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+            config=config,
+        )
+        events = []
+        async for event in orchestrator.run("test query"):
+            events.append(event)
+        # Should have error events
+        event_types = [e.type for e in events]
+        assert "error" in event_types
+    @pytest.mark.asyncio
+    async def test_orchestrator_deduplicates_evidence(self, mock_judge_insufficient):
+        """Orchestrator should deduplicate evidence by URL."""
+        from src.orchestrator import Orchestrator
+        # Search returns same evidence each time
+        duplicate_evidence = Evidence(
+            content="Duplicate content",
+            citation=Citation(
+                source="pubmed",
+                title="Same Title",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",  # Same URL
+                date="2024-01-01",
+            ),
+        )
+        mock_search = AsyncMock()
+        mock_search.execute = AsyncMock(return_value=SearchResult(
+            query="test",
+            evidence=[duplicate_evidence],
+            sources_searched=["pubmed"],
+            total_found=1,
+            errors=[],
+        ))
+        config = OrchestratorConfig(max_iterations=2)
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge_insufficient,
+            config=config,
+        )
+        events = []
+        async for event in orchestrator.run("test query"):
+            events.append(event)
+        # Second search_complete should show 0 new evidence
+        search_complete_events = [e for e in events if e.type == "search_complete"]
+        assert len(search_complete_events) == 2
+        # First iteration should have 1 new
+        assert search_complete_events[0].data["new_count"] == 1
+        # Second iteration should have 0 new (duplicate)
+        assert search_complete_events[1].data["new_count"] == 0
+class TestAgentEvent:
+    """Tests for AgentEvent."""
+    def test_to_markdown(self):
+        """AgentEvent should format to markdown correctly."""
+        from src.utils.models import AgentEvent
+        event = AgentEvent(
+            type="searching",
+            message="Searching for: metformin alzheimer",
+            iteration=1,
+        )
+        md = event.to_markdown()
+        assert "🔍" in md
+        assert "SEARCHING" in md
+        assert "metformin alzheimer" in md
+    def test_complete_event_icon(self):
+        """Complete event should have celebration icon."""
+        from src.utils.models import AgentEvent
+        event = AgentEvent(
+            type="complete",
+            message="Done!",
+            iteration=3,
+        )
+        md = event.to_markdown()
+        assert "🎉" in md
+```
+---
+## 6. Dockerfile
+```dockerfile
+# Dockerfile for DeepCritical
+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv
+RUN pip install uv
+# Copy project files
+COPY pyproject.toml .
+COPY src/ src/
+# Install dependencies
+RUN uv pip install --system .
+# Expose port
+EXPOSE 7860
+# Set environment variables
+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV GRADIO_SERVER_PORT=7860
+# Run the app
+CMD ["python", "-m", "src.app"]
+```
+---
+## 7. HuggingFace Spaces Configuration
+Create `README.md` header for HuggingFace Spaces:
+```markdown
+---
+title: DeepCritical
+emoji: 🧬
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 5.0.0
+app_file: src/app.py
+pinned: false
+license: mit
+---
+# DeepCritical
+AI-Powered Drug Repurposing Research Agent
+```
+---
+## 8. Implementation Checklist
+- [ ] Add `AgentEvent` and `OrchestratorConfig` models to `src/utils/models.py`
+- [ ] Implement `src/orchestrator.py` with full Orchestrator class
+- [ ] Implement `src/app.py` with Gradio interface
+- [ ] Create `tests/unit/test_orchestrator.py` with all tests
+- [ ] Create `Dockerfile` for deployment
+- [ ] Update project `README.md` with usage instructions
+- [ ] Run `uv run pytest tests/unit/test_orchestrator.py -v` — **ALL TESTS MUST PASS**
+- [ ] Test locally: `uv run python -m src.app`
+- [ ] Commit: `git commit -m "feat: phase 4 orchestrator and UI complete"`
+---
+## 9. Definition of Done
+Phase 4 is **COMPLETE** when:
+1. All unit tests pass: `uv run pytest tests/unit/test_orchestrator.py -v`
+2. Orchestrator correctly loops Search -> Judge until sufficient
+3. Max iterations limit is enforced
+4. Graceful error handling throughout
+5. Gradio UI streams events in real-time
+6. Can run locally:
+```bash
+# Start the UI
+uv run python -m src.app
+# Open browser to http://localhost:7860
+# Enter a question like "What drugs could be repurposed for Alzheimer's disease?"
+# Watch the agent search, evaluate, and respond
+```
+7. Can run the full flow in Python:
+```python
+import asyncio
+from src.orchestrator import Orchestrator
+from src.tools.pubmed import PubMedTool
+from src.tools.biorxiv import BioRxivTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.search_handler import SearchHandler
+from src.agent_factory.judges import HFInferenceJudgeHandler, MockJudgeHandler
+from src.utils.models import OrchestratorConfig
+async def test_full_flow():
+    # Create components
+    search_handler = SearchHandler([PubMedTool(), ClinicalTrialsTool(), BioRxivTool()])
+    # Option 1: Use FREE HuggingFace Inference (real AI analysis)
+    judge_handler = HFInferenceJudgeHandler()
+    # Option 2: Use MockJudgeHandler for UNIT TESTING ONLY
+    # judge_handler = MockJudgeHandler()
+    config = OrchestratorConfig(max_iterations=3)
+    # Create orchestrator
+    orchestrator = Orchestrator(
+        search_handler=search_handler,
+        judge_handler=judge_handler,
+        config=config,
+    )
+    # Run and collect events
+    print("Starting agent...")
+    async for event in orchestrator.run("metformin alzheimer"):
+        print(event.to_markdown())
+    print("\nDone!")
+asyncio.run(test_full_flow())
+```
+**Important**: `MockJudgeHandler` is for **unit testing only**. For actual demo/production use, always use `HFInferenceJudgeHandler` (free) or `JudgeHandler` (with API key).
+---
+## 10. Deployment Verification
+After deployment to HuggingFace Spaces:
+1. **Visit the Space URL** and verify the UI loads
+2. **Test with example queries**:
+   - "What drugs could be repurposed for Alzheimer's disease?"
+   - "Is metformin effective for cancer treatment?"
+3. **Verify streaming** - events should appear in real-time
+4. **Check error handling** - try an empty query, verify graceful handling
+5. **Monitor logs** for any errors
+---
+## Project Complete! 🎉
+When Phase 4 is done, the DeepCritical MVP is complete:
+- **Phase 1**: Foundation (uv, pytest, config) ✅
+- **Phase 2**: Search Slice (PubMed, DuckDuckGo) ✅
+- **Phase 3**: Judge Slice (PydanticAI, structured output) ✅
+- **Phase 4**: Orchestrator + UI (Gradio, streaming) ✅
+The agent can:
+1. Accept a drug repurposing question
+2. Search PubMed and the web for evidence
+3. Evaluate evidence quality with an LLM
+4. Loop until confident or max iterations
+5. Synthesize a research-backed recommendation
+6. Display real-time progress in a beautiful UI

docs/implementation/05_phase_magentic.md ADDED Viewed

	@@ -0,0 +1,1091 @@

+# Phase 5 Implementation Spec: Magentic Integration
+**Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
+**Philosophy**: "Same API, Better Engine."
+**Prerequisite**: Phase 4 complete (MVP working end-to-end)
+---
+## 1. Why Magentic?
+Magentic-One provides:
+- **LLM-powered manager** that dynamically plans, selects agents, tracks progress
+- **Built-in stall detection** and automatic replanning
+- **Checkpointing** for pause/resume workflows
+- **Event streaming** for real-time UI updates
+- **Multi-agent coordination** with round limits and reset logic
+---
+## 2. Critical Architecture Understanding
+### 2.1 How Magentic Actually Works
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                        MagenticBuilder Workflow                          │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  User Task: "Research drug repurposing for metformin alzheimer"          │
+│                              ↓                                           │
+│  ┌──────────────────────────────────────────────────────────────────┐   │
+│  │                   StandardMagenticManager                         │   │
+│  │                                                                   │   │
+│  │  1. plan() → LLM generates facts & plan                          │   │
+│  │  2. create_progress_ledger() → LLM decides:                      │   │
+│  │     - is_request_satisfied?                                       │   │
+│  │     - next_speaker: "searcher"                                    │   │
+│  │     - instruction_or_question: "Search for clinical trials..."   │   │
+│  │                                                                   │   │
+│  └──────────────────────────────────────────────────────────────────┘   │
+│                              ↓                                           │
+│           NATURAL LANGUAGE INSTRUCTION sent to agent                     │
+│           "Search for clinical trials about metformin..."                │
+│                              ↓                                           │
+│  ┌──────────────────────────────────────────────────────────────────┐   │
+│  │                      ChatAgent (searcher)                         │   │
+│  │                                                                   │   │
+│  │  chat_client (INTERNAL LLM) ← understands instruction            │   │
+│  │         ↓                                                         │   │
+│  │  "I'll search for metformin alzheimer clinical trials"           │   │
+│  │         ↓                                                         │   │
+│  │  tools=[search_pubmed, search_clinicaltrials] ← calls tools      │   │
+│  │         ↓                                                         │   │
+│  │  Returns natural language response to manager                     │   │
+│  │                                                                   │   │
+│  └──────────────────────────────────────────────────────────────────┘   │
+│                              ↓                                           │
+│                    Manager evaluates response                            │
+│                    Decides next agent or completion                      │
+│                                                                          │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+### 2.2 The Critical Insight
+**Microsoft's ChatAgent has an INTERNAL LLM (`chat_client`) that:**
+1. Receives natural language instructions from the manager
+2. Understands what action to take
+3. Calls attached tools (functions)
+4. Returns natural language responses
+**Our previous implementation was WRONG because:**
+- We wrapped handlers as bare `BaseAgent` subclasses
+- No internal LLM to understand instructions
+- Raw instruction text was passed directly to APIs (PubMed doesn't understand "Search for clinical trials...")
+### 2.3 Correct Pattern: ChatAgent with Tools
+```python
+# CORRECT: Agent backed by LLM that calls tools
+from agent_framework import ChatAgent, AIFunction
+from agent_framework.openai import OpenAIChatClient
+# Define tool that ChatAgent can call
+@AIFunction
+async def search_pubmed(query: str, max_results: int = 10) -> str:
+    """Search PubMed for biomedical literature.
+    Args:
+        query: Search keywords (e.g., "metformin alzheimer mechanism")
+        max_results: Maximum number of results to return
+    """
+    result = await pubmed_tool.search(query, max_results)
+    return format_results(result)
+# ChatAgent with internal LLM + tools
+search_agent = ChatAgent(
+    name="SearchAgent",
+    description="Searches biomedical databases for drug repurposing evidence",
+    instructions="You search PubMed, ClinicalTrials.gov, and bioRxiv for evidence.",
+    chat_client=OpenAIChatClient(model_id="gpt-4o-mini"),  # INTERNAL LLM
+    tools=[search_pubmed, search_clinicaltrials, search_biorxiv],  # TOOLS
+)
+```
+---
+## 3. Correct Implementation
+### 3.1 Shared State Module (`src/agents/state.py`)
+**CRITICAL**: Tools must update shared state so:
+1. EmbeddingService can deduplicate across searches
+2. ReportAgent can access structured Evidence objects for citations
+```python
+"""Shared state for Magentic agents.
+This module provides global state that tools update as a side effect.
+ChatAgent tools return strings to the LLM, but also update this state
+for semantic deduplication and structured citation access.
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+import structlog
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+from src.utils.models import Evidence
+logger = structlog.get_logger()
+class MagenticState:
+    """Shared state container for Magentic workflow.
+    Maintains:
+    - evidence_store: All collected Evidence objects (for citations)
+    - embedding_service: Optional semantic search (for deduplication)
+    """
+    def __init__(self) -> None:
+        self.evidence_store: list[Evidence] = []
+        self.embedding_service: EmbeddingService | None = None
+        self._seen_urls: set[str] = set()
+    def init_embedding_service(self) -> None:
+        """Lazy-initialize embedding service if available."""
+        if self.embedding_service is not None:
+            return
+        try:
+            from src.services.embeddings import get_embedding_service
+            self.embedding_service = get_embedding_service()
+            logger.info("Embedding service enabled for Magentic mode")
+        except Exception as e:
+            logger.warning("Embedding service unavailable", error=str(e))
+    async def add_evidence(self, evidence_list: list[Evidence]) -> list[Evidence]:
+        """Add evidence with semantic deduplication.
+        Args:
+            evidence_list: New evidence from search
+        Returns:
+            List of unique evidence (not duplicates)
+        """
+        if not evidence_list:
+            return []
+        # URL-based deduplication first (fast)
+        url_unique = [
+            e for e in evidence_list
+            if e.citation.url not in self._seen_urls
+        ]
+        # Semantic deduplication if available
+        if self.embedding_service and url_unique:
+            try:
+                unique = await self.embedding_service.deduplicate(url_unique, threshold=0.85)
+                logger.info(
+                    "Semantic deduplication",
+                    before=len(url_unique),
+                    after=len(unique),
+                )
+            except Exception as e:
+                logger.warning("Deduplication failed, using URL-based", error=str(e))
+                unique = url_unique
+        else:
+            unique = url_unique
+        # Update state
+        for e in unique:
+            self._seen_urls.add(e.citation.url)
+            self.evidence_store.append(e)
+        return unique
+    async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]:
+        """Find semantically related evidence from vector store.
+        Args:
+            query: Search query
+            n_results: Number of related items
+        Returns:
+            Related Evidence objects (reconstructed from vector store)
+        """
+        if not self.embedding_service:
+            return []
+        try:
+            from src.utils.models import Citation
+            related = await self.embedding_service.search_similar(query, n_results)
+            evidence = []
+            for item in related:
+                if item["id"] in self._seen_urls:
+                    continue  # Already in results
+                meta = item.get("metadata", {})
+                authors_str = meta.get("authors", "")
+                authors = [a.strip() for a in authors_str.split(",") if a.strip()]
+                ev = Evidence(
+                    content=item["content"],
+                    citation=Citation(
+                        title=meta.get("title", "Related Evidence"),
+                        url=item["id"],
+                        source=meta.get("source", "pubmed"),
+                        date=meta.get("date", "n.d."),
+                        authors=authors,
+                    ),
+                    relevance=max(0.0, 1.0 - item.get("distance", 0.5)),
+                )
+                evidence.append(ev)
+            return evidence
+        except Exception as e:
+            logger.warning("Related search failed", error=str(e))
+            return []
+    def reset(self) -> None:
+        """Reset state for new workflow run."""
+        self.evidence_store.clear()
+        self._seen_urls.clear()
+# Global singleton for workflow
+_state: MagenticState | None = None
+def get_magentic_state() -> MagenticState:
+    """Get or create the global Magentic state."""
+    global _state
+    if _state is None:
+        _state = MagenticState()
+    return _state
+def reset_magentic_state() -> None:
+    """Reset state for a fresh workflow run."""
+    global _state
+    if _state is not None:
+        _state.reset()
+    else:
+        _state = MagenticState()
+```
+### 3.2 Tool Functions (`src/agents/tools.py`)
+Tools call APIs AND update shared state. Return strings to LLM, but also store structured Evidence.
+```python
+"""Tool functions for Magentic agents.
+IMPORTANT: These tools do TWO things:
+1. Return formatted strings to the ChatAgent's internal LLM
+2. Update shared state (evidence_store, embeddings) as a side effect
+This preserves semantic deduplication and structured citation access.
+"""
+from agent_framework import AIFunction
+from src.agents.state import get_magentic_state
+from src.tools.biorxiv import BioRxivTool
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.pubmed import PubMedTool
+# Singleton tool instances
+_pubmed = PubMedTool()
+_clinicaltrials = ClinicalTrialsTool()
+_biorxiv = BioRxivTool()
+def _format_results(results: list, source_name: str, query: str) -> str:
+    """Format search results for LLM consumption."""
+    if not results:
+        return f"No {source_name} results found for: {query}"
+    output = [f"Found {len(results)} {source_name} results:\n"]
+    for i, r in enumerate(results[:10], 1):
+        output.append(f"{i}. **{r.citation.title}**")
+        output.append(f"   Source: {r.citation.source} | Date: {r.citation.date}")
+        output.append(f"   {r.content[:300]}...")
+        output.append(f"   URL: {r.citation.url}\n")
+    return "\n".join(output)
+@AIFunction
+async def search_pubmed(query: str, max_results: int = 10) -> str:
+    """Search PubMed for biomedical research papers.
+    Use this tool to find peer-reviewed scientific literature about
+    drugs, diseases, mechanisms of action, and clinical studies.
+    Args:
+        query: Search keywords (e.g., "metformin alzheimer mechanism")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of papers with titles, abstracts, and citations
+    """
+    # 1. Execute search
+    results = await _pubmed.search(query, max_results)
+    # 2. Update shared state (semantic dedup + evidence store)
+    state = get_magentic_state()
+    unique = await state.add_evidence(results)
+    # 3. Also get related evidence from vector store
+    related = await state.search_related(query, n_results=3)
+    if related:
+        await state.add_evidence(related)
+    # 4. Return formatted string for LLM
+    total_new = len(unique)
+    total_stored = len(state.evidence_store)
+    output = _format_results(results, "PubMed", query)
+    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
+    if related:
+        output += f"\n[Also found {len(related)} semantically related items from previous searches]"
+    return output
+@AIFunction
+async def search_clinical_trials(query: str, max_results: int = 10) -> str:
+    """Search ClinicalTrials.gov for clinical studies.
+    Use this tool to find ongoing and completed clinical trials
+    for drug repurposing candidates.
+    Args:
+        query: Search terms (e.g., "metformin cancer phase 3")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of clinical trials with status and details
+    """
+    # 1. Execute search
+    results = await _clinicaltrials.search(query, max_results)
+    # 2. Update shared state
+    state = get_magentic_state()
+    unique = await state.add_evidence(results)
+    # 3. Return formatted string
+    total_new = len(unique)
+    total_stored = len(state.evidence_store)
+    output = _format_results(results, "ClinicalTrials.gov", query)
+    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
+    return output
+@AIFunction
+async def search_preprints(query: str, max_results: int = 10) -> str:
+    """Search bioRxiv/medRxiv for preprint papers.
+    Use this tool to find the latest research that hasn't been
+    peer-reviewed yet. Good for cutting-edge findings.
+    Args:
+        query: Search terms (e.g., "long covid treatment")
+        max_results: Maximum results to return (default 10)
+    Returns:
+        Formatted list of preprints with abstracts and links
+    """
+    # 1. Execute search
+    results = await _biorxiv.search(query, max_results)
+    # 2. Update shared state
+    state = get_magentic_state()
+    unique = await state.add_evidence(results)
+    # 3. Return formatted string
+    total_new = len(unique)
+    total_stored = len(state.evidence_store)
+    output = _format_results(results, "bioRxiv/medRxiv", query)
+    output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
+    return output
+@AIFunction
+async def get_evidence_summary() -> str:
+    """Get summary of all collected evidence.
+    Use this tool when you need to review what evidence has been collected
+    before making an assessment or generating a report.
+    Returns:
+        Summary of evidence store with counts and key citations
+    """
+    state = get_magentic_state()
+    evidence = state.evidence_store
+    if not evidence:
+        return "No evidence collected yet."
+    # Group by source
+    by_source: dict[str, list] = {}
+    for e in evidence:
+        src = e.citation.source
+        if src not in by_source:
+            by_source[src] = []
+        by_source[src].append(e)
+    output = [f"**Evidence Store Summary** ({len(evidence)} total items)\n"]
+    for source, items in by_source.items():
+        output.append(f"\n### {source.upper()} ({len(items)} items)")
+        for e in items[:5]:  # First 5 per source
+            output.append(f"- {e.citation.title[:80]}...")
+    return "\n".join(output)
+@AIFunction
+async def get_bibliography() -> str:
+    """Get full bibliography of all collected evidence.
+    Use this tool when generating a final report to get properly
+    formatted citations for all evidence.
+    Returns:
+        Numbered bibliography with full citation details
+    """
+    state = get_magentic_state()
+    evidence = state.evidence_store
+    if not evidence:
+        return "No evidence collected for bibliography."
+    output = ["## References\n"]
+    for i, e in enumerate(evidence, 1):
+        # Format: Authors (Year). Title. Source. URL
+        authors = ", ".join(e.citation.authors[:3]) if e.citation.authors else "Unknown"
+        if e.citation.authors and len(e.citation.authors) > 3:
+            authors += " et al."
+        year = e.citation.date[:4] if e.citation.date else "n.d."
+        output.append(
+            f"{i}. {authors} ({year}). {e.citation.title}. "
+            f"*{e.citation.source.upper()}*. [{e.citation.url}]({e.citation.url})"
+        )
+    return "\n".join(output)
+```
+### 3.3 ChatAgent-Based Agents (`src/agents/magentic_agents.py`)
+```python
+"""Magentic-compatible agents using ChatAgent pattern."""
+from agent_framework import ChatAgent
+from agent_framework.openai import OpenAIChatClient
+from src.agents.tools import (
+    get_bibliography,
+    get_evidence_summary,
+    search_clinical_trials,
+    search_preprints,
+    search_pubmed,
+)
+from src.utils.config import settings
+def create_search_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a search agent with internal LLM and search tools.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for biomedical search
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o-mini",  # Fast, cheap for tool orchestration
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="SearchAgent",
+        description="Searches biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) for drug repurposing evidence",
+        instructions="""You are a biomedical search specialist. When asked to find evidence:
+1. Analyze the request to determine what to search for
+2. Extract key search terms (drug names, disease names, mechanisms)
+3. Use the appropriate search tools:
+   - search_pubmed for peer-reviewed papers
+   - search_clinical_trials for clinical studies
+   - search_preprints for cutting-edge findings
+4. Summarize what you found and highlight key evidence
+Be thorough - search multiple databases when appropriate.
+Focus on finding: mechanisms of action, clinical evidence, and specific drug candidates.""",
+        chat_client=client,
+        tools=[search_pubmed, search_clinical_trials, search_preprints],
+        temperature=0.3,  # More deterministic for tool use
+    )
+def create_judge_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a judge agent that evaluates evidence quality.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for evidence assessment
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o",  # Better model for nuanced judgment
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="JudgeAgent",
+        description="Evaluates evidence quality and determines if sufficient for synthesis",
+        instructions="""You are an evidence quality assessor. When asked to evaluate:
+1. First, call get_evidence_summary() to see all collected evidence
+2. Score on two dimensions (0-10 each):
+   - Mechanism Score: How well is the biological mechanism explained?
+   - Clinical Score: How strong is the clinical/preclinical evidence?
+3. Determine if evidence is SUFFICIENT for a final report:
+   - Sufficient: Clear mechanism + supporting clinical data
+   - Insufficient: Gaps in mechanism OR weak clinical evidence
+4. If insufficient, suggest specific search queries to fill gaps
+Be rigorous but fair. Look for:
+- Molecular targets and pathways
+- Animal model studies
+- Human clinical trials
+- Safety data
+- Drug-drug interactions""",
+        chat_client=client,
+        tools=[get_evidence_summary],  # Can review collected evidence
+        temperature=0.2,  # Consistent judgments
+    )
+def create_hypothesis_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a hypothesis generation agent.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for hypothesis generation
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o",
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="HypothesisAgent",
+        description="Generates mechanistic hypotheses for drug repurposing",
+        instructions="""You are a biomedical hypothesis generator. Based on evidence:
+1. Identify the key molecular targets involved
+2. Map the biological pathways affected
+3. Generate testable hypotheses in this format:
+   DRUG → TARGET → PATHWAY → THERAPEUTIC EFFECT
+   Example:
+   Metformin → AMPK activation → mTOR inhibition → Reduced tau phosphorylation
+4. Explain the rationale for each hypothesis
+5. Suggest what additional evidence would support or refute it
+Focus on mechanistic plausibility and existing evidence.""",
+        chat_client=client,
+        temperature=0.5,  # Some creativity for hypothesis generation
+    )
+def create_report_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
+    """Create a report synthesis agent.
+    Args:
+        chat_client: Optional custom chat client. If None, uses default.
+    Returns:
+        ChatAgent configured for report generation
+    """
+    client = chat_client or OpenAIChatClient(
+        model_id="gpt-4o",
+        api_key=settings.openai_api_key,
+    )
+    return ChatAgent(
+        name="ReportAgent",
+        description="Synthesizes research findings into structured reports",
+        instructions="""You are a scientific report writer. When asked to synthesize:
+1. First, call get_evidence_summary() to review all collected evidence
+2. Then call get_bibliography() to get properly formatted citations
+Generate a structured report with these sections:
+## Executive Summary
+Brief overview of findings and recommendation
+## Methodology
+Databases searched, queries used, evidence reviewed
+## Key Findings
+### Mechanism of Action
+- Molecular targets
+- Biological pathways
+- Proposed mechanism
+### Clinical Evidence
+- Preclinical studies
+- Clinical trials
+- Safety profile
+## Drug Candidates
+List specific drugs with repurposing potential
+## Limitations
+Gaps in evidence, conflicting data, caveats
+## Conclusion
+Final recommendation with confidence level
+## References
+Use the output from get_bibliography() - do not make up citations!
+Be comprehensive but concise. Cite evidence for all claims.""",
+        chat_client=client,
+        tools=[get_evidence_summary, get_bibliography],  # Access to collected evidence
+        temperature=0.3,
+    )
+```
+### 3.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
+```python
+"""Magentic-based orchestrator using ChatAgent pattern."""
+from collections.abc import AsyncGenerator
+from typing import Any
+import structlog
+from agent_framework import (
+    MagenticAgentDeltaEvent,
+    MagenticAgentMessageEvent,
+    MagenticBuilder,
+    MagenticFinalResultEvent,
+    MagenticOrchestratorMessageEvent,
+    WorkflowOutputEvent,
+)
+from agent_framework.openai import OpenAIChatClient
+from src.agents.magentic_agents import (
+    create_hypothesis_agent,
+    create_judge_agent,
+    create_report_agent,
+    create_search_agent,
+)
+from src.agents.state import get_magentic_state, reset_magentic_state
+from src.utils.config import settings
+from src.utils.exceptions import ConfigurationError
+from src.utils.models import AgentEvent
+logger = structlog.get_logger()
+class MagenticOrchestrator:
+    """
+    Magentic-based orchestrator using ChatAgent pattern.
+    Each agent has an internal LLM that understands natural language
+    instructions from the manager and can call tools appropriately.
+    """
+    def __init__(
+        self,
+        max_rounds: int = 10,
+        chat_client: OpenAIChatClient | None = None,
+    ) -> None:
+        """Initialize orchestrator.
+        Args:
+            max_rounds: Maximum coordination rounds
+            chat_client: Optional shared chat client for agents
+        """
+        if not settings.openai_api_key:
+            raise ConfigurationError(
+                "Magentic mode requires OPENAI_API_KEY. "
+                "Set the key or use mode='simple'."
+            )
+        self._max_rounds = max_rounds
+        self._chat_client = chat_client
+    def _build_workflow(self) -> Any:
+        """Build the Magentic workflow with ChatAgent participants."""
+        # Create agents with internal LLMs
+        search_agent = create_search_agent(self._chat_client)
+        judge_agent = create_judge_agent(self._chat_client)
+        hypothesis_agent = create_hypothesis_agent(self._chat_client)
+        report_agent = create_report_agent(self._chat_client)
+        # Manager chat client (orchestrates the agents)
+        manager_client = OpenAIChatClient(
+            model_id="gpt-4o",  # Good model for planning/coordination
+            api_key=settings.openai_api_key,
+        )
+        return (
+            MagenticBuilder()
+            .participants(
+                searcher=search_agent,
+                hypothesizer=hypothesis_agent,
+                judge=judge_agent,
+                reporter=report_agent,
+            )
+            .with_standard_manager(
+                chat_client=manager_client,
+                max_round_count=self._max_rounds,
+                max_stall_count=3,
+                max_reset_count=2,
+            )
+            .build()
+        )
+    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
+        """
+        Run the Magentic workflow.
+        Args:
+            query: User's research question
+        Yields:
+            AgentEvent objects for real-time UI updates
+        """
+        logger.info("Starting Magentic orchestrator", query=query)
+        # CRITICAL: Reset state for fresh workflow run
+        reset_magentic_state()
+        # Initialize embedding service if available
+        state = get_magentic_state()
+        state.init_embedding_service()
+        yield AgentEvent(
+            type="started",
+            message=f"Starting research (Magentic mode): {query}",
+            iteration=0,
+        )
+        workflow = self._build_workflow()
+        task = f"""Research drug repurposing opportunities for: {query}
+Workflow:
+1. SearchAgent: Find evidence from PubMed, ClinicalTrials.gov, and bioRxiv
+2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
+3. JudgeAgent: Evaluate if evidence is sufficient
+4. If insufficient → SearchAgent refines search based on gaps
+5. If sufficient → ReportAgent synthesizes final report
+Focus on:
+- Identifying specific molecular targets
+- Understanding mechanism of action
+- Finding clinical evidence supporting hypotheses
+The final output should be a structured research report."""
+        iteration = 0
+        try:
+            async for event in workflow.run_stream(task):
+                agent_event = self._process_event(event, iteration)
+                if agent_event:
+                    if isinstance(event, MagenticAgentMessageEvent):
+                        iteration += 1
+                    yield agent_event
+        except Exception as e:
+            logger.error("Magentic workflow failed", error=str(e))
+            yield AgentEvent(
+                type="error",
+                message=f"Workflow error: {e!s}",
+                iteration=iteration,
+            )
+    def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
+        """Process workflow event into AgentEvent."""
+        if isinstance(event, MagenticOrchestratorMessageEvent):
+            text = event.message.text if event.message else ""
+            if text:
+                return AgentEvent(
+                    type="judging",
+                    message=f"Manager ({event.kind}): {text[:200]}...",
+                    iteration=iteration,
+                )
+        elif isinstance(event, MagenticAgentMessageEvent):
+            agent_name = event.agent_id or "unknown"
+            text = event.message.text if event.message else ""
+            event_type = "judging"
+            if "search" in agent_name.lower():
+                event_type = "search_complete"
+            elif "judge" in agent_name.lower():
+                event_type = "judge_complete"
+            elif "hypothes" in agent_name.lower():
+                event_type = "hypothesizing"
+            elif "report" in agent_name.lower():
+                event_type = "synthesizing"
+            return AgentEvent(
+                type=event_type,
+                message=f"{agent_name}: {text[:200]}...",
+                iteration=iteration + 1,
+            )
+        elif isinstance(event, MagenticFinalResultEvent):
+            text = event.message.text if event.message else "No result"
+            return AgentEvent(
+                type="complete",
+                message=text,
+                data={"iterations": iteration},
+                iteration=iteration,
+            )
+        elif isinstance(event, MagenticAgentDeltaEvent):
+            if event.text:
+                return AgentEvent(
+                    type="streaming",
+                    message=event.text,
+                    data={"agent_id": event.agent_id},
+                    iteration=iteration,
+                )
+        elif isinstance(event, WorkflowOutputEvent):
+            if event.data:
+                return AgentEvent(
+                    type="complete",
+                    message=str(event.data),
+                    iteration=iteration,
+                )
+        return None
+```
+### 3.4 Updated Factory (`src/orchestrator_factory.py`)
+```python
+"""Factory for creating orchestrators."""
+from typing import Any, Literal
+from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
+from src.utils.models import OrchestratorConfig
+def create_orchestrator(
+    search_handler: SearchHandlerProtocol | None = None,
+    judge_handler: JudgeHandlerProtocol | None = None,
+    config: OrchestratorConfig | None = None,
+    mode: Literal["simple", "magentic"] = "simple",
+) -> Any:
+    """
+    Create an orchestrator instance.
+    Args:
+        search_handler: The search handler (required for simple mode)
+        judge_handler: The judge handler (required for simple mode)
+        config: Optional configuration
+        mode: "simple" for Phase 4 loop, "magentic" for ChatAgent-based multi-agent
+    Returns:
+        Orchestrator instance
+    Note:
+        Magentic mode does NOT use search_handler/judge_handler.
+        It creates ChatAgent instances with internal LLMs that call tools directly.
+    """
+    if mode == "magentic":
+        try:
+            from src.orchestrator_magentic import MagenticOrchestrator
+            return MagenticOrchestrator(
+                max_rounds=config.max_iterations if config else 10,
+            )
+        except ImportError:
+            # Fallback to simple if agent-framework not installed
+            pass
+    # Simple mode requires handlers
+    if search_handler is None or judge_handler is None:
+        raise ValueError("Simple mode requires search_handler and judge_handler")
+    return Orchestrator(
+        search_handler=search_handler,
+        judge_handler=judge_handler,
+        config=config,
+    )
+```
+---
+## 4. Why This Works
+### 4.1 The Manager → Agent Communication
+```
+Manager LLM decides: "Tell SearchAgent to find clinical trials for metformin"
+           ↓
+Sends instruction: "Search for clinical trials about metformin and cancer"
+           ↓
+SearchAgent's INTERNAL LLM receives this
+           ↓
+Internal LLM understands: "I should call search_clinical_trials('metformin cancer')"
+           ↓
+Tool executes: ClinicalTrials.gov API
+           ↓
+Internal LLM formats response: "I found 15 trials. Here are the key ones..."
+           ↓
+Manager receives natural language response
+```
+### 4.2 Why Our Old Implementation Failed
+```
+Manager sends: "Search for clinical trials about metformin..."
+           ↓
+OLD SearchAgent.run() extracts: query = "Search for clinical trials about metformin..."
+           ↓
+Passes to PubMed: pubmed.search("Search for clinical trials about metformin...")
+           ↓
+PubMed doesn't understand English instructions → garbage results or error
+```
+---
+## 5. Directory Structure
+```text
+src/
+├── agents/
+│   ├── __init__.py
+│   ├── state.py                 # MagenticState (evidence_store + embeddings)
+│   ├── tools.py                 # AIFunction tool definitions (update state)
+│   └── magentic_agents.py       # ChatAgent factory functions
+├── services/
+│   └── embeddings.py            # EmbeddingService (semantic dedup)
+├── orchestrator.py              # Simple mode (unchanged)
+├── orchestrator_magentic.py     # Magentic mode with ChatAgents
+└── orchestrator_factory.py      # Mode selection
+```
+---
+## 6. Dependencies
+```toml
+[project.optional-dependencies]
+magentic = [
+    "agent-framework-core>=1.0.0b",
+    "agent-framework-openai>=1.0.0b",  # For OpenAIChatClient
+]
+embeddings = [
+    "chromadb>=0.4.0",
+    "sentence-transformers>=2.2.0",
+]
+```
+**IMPORTANT: Magentic mode REQUIRES OpenAI API key.**
+The Microsoft Agent Framework's standard manager and ChatAgent use OpenAIChatClient internally.
+There is no AnthropicChatClient in the framework. If only `ANTHROPIC_API_KEY` is set:
+- `mode="simple"` works fine
+- `mode="magentic"` throws `ConfigurationError`
+This is enforced in `MagenticOrchestrator.__init__`.
+---
+## 7. Implementation Checklist
+- [ ] Create `src/agents/state.py` with MagenticState class
+- [ ] Create `src/agents/tools.py` with AIFunction search tools + state updates
+- [ ] Create `src/agents/magentic_agents.py` with ChatAgent factories
+- [ ] Rewrite `src/orchestrator_magentic.py` to use ChatAgent pattern
+- [ ] Update `src/orchestrator_factory.py` for new signature
+- [ ] Test with real OpenAI API
+- [ ] Verify manager properly coordinates agents
+- [ ] Ensure tools are called with correct parameters
+- [ ] Verify semantic deduplication works (evidence_store populates)
+- [ ] Verify bibliography generation in final reports
+---
+## 8. Definition of Done
+Phase 5 is **COMPLETE** when:
+1. Magentic mode runs without hanging
+2. Manager successfully coordinates agents via natural language
+3. SearchAgent calls tools with proper search keywords (not raw instructions)
+4. JudgeAgent evaluates evidence from conversation history
+5. ReportAgent generates structured final report
+6. Events stream to UI correctly
+---
+## 9. Testing Magentic Mode
+```bash
+# Test with real API
+OPENAI_API_KEY=sk-... uv run python -c "
+import asyncio
+from src.orchestrator_factory import create_orchestrator
+async def test():
+    orch = create_orchestrator(mode='magentic')
+    async for event in orch.run('metformin alzheimer'):
+        print(f'[{event.type}] {event.message[:100]}')
+asyncio.run(test())
+"
+```
+Expected output:
+```
+[started] Starting research (Magentic mode): metformin alzheimer
+[judging] Manager (plan): I will coordinate the agents to research...
+[search_complete] SearchAgent: Found 25 PubMed results for metformin alzheimer...
+[hypothesizing] HypothesisAgent: Based on the evidence, I propose...
+[judge_complete] JudgeAgent: Mechanism Score: 7/10, Clinical Score: 6/10...
+[synthesizing] ReportAgent: ## Executive Summary...
+[complete] <full research report>
+```
+---
+## 10. Key Differences from Old Spec
+| Aspect | OLD (Wrong) | NEW (Correct) |
+|--------|-------------|---------------|
+| Agent type | `BaseAgent` subclass | `ChatAgent` with `chat_client` |
+| Internal LLM | None | OpenAIChatClient |
+| How tools work | Handler.execute(raw_instruction) | LLM understands instruction, calls AIFunction |
+| Message handling | Extract text → pass to API | LLM interprets → extracts keywords → calls tool |
+| State management | Passed to agent constructors | Global MagenticState singleton |
+| Evidence storage | In agent instance | In MagenticState.evidence_store |
+| Semantic search | Coupled to agents | Tools call state.add_evidence() |
+| Citations for report | From agent's store | Via get_bibliography() tool |
+**Key Insights:**
+1. Magentic agents must have internal LLMs to understand natural language instructions
+2. Tools must update shared state as a side effect (return strings, but also store Evidence)
+3. ReportAgent uses `get_bibliography()` tool to access structured citations
+4. State is reset at start of each workflow run via `reset_magentic_state()`

docs/implementation/06_phase_embeddings.md ADDED Viewed

	@@ -0,0 +1,409 @@

+# Phase 6 Implementation Spec: Embeddings & Semantic Search
+**Goal**: Add vector search for semantic evidence retrieval.
+**Philosophy**: "Find what you mean, not just what you type."
+**Prerequisite**: Phase 5 complete (Magentic working)
+---
+## 1. Why Embeddings?
+Current limitation: **Keyword-only search misses semantically related papers.**
+Example problem:
+- User searches: "metformin alzheimer"
+- PubMed returns: Papers with exact keywords
+- MISSED: Papers about "AMPK activation neuroprotection" (same mechanism, different words)
+With embeddings:
+- Embed the query AND all evidence
+- Find semantically similar papers even without keyword match
+- Deduplicate by meaning, not just URL
+---
+## 2. Architecture
+### Current (Phase 5)
+```
+Query → SearchAgent → PubMed/Web (keyword) → Evidence
+```
+### Phase 6
+```
+Query → Embed(Query) → SearchAgent
+                          ├── PubMed/Web (keyword) → Evidence
+                          └── VectorDB (semantic) → Related Evidence
+                                    ↑
+                          Evidence → Embed → Store
+```
+### Shared Context Enhancement
+```python
+# Current
+evidence_store = {"current": []}
+# Phase 6
+evidence_store = {
+    "current": [],           # Raw evidence
+    "embeddings": {},        # URL -> embedding vector
+    "vector_index": None,    # ChromaDB collection
+}
+```
+---
+## 3. Technology Choice
+### ChromaDB (Recommended)
+- **Free**, open-source, local-first
+- No API keys, no cloud dependency
+- Supports sentence-transformers out of the box
+- Perfect for hackathon (no infra setup)
+### Embedding Model
+- `sentence-transformers/all-MiniLM-L6-v2` (fast, good quality)
+- Or `BAAI/bge-small-en-v1.5` (better quality, still fast)
+---
+## 4. Implementation
+### 4.1 Dependencies
+Add to `pyproject.toml`:
+```toml
+[project.optional-dependencies]
+embeddings = [
+    "chromadb>=0.4.0",
+    "sentence-transformers>=2.2.0",
+]
+```
+### 4.2 Embedding Service (`src/services/embeddings.py`)
+> **CRITICAL: Async Pattern Required**
+>
+> `sentence-transformers` is synchronous and CPU-bound. Running it directly in async code
+> will **block the event loop**, freezing the UI and halting all concurrent operations.
+>
+> **Solution**: Use `asyncio.run_in_executor()` to offload to thread pool.
+> This pattern already exists in `src/tools/websearch.py:28-34`.
+```python
+"""Embedding service for semantic search.
+IMPORTANT: All public methods are async to avoid blocking the event loop.
+The sentence-transformers model is CPU-bound, so we use run_in_executor().
+"""
+import asyncio
+from typing import List
+import chromadb
+from sentence_transformers import SentenceTransformer
+class EmbeddingService:
+    """Handles text embedding and vector storage.
+    All embedding operations run in a thread pool to avoid blocking
+    the async event loop. See src/tools/websearch.py for the pattern.
+    """
+    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
+        self._model = SentenceTransformer(model_name)
+        self._client = chromadb.Client()  # In-memory for hackathon
+        self._collection = self._client.create_collection(
+            name="evidence",
+            metadata={"hnsw:space": "cosine"}
+        )
+    # ─────────────────────────────────────────────────────────────────
+    # Sync internal methods (run in thread pool)
+    # ─────────────────────────────────────────────────────────────────
+    def _sync_embed(self, text: str) -> List[float]:
+        """Synchronous embedding - DO NOT call directly from async code."""
+        return self._model.encode(text).tolist()
+    def _sync_batch_embed(self, texts: List[str]) -> List[List[float]]:
+        """Batch embedding for efficiency - DO NOT call directly from async code."""
+        return [e.tolist() for e in self._model.encode(texts)]
+    # ─────────────────────────────────────────────────────────────────
+    # Async public methods (safe for event loop)
+    # ─────────────────────────────────────────────────────────────────
+    async def embed(self, text: str) -> List[float]:
+        """Embed a single text (async-safe).
+        Uses run_in_executor to avoid blocking the event loop.
+        """
+        loop = asyncio.get_running_loop()
+        return await loop.run_in_executor(None, self._sync_embed, text)
+    async def embed_batch(self, texts: List[str]) -> List[List[float]]:
+        """Batch embed multiple texts (async-safe, more efficient)."""
+        loop = asyncio.get_running_loop()
+        return await loop.run_in_executor(None, self._sync_batch_embed, texts)
+    async def add_evidence(self, evidence_id: str, content: str, metadata: dict) -> None:
+        """Add evidence to vector store (async-safe)."""
+        embedding = await self.embed(content)
+        # ChromaDB operations are fast, but wrap for consistency
+        loop = asyncio.get_running_loop()
+        await loop.run_in_executor(
+            None,
+            lambda: self._collection.add(
+                ids=[evidence_id],
+                embeddings=[embedding],
+                metadatas=[metadata],
+                documents=[content]
+            )
+        )
+    async def search_similar(self, query: str, n_results: int = 5) -> List[dict]:
+        """Find semantically similar evidence (async-safe)."""
+        query_embedding = await self.embed(query)
+        loop = asyncio.get_running_loop()
+        results = await loop.run_in_executor(
+            None,
+            lambda: self._collection.query(
+                query_embeddings=[query_embedding],
+                n_results=n_results
+            )
+        )
+        # Handle empty results gracefully
+        if not results["ids"] or not results["ids"][0]:
+            return []
+        return [
+            {"id": id, "content": doc, "metadata": meta, "distance": dist}
+            for id, doc, meta, dist in zip(
+                results["ids"][0],
+                results["documents"][0],
+                results["metadatas"][0],
+                results["distances"][0]
+            )
+        ]
+    async def deduplicate(self, new_evidence: List, threshold: float = 0.9) -> List:
+        """Remove semantically duplicate evidence (async-safe)."""
+        unique = []
+        for evidence in new_evidence:
+            similar = await self.search_similar(evidence.content, n_results=1)
+            if not similar or similar[0]["distance"] > (1 - threshold):
+                unique.append(evidence)
+                await self.add_evidence(
+                    evidence_id=evidence.citation.url,
+                    content=evidence.content,
+                    metadata={"source": evidence.citation.source}
+                )
+        return unique
+```
+### 4.3 Enhanced SearchAgent (`src/agents/search_agent.py`)
+Update SearchAgent to use embeddings. **Note**: All embedding calls are `await`ed:
+```python
+class SearchAgent(BaseAgent):
+    def __init__(
+        self,
+        search_handler: SearchHandlerProtocol,
+        evidence_store: dict,
+        embedding_service: EmbeddingService | None = None,  # NEW
+    ):
+        # ... existing init ...
+        self._embeddings = embedding_service
+    async def run(self, messages, *, thread=None, **kwargs) -> AgentRunResponse:
+        # ... extract query ...
+        # Execute keyword search
+        result = await self._handler.execute(query, max_results_per_tool=10)
+        # Semantic deduplication (NEW) - ALL CALLS ARE AWAITED
+        if self._embeddings:
+            # Deduplicate by semantic similarity (async-safe)
+            unique_evidence = await self._embeddings.deduplicate(result.evidence)
+            # Also search for semantically related evidence (async-safe)
+            related = await self._embeddings.search_similar(query, n_results=5)
+            # Merge related evidence not already in results
+            existing_urls = {e.citation.url for e in unique_evidence}
+            for item in related:
+                if item["id"] not in existing_urls:
+                    # Reconstruct Evidence from stored data
+                    # ... merge logic ...
+        # ... rest of method ...
+```
+### 4.4 Semantic Expansion in Orchestrator
+The MagenticOrchestrator can use embeddings to expand queries:
+```python
+# In task instruction
+task = f"""Research drug repurposing opportunities for: {query}
+The system has semantic search enabled. When evidence is found:
+1. Related concepts will be automatically surfaced
+2. Duplicates are removed by meaning, not just URL
+3. Use the surfaced related concepts to refine searches
+"""
+```
+### 4.5 HuggingFace Spaces Deployment
+> **⚠️ Important for HF Spaces**
+>
+> `sentence-transformers` downloads models (~500MB) to `~/.cache` on first use.
+> HuggingFace Spaces have **ephemeral storage** - the cache is wiped on restart.
+> This causes slow cold starts and bandwidth usage.
+**Solution**: Pre-download the model in your Dockerfile:
+```dockerfile
+# In Dockerfile
+FROM python:3.11-slim
+# Set cache directory
+ENV HF_HOME=/app/.cache
+ENV TRANSFORMERS_CACHE=/app/.cache
+# Pre-download the embedding model during build
+RUN pip install sentence-transformers && \
+    python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
+# ... rest of Dockerfile
+```
+**Alternative**: Use environment variable to specify persistent path:
+```yaml
+# In HF Spaces settings or app.yaml
+env:
+  - name: HF_HOME
+    value: /data/.cache  # Persistent volume
+```
+---
+## 5. Directory Structure After Phase 6
+```
+src/
+├── services/                   # NEW
+│   ├── __init__.py
+│   └── embeddings.py           # EmbeddingService
+├── agents/
+│   ├── search_agent.py         # Updated with embeddings
+│   └── judge_agent.py
+└── ...
+```
+---
+## 6. Tests
+### 6.1 Unit Tests (`tests/unit/services/test_embeddings.py`)
+> **Note**: All tests are async since the EmbeddingService methods are async.
+```python
+"""Unit tests for EmbeddingService."""
+import pytest
+from src.services.embeddings import EmbeddingService
+class TestEmbeddingService:
+    @pytest.mark.asyncio
+    async def test_embed_returns_vector(self):
+        """Embedding should return a float vector."""
+        service = EmbeddingService()
+        embedding = await service.embed("metformin diabetes")
+        assert isinstance(embedding, list)
+        assert len(embedding) > 0
+        assert all(isinstance(x, float) for x in embedding)
+    @pytest.mark.asyncio
+    async def test_similar_texts_have_close_embeddings(self):
+        """Semantically similar texts should have similar embeddings."""
+        service = EmbeddingService()
+        e1 = await service.embed("metformin treats diabetes")
+        e2 = await service.embed("metformin is used for diabetes treatment")
+        e3 = await service.embed("the weather is sunny today")
+        # Cosine similarity helper
+        from numpy import dot
+        from numpy.linalg import norm
+        cosine = lambda a, b: dot(a, b) / (norm(a) * norm(b))
+        # Similar texts should be closer
+        assert cosine(e1, e2) > cosine(e1, e3)
+    @pytest.mark.asyncio
+    async def test_batch_embed_efficient(self):
+        """Batch embedding should be more efficient than individual calls."""
+        service = EmbeddingService()
+        texts = ["text one", "text two", "text three"]
+        # Batch embed
+        batch_results = await service.embed_batch(texts)
+        assert len(batch_results) == 3
+        assert all(isinstance(e, list) for e in batch_results)
+    @pytest.mark.asyncio
+    async def test_add_and_search(self):
+        """Should be able to add evidence and search for similar."""
+        service = EmbeddingService()
+        await service.add_evidence(
+            evidence_id="test1",
+            content="Metformin activates AMPK pathway",
+            metadata={"source": "pubmed"}
+        )
+        results = await service.search_similar("AMPK activation drugs", n_results=1)
+        assert len(results) == 1
+        assert "AMPK" in results[0]["content"]
+    @pytest.mark.asyncio
+    async def test_search_similar_empty_collection(self):
+        """Search on empty collection should return empty list, not error."""
+        service = EmbeddingService()
+        results = await service.search_similar("anything", n_results=5)
+        assert results == []
+```
+---
+## 7. Definition of Done
+Phase 6 is **COMPLETE** when:
+1. `EmbeddingService` implemented with ChromaDB
+2. SearchAgent uses embeddings for deduplication
+3. Semantic search surfaces related evidence
+4. All unit tests pass
+5. Integration test shows improved recall (finds related papers)
+---
+## 8. Value Delivered
+| Before (Phase 5) | After (Phase 6) |
+|------------------|-----------------|
+| Keyword-only search | Semantic + keyword search |
+| URL-based deduplication | Meaning-based deduplication |
+| Miss related papers | Surface related concepts |
+| Exact match required | Fuzzy semantic matching |
+**Real example improvement:**
+- Query: "metformin alzheimer"
+- Before: Only papers mentioning both words
+- After: Also finds "AMPK neuroprotection", "biguanide cognitive", etc.

docs/implementation/07_phase_hypothesis.md ADDED Viewed

	@@ -0,0 +1,630 @@

+# Phase 7 Implementation Spec: Hypothesis Agent
+**Goal**: Add an agent that generates scientific hypotheses to guide targeted searches.
+**Philosophy**: "Don't just find evidence—understand the mechanisms."
+**Prerequisite**: Phase 6 complete (Embeddings working)
+---
+## 1. Why Hypothesis Agent?
+Current limitation: **Search is reactive, not hypothesis-driven.**
+Current flow:
+1. User asks about "metformin alzheimer"
+2. Search finds papers
+3. Judge says "need more evidence"
+4. Search again with slightly different keywords
+With Hypothesis Agent:
+1. User asks about "metformin alzheimer"
+2. Search finds initial papers
+3. **Hypothesis Agent analyzes**: "Evidence suggests metformin → AMPK activation → autophagy → amyloid clearance"
+4. Search can now target: "metformin AMPK", "autophagy neurodegeneration", "amyloid clearance drugs"
+**Key insight**: Scientific research is hypothesis-driven. The agent should think like a researcher.
+---
+## 2. Architecture
+### Current (Phase 6)
+```
+User Query → Magentic Manager
+                ├── SearchAgent → Evidence
+                └── JudgeAgent → Sufficient? → Synthesize/Continue
+```
+### Phase 7
+```
+User Query → Magentic Manager
+                ├── SearchAgent → Evidence
+                ├── HypothesisAgent → Mechanistic Hypotheses  ← NEW
+                └── JudgeAgent → Sufficient? → Synthesize/Continue
+                       ↑
+                  Uses hypotheses to guide next search
+```
+### Shared Context Enhancement
+```python
+evidence_store = {
+    "current": [],
+    "embeddings": {},
+    "vector_index": None,
+    "hypotheses": [],        # NEW: Generated hypotheses
+    "tested_hypotheses": [], # NEW: Hypotheses with supporting/contradicting evidence
+}
+```
+---
+## 3. Hypothesis Model
+### 3.1 Data Model (`src/utils/models.py`)
+```python
+class MechanismHypothesis(BaseModel):
+    """A scientific hypothesis about drug mechanism."""
+    drug: str = Field(description="The drug being studied")
+    target: str = Field(description="Molecular target (e.g., AMPK, mTOR)")
+    pathway: str = Field(description="Biological pathway affected")
+    effect: str = Field(description="Downstream effect on disease")
+    confidence: float = Field(ge=0, le=1, description="Confidence in hypothesis")
+    supporting_evidence: list[str] = Field(
+        default_factory=list,
+        description="PMIDs or URLs supporting this hypothesis"
+    )
+    contradicting_evidence: list[str] = Field(
+        default_factory=list,
+        description="PMIDs or URLs contradicting this hypothesis"
+    )
+    search_suggestions: list[str] = Field(
+        default_factory=list,
+        description="Suggested searches to test this hypothesis"
+    )
+    def to_search_queries(self) -> list[str]:
+        """Generate search queries to test this hypothesis."""
+        return [
+            f"{self.drug} {self.target}",
+            f"{self.target} {self.pathway}",
+            f"{self.pathway} {self.effect}",
+            *self.search_suggestions
+        ]
+```
+### 3.2 Hypothesis Assessment
+```python
+class HypothesisAssessment(BaseModel):
+    """Assessment of evidence against hypotheses."""
+    hypotheses: list[MechanismHypothesis]
+    primary_hypothesis: MechanismHypothesis | None = Field(
+        description="Most promising hypothesis based on current evidence"
+    )
+    knowledge_gaps: list[str] = Field(
+        description="What we don't know yet"
+    )
+    recommended_searches: list[str] = Field(
+        description="Searches to fill knowledge gaps"
+    )
+```
+---
+## 4. Implementation
+### 4.0 Text Utilities (`src/utils/text_utils.py`)
+> **Why These Utilities?**
+>
+> The original spec used arbitrary truncation (`evidence[:10]` and `content[:300]`).
+> This loses important information randomly. These utilities provide:
+> 1. **Sentence-aware truncation** - cuts at sentence boundaries, not mid-word
+> 2. **Diverse evidence selection** - uses embeddings to select varied evidence (MMR)
+```python
+"""Text processing utilities for evidence handling."""
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+    from src.utils.models import Evidence
+def truncate_at_sentence(text: str, max_chars: int = 300) -> str:
+    """Truncate text at sentence boundary, preserving meaning.
+    Args:
+        text: The text to truncate
+        max_chars: Maximum characters (default 300)
+    Returns:
+        Text truncated at last complete sentence within limit
+    """
+    if len(text) <= max_chars:
+        return text
+    # Find truncation point
+    truncated = text[:max_chars]
+    # Look for sentence endings: . ! ? followed by space or end
+    for sep in ['. ', '! ', '? ', '.\n', '!\n', '?\n']:
+        last_sep = truncated.rfind(sep)
+        if last_sep > max_chars // 2:  # Don't truncate too aggressively
+            return text[:last_sep + 1].strip()
+    # Fallback: find last period
+    last_period = truncated.rfind('.')
+    if last_period > max_chars // 2:
+        return text[:last_period + 1].strip()
+    # Last resort: truncate at word boundary
+    last_space = truncated.rfind(' ')
+    if last_space > 0:
+        return text[:last_space].strip() + "..."
+    return truncated + "..."
+async def select_diverse_evidence(
+    evidence: list["Evidence"],
+    n: int,
+    query: str,
+    embeddings: "EmbeddingService | None" = None
+) -> list["Evidence"]:
+    """Select n most diverse and relevant evidence items.
+    Uses Maximal Marginal Relevance (MMR) when embeddings available,
+    falls back to relevance_score sorting otherwise.
+    Args:
+        evidence: All available evidence
+        n: Number of items to select
+        query: Original query for relevance scoring
+        embeddings: Optional EmbeddingService for semantic diversity
+    Returns:
+        Selected evidence items, diverse and relevant
+    """
+    if not evidence:
+        return []
+    if n >= len(evidence):
+        return evidence
+    # Fallback: sort by relevance score if no embeddings
+    if embeddings is None:
+        return sorted(
+            evidence,
+            key=lambda e: e.relevance_score,
+            reverse=True
+        )[:n]
+    # MMR: Maximal Marginal Relevance for diverse selection
+    # Score = λ * relevance - (1-λ) * max_similarity_to_selected
+    lambda_param = 0.7  # Balance relevance vs diversity
+    # Get query embedding
+    query_emb = await embeddings.embed(query)
+    # Get all evidence embeddings
+    evidence_embs = await embeddings.embed_batch([e.content for e in evidence])
+    # Compute relevance scores (cosine similarity to query)
+    from numpy import dot
+    from numpy.linalg import norm
+    cosine = lambda a, b: float(dot(a, b) / (norm(a) * norm(b)))
+    relevance_scores = [cosine(query_emb, emb) for emb in evidence_embs]
+    # Greedy MMR selection
+    selected_indices: list[int] = []
+    remaining = set(range(len(evidence)))
+    for _ in range(n):
+        best_score = float('-inf')
+        best_idx = -1
+        for idx in remaining:
+            # Relevance component
+            relevance = relevance_scores[idx]
+            # Diversity component: max similarity to already selected
+            if selected_indices:
+                max_sim = max(
+                    cosine(evidence_embs[idx], evidence_embs[sel])
+                    for sel in selected_indices
+                )
+            else:
+                max_sim = 0
+            # MMR score
+            mmr_score = lambda_param * relevance - (1 - lambda_param) * max_sim
+            if mmr_score > best_score:
+                best_score = mmr_score
+                best_idx = idx
+        if best_idx >= 0:
+            selected_indices.append(best_idx)
+            remaining.remove(best_idx)
+    return [evidence[i] for i in selected_indices]
+```
+### 4.1 Hypothesis Prompts (`src/prompts/hypothesis.py`)
+```python
+"""Prompts for Hypothesis Agent."""
+from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
+SYSTEM_PROMPT = """You are a biomedical research scientist specializing in drug repurposing.
+Your role is to generate mechanistic hypotheses based on evidence.
+A good hypothesis:
+1. Proposes a MECHANISM: Drug → Target → Pathway → Effect
+2. Is TESTABLE: Can be supported or refuted by literature search
+3. Is SPECIFIC: Names actual molecular targets and pathways
+4. Generates SEARCH QUERIES: Helps find more evidence
+Example hypothesis format:
+- Drug: Metformin
+- Target: AMPK (AMP-activated protein kinase)
+- Pathway: mTOR inhibition → autophagy activation
+- Effect: Enhanced clearance of amyloid-beta in Alzheimer's
+- Confidence: 0.7
+- Search suggestions: ["metformin AMPK brain", "autophagy amyloid clearance"]
+Be specific. Use actual gene/protein names when possible."""
+async def format_hypothesis_prompt(
+    query: str,
+    evidence: list,
+    embeddings=None
+) -> str:
+    """Format prompt for hypothesis generation.
+    Uses smart evidence selection instead of arbitrary truncation.
+    Args:
+        query: The research query
+        evidence: All collected evidence
+        embeddings: Optional EmbeddingService for diverse selection
+    """
+    # Select diverse, relevant evidence (not arbitrary first 10)
+    selected = await select_diverse_evidence(
+        evidence, n=10, query=query, embeddings=embeddings
+    )
+    # Format with sentence-aware truncation
+    evidence_text = "\n".join([
+        f"- **{e.citation.title}** ({e.citation.source}): {truncate_at_sentence(e.content, 300)}"
+        for e in selected
+    ])
+    return f"""Based on the following evidence about "{query}", generate mechanistic hypotheses.
+## Evidence ({len(selected)} papers selected for diversity)
+{evidence_text}
+## Task
+1. Identify potential drug targets mentioned in the evidence
+2. Propose mechanism hypotheses (Drug → Target → Pathway → Effect)
+3. Rate confidence based on evidence strength
+4. Suggest searches to test each hypothesis
+Generate 2-4 hypotheses, prioritized by confidence."""
+```
+### 4.2 Hypothesis Agent (`src/agents/hypothesis_agent.py`)
+```python
+"""Hypothesis agent for mechanistic reasoning."""
+from collections.abc import AsyncIterable
+from typing import TYPE_CHECKING, Any
+from agent_framework import (
+    AgentRunResponse,
+    AgentRunResponseUpdate,
+    AgentThread,
+    BaseAgent,
+    ChatMessage,
+    Role,
+)
+from pydantic_ai import Agent
+from src.prompts.hypothesis import SYSTEM_PROMPT, format_hypothesis_prompt
+from src.utils.config import settings
+from src.utils.models import Evidence, HypothesisAssessment
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+class HypothesisAgent(BaseAgent):
+    """Generates mechanistic hypotheses based on evidence."""
+    def __init__(
+        self,
+        evidence_store: dict[str, list[Evidence]],
+        embedding_service: "EmbeddingService | None" = None,  # NEW: for diverse selection
+    ) -> None:
+        super().__init__(
+            name="HypothesisAgent",
+            description="Generates scientific hypotheses about drug mechanisms to guide research",
+        )
+        self._evidence_store = evidence_store
+        self._embeddings = embedding_service  # Used for MMR evidence selection
+        self._agent = Agent(
+            model=settings.llm_provider,  # Uses configured LLM
+            output_type=HypothesisAssessment,
+            system_prompt=SYSTEM_PROMPT,
+        )
+    async def run(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AgentRunResponse:
+        """Generate hypotheses based on current evidence."""
+        # Extract query
+        query = self._extract_query(messages)
+        # Get current evidence
+        evidence = self._evidence_store.get("current", [])
+        if not evidence:
+            return AgentRunResponse(
+                messages=[ChatMessage(
+                    role=Role.ASSISTANT,
+                    text="No evidence available yet. Search for evidence first."
+                )],
+                response_id="hypothesis-no-evidence",
+            )
+        # Generate hypotheses with diverse evidence selection
+        # NOTE: format_hypothesis_prompt is now async
+        prompt = await format_hypothesis_prompt(
+            query, evidence, embeddings=self._embeddings
+        )
+        result = await self._agent.run(prompt)
+        assessment = result.output
+        # Store hypotheses in shared context
+        existing = self._evidence_store.get("hypotheses", [])
+        self._evidence_store["hypotheses"] = existing + assessment.hypotheses
+        # Format response
+        response_text = self._format_response(assessment)
+        return AgentRunResponse(
+            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
+            response_id=f"hypothesis-{len(assessment.hypotheses)}",
+            additional_properties={"assessment": assessment.model_dump()},
+        )
+    def _format_response(self, assessment: HypothesisAssessment) -> str:
+        """Format hypothesis assessment as markdown."""
+        lines = ["## Generated Hypotheses\n"]
+        for i, h in enumerate(assessment.hypotheses, 1):
+            lines.append(f"### Hypothesis {i} (Confidence: {h.confidence:.0%})")
+            lines.append(f"**Mechanism**: {h.drug} → {h.target} → {h.pathway} → {h.effect}")
+            lines.append(f"**Suggested searches**: {', '.join(h.search_suggestions)}\n")
+        if assessment.primary_hypothesis:
+            lines.append(f"### Primary Hypothesis")
+            h = assessment.primary_hypothesis
+            lines.append(f"{h.drug} → {h.target} → {h.pathway} → {h.effect}\n")
+        if assessment.knowledge_gaps:
+            lines.append("### Knowledge Gaps")
+            for gap in assessment.knowledge_gaps:
+                lines.append(f"- {gap}")
+        if assessment.recommended_searches:
+            lines.append("\n### Recommended Next Searches")
+            for search in assessment.recommended_searches:
+                lines.append(f"- `{search}`")
+        return "\n".join(lines)
+    def _extract_query(self, messages) -> str:
+        """Extract query from messages."""
+        if isinstance(messages, str):
+            return messages
+        elif isinstance(messages, ChatMessage):
+            return messages.text or ""
+        elif isinstance(messages, list):
+            for msg in reversed(messages):
+                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
+                    return msg.text or ""
+                elif isinstance(msg, str):
+                    return msg
+        return ""
+    async def run_stream(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AsyncIterable[AgentRunResponseUpdate]:
+        """Streaming wrapper."""
+        result = await self.run(messages, thread=thread, **kwargs)
+        yield AgentRunResponseUpdate(
+            messages=result.messages,
+            response_id=result.response_id
+        )
+```
+### 4.3 Update MagenticOrchestrator
+Add HypothesisAgent to the workflow:
+```python
+# In MagenticOrchestrator.__init__
+self._hypothesis_agent = HypothesisAgent(self._evidence_store)
+# In workflow building
+workflow = (
+    MagenticBuilder()
+    .participants(
+        searcher=search_agent,
+        hypothesizer=self._hypothesis_agent,  # NEW
+        judge=judge_agent,
+    )
+    .with_standard_manager(...)
+    .build()
+)
+# Update task instruction
+task = f"""Research drug repurposing opportunities for: {query}
+Workflow:
+1. SearchAgent: Find initial evidence from PubMed and web
+2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
+3. SearchAgent: Use hypothesis-suggested queries for targeted search
+4. JudgeAgent: Evaluate if evidence supports hypotheses
+5. Repeat until confident or max rounds
+Focus on:
+- Identifying specific molecular targets
+- Understanding mechanism of action
+- Finding supporting/contradicting evidence for hypotheses
+"""
+```
+---
+## 5. Directory Structure After Phase 7
+```
+src/
+├── agents/
+│   ├── search_agent.py
+│   ├── judge_agent.py
+│   └── hypothesis_agent.py     # NEW
+├── prompts/
+│   ├── judge.py
+│   └── hypothesis.py           # NEW
+├── services/
+│   └── embeddings.py
+└── utils/
+    └── models.py               # Updated with hypothesis models
+```
+---
+## 6. Tests
+### 6.1 Unit Tests (`tests/unit/agents/test_hypothesis_agent.py`)
+```python
+"""Unit tests for HypothesisAgent."""
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+from src.agents.hypothesis_agent import HypothesisAgent
+from src.utils.models import Citation, Evidence, HypothesisAssessment, MechanismHypothesis
+@pytest.fixture
+def sample_evidence():
+    return [
+        Evidence(
+            content="Metformin activates AMPK, which inhibits mTOR signaling...",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin and AMPK",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                date="2023"
+            )
+        )
+    ]
+@pytest.fixture
+def mock_assessment():
+    return HypothesisAssessment(
+        hypotheses=[
+            MechanismHypothesis(
+                drug="Metformin",
+                target="AMPK",
+                pathway="mTOR inhibition",
+                effect="Reduced cancer cell proliferation",
+                confidence=0.75,
+                search_suggestions=["metformin AMPK cancer", "mTOR cancer therapy"]
+            )
+        ],
+        primary_hypothesis=None,
+        knowledge_gaps=["Clinical trial data needed"],
+        recommended_searches=["metformin clinical trial cancer"]
+    )
+@pytest.mark.asyncio
+async def test_hypothesis_agent_generates_hypotheses(sample_evidence, mock_assessment):
+    """HypothesisAgent should generate mechanistic hypotheses."""
+    store = {"current": sample_evidence, "hypotheses": []}
+    with patch("src.agents.hypothesis_agent.Agent") as MockAgent:
+        mock_result = MagicMock()
+        mock_result.output = mock_assessment
+        MockAgent.return_value.run = AsyncMock(return_value=mock_result)
+        agent = HypothesisAgent(store)
+        response = await agent.run("metformin cancer")
+        assert "AMPK" in response.messages[0].text
+        assert len(store["hypotheses"]) == 1
+@pytest.mark.asyncio
+async def test_hypothesis_agent_no_evidence():
+    """HypothesisAgent should handle empty evidence gracefully."""
+    store = {"current": [], "hypotheses": []}
+    agent = HypothesisAgent(store)
+    response = await agent.run("test query")
+    assert "No evidence" in response.messages[0].text
+```
+---
+## 7. Definition of Done
+Phase 7 is **COMPLETE** when:
+1. `MechanismHypothesis` and `HypothesisAssessment` models implemented
+2. `HypothesisAgent` generates hypotheses from evidence
+3. Hypotheses stored in shared context
+4. Search queries generated from hypotheses
+5. Magentic workflow includes HypothesisAgent
+6. All unit tests pass
+---
+## 8. Value Delivered
+| Before (Phase 6) | After (Phase 7) |
+|------------------|-----------------|
+| Reactive search | Hypothesis-driven search |
+| Generic queries | Mechanism-targeted queries |
+| No scientific reasoning | Drug → Target → Pathway → Effect |
+| Judge says "need more" | Hypothesis says "search for X to test Y" |
+**Real example improvement:**
+- Query: "metformin alzheimer"
+- Before: "metformin alzheimer mechanism", "metformin brain"
+- After: "metformin AMPK activation", "AMPK autophagy neurodegeneration", "autophagy amyloid clearance"
+The search becomes **scientifically targeted** rather than keyword variations.

docs/implementation/08_phase_report.md ADDED Viewed

	@@ -0,0 +1,854 @@

+# Phase 8 Implementation Spec: Report Agent
+**Goal**: Generate structured scientific reports with proper citations and methodology.
+**Philosophy**: "Research isn't complete until it's communicated clearly."
+**Prerequisite**: Phase 7 complete (Hypothesis Agent working)
+---
+## 1. Why Report Agent?
+Current limitation: **Synthesis is basic markdown, not a scientific report.**
+Current output:
+```markdown
+## Drug Repurposing Analysis
+### Drug Candidates
+- Metformin
+### Key Findings
+- Some findings
+### Citations
+1. [Paper 1](url)
+```
+With Report Agent:
+```markdown
+## Executive Summary
+One-paragraph summary for busy readers...
+## Research Question
+Clear statement of what was investigated...
+## Methodology
+- Sources searched: PubMed, DuckDuckGo
+- Date range: ...
+- Inclusion criteria: ...
+## Hypotheses Tested
+1. Metformin → AMPK → neuroprotection (Supported: 7 papers, Contradicted: 2)
+## Findings
+### Mechanistic Evidence
+...
+### Clinical Evidence
+...
+## Limitations
+- Only English language papers
+- Abstract-level analysis only
+## Conclusion
+...
+## References
+Properly formatted citations...
+```
+---
+## 2. Architecture
+### Phase 8 Addition
+```text
+Evidence + Hypotheses + Assessment
+            ↓
+      Report Agent
+            ↓
+   Structured Scientific Report
+```
+### Report Generation Flow
+```text
+1. JudgeAgent says "synthesize"
+2. Magentic Manager selects ReportAgent
+3. ReportAgent gathers:
+   - All evidence from shared context
+   - All hypotheses (supported/contradicted)
+   - Assessment scores
+4. ReportAgent generates structured report
+5. Final output to user
+```
+---
+## 3. Report Model
+### 3.1 Data Model (`src/utils/models.py`)
+```python
+class ReportSection(BaseModel):
+    """A section of the research report."""
+    title: str
+    content: str
+    citations: list[str] = Field(default_factory=list)
+class ResearchReport(BaseModel):
+    """Structured scientific report."""
+    title: str = Field(description="Report title")
+    executive_summary: str = Field(
+        description="One-paragraph summary for quick reading",
+        min_length=100,
+        max_length=500
+    )
+    research_question: str = Field(description="Clear statement of what was investigated")
+    methodology: ReportSection = Field(description="How the research was conducted")
+    hypotheses_tested: list[dict] = Field(
+        description="Hypotheses with supporting/contradicting evidence counts"
+    )
+    mechanistic_findings: ReportSection = Field(
+        description="Findings about drug mechanisms"
+    )
+    clinical_findings: ReportSection = Field(
+        description="Findings from clinical/preclinical studies"
+    )
+    drug_candidates: list[str] = Field(description="Identified drug candidates")
+    limitations: list[str] = Field(description="Study limitations")
+    conclusion: str = Field(description="Overall conclusion")
+    references: list[dict] = Field(
+        description="Formatted references with title, authors, source, URL"
+    )
+    # Metadata
+    sources_searched: list[str] = Field(default_factory=list)
+    total_papers_reviewed: int = 0
+    search_iterations: int = 0
+    confidence_score: float = Field(ge=0, le=1)
+    def to_markdown(self) -> str:
+        """Render report as markdown."""
+        sections = [
+            f"# {self.title}\n",
+            f"## Executive Summary\n{self.executive_summary}\n",
+            f"## Research Question\n{self.research_question}\n",
+            f"## Methodology\n{self.methodology.content}\n",
+        ]
+        # Hypotheses
+        sections.append("## Hypotheses Tested\n")
+        for h in self.hypotheses_tested:
+            status = "✅ Supported" if h.get("supported", 0) > h.get("contradicted", 0) else "⚠️ Mixed"
+            sections.append(
+                f"- **{h['mechanism']}** ({status}): "
+                f"{h.get('supported', 0)} supporting, {h.get('contradicted', 0)} contradicting\n"
+            )
+        # Findings
+        sections.append(f"## Mechanistic Findings\n{self.mechanistic_findings.content}\n")
+        sections.append(f"## Clinical Findings\n{self.clinical_findings.content}\n")
+        # Drug candidates
+        sections.append("## Drug Candidates\n")
+        for drug in self.drug_candidates:
+            sections.append(f"- **{drug}**\n")
+        # Limitations
+        sections.append("## Limitations\n")
+        for lim in self.limitations:
+            sections.append(f"- {lim}\n")
+        # Conclusion
+        sections.append(f"## Conclusion\n{self.conclusion}\n")
+        # References
+        sections.append("## References\n")
+        for i, ref in enumerate(self.references, 1):
+            sections.append(
+                f"{i}. {ref.get('authors', 'Unknown')}. "
+                f"*{ref.get('title', 'Untitled')}*. "
+                f"{ref.get('source', '')} ({ref.get('date', '')}). "
+                f"[Link]({ref.get('url', '#')})\n"
+            )
+        # Metadata footer
+        sections.append("\n---\n")
+        sections.append(
+            f"*Report generated from {self.total_papers_reviewed} papers "
+            f"across {self.search_iterations} search iterations. "
+            f"Confidence: {self.confidence_score:.0%}*"
+        )
+        return "\n".join(sections)
+```
+---
+## 4. Implementation
+### 4.0 Citation Validation (`src/utils/citation_validator.py`)
+> **🚨 CRITICAL: Why Citation Validation?**
+>
+> LLMs frequently **hallucinate** citations - inventing paper titles, authors, and URLs
+> that don't exist. For a medical research tool, fake citations are **dangerous**.
+>
+> This validation layer ensures every reference in the report actually exists
+> in the collected evidence.
+```python
+"""Citation validation to prevent LLM hallucination.
+CRITICAL: Medical research requires accurate citations.
+This module validates that all references exist in collected evidence.
+"""
+import logging
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from src.utils.models import Evidence, ResearchReport
+logger = logging.getLogger(__name__)
+def validate_references(
+    report: "ResearchReport",
+    evidence: list["Evidence"]
+) -> "ResearchReport":
+    """Ensure all references actually exist in collected evidence.
+    CRITICAL: Prevents LLM hallucination of citations.
+    Args:
+        report: The generated research report
+        evidence: All evidence collected during research
+    Returns:
+        Report with only valid references (hallucinated ones removed)
+    """
+    # Build set of valid URLs from evidence
+    valid_urls = {e.citation.url for e in evidence}
+    valid_titles = {e.citation.title.lower() for e in evidence}
+    validated_refs = []
+    removed_count = 0
+    for ref in report.references:
+        ref_url = ref.get("url", "")
+        ref_title = ref.get("title", "").lower()
+        # Check if URL matches collected evidence
+        if ref_url in valid_urls:
+            validated_refs.append(ref)
+        # Fallback: check title match (URLs might differ slightly)
+        elif ref_title and any(ref_title in t or t in ref_title for t in valid_titles):
+            validated_refs.append(ref)
+        else:
+            removed_count += 1
+            logger.warning(
+                f"Removed hallucinated reference: '{ref.get('title', 'Unknown')}' "
+                f"(URL: {ref_url[:50]}...)"
+            )
+    if removed_count > 0:
+        logger.info(
+            f"Citation validation removed {removed_count} hallucinated references. "
+            f"{len(validated_refs)} valid references remain."
+        )
+    # Update report with validated references
+    report.references = validated_refs
+    return report
+def build_reference_from_evidence(evidence: "Evidence") -> dict:
+    """Build a properly formatted reference from evidence.
+    Use this to ensure references match the original evidence exactly.
+    """
+    return {
+        "title": evidence.citation.title,
+        "authors": evidence.citation.authors or ["Unknown"],
+        "source": evidence.citation.source,
+        "date": evidence.citation.date or "n.d.",
+        "url": evidence.citation.url,
+    }
+```
+### 4.1 Report Prompts (`src/prompts/report.py`)
+```python
+"""Prompts for Report Agent."""
+from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
+SYSTEM_PROMPT = """You are a scientific writer specializing in drug repurposing research reports.
+Your role is to synthesize evidence and hypotheses into a clear, structured report.
+A good report:
+1. Has a clear EXECUTIVE SUMMARY (one paragraph, key takeaways)
+2. States the RESEARCH QUESTION clearly
+3. Describes METHODOLOGY (what was searched, how)
+4. Evaluates HYPOTHESES with evidence counts
+5. Separates MECHANISTIC and CLINICAL findings
+6. Lists specific DRUG CANDIDATES
+7. Acknowledges LIMITATIONS honestly
+8. Provides a balanced CONCLUSION
+9. Includes properly formatted REFERENCES
+Write in scientific but accessible language. Be specific about evidence strength.
+─────────────────────────────────────────────────────────────────────────────
+🚨 CRITICAL CITATION REQUIREMENTS 🚨
+─────────────────────────────────────────────────────────────────────────────
+You MUST follow these rules for the References section:
+1. You may ONLY cite papers that appear in the Evidence section above
+2. Every reference URL must EXACTLY match a provided evidence URL
+3. Do NOT invent, fabricate, or hallucinate any references
+4. Do NOT modify paper titles, authors, dates, or URLs
+5. If unsure about a citation, OMIT it rather than guess
+6. Copy URLs exactly as provided - do not create similar-looking URLs
+VIOLATION OF THESE RULES PRODUCES DANGEROUS MISINFORMATION.
+─────────────────────────────────────────────────────────────────────────────"""
+async def format_report_prompt(
+    query: str,
+    evidence: list,
+    hypotheses: list,
+    assessment: dict,
+    metadata: dict,
+    embeddings=None
+) -> str:
+    """Format prompt for report generation.
+    Includes full evidence details for accurate citation.
+    """
+    # Select diverse evidence (not arbitrary truncation)
+    selected = await select_diverse_evidence(
+        evidence, n=20, query=query, embeddings=embeddings
+    )
+    # Include FULL citation details for each evidence item
+    # This helps the LLM create accurate references
+    evidence_summary = "\n".join([
+        f"- **Title**: {e.citation.title}\n"
+        f"  **URL**: {e.citation.url}\n"
+        f"  **Authors**: {', '.join(e.citation.authors or ['Unknown'])}\n"
+        f"  **Date**: {e.citation.date or 'n.d.'}\n"
+        f"  **Source**: {e.citation.source}\n"
+        f"  **Content**: {truncate_at_sentence(e.content, 200)}\n"
+        for e in selected
+    ])
+    hypotheses_summary = "\n".join([
+        f"- {h.drug} → {h.target} → {h.pathway} → {h.effect} (Confidence: {h.confidence:.0%})"
+        for h in hypotheses
+    ]) if hypotheses else "No hypotheses generated yet."
+    return f"""Generate a structured research report for the following query.
+## Original Query
+{query}
+## Evidence Collected ({len(selected)} papers, selected for diversity)
+{evidence_summary}
+## Hypotheses Generated
+{hypotheses_summary}
+## Assessment Scores
+- Mechanism Score: {assessment.get('mechanism_score', 'N/A')}/10
+- Clinical Evidence Score: {assessment.get('clinical_score', 'N/A')}/10
+- Overall Confidence: {assessment.get('confidence', 0):.0%}
+## Metadata
+- Sources Searched: {', '.join(metadata.get('sources', []))}
+- Search Iterations: {metadata.get('iterations', 0)}
+Generate a complete ResearchReport with all sections filled in.
+REMINDER: Only cite papers from the Evidence section above. Copy URLs exactly."""
+```
+### 4.2 Report Agent (`src/agents/report_agent.py`)
+```python
+"""Report agent for generating structured research reports."""
+from collections.abc import AsyncIterable
+from typing import TYPE_CHECKING, Any
+from agent_framework import (
+    AgentRunResponse,
+    AgentRunResponseUpdate,
+    AgentThread,
+    BaseAgent,
+    ChatMessage,
+    Role,
+)
+from pydantic_ai import Agent
+from src.prompts.report import SYSTEM_PROMPT, format_report_prompt
+from src.utils.citation_validator import validate_references  # CRITICAL
+from src.utils.config import settings
+from src.utils.models import Evidence, MechanismHypothesis, ResearchReport
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+class ReportAgent(BaseAgent):
+    """Generates structured scientific reports from evidence and hypotheses."""
+    def __init__(
+        self,
+        evidence_store: dict[str, list[Evidence]],
+        embedding_service: "EmbeddingService | None" = None,  # For diverse selection
+    ) -> None:
+        super().__init__(
+            name="ReportAgent",
+            description="Generates structured scientific research reports with citations",
+        )
+        self._evidence_store = evidence_store
+        self._embeddings = embedding_service
+        self._agent = Agent(
+            model=settings.llm_provider,
+            output_type=ResearchReport,
+            system_prompt=SYSTEM_PROMPT,
+        )
+    async def run(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AgentRunResponse:
+        """Generate research report."""
+        query = self._extract_query(messages)
+        # Gather all context
+        evidence = self._evidence_store.get("current", [])
+        hypotheses = self._evidence_store.get("hypotheses", [])
+        assessment = self._evidence_store.get("last_assessment", {})
+        if not evidence:
+            return AgentRunResponse(
+                messages=[ChatMessage(
+                    role=Role.ASSISTANT,
+                    text="Cannot generate report: No evidence collected."
+                )],
+                response_id="report-no-evidence",
+            )
+        # Build metadata
+        metadata = {
+            "sources": list(set(e.citation.source for e in evidence)),
+            "iterations": self._evidence_store.get("iteration_count", 0),
+        }
+        # Generate report (format_report_prompt is now async)
+        prompt = await format_report_prompt(
+            query=query,
+            evidence=evidence,
+            hypotheses=hypotheses,
+            assessment=assessment,
+            metadata=metadata,
+            embeddings=self._embeddings,
+        )
+        result = await self._agent.run(prompt)
+        report = result.output
+        # ═══════════════════════════════════════════════════════════════════
+        # 🚨 CRITICAL: Validate citations to prevent hallucination
+        # ═══════════════════════════════════════════════════════════════════
+        report = validate_references(report, evidence)
+        # Store validated report
+        self._evidence_store["final_report"] = report
+        # Return markdown version
+        return AgentRunResponse(
+            messages=[ChatMessage(role=Role.ASSISTANT, text=report.to_markdown())],
+            response_id="report-complete",
+            additional_properties={"report": report.model_dump()},
+        )
+    def _extract_query(self, messages) -> str:
+        """Extract query from messages."""
+        if isinstance(messages, str):
+            return messages
+        elif isinstance(messages, ChatMessage):
+            return messages.text or ""
+        elif isinstance(messages, list):
+            for msg in reversed(messages):
+                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
+                    return msg.text or ""
+                elif isinstance(msg, str):
+                    return msg
+        return ""
+    async def run_stream(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AsyncIterable[AgentRunResponseUpdate]:
+        """Streaming wrapper."""
+        result = await self.run(messages, thread=thread, **kwargs)
+        yield AgentRunResponseUpdate(
+            messages=result.messages,
+            response_id=result.response_id
+        )
+```
+### 4.3 Update MagenticOrchestrator
+Add ReportAgent as the final synthesis step:
+```python
+# In MagenticOrchestrator.__init__
+self._report_agent = ReportAgent(self._evidence_store)
+# In workflow building
+workflow = (
+    MagenticBuilder()
+    .participants(
+        searcher=search_agent,
+        hypothesizer=hypothesis_agent,
+        judge=judge_agent,
+        reporter=self._report_agent,  # NEW
+    )
+    .with_standard_manager(...)
+    .build()
+)
+# Update task instruction
+task = f"""Research drug repurposing opportunities for: {query}
+Workflow:
+1. SearchAgent: Find evidence from PubMed and web
+2. HypothesisAgent: Generate mechanistic hypotheses
+3. SearchAgent: Targeted search based on hypotheses
+4. JudgeAgent: Evaluate evidence sufficiency
+5. If sufficient → ReportAgent: Generate structured research report
+6. If not sufficient → Repeat from step 1 with refined queries
+The final output should be a complete research report with:
+- Executive summary
+- Methodology
+- Hypotheses tested
+- Mechanistic and clinical findings
+- Drug candidates
+- Limitations
+- Conclusion with references
+"""
+```
+---
+## 5. Directory Structure After Phase 8
+```
+src/
+├── agents/
+│   ├── search_agent.py
+│   ├── judge_agent.py
+│   ├── hypothesis_agent.py
+│   └── report_agent.py         # NEW
+├── prompts/
+│   ├── judge.py
+│   ├── hypothesis.py
+│   └── report.py               # NEW
+├── services/
+│   └── embeddings.py
+└── utils/
+    └── models.py               # Updated with report models
+```
+---
+## 6. Tests
+### 6.1 Unit Tests (`tests/unit/agents/test_report_agent.py`)
+```python
+"""Unit tests for ReportAgent."""
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+from src.agents.report_agent import ReportAgent
+from src.utils.models import (
+    Citation, Evidence, MechanismHypothesis,
+    ResearchReport, ReportSection
+)
+@pytest.fixture
+def sample_evidence():
+    return [
+        Evidence(
+            content="Metformin activates AMPK...",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin mechanisms",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                date="2023",
+                authors=["Smith J", "Jones A"]
+            )
+        )
+    ]
+@pytest.fixture
+def sample_hypotheses():
+    return [
+        MechanismHypothesis(
+            drug="Metformin",
+            target="AMPK",
+            pathway="mTOR inhibition",
+            effect="Neuroprotection",
+            confidence=0.8,
+            search_suggestions=[]
+        )
+    ]
+@pytest.fixture
+def mock_report():
+    return ResearchReport(
+        title="Drug Repurposing Analysis: Metformin for Alzheimer's",
+        executive_summary="This report analyzes metformin as a potential...",
+        research_question="Can metformin be repurposed for Alzheimer's disease?",
+        methodology=ReportSection(
+            title="Methodology",
+            content="Searched PubMed and web sources..."
+        ),
+        hypotheses_tested=[
+            {"mechanism": "Metformin → AMPK → neuroprotection", "supported": 5, "contradicted": 1}
+        ],
+        mechanistic_findings=ReportSection(
+            title="Mechanistic Findings",
+            content="Evidence suggests AMPK activation..."
+        ),
+        clinical_findings=ReportSection(
+            title="Clinical Findings",
+            content="Limited clinical data available..."
+        ),
+        drug_candidates=["Metformin"],
+        limitations=["Abstract-level analysis only"],
+        conclusion="Metformin shows promise...",
+        references=[],
+        sources_searched=["pubmed", "web"],
+        total_papers_reviewed=10,
+        search_iterations=3,
+        confidence_score=0.75
+    )
+@pytest.mark.asyncio
+async def test_report_agent_generates_report(
+    sample_evidence, sample_hypotheses, mock_report
+):
+    """ReportAgent should generate structured report."""
+    store = {
+        "current": sample_evidence,
+        "hypotheses": sample_hypotheses,
+        "last_assessment": {"mechanism_score": 8, "clinical_score": 6}
+    }
+    with patch("src.agents.report_agent.Agent") as MockAgent:
+        mock_result = MagicMock()
+        mock_result.output = mock_report
+        MockAgent.return_value.run = AsyncMock(return_value=mock_result)
+        agent = ReportAgent(store)
+        response = await agent.run("metformin alzheimer")
+        assert "Executive Summary" in response.messages[0].text
+        assert "Methodology" in response.messages[0].text
+        assert "References" in response.messages[0].text
+@pytest.mark.asyncio
+async def test_report_agent_no_evidence():
+    """ReportAgent should handle empty evidence gracefully."""
+    store = {"current": [], "hypotheses": []}
+    agent = ReportAgent(store)
+    response = await agent.run("test query")
+    assert "Cannot generate report" in response.messages[0].text
+# ═══════════════════════════════════════════════════════════════════════════
+# 🚨 CRITICAL: Citation Validation Tests
+# ═══════════════════════════════════════════════════════════════════════════
+@pytest.mark.asyncio
+async def test_report_agent_removes_hallucinated_citations(sample_evidence):
+    """ReportAgent should remove citations not in evidence."""
+    from src.utils.citation_validator import validate_references
+    # Create report with mix of valid and hallucinated references
+    report_with_hallucinations = ResearchReport(
+        title="Test Report",
+        executive_summary="This is a test report for citation validation...",
+        research_question="Testing citation validation",
+        methodology=ReportSection(title="Methodology", content="Test"),
+        hypotheses_tested=[],
+        mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
+        clinical_findings=ReportSection(title="Clinical", content="Test"),
+        drug_candidates=["TestDrug"],
+        limitations=["Test limitation"],
+        conclusion="Test conclusion",
+        references=[
+            # Valid reference (matches sample_evidence)
+            {
+                "title": "Metformin mechanisms",
+                "url": "https://pubmed.ncbi.nlm.nih.gov/12345/",
+                "authors": ["Smith J", "Jones A"],
+                "date": "2023",
+                "source": "pubmed"
+            },
+            # HALLUCINATED reference (URL doesn't exist in evidence)
+            {
+                "title": "Fake Paper That Doesn't Exist",
+                "url": "https://fake-journal.com/made-up-paper",
+                "authors": ["Hallucinated A"],
+                "date": "2024",
+                "source": "fake"
+            },
+            # Another HALLUCINATED reference
+            {
+                "title": "Invented Research",
+                "url": "https://pubmed.ncbi.nlm.nih.gov/99999999/",
+                "authors": ["NotReal B"],
+                "date": "2025",
+                "source": "pubmed"
+            }
+        ],
+        sources_searched=["pubmed"],
+        total_papers_reviewed=1,
+        search_iterations=1,
+        confidence_score=0.5
+    )
+    # Validate - should remove hallucinated references
+    validated_report = validate_references(report_with_hallucinations, sample_evidence)
+    # Only the valid reference should remain
+    assert len(validated_report.references) == 1
+    assert validated_report.references[0]["title"] == "Metformin mechanisms"
+    assert "Fake Paper" not in str(validated_report.references)
+def test_citation_validator_handles_empty_references():
+    """Citation validator should handle reports with no references."""
+    from src.utils.citation_validator import validate_references
+    report = ResearchReport(
+        title="Empty Refs Report",
+        executive_summary="This report has no references...",
+        research_question="Testing empty refs",
+        methodology=ReportSection(title="Methodology", content="Test"),
+        hypotheses_tested=[],
+        mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
+        clinical_findings=ReportSection(title="Clinical", content="Test"),
+        drug_candidates=[],
+        limitations=[],
+        conclusion="Test",
+        references=[],  # Empty!
+        sources_searched=[],
+        total_papers_reviewed=0,
+        search_iterations=0,
+        confidence_score=0.0
+    )
+    validated = validate_references(report, [])
+    assert validated.references == []
+```
+---
+## 7. Definition of Done
+Phase 8 is **COMPLETE** when:
+1. `ResearchReport` model implemented with all sections
+2. `ReportAgent` generates structured reports
+3. Reports include proper citations and methodology
+4. Magentic workflow uses ReportAgent for final synthesis
+5. Report renders as clean markdown
+6. All unit tests pass
+---
+## 8. Value Delivered
+| Before (Phase 7) | After (Phase 8) |
+|------------------|-----------------|
+| Basic synthesis | Structured scientific report |
+| Simple bullet points | Executive summary + methodology |
+| List of citations | Formatted references |
+| No methodology | Clear research process |
+| No limitations | Honest limitations section |
+**Sample output comparison:**
+Before:
+```
+## Analysis
+- Metformin might help
+- Found 5 papers
+[Link 1] [Link 2]
+```
+After:
+```
+# Drug Repurposing Analysis: Metformin for Alzheimer's Disease
+## Executive Summary
+Analysis of 15 papers suggests metformin may provide neuroprotection
+through AMPK activation. Mechanistic evidence is strong (8/10),
+while clinical evidence is moderate (6/10)...
+## Methodology
+Systematic search of PubMed and web sources using queries...
+## Hypotheses Tested
+- ✅ Metformin → AMPK → neuroprotection (7 supporting, 2 contradicting)
+## References
+1. Smith J, Jones A. *Metformin mechanisms*. Nature (2023). [Link](...)
+```
+---
+## 9. Complete Magentic Architecture (Phases 5-8)
+```
+User Query
+    ↓
+Gradio UI
+    ↓
+Magentic Manager (LLM Coordinator)
+    ├── SearchAgent ←→ PubMed + Web + VectorDB
+    ├── HypothesisAgent ←→ Mechanistic Reasoning
+    ├── JudgeAgent ←→ Evidence Assessment
+    └── ReportAgent ←→ Final Synthesis
+    ↓
+Structured Research Report
+```
+**This matches Mario's diagram** with the practical agents that add real value for drug repurposing research.

docs/implementation/09_phase_source_cleanup.md ADDED Viewed

	@@ -0,0 +1,257 @@

+# Phase 9 Implementation Spec: Remove DuckDuckGo
+**Goal**: Remove unreliable web search, focus on credible scientific sources.
+**Philosophy**: "Scientific credibility over source quantity."
+**Prerequisite**: Phase 8 complete (all agents working)
+**Estimated Time**: 30-45 minutes
+---
+## 1. Why Remove DuckDuckGo?
+### Current Problems
+| Issue | Impact |
+|-------|--------|
+| Rate-limited aggressively | Returns 0 results frequently |
+| Not peer-reviewed | Random blogs, news, misinformation |
+| Not citable | Cannot use in scientific reports |
+| Adds noise | Dilutes quality evidence |
+### After Removal
+| Benefit | Impact |
+|---------|--------|
+| Cleaner codebase | -150 lines of dead code |
+| No rate limit failures | 100% source reliability |
+| Scientific credibility | All sources peer-reviewed/preprint |
+| Simpler debugging | Fewer failure modes |
+---
+## 2. Files to Modify/Delete
+### 2.1 DELETE: `src/tools/websearch.py`
+```bash
+# File to delete entirely
+src/tools/websearch.py  # ~80 lines
+```
+### 2.2 MODIFY: SearchHandler Usage
+Update all files that instantiate `SearchHandler` with `WebTool()`:
+| File | Change |
+|------|--------|
+| `examples/search_demo/run_search.py` | Remove `WebTool()` from tools list |
+| `examples/hypothesis_demo/run_hypothesis.py` | Remove `WebTool()` from tools list |
+| `examples/full_stack_demo/run_full.py` | Remove `WebTool()` from tools list |
+| `examples/orchestrator_demo/run_agent.py` | Remove `WebTool()` from tools list |
+| `examples/orchestrator_demo/run_magentic.py` | Remove `WebTool()` from tools list |
+### 2.3 MODIFY: Type Definitions
+Update `src/utils/models.py`:
+```python
+# BEFORE
+sources_searched: list[Literal["pubmed", "web"]]
+# AFTER (Phase 9)
+sources_searched: list[Literal["pubmed"]]
+# AFTER (Phase 10-11)
+sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
+```
+### 2.4 DELETE: Tests for WebTool
+```bash
+# File to delete
+tests/unit/tools/test_websearch.py
+```
+---
+## 3. TDD Implementation
+### 3.1 Test: SearchHandler Works Without WebTool
+```python
+# tests/unit/tools/test_search_handler.py
+@pytest.mark.asyncio
+async def test_search_handler_pubmed_only():
+    """SearchHandler should work with only PubMed tool."""
+    from src.tools.pubmed import PubMedTool
+    from src.tools.search_handler import SearchHandler
+    handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
+    # Should not raise
+    result = await handler.execute("metformin diabetes", max_results_per_tool=3)
+    assert result.sources_searched == ["pubmed"]
+    assert "web" not in result.sources_searched
+    assert len(result.errors) == 0  # No failures
+```
+### 3.2 Test: WebTool Import Fails (Deleted)
+```python
+# tests/unit/tools/test_websearch_removed.py
+def test_websearch_module_deleted():
+    """WebTool should no longer exist."""
+    with pytest.raises(ImportError):
+        from src.tools.websearch import WebTool
+```
+### 3.3 Test: Examples Don't Reference WebTool
+```python
+# tests/unit/test_no_webtool_references.py
+import ast
+import pathlib
+def test_examples_no_webtool_imports():
+    """No example files should import WebTool."""
+    examples_dir = pathlib.Path("examples")
+    for py_file in examples_dir.rglob("*.py"):
+        content = py_file.read_text()
+        tree = ast.parse(content)
+        for node in ast.walk(tree):
+            if isinstance(node, ast.ImportFrom):
+                if node.module and "websearch" in node.module:
+                    pytest.fail(f"{py_file} imports websearch (should be removed)")
+            if isinstance(node, ast.Import):
+                for alias in node.names:
+                    if "websearch" in alias.name:
+                        pytest.fail(f"{py_file} imports websearch (should be removed)")
+```
+---
+## 4. Step-by-Step Implementation
+### Step 1: Write Tests First (TDD)
+```bash
+# Create the test file
+touch tests/unit/tools/test_websearch_removed.py
+# Write the tests from section 3
+```
+### Step 2: Run Tests (Should Fail)
+```bash
+uv run pytest tests/unit/tools/test_websearch_removed.py -v
+# Expected: FAIL (websearch still exists)
+```
+### Step 3: Delete WebTool
+```bash
+rm src/tools/websearch.py
+rm tests/unit/tools/test_websearch.py
+```
+### Step 4: Update SearchHandler Usages
+```python
+# BEFORE (in each example file)
+from src.tools.websearch import WebTool
+search_handler = SearchHandler(tools=[PubMedTool(), WebTool()], timeout=30.0)
+# AFTER
+from src.tools.pubmed import PubMedTool
+search_handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
+```
+### Step 5: Update Type Definitions
+```python
+# src/utils/models.py
+# BEFORE
+sources_searched: list[Literal["pubmed", "web"]]
+# AFTER
+sources_searched: list[Literal["pubmed"]]
+```
+### Step 6: Run All Tests
+```bash
+uv run pytest tests/unit/ -v
+# Expected: ALL PASS
+```
+### Step 7: Run Lints
+```bash
+uv run ruff check src tests examples
+uv run mypy src
+# Expected: No errors
+```
+---
+## 5. Definition of Done
+Phase 9 is **COMPLETE** when:
+- [ ] `src/tools/websearch.py` deleted
+- [ ] `tests/unit/tools/test_websearch.py` deleted
+- [ ] All example files updated (no WebTool imports)
+- [ ] Type definitions updated in models.py
+- [ ] New tests verify WebTool is removed
+- [ ] All existing tests pass
+- [ ] Lints pass
+- [ ] Examples run successfully with PubMed only
+---
+## 6. Verification Commands
+```bash
+# 1. Verify websearch.py is gone
+ls src/tools/websearch.py 2>&1 | grep "No such file"
+# 2. Verify no WebTool imports remain
+grep -r "WebTool" src/ examples/ && echo "FAIL: WebTool references found" || echo "PASS"
+grep -r "websearch" src/ examples/ && echo "FAIL: websearch references found" || echo "PASS"
+# 3. Run tests
+uv run pytest tests/unit/ -v
+# 4. Run example (should work)
+source .env && uv run python examples/search_demo/run_search.py "metformin cancer"
+```
+---
+## 7. Rollback Plan
+If something breaks:
+```bash
+git checkout HEAD -- src/tools/websearch.py
+git checkout HEAD -- tests/unit/tools/test_websearch.py
+```
+---
+## 8. Value Delivered
+| Before | After |
+|--------|-------|
+| 2 search sources (1 broken) | 1 reliable source |
+| Rate limit failures | No failures |
+| Web noise in results | Pure scientific sources |
+| ~230 lines for websearch | 0 lines |
+**Net effect**: Simpler, more reliable, more credible.

docs/implementation/10_phase_clinicaltrials.md ADDED Viewed

	@@ -0,0 +1,437 @@

+# Phase 10 Implementation Spec: ClinicalTrials.gov Integration
+**Goal**: Add clinical trial search for drug repurposing evidence.
+**Philosophy**: "Clinical trials are the bridge from hypothesis to therapy."
+**Prerequisite**: Phase 9 complete (DuckDuckGo removed)
+**Estimated Time**: 2-3 hours
+---
+## 1. Why ClinicalTrials.gov?
+### Scientific Value
+| Feature | Value for Drug Repurposing |
+|---------|---------------------------|
+| **400,000+ studies** | Massive evidence base |
+| **Trial phase data** | Phase I/II/III = evidence strength |
+| **Intervention details** | Exact drug + dosing |
+| **Outcome measures** | What was measured |
+| **Status tracking** | Completed vs recruiting |
+| **Free API** | No cost, no key required |
+### Example Query Response
+Query: "metformin Alzheimer's"
+```json
+{
+  "studies": [
+    {
+      "nctId": "NCT04098666",
+      "briefTitle": "Metformin in Alzheimer's Dementia Prevention",
+      "phase": "Phase 2",
+      "status": "Recruiting",
+      "conditions": ["Alzheimer Disease"],
+      "interventions": ["Drug: Metformin"]
+    }
+  ]
+}
+```
+**This is GOLD for drug repurposing** - actual trials testing the hypothesis!
+---
+## 2. API Specification
+### Endpoint
+```
+Base URL: https://clinicaltrials.gov/api/v2/studies
+```
+### Key Parameters
+| Parameter | Description | Example |
+|-----------|-------------|---------|
+| `query.cond` | Condition/disease | `Alzheimer` |
+| `query.intr` | Intervention/drug | `Metformin` |
+| `query.term` | General search | `metformin alzheimer` |
+| `pageSize` | Results per page | `20` |
+| `fields` | Fields to return | See below |
+### Fields We Need
+```
+NCTId, BriefTitle, Phase, OverallStatus, Condition,
+InterventionName, StartDate, CompletionDate, BriefSummary
+```
+### Rate Limits
+- ~50 requests/minute per IP
+- No authentication required
+- Paginated (100 results max per call)
+### Documentation
+- [API v2 Docs](https://clinicaltrials.gov/data-api/api)
+- [Migration Guide](https://www.nlm.nih.gov/pubs/techbull/ma24/ma24_clinicaltrials_api.html)
+---
+## 3. Data Model
+### 3.1 Update Citation Source Type (`src/utils/models.py`)
+```python
+# BEFORE
+source: Literal["pubmed", "web"]
+# AFTER
+source: Literal["pubmed", "clinicaltrials", "biorxiv"]
+```
+### 3.2 Evidence from Clinical Trials
+Clinical trial data maps to our existing `Evidence` model:
+```python
+Evidence(
+    content=f"{brief_summary}. Phase: {phase}. Status: {status}.",
+    citation=Citation(
+        source="clinicaltrials",
+        title=brief_title,
+        url=f"https://clinicaltrials.gov/study/{nct_id}",
+        date=start_date or "Unknown",
+        authors=[]  # Trials don't have authors in the same way
+    ),
+    relevance=0.8  # Trials are highly relevant for repurposing
+)
+```
+---
+## 4. Implementation
+### 4.0 Important: HTTP Client Selection
+**ClinicalTrials.gov's WAF blocks `httpx`'s TLS fingerprint.** Use `requests` instead.
+| Library | Status | Notes |
+|---------|--------|-------|
+| `httpx` | ❌ 403 Blocked | TLS/JA3 fingerprint flagged |
+| `httpx[http2]` | ❌ 403 Blocked | HTTP/2 doesn't help |
+| `requests` | ✅ Works | Industry standard, not blocked |
+| `urllib` | ✅ Works | Stdlib alternative |
+We use `requests` wrapped in `asyncio.to_thread()` for async compatibility.
+### 4.1 ClinicalTrials Tool (`src/tools/clinicaltrials.py`)
+```python
+"""ClinicalTrials.gov search tool using API v2."""
+import asyncio
+from typing import Any, ClassVar
+import requests
+from tenacity import retry, stop_after_attempt, wait_exponential
+from src.utils.exceptions import SearchError
+from src.utils.models import Citation, Evidence
+class ClinicalTrialsTool:
+    """Search tool for ClinicalTrials.gov.
+    Note: Uses `requests` library instead of `httpx` because ClinicalTrials.gov's
+    WAF blocks httpx's TLS fingerprint. The `requests` library is not blocked.
+    """
+    BASE_URL = "https://clinicaltrials.gov/api/v2/studies"
+    FIELDS: ClassVar[list[str]] = [
+        "NCTId",
+        "BriefTitle",
+        "Phase",
+        "OverallStatus",
+        "Condition",
+        "InterventionName",
+        "StartDate",
+        "BriefSummary",
+    ]
+    @property
+    def name(self) -> str:
+        return "clinicaltrials"
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=1, max=10),
+        reraise=True,
+    )
+    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
+        """Search ClinicalTrials.gov for studies."""
+        params = {
+            "query.term": query,
+            "pageSize": min(max_results, 100),
+            "fields": "|".join(self.FIELDS),
+        }
+        try:
+            # Run blocking requests.get in a separate thread for async compatibility
+            response = await asyncio.to_thread(
+                requests.get,
+                self.BASE_URL,
+                params=params,
+                headers={"User-Agent": "DeepCritical-Research-Agent/1.0"},
+                timeout=30,
+            )
+            response.raise_for_status()
+            data = response.json()
+            studies = data.get("studies", [])
+            return [self._study_to_evidence(study) for study in studies[:max_results]]
+        except requests.HTTPError as e:
+            raise SearchError(f"ClinicalTrials.gov API error: {e}") from e
+        except requests.RequestException as e:
+            raise SearchError(f"ClinicalTrials.gov request failed: {e}") from e
+    def _study_to_evidence(self, study: dict) -> Evidence:
+        """Convert a clinical trial study to Evidence."""
+        # Navigate nested structure
+        protocol = study.get("protocolSection", {})
+        id_module = protocol.get("identificationModule", {})
+        status_module = protocol.get("statusModule", {})
+        desc_module = protocol.get("descriptionModule", {})
+        design_module = protocol.get("designModule", {})
+        conditions_module = protocol.get("conditionsModule", {})
+        arms_module = protocol.get("armsInterventionsModule", {})
+        nct_id = id_module.get("nctId", "Unknown")
+        title = id_module.get("briefTitle", "Untitled Study")
+        status = status_module.get("overallStatus", "Unknown")
+        start_date = status_module.get("startDateStruct", {}).get("date", "Unknown")
+        # Get phase (might be a list)
+        phases = design_module.get("phases", [])
+        phase = phases[0] if phases else "Not Applicable"
+        # Get conditions
+        conditions = conditions_module.get("conditions", [])
+        conditions_str = ", ".join(conditions[:3]) if conditions else "Unknown"
+        # Get interventions
+        interventions = arms_module.get("interventions", [])
+        intervention_names = [i.get("name", "") for i in interventions[:3]]
+        interventions_str = ", ".join(intervention_names) if intervention_names else "Unknown"
+        # Get summary
+        summary = desc_module.get("briefSummary", "No summary available.")
+        # Build content with key trial info
+        content = (
+            f"{summary[:500]}... "
+            f"Trial Phase: {phase}. "
+            f"Status: {status}. "
+            f"Conditions: {conditions_str}. "
+            f"Interventions: {interventions_str}."
+        )
+        return Evidence(
+            content=content[:2000],
+            citation=Citation(
+                source="clinicaltrials",
+                title=title[:500],
+                url=f"https://clinicaltrials.gov/study/{nct_id}",
+                date=start_date,
+                authors=[],  # Trials don't have traditional authors
+            ),
+            relevance=0.85,  # Trials are highly relevant for repurposing
+        )
+```
+---
+## 5. TDD Test Suite
+### 5.1 Unit Tests (`tests/unit/tools/test_clinicaltrials.py`)
+Uses `unittest.mock.patch` to mock `requests.get` (not `respx` since we're not using `httpx`).
+```python
+"""Unit tests for ClinicalTrials.gov tool."""
+from unittest.mock import MagicMock, patch
+import pytest
+import requests
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.utils.exceptions import SearchError
+from src.utils.models import Evidence
+@pytest.fixture
+def mock_clinicaltrials_response() -> dict:
+    """Mock ClinicalTrials.gov API response."""
+    return {
+        "studies": [
+            {
+                "protocolSection": {
+                    "identificationModule": {
+                        "nctId": "NCT04098666",
+                        "briefTitle": "Metformin in Alzheimer's Dementia Prevention",
+                    },
+                    "statusModule": {
+                        "overallStatus": "Recruiting",
+                        "startDateStruct": {"date": "2020-01-15"},
+                    },
+                    "descriptionModule": {
+                        "briefSummary": "This study evaluates metformin for Alzheimer's prevention."
+                    },
+                    "designModule": {"phases": ["PHASE2"]},
+                    "conditionsModule": {"conditions": ["Alzheimer Disease", "Dementia"]},
+                    "armsInterventionsModule": {
+                        "interventions": [{"name": "Metformin", "type": "Drug"}]
+                    },
+                }
+            }
+        ]
+    }
+class TestClinicalTrialsTool:
+    """Tests for ClinicalTrialsTool."""
+    def test_tool_name(self) -> None:
+        """Tool should have correct name."""
+        tool = ClinicalTrialsTool()
+        assert tool.name == "clinicaltrials"
+    @pytest.mark.asyncio
+    async def test_search_returns_evidence(
+        self, mock_clinicaltrials_response: dict
+    ) -> None:
+        """Search should return Evidence objects."""
+        with patch("src.tools.clinicaltrials.requests.get") as mock_get:
+            mock_response = MagicMock()
+            mock_response.json.return_value = mock_clinicaltrials_response
+            mock_response.raise_for_status = MagicMock()
+            mock_get.return_value = mock_response
+            tool = ClinicalTrialsTool()
+            results = await tool.search("metformin alzheimer", max_results=5)
+            assert len(results) == 1
+            assert isinstance(results[0], Evidence)
+            assert results[0].citation.source == "clinicaltrials"
+            assert "NCT04098666" in results[0].citation.url
+            assert "Metformin" in results[0].citation.title
+    @pytest.mark.asyncio
+    async def test_search_api_error(self) -> None:
+        """Search should raise SearchError on API failure."""
+        with patch("src.tools.clinicaltrials.requests.get") as mock_get:
+            mock_response = MagicMock()
+            mock_response.raise_for_status.side_effect = requests.HTTPError(
+                "500 Server Error"
+            )
+            mock_get.return_value = mock_response
+            tool = ClinicalTrialsTool()
+            with pytest.raises(SearchError):
+                await tool.search("metformin alzheimer")
+class TestClinicalTrialsIntegration:
+    """Integration tests (marked for separate run)."""
+    @pytest.mark.integration
+    @pytest.mark.asyncio
+    async def test_real_api_call(self) -> None:
+        """Test actual API call (requires network)."""
+        tool = ClinicalTrialsTool()
+        results = await tool.search("metformin diabetes", max_results=3)
+        assert len(results) > 0
+        assert all(isinstance(r, Evidence) for r in results)
+        assert all(r.citation.source == "clinicaltrials" for r in results)
+```
+---
+## 6. Integration with SearchHandler
+### 6.1 Update Example Files
+```python
+# examples/search_demo/run_search.py
+from src.tools.clinicaltrials import ClinicalTrialsTool
+from src.tools.pubmed import PubMedTool
+from src.tools.search_handler import SearchHandler
+search_handler = SearchHandler(
+    tools=[PubMedTool(), ClinicalTrialsTool()],
+    timeout=30.0
+)
+```
+### 6.2 Update SearchResult Type
+```python
+# src/utils/models.py
+sources_searched: list[Literal["pubmed", "clinicaltrials"]]
+```
+---
+## 7. Definition of Done
+Phase 10 is **COMPLETE** when:
+- [ ] `src/tools/clinicaltrials.py` implemented
+- [ ] Unit tests in `tests/unit/tools/test_clinicaltrials.py`
+- [ ] Integration test marked with `@pytest.mark.integration`
+- [ ] SearchHandler updated to include ClinicalTrialsTool
+- [ ] Type definitions updated in models.py
+- [ ] Example files updated
+- [ ] All unit tests pass
+- [ ] Lints pass
+- [ ] Manual verification with real API
+---
+## 8. Verification Commands
+```bash
+# 1. Run unit tests
+uv run pytest tests/unit/tools/test_clinicaltrials.py -v
+# 2. Run integration test (requires network)
+uv run pytest tests/unit/tools/test_clinicaltrials.py -v -m integration
+# 3. Run full test suite
+uv run pytest tests/unit/ -v
+# 4. Run example
+source .env && uv run python examples/search_demo/run_search.py "metformin alzheimer"
+# Should show results from BOTH PubMed AND ClinicalTrials.gov
+```
+---
+## 9. Value Delivered
+| Before | After |
+|--------|-------|
+| Papers only | Papers + Clinical Trials |
+| "Drug X might help" | "Drug X is in Phase II trial" |
+| No trial status | Recruiting/Completed/Terminated |
+| No phase info | Phase I/II/III evidence strength |
+**Demo pitch addition**:
+> "DeepCritical searches PubMed for peer-reviewed evidence AND ClinicalTrials.gov for 400,000+ clinical trials."