Joseph Pollack commited on
Commit
4a653e3
·
unverified ·
0 Parent(s):

Initial commit - Independent repository - Breaking fork relationship

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .cursorrules +240 -0
  2. .env.example +48 -0
  3. .gitattributes +35 -0
  4. .github/README.md +203 -0
  5. .github/workflows/ci.yml +67 -0
  6. .gitignore +77 -0
  7. .pre-commit-config.yaml +64 -0
  8. .pre-commit-hooks/run_pytest.ps1 +14 -0
  9. .pre-commit-hooks/run_pytest.sh +15 -0
  10. .python-version +1 -0
  11. AGENTS.txt +236 -0
  12. CONTRIBUTING.md +1 -0
  13. Dockerfile +52 -0
  14. Makefile +42 -0
  15. README.md +196 -0
  16. docs/CONFIGURATION.md +301 -0
  17. docs/architecture/design-patterns.md +1509 -0
  18. docs/architecture/graph_orchestration.md +151 -0
  19. docs/architecture/overview.md +474 -0
  20. docs/brainstorming/00_ROADMAP_SUMMARY.md +194 -0
  21. docs/brainstorming/01_PUBMED_IMPROVEMENTS.md +125 -0
  22. docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md +193 -0
  23. docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md +211 -0
  24. docs/brainstorming/04_OPENALEX_INTEGRATION.md +303 -0
  25. docs/brainstorming/implementation/15_PHASE_OPENALEX.md +603 -0
  26. docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md +586 -0
  27. docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md +540 -0
  28. docs/brainstorming/implementation/README.md +143 -0
  29. docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md +189 -0
  30. docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md +289 -0
  31. docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md +112 -0
  32. docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md +112 -0
  33. docs/brainstorming/magentic-pydantic/04_FOLLOWUP_REVIEW_REQUEST.md +158 -0
  34. docs/brainstorming/magentic-pydantic/REVIEW_PROMPT_FOR_SENIOR_AGENT.md +113 -0
  35. docs/bugs/FIX_PLAN_MAGENTIC_MODE.md +227 -0
  36. docs/bugs/P0_MAGENTIC_MODE_BROKEN.md +116 -0
  37. docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md +81 -0
  38. docs/development/testing.md +139 -0
  39. docs/examples/writer_agents_usage.md +425 -0
  40. docs/guides/deployment.md +142 -0
  41. docs/implementation/01_phase_foundation.md +587 -0
  42. docs/implementation/02_phase_search.md +822 -0
  43. docs/implementation/03_phase_judge.md +1052 -0
  44. docs/implementation/04_phase_ui.md +1104 -0
  45. docs/implementation/05_phase_magentic.md +1091 -0
  46. docs/implementation/06_phase_embeddings.md +409 -0
  47. docs/implementation/07_phase_hypothesis.md +630 -0
  48. docs/implementation/08_phase_report.md +854 -0
  49. docs/implementation/09_phase_source_cleanup.md +257 -0
  50. docs/implementation/10_phase_clinicaltrials.md +437 -0
.cursorrules ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepCritical Project - Cursor Rules
2
+
3
+ ## Project-Wide Rules
4
+
5
+ **Architecture**: Multi-agent research system using Pydantic AI for agent orchestration, supporting iterative and deep research patterns. Uses middleware for state management, budget tracking, and workflow coordination.
6
+
7
+ **Type Safety**: ALWAYS use complete type hints. All functions must have parameter and return type annotations. Use `mypy --strict` compliance. Use `TYPE_CHECKING` imports for circular dependencies: `from typing import TYPE_CHECKING; if TYPE_CHECKING: from src.services.embeddings import EmbeddingService`
8
+
9
+ **Async Patterns**: ALL I/O operations must be async (`async def`, `await`). Use `asyncio.gather()` for parallel operations. CPU-bound work must use `run_in_executor()`: `loop = asyncio.get_running_loop(); result = await loop.run_in_executor(None, cpu_bound_function, args)`. Never block the event loop.
10
+
11
+ **Error Handling**: Use custom exceptions from `src/utils/exceptions.py`: `DeepCriticalError`, `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions: `raise SearchError(...) from e`. Log with structlog: `logger.error("Operation failed", error=str(e), context=value)`.
12
+
13
+ **Logging**: Use `structlog` for ALL logging (NOT `print` or `logging`). Import: `import structlog; logger = structlog.get_logger()`. Log with structured data: `logger.info("event", key=value)`. Use appropriate levels: DEBUG, INFO, WARNING, ERROR.
14
+
15
+ **Pydantic Models**: All data exchange uses Pydantic models from `src/utils/models.py`. Models are frozen (`model_config = {"frozen": True}`) for immutability. Use `Field()` with descriptions. Validate with `ge=`, `le=`, `min_length=`, `max_length=` constraints.
16
+
17
+ **Code Style**: Ruff with 100-char line length. Ignore rules: `PLR0913` (too many arguments), `PLR0912` (too many branches), `PLR0911` (too many returns), `PLR2004` (magic values), `PLW0603` (global statement), `PLC0415` (lazy imports).
18
+
19
+ **Docstrings**: Google-style docstrings for all public functions. Include Args, Returns, Raises sections. Use type hints in docstrings only if needed for clarity.
20
+
21
+ **Testing**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`). Use `respx` for httpx mocking, `pytest-mock` for general mocking.
22
+
23
+ **State Management**: Use `ContextVar` in middleware for thread-safe isolation. Never use global mutable state (except singletons via `@lru_cache`). Use `WorkflowState` from `src/middleware/state_machine.py` for workflow state.
24
+
25
+ **Citation Validation**: ALWAYS validate references before returning reports. Use `validate_references()` from `src/utils/citation_validator.py`. Remove hallucinated citations. Log warnings for removed citations.
26
+
27
+ ---
28
+
29
+ ## src/agents/ - Agent Implementation Rules
30
+
31
+ **Pattern**: All agents use Pydantic AI `Agent` class. Agents have structured output types (Pydantic models) or return strings. Use factory functions in `src/agent_factory/agents.py` for creation.
32
+
33
+ **Agent Structure**:
34
+ - System prompt as module-level constant (with date injection: `datetime.now().strftime("%Y-%m-%d")`)
35
+ - Agent class with `__init__(model: Any | None = None)`
36
+ - Main method (e.g., `async def evaluate()`, `async def write_report()`)
37
+ - Factory function: `def create_agent_name(model: Any | None = None) -> AgentName`
38
+
39
+ **Model Initialization**: Use `get_model()` from `src/agent_factory/judges.py` if no model provided. Support OpenAI/Anthropic/HF Inference via settings.
40
+
41
+ **Error Handling**: Return fallback values (e.g., `KnowledgeGapOutput(research_complete=False, outstanding_gaps=[...])`) on failure. Log errors with context. Use retry logic (3 retries) in Pydantic AI Agent initialization.
42
+
43
+ **Input Validation**: Validate query/inputs are not empty. Truncate very long inputs with warnings. Handle None values gracefully.
44
+
45
+ **Output Types**: Use structured output types from `src/utils/models.py` (e.g., `KnowledgeGapOutput`, `AgentSelectionPlan`, `ReportDraft`). For text output (writer agents), return `str` directly.
46
+
47
+ **Agent-Specific Rules**:
48
+ - `knowledge_gap.py`: Outputs `KnowledgeGapOutput`. Evaluates research completeness.
49
+ - `tool_selector.py`: Outputs `AgentSelectionPlan`. Selects tools (RAG/web/database).
50
+ - `writer.py`: Returns markdown string. Includes citations in numbered format.
51
+ - `long_writer.py`: Uses `ReportDraft` input/output. Handles section-by-section writing.
52
+ - `proofreader.py`: Takes `ReportDraft`, returns polished markdown.
53
+ - `thinking.py`: Returns observation string from conversation history.
54
+ - `input_parser.py`: Outputs `ParsedQuery` with research mode detection.
55
+
56
+ ---
57
+
58
+ ## src/tools/ - Search Tool Rules
59
+
60
+ **Protocol**: All tools implement `SearchTool` protocol from `src/tools/base.py`: `name` property and `async def search(query, max_results) -> list[Evidence]`.
61
+
62
+ **Rate Limiting**: Use `@retry` decorator from tenacity: `@retry(stop=stop_after_attempt(3), wait=wait_exponential(...))`. Implement `_rate_limit()` method for APIs with limits. Use shared rate limiters from `src/tools/rate_limiter.py`.
63
+
64
+ **Error Handling**: Raise `SearchError` or `RateLimitError` on failures. Handle HTTP errors (429, 500, timeout). Return empty list on non-critical errors (log warning).
65
+
66
+ **Query Preprocessing**: Use `preprocess_query()` from `src/tools/query_utils.py` to remove noise and expand synonyms.
67
+
68
+ **Evidence Conversion**: Convert API responses to `Evidence` objects with `Citation`. Extract metadata (title, url, date, authors). Set relevance scores (0.0-1.0). Handle missing fields gracefully.
69
+
70
+ **Tool-Specific Rules**:
71
+ - `pubmed.py`: Use NCBI E-utilities (ESearch → EFetch). Rate limit: 0.34s between requests. Parse XML with `xmltodict`. Handle single vs. multiple articles.
72
+ - `clinicaltrials.py`: Use `requests` library (NOT httpx - WAF blocks httpx). Run in thread pool: `await asyncio.to_thread(requests.get, ...)`. Filter: Only interventional studies, active/completed.
73
+ - `europepmc.py`: Handle preprint markers: `[PREPRINT - Not peer-reviewed]`. Build URLs from DOI or PMID.
74
+ - `rag_tool.py`: Wraps `LlamaIndexRAGService`. Returns Evidence from RAG results. Handles ingestion.
75
+ - `search_handler.py`: Orchestrates parallel searches across multiple tools. Uses `asyncio.gather()` with `return_exceptions=True`. Aggregates results into `SearchResult`.
76
+
77
+ ---
78
+
79
+ ## src/middleware/ - Middleware Rules
80
+
81
+ **State Management**: Use `ContextVar` for thread-safe isolation. `WorkflowState` uses `ContextVar[WorkflowState | None]`. Initialize with `init_workflow_state(embedding_service)`. Access with `get_workflow_state()` (auto-initializes if missing).
82
+
83
+ **WorkflowState**: Tracks `evidence: list[Evidence]`, `conversation: Conversation`, `embedding_service: Any`. Methods: `add_evidence()` (deduplicates by URL), `async search_related()` (semantic search).
84
+
85
+ **WorkflowManager**: Manages parallel research loops. Methods: `add_loop()`, `run_loops_parallel()`, `update_loop_status()`, `sync_loop_evidence_to_state()`. Uses `asyncio.gather()` for parallel execution. Handles errors per loop (don't fail all if one fails).
86
+
87
+ **BudgetTracker**: Tracks tokens, time, iterations per loop and globally. Methods: `create_budget()`, `add_tokens()`, `start_timer()`, `update_timer()`, `increment_iteration()`, `check_budget()`, `can_continue()`. Token estimation: `estimate_tokens(text)` (~4 chars per token), `estimate_llm_call_tokens(prompt, response)`.
88
+
89
+ **Models**: All middleware models in `src/utils/models.py`. `IterationData`, `Conversation`, `ResearchLoop`, `BudgetStatus` are used by middleware.
90
+
91
+ ---
92
+
93
+ ## src/orchestrator/ - Orchestration Rules
94
+
95
+ **Research Flows**: Two patterns: `IterativeResearchFlow` (single loop) and `DeepResearchFlow` (plan → parallel loops → synthesis). Both support agent chains (`use_graph=False`) and graph execution (`use_graph=True`).
96
+
97
+ **IterativeResearchFlow**: Pattern: Generate observations → Evaluate gaps → Select tools → Execute → Judge → Continue/Complete. Uses `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`, `WriterAgent`, `JudgeHandler`. Tracks iterations, time, budget.
98
+
99
+ **DeepResearchFlow**: Pattern: Planner → Parallel iterative loops per section → Synthesizer. Uses `PlannerAgent`, `IterativeResearchFlow` (per section), `LongWriterAgent` or `ProofreaderAgent`. Uses `WorkflowManager` for parallel execution.
100
+
101
+ **Graph Orchestrator**: Uses Pydantic AI Graphs (when available) or agent chains (fallback). Routes based on research mode (iterative/deep/auto). Streams `AgentEvent` objects for UI.
102
+
103
+ **State Initialization**: Always call `init_workflow_state()` before running flows. Initialize `BudgetTracker` per loop. Use `WorkflowManager` for parallel coordination.
104
+
105
+ **Event Streaming**: Yield `AgentEvent` objects during execution. Event types: "started", "search_complete", "judge_complete", "hypothesizing", "synthesizing", "complete", "error". Include iteration numbers and data payloads.
106
+
107
+ ---
108
+
109
+ ## src/services/ - Service Rules
110
+
111
+ **EmbeddingService**: Local sentence-transformers (NO API key required). All operations async-safe via `run_in_executor()`. ChromaDB for vector storage. Deduplication threshold: 0.85 (85% similarity = duplicate).
112
+
113
+ **LlamaIndexRAGService**: Uses OpenAI embeddings (requires `OPENAI_API_KEY`). Methods: `ingest_evidence()`, `retrieve()`, `query()`. Returns documents with metadata (source, title, url, date, authors). Lazy initialization with graceful fallback.
114
+
115
+ **StatisticalAnalyzer**: Generates Python code via LLM. Executes in Modal sandbox (secure, isolated). Library versions pinned in `SANDBOX_LIBRARIES` dict. Returns `AnalysisResult` with verdict (SUPPORTED/REFUTED/INCONCLUSIVE).
116
+
117
+ **Singleton Pattern**: Use `@lru_cache(maxsize=1)` for singletons: `@lru_cache(maxsize=1); def get_service() -> Service: return Service()`. Lazy initialization to avoid requiring dependencies at import time.
118
+
119
+ ---
120
+
121
+ ## src/utils/ - Utility Rules
122
+
123
+ **Models**: All Pydantic models in `src/utils/models.py`. Use frozen models (`model_config = {"frozen": True}`) except where mutation needed. Use `Field()` with descriptions. Validate with constraints.
124
+
125
+ **Config**: Settings via Pydantic Settings (`src/utils/config.py`). Load from `.env` automatically. Use `settings` singleton: `from src.utils.config import settings`. Validate API keys with properties: `has_openai_key`, `has_anthropic_key`.
126
+
127
+ **Exceptions**: Custom exception hierarchy in `src/utils/exceptions.py`. Base: `DeepCriticalError`. Specific: `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions.
128
+
129
+ **LLM Factory**: Centralized LLM model creation in `src/utils/llm_factory.py`. Supports OpenAI, Anthropic, HF Inference. Use `get_model()` or factory functions. Check requirements before initialization.
130
+
131
+ **Citation Validator**: Use `validate_references()` from `src/utils/citation_validator.py`. Removes hallucinated citations (URLs not in evidence). Logs warnings. Returns validated report string.
132
+
133
+ ---
134
+
135
+ ## src/orchestrator_factory.py Rules
136
+
137
+ **Purpose**: Factory for creating orchestrators. Supports "simple" (legacy) and "advanced" (magentic) modes. Auto-detects mode based on API key availability.
138
+
139
+ **Pattern**: Lazy import for optional dependencies (`_get_magentic_orchestrator_class()`). Handles `ImportError` gracefully with clear error messages.
140
+
141
+ **Mode Detection**: `_determine_mode()` checks explicit mode or auto-detects: "advanced" if `settings.has_openai_key`, else "simple". Maps "magentic" → "advanced".
142
+
143
+ **Function Signature**: `create_orchestrator(search_handler, judge_handler, config, mode) -> Any`. Simple mode requires handlers. Advanced mode uses MagenticOrchestrator.
144
+
145
+ **Error Handling**: Raise `ValueError` with clear messages if requirements not met. Log mode selection with structlog.
146
+
147
+ ---
148
+
149
+ ## src/orchestrator_hierarchical.py Rules
150
+
151
+ **Purpose**: Hierarchical orchestrator using middleware and sub-teams. Adapts Magentic ChatAgent to SubIterationTeam protocol.
152
+
153
+ **Pattern**: Uses `SubIterationMiddleware` with `ResearchTeam` and `LLMSubIterationJudge`. Event-driven via callback queue.
154
+
155
+ **State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated, but kept for compatibility).
156
+
157
+ **Event Streaming**: Uses `asyncio.Queue` for event coordination. Yields `AgentEvent` objects. Handles event callback pattern with `asyncio.wait()`.
158
+
159
+ **Error Handling**: Log errors with context. Yield error events. Process remaining events after task completion.
160
+
161
+ ---
162
+
163
+ ## src/orchestrator_magentic.py Rules
164
+
165
+ **Purpose**: Magentic-based orchestrator using ChatAgent pattern. Each agent has internal LLM. Manager orchestrates agents.
166
+
167
+ **Pattern**: Uses `MagenticBuilder` with participants (searcher, hypothesizer, judge, reporter). Manager uses `OpenAIChatClient`. Workflow built in `_build_workflow()`.
168
+
169
+ **Event Processing**: `_process_event()` converts Magentic events to `AgentEvent`. Handles: `MagenticOrchestratorMessageEvent`, `MagenticAgentMessageEvent`, `MagenticFinalResultEvent`, `MagenticAgentDeltaEvent`, `WorkflowOutputEvent`.
170
+
171
+ **Text Extraction**: `_extract_text()` defensively extracts text from messages. Priority: `.content` → `.text` → `str(message)`. Handles buggy message objects.
172
+
173
+ **State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated).
174
+
175
+ **Requirements**: Must call `check_magentic_requirements()` in `__init__`. Requires `agent-framework-core` and OpenAI API key.
176
+
177
+ **Event Types**: Maps agent names to event types: "search" → "search_complete", "judge" → "judge_complete", "hypothes" → "hypothesizing", "report" → "synthesizing".
178
+
179
+ ---
180
+
181
+ ## src/agent_factory/ - Factory Rules
182
+
183
+ **Pattern**: Factory functions for creating agents and handlers. Lazy initialization for optional dependencies. Support OpenAI/Anthropic/HF Inference.
184
+
185
+ **Judges**: `create_judge_handler()` creates `JudgeHandler` with structured output (`JudgeAssessment`). Supports `MockJudgeHandler`, `HFInferenceJudgeHandler` as fallbacks.
186
+
187
+ **Agents**: Factory functions in `agents.py` for all Pydantic AI agents. Pattern: `create_agent_name(model: Any | None = None) -> AgentName`. Use `get_model()` if model not provided.
188
+
189
+ **Graph Builder**: `graph_builder.py` contains utilities for building research graphs. Supports iterative and deep research graph construction.
190
+
191
+ **Error Handling**: Raise `ConfigurationError` if required API keys missing. Log agent creation. Handle import errors gracefully.
192
+
193
+ ---
194
+
195
+ ## src/prompts/ - Prompt Rules
196
+
197
+ **Pattern**: System prompts stored as module-level constants. Include date injection: `datetime.now().strftime("%Y-%m-%d")`. Format evidence with truncation (1500 chars per item).
198
+
199
+ **Judge Prompts**: In `judge.py`. Handle empty evidence case separately. Always request structured JSON output.
200
+
201
+ **Hypothesis Prompts**: In `hypothesis.py`. Use diverse evidence selection (MMR algorithm). Sentence-aware truncation.
202
+
203
+ **Report Prompts**: In `report.py`. Include full citation details. Use diverse evidence selection (n=20). Emphasize citation validation rules.
204
+
205
+ ---
206
+
207
+ ## Testing Rules
208
+
209
+ **Structure**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`).
210
+
211
+ **Mocking**: Use `respx` for httpx mocking. Use `pytest-mock` for general mocking. Mock LLM calls in unit tests (use `MockJudgeHandler`).
212
+
213
+ **Fixtures**: Common fixtures in `tests/conftest.py`: `mock_httpx_client`, `mock_llm_response`.
214
+
215
+ **Coverage**: Aim for >80% coverage. Test error handling, edge cases, and integration paths.
216
+
217
+ ---
218
+
219
+ ## File-Specific Agent Rules
220
+
221
+ **knowledge_gap.py**: Outputs `KnowledgeGapOutput`. System prompt evaluates research completeness. Handles conversation history. Returns fallback on error.
222
+
223
+ **writer.py**: Returns markdown string. System prompt includes citation format examples. Validates inputs. Truncates long findings. Retry logic for transient failures.
224
+
225
+ **long_writer.py**: Uses `ReportDraft` input/output. Writes sections iteratively. Reformats references (deduplicates, renumbers). Reformats section headings.
226
+
227
+ **proofreader.py**: Takes `ReportDraft`, returns polished markdown. Removes duplicates. Adds summary. Preserves references.
228
+
229
+ **tool_selector.py**: Outputs `AgentSelectionPlan`. System prompt lists available agents (WebSearchAgent, SiteCrawlerAgent, RAGAgent). Guidelines for when to use each.
230
+
231
+ **thinking.py**: Returns observation string. Generates observations from conversation history. Uses query and background context.
232
+
233
+ **input_parser.py**: Outputs `ParsedQuery`. Detects research mode (iterative/deep). Extracts entities and research questions. Improves/refines query.
234
+
235
+
236
+
237
+
238
+
239
+
240
+
.env.example ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ============== LLM CONFIGURATION ==============
2
+
3
+ # Provider: "openai" or "anthropic"
4
+ LLM_PROVIDER=openai
5
+
6
+ # API Keys (at least one required for full LLM analysis)
7
+ OPENAI_API_KEY=sk-your-key-here
8
+ ANTHROPIC_API_KEY=sk-ant-your-key-here
9
+
10
+ # Model names (optional - sensible defaults set in config.py)
11
+ # ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
12
+ # OPENAI_MODEL=gpt-5.1
13
+
14
+ # ============== EMBEDDINGS ==============
15
+
16
+ # OpenAI Embedding Model (used if LLM_PROVIDER is openai and performing RAG/Embeddings)
17
+ OPENAI_EMBEDDING_MODEL=text-embedding-3-small
18
+
19
+ # Local Embedding Model (used for local/offline embeddings)
20
+ LOCAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
21
+
22
+ # ============== HUGGINGFACE (FREE TIER) ==============
23
+
24
+ # HuggingFace Token - enables Llama 3.1 (best quality free model)
25
+ # Get yours at: https://huggingface.co/settings/tokens
26
+ #
27
+ # WITHOUT HF_TOKEN: Falls back to ungated models (zephyr-7b-beta)
28
+ # WITH HF_TOKEN: Uses Llama 3.1 8B Instruct (requires accepting license)
29
+ #
30
+ # For HuggingFace Spaces deployment:
31
+ # Set this as a "Secret" in Space Settings -> Variables and secrets
32
+ # Users/judges don't need their own token - the Space secret is used
33
+ #
34
+ HF_TOKEN=hf_your-token-here
35
+
36
+ # ============== AGENT CONFIGURATION ==============
37
+
38
+ MAX_ITERATIONS=10
39
+ SEARCH_TIMEOUT=30
40
+ LOG_LEVEL=INFO
41
+
42
+ # ============== EXTERNAL SERVICES ==============
43
+
44
+ # PubMed (optional - higher rate limits)
45
+ NCBI_API_KEY=your-ncbi-key-here
46
+
47
+ # Vector Database (optional - for LlamaIndex RAG)
48
+ CHROMA_DB_PATH=./chroma_db
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.github/README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DeepCritical
3
+ emoji: 🧬
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: "6.0.1"
8
+ python_version: "3.11"
9
+ app_file: src/app.py
10
+ pinned: false
11
+ license: mit
12
+ tags:
13
+ - mcp-in-action-track-enterprise
14
+ - mcp-hackathon
15
+ - drug-repurposing
16
+ - biomedical-ai
17
+ - pydantic-ai
18
+ - llamaindex
19
+ - modal
20
+ ---
21
+
22
+ # DeepCritical
23
+
24
+ ## Intro
25
+
26
+ ## Features
27
+
28
+ - **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
29
+ - **MCP Integration**: Use our tools from Claude Desktop or any MCP client
30
+ - **Modal Sandbox**: Secure execution of AI-generated statistical code
31
+ - **LlamaIndex RAG**: Semantic search and evidence synthesis
32
+ - **HuggingfaceInference**:
33
+ - **HuggingfaceMCP Custom Config To Use Community Tools**:
34
+ - **Strongly Typed Composable Graphs**:
35
+ - **Specialized Research Teams of Agents**:
36
+
37
+ ## Quick Start
38
+
39
+ ### 1. Environment Setup
40
+
41
+ ```bash
42
+ # Install uv if you haven't already
43
+ pip install uv
44
+
45
+ # Sync dependencies
46
+ uv sync
47
+ ```
48
+
49
+ ### 2. Run the UI
50
+
51
+ ```bash
52
+ # Start the Gradio app
53
+ uv run gradio run src/app.py
54
+ ```
55
+
56
+ Open your browser to `http://localhost:7860`.
57
+
58
+ ### 3. Connect via MCP
59
+
60
+ This application exposes a Model Context Protocol (MCP) server, allowing you to use its search tools directly from Claude Desktop or other MCP clients.
61
+
62
+ **MCP Server URL**: `http://localhost:7860/gradio_api/mcp/`
63
+
64
+ **Claude Desktop Configuration**:
65
+ Add this to your `claude_desktop_config.json`:
66
+ ```json
67
+ {
68
+ "mcpServers": {
69
+ "deepcritical": {
70
+ "url": "http://localhost:7860/gradio_api/mcp/"
71
+ }
72
+ }
73
+ }
74
+ ```
75
+
76
+ **Available Tools**:
77
+ - `search_pubmed`: Search peer-reviewed biomedical literature.
78
+ - `search_clinical_trials`: Search ClinicalTrials.gov.
79
+ - `search_biorxiv`: Search bioRxiv/medRxiv preprints.
80
+ - `search_all`: Search all sources simultaneously.
81
+ - `analyze_hypothesis`: Secure statistical analysis using Modal sandboxes.
82
+
83
+
84
+ ## Deep Research Flows
85
+
86
+ - iterativeResearch
87
+ - deepResearch
88
+ - researchTeam
89
+
90
+ ### Iterative Research
91
+
92
+ sequenceDiagram
93
+ participant IterativeFlow
94
+ participant ThinkingAgent
95
+ participant KnowledgeGapAgent
96
+ participant ToolSelector
97
+ participant ToolExecutor
98
+ participant JudgeHandler
99
+ participant WriterAgent
100
+
101
+ IterativeFlow->>IterativeFlow: run(query)
102
+
103
+ loop Until complete or max_iterations
104
+ IterativeFlow->>ThinkingAgent: generate_observations()
105
+ ThinkingAgent-->>IterativeFlow: observations
106
+
107
+ IterativeFlow->>KnowledgeGapAgent: evaluate_gaps()
108
+ KnowledgeGapAgent-->>IterativeFlow: KnowledgeGapOutput
109
+
110
+ alt Research complete
111
+ IterativeFlow->>WriterAgent: create_final_report()
112
+ WriterAgent-->>IterativeFlow: final_report
113
+ else Gaps remain
114
+ IterativeFlow->>ToolSelector: select_agents(gap)
115
+ ToolSelector-->>IterativeFlow: AgentSelectionPlan
116
+
117
+ IterativeFlow->>ToolExecutor: execute_tool_tasks()
118
+ ToolExecutor-->>IterativeFlow: ToolAgentOutput[]
119
+
120
+ IterativeFlow->>JudgeHandler: assess_evidence()
121
+ JudgeHandler-->>IterativeFlow: should_continue
122
+ end
123
+ end
124
+
125
+
126
+ ### Deep Research
127
+
128
+ sequenceDiagram
129
+ actor User
130
+ participant GraphOrchestrator
131
+ participant InputParser
132
+ participant GraphBuilder
133
+ participant GraphExecutor
134
+ participant Agent
135
+ participant BudgetTracker
136
+ participant WorkflowState
137
+
138
+ User->>GraphOrchestrator: run(query)
139
+ GraphOrchestrator->>InputParser: detect_research_mode(query)
140
+ InputParser-->>GraphOrchestrator: mode (iterative/deep)
141
+ GraphOrchestrator->>GraphBuilder: build_graph(mode)
142
+ GraphBuilder-->>GraphOrchestrator: ResearchGraph
143
+ GraphOrchestrator->>WorkflowState: init_workflow_state()
144
+ GraphOrchestrator->>BudgetTracker: create_budget()
145
+ GraphOrchestrator->>GraphExecutor: _execute_graph(graph)
146
+
147
+ loop For each node in graph
148
+ GraphExecutor->>Agent: execute_node(agent_node)
149
+ Agent->>Agent: process_input
150
+ Agent-->>GraphExecutor: result
151
+ GraphExecutor->>WorkflowState: update_state(result)
152
+ GraphExecutor->>BudgetTracker: add_tokens(used)
153
+ GraphExecutor->>BudgetTracker: check_budget()
154
+ alt Budget exceeded
155
+ GraphExecutor->>GraphOrchestrator: emit(error_event)
156
+ else Continue
157
+ GraphExecutor->>GraphOrchestrator: emit(progress_event)
158
+ end
159
+ end
160
+
161
+ GraphOrchestrator->>User: AsyncGenerator[AgentEvent]
162
+
163
+ ### Research Team
164
+ Critical Deep Research Agent
165
+
166
+ ## Development
167
+
168
+ ### Run Tests
169
+
170
+ ```bash
171
+ uv run pytest
172
+ ```
173
+
174
+ ### Run Checks
175
+
176
+ ```bash
177
+ make check
178
+ ```
179
+
180
+ ## Architecture
181
+
182
+ DeepCritical uses a Vertical Slice Architecture:
183
+
184
+ 1. **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and bioRxiv.
185
+ 2. **Judge Slice**: Evaluating evidence quality using LLMs.
186
+ 3. **Orchestrator Slice**: Managing the research loop and UI.
187
+
188
+ Built with:
189
+ - **PydanticAI**: For robust agent interactions.
190
+ - **Gradio**: For the streaming user interface.
191
+ - **PubMed, ClinicalTrials.gov, bioRxiv**: For biomedical data.
192
+ - **MCP**: For universal tool access.
193
+ - **Modal**: For secure code execution.
194
+
195
+ ## Team
196
+
197
+ - The-Obstacle-Is-The-Way
198
+ - MarioAderman
199
+ - Josephrp
200
+
201
+ ## Links
202
+
203
+ - [GitHub Repository](https://github.com/The-Obstacle-Is-The-Way/DeepCritical-1)
.github/workflows/ci.yml ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [main, develop]
6
+ pull_request:
7
+ branches: [main, develop]
8
+
9
+ jobs:
10
+ test:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ matrix:
14
+ python-version: ["3.11"]
15
+
16
+ steps:
17
+ - uses: actions/checkout@v4
18
+
19
+ - name: Set up Python ${{ matrix.python-version }}
20
+ uses: actions/setup-python@v5
21
+ with:
22
+ python-version: ${{ matrix.python-version }}
23
+
24
+ - name: Install dependencies
25
+ run: |
26
+ python -m pip install --upgrade pip
27
+ pip install -e ".[dev]"
28
+
29
+ - name: Lint with ruff
30
+ run: |
31
+ ruff check . --exclude tests
32
+ ruff format --check . --exclude tests
33
+
34
+ - name: Type check with mypy
35
+ run: |
36
+ mypy src
37
+
38
+ - name: Install embedding dependencies
39
+ run: |
40
+ pip install -e ".[embeddings]"
41
+
42
+ - name: Run unit tests (excluding OpenAI and embedding providers)
43
+ env:
44
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
45
+ run: |
46
+ pytest tests/unit/ -v -m "not openai and not embedding_provider" --tb=short -p no:logfire
47
+
48
+ - name: Run local embeddings tests
49
+ env:
50
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
51
+ run: |
52
+ pytest tests/ -v -m "local_embeddings" --tb=short -p no:logfire || true
53
+ continue-on-error: true # Allow failures if dependencies not available
54
+
55
+ - name: Run HuggingFace integration tests
56
+ env:
57
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
58
+ run: |
59
+ pytest tests/integration/ -v -m "huggingface and not embedding_provider" --tb=short -p no:logfire || true
60
+ continue-on-error: true # Allow failures if HF_TOKEN not set
61
+
62
+ - name: Run non-OpenAI integration tests (excluding embedding providers)
63
+ env:
64
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
65
+ run: |
66
+ pytest tests/integration/ -v -m "integration and not openai and not embedding_provider" --tb=short -p no:logfire || true
67
+ continue-on-error: true # Allow failures if dependencies not available
.gitignore ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ folder/
2
+ .cursor/
3
+ .ruff_cache/
4
+ # Python
5
+ __pycache__/
6
+ *.py[cod]
7
+ *$py.class
8
+ *.so
9
+ .Python
10
+ build/
11
+ develop-eggs/
12
+ dist/
13
+ downloads/
14
+ eggs/
15
+ .eggs/
16
+ lib/
17
+ lib64/
18
+ parts/
19
+ sdist/
20
+ var/
21
+ wheels/
22
+ *.egg-info/
23
+ .installed.cfg
24
+ *.egg
25
+
26
+ # Virtual environments
27
+ .venv/
28
+ venv/
29
+ ENV/
30
+ env/
31
+
32
+ # IDE
33
+ .vscode/
34
+ .idea/
35
+ *.swp
36
+ *.swo
37
+
38
+ # Environment
39
+ .env
40
+ .env.local
41
+ *.local
42
+
43
+ # Claude
44
+ .claude/
45
+
46
+ # Burner docs (working drafts, not for commit)
47
+ burner_docs/
48
+
49
+ # Reference repos (clone locally, don't commit)
50
+ reference_repos/autogen-microsoft/
51
+ reference_repos/claude-agent-sdk/
52
+ reference_repos/pydanticai-research-agent/
53
+ reference_repos/pubmed-mcp-server/
54
+ reference_repos/DeepCritical/
55
+
56
+ # Keep the README in reference_repos
57
+ !reference_repos/README.md
58
+
59
+ # OS
60
+ .DS_Store
61
+ Thumbs.db
62
+
63
+ # Logs
64
+ *.log
65
+ logs/
66
+
67
+ # Testing
68
+ .pytest_cache/
69
+ .mypy_cache/
70
+ .coverage
71
+ htmlcov/
72
+
73
+ # Database files
74
+ chroma_db/
75
+ *.sqlite3
76
+
77
+ # Trigger rebuild Wed Nov 26 17:51:41 EST 2025
.pre-commit-config.yaml ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ repos:
2
+ - repo: https://github.com/astral-sh/ruff-pre-commit
3
+ rev: v0.4.4
4
+ hooks:
5
+ - id: ruff
6
+ args: [--fix, --exclude, tests]
7
+ exclude: ^reference_repos/
8
+ - id: ruff-format
9
+ args: [--exclude, tests]
10
+ exclude: ^reference_repos/
11
+
12
+ - repo: https://github.com/pre-commit/mirrors-mypy
13
+ rev: v1.10.0
14
+ hooks:
15
+ - id: mypy
16
+ files: ^src/
17
+ exclude: ^folder
18
+ additional_dependencies:
19
+ - pydantic>=2.7
20
+ - pydantic-settings>=2.2
21
+ - tenacity>=8.2
22
+ - pydantic-ai>=0.0.16
23
+ args: [--ignore-missing-imports]
24
+
25
+ - repo: local
26
+ hooks:
27
+ - id: pytest-unit
28
+ name: pytest unit tests (no OpenAI)
29
+ entry: uv
30
+ language: system
31
+ types: [python]
32
+ args: [
33
+ "run",
34
+ "pytest",
35
+ "tests/unit/",
36
+ "-v",
37
+ "-m",
38
+ "not openai and not embedding_provider",
39
+ "--tb=short",
40
+ "-p",
41
+ "no:logfire",
42
+ ]
43
+ pass_filenames: false
44
+ always_run: true
45
+ require_serial: false
46
+ - id: pytest-local-embeddings
47
+ name: pytest local embeddings tests
48
+ entry: uv
49
+ language: system
50
+ types: [python]
51
+ args: [
52
+ "run",
53
+ "pytest",
54
+ "tests/",
55
+ "-v",
56
+ "-m",
57
+ "local_embeddings",
58
+ "--tb=short",
59
+ "-p",
60
+ "no:logfire",
61
+ ]
62
+ pass_filenames: false
63
+ always_run: true
64
+ require_serial: false
.pre-commit-hooks/run_pytest.ps1 ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PowerShell pytest runner for pre-commit (Windows)
2
+ # Uses uv if available, otherwise falls back to pytest
3
+
4
+ if (Get-Command uv -ErrorAction SilentlyContinue) {
5
+ uv run pytest $args
6
+ } else {
7
+ Write-Warning "uv not found, using system pytest (may have missing dependencies)"
8
+ pytest $args
9
+ }
10
+
11
+
12
+
13
+
14
+
.pre-commit-hooks/run_pytest.sh ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Cross-platform pytest runner for pre-commit
3
+ # Uses uv if available, otherwise falls back to pytest
4
+
5
+ if command -v uv >/dev/null 2>&1; then
6
+ uv run pytest "$@"
7
+ else
8
+ echo "Warning: uv not found, using system pytest (may have missing dependencies)"
9
+ pytest "$@"
10
+ fi
11
+
12
+
13
+
14
+
15
+
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.11
AGENTS.txt ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepCritical Project - Rules
2
+
3
+ ## Project-Wide Rules
4
+
5
+ **Architecture**: Multi-agent research system using Pydantic AI for agent orchestration, supporting iterative and deep research patterns. Uses middleware for state management, budget tracking, and workflow coordination.
6
+
7
+ **Type Safety**: ALWAYS use complete type hints. All functions must have parameter and return type annotations. Use `mypy --strict` compliance. Use `TYPE_CHECKING` imports for circular dependencies: `from typing import TYPE_CHECKING; if TYPE_CHECKING: from src.services.embeddings import EmbeddingService`
8
+
9
+ **Async Patterns**: ALL I/O operations must be async (`async def`, `await`). Use `asyncio.gather()` for parallel operations. CPU-bound work must use `run_in_executor()`: `loop = asyncio.get_running_loop(); result = await loop.run_in_executor(None, cpu_bound_function, args)`. Never block the event loop.
10
+
11
+ **Error Handling**: Use custom exceptions from `src/utils/exceptions.py`: `DeepCriticalError`, `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions: `raise SearchError(...) from e`. Log with structlog: `logger.error("Operation failed", error=str(e), context=value)`.
12
+
13
+ **Logging**: Use `structlog` for ALL logging (NOT `print` or `logging`). Import: `import structlog; logger = structlog.get_logger()`. Log with structured data: `logger.info("event", key=value)`. Use appropriate levels: DEBUG, INFO, WARNING, ERROR.
14
+
15
+ **Pydantic Models**: All data exchange uses Pydantic models from `src/utils/models.py`. Models are frozen (`model_config = {"frozen": True}`) for immutability. Use `Field()` with descriptions. Validate with `ge=`, `le=`, `min_length=`, `max_length=` constraints.
16
+
17
+ **Code Style**: Ruff with 100-char line length. Ignore rules: `PLR0913` (too many arguments), `PLR0912` (too many branches), `PLR0911` (too many returns), `PLR2004` (magic values), `PLW0603` (global statement), `PLC0415` (lazy imports).
18
+
19
+ **Docstrings**: Google-style docstrings for all public functions. Include Args, Returns, Raises sections. Use type hints in docstrings only if needed for clarity.
20
+
21
+ **Testing**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`). Use `respx` for httpx mocking, `pytest-mock` for general mocking.
22
+
23
+ **State Management**: Use `ContextVar` in middleware for thread-safe isolation. Never use global mutable state (except singletons via `@lru_cache`). Use `WorkflowState` from `src/middleware/state_machine.py` for workflow state.
24
+
25
+ **Citation Validation**: ALWAYS validate references before returning reports. Use `validate_references()` from `src/utils/citation_validator.py`. Remove hallucinated citations. Log warnings for removed citations.
26
+
27
+ ---
28
+
29
+ ## src/agents/ - Agent Implementation Rules
30
+
31
+ **Pattern**: All agents use Pydantic AI `Agent` class. Agents have structured output types (Pydantic models) or return strings. Use factory functions in `src/agent_factory/agents.py` for creation.
32
+
33
+ **Agent Structure**:
34
+ - System prompt as module-level constant (with date injection: `datetime.now().strftime("%Y-%m-%d")`)
35
+ - Agent class with `__init__(model: Any | None = None)`
36
+ - Main method (e.g., `async def evaluate()`, `async def write_report()`)
37
+ - Factory function: `def create_agent_name(model: Any | None = None) -> AgentName`
38
+
39
+ **Model Initialization**: Use `get_model()` from `src/agent_factory/judges.py` if no model provided. Support OpenAI/Anthropic/HF Inference via settings.
40
+
41
+ **Error Handling**: Return fallback values (e.g., `KnowledgeGapOutput(research_complete=False, outstanding_gaps=[...])`) on failure. Log errors with context. Use retry logic (3 retries) in Pydantic AI Agent initialization.
42
+
43
+ **Input Validation**: Validate query/inputs are not empty. Truncate very long inputs with warnings. Handle None values gracefully.
44
+
45
+ **Output Types**: Use structured output types from `src/utils/models.py` (e.g., `KnowledgeGapOutput`, `AgentSelectionPlan`, `ReportDraft`). For text output (writer agents), return `str` directly.
46
+
47
+ **Agent-Specific Rules**:
48
+ - `knowledge_gap.py`: Outputs `KnowledgeGapOutput`. Evaluates research completeness.
49
+ - `tool_selector.py`: Outputs `AgentSelectionPlan`. Selects tools (RAG/web/database).
50
+ - `writer.py`: Returns markdown string. Includes citations in numbered format.
51
+ - `long_writer.py`: Uses `ReportDraft` input/output. Handles section-by-section writing.
52
+ - `proofreader.py`: Takes `ReportDraft`, returns polished markdown.
53
+ - `thinking.py`: Returns observation string from conversation history.
54
+ - `input_parser.py`: Outputs `ParsedQuery` with research mode detection.
55
+
56
+ ---
57
+
58
+ ## src/tools/ - Search Tool Rules
59
+
60
+ **Protocol**: All tools implement `SearchTool` protocol from `src/tools/base.py`: `name` property and `async def search(query, max_results) -> list[Evidence]`.
61
+
62
+ **Rate Limiting**: Use `@retry` decorator from tenacity: `@retry(stop=stop_after_attempt(3), wait=wait_exponential(...))`. Implement `_rate_limit()` method for APIs with limits. Use shared rate limiters from `src/tools/rate_limiter.py`.
63
+
64
+ **Error Handling**: Raise `SearchError` or `RateLimitError` on failures. Handle HTTP errors (429, 500, timeout). Return empty list on non-critical errors (log warning).
65
+
66
+ **Query Preprocessing**: Use `preprocess_query()` from `src/tools/query_utils.py` to remove noise and expand synonyms.
67
+
68
+ **Evidence Conversion**: Convert API responses to `Evidence` objects with `Citation`. Extract metadata (title, url, date, authors). Set relevance scores (0.0-1.0). Handle missing fields gracefully.
69
+
70
+ **Tool-Specific Rules**:
71
+ - `pubmed.py`: Use NCBI E-utilities (ESearch → EFetch). Rate limit: 0.34s between requests. Parse XML with `xmltodict`. Handle single vs. multiple articles.
72
+ - `clinicaltrials.py`: Use `requests` library (NOT httpx - WAF blocks httpx). Run in thread pool: `await asyncio.to_thread(requests.get, ...)`. Filter: Only interventional studies, active/completed.
73
+ - `europepmc.py`: Handle preprint markers: `[PREPRINT - Not peer-reviewed]`. Build URLs from DOI or PMID.
74
+ - `rag_tool.py`: Wraps `LlamaIndexRAGService`. Returns Evidence from RAG results. Handles ingestion.
75
+ - `search_handler.py`: Orchestrates parallel searches across multiple tools. Uses `asyncio.gather()` with `return_exceptions=True`. Aggregates results into `SearchResult`.
76
+
77
+ ---
78
+
79
+ ## src/middleware/ - Middleware Rules
80
+
81
+ **State Management**: Use `ContextVar` for thread-safe isolation. `WorkflowState` uses `ContextVar[WorkflowState | None]`. Initialize with `init_workflow_state(embedding_service)`. Access with `get_workflow_state()` (auto-initializes if missing).
82
+
83
+ **WorkflowState**: Tracks `evidence: list[Evidence]`, `conversation: Conversation`, `embedding_service: Any`. Methods: `add_evidence()` (deduplicates by URL), `async search_related()` (semantic search).
84
+
85
+ **WorkflowManager**: Manages parallel research loops. Methods: `add_loop()`, `run_loops_parallel()`, `update_loop_status()`, `sync_loop_evidence_to_state()`. Uses `asyncio.gather()` for parallel execution. Handles errors per loop (don't fail all if one fails).
86
+
87
+ **BudgetTracker**: Tracks tokens, time, iterations per loop and globally. Methods: `create_budget()`, `add_tokens()`, `start_timer()`, `update_timer()`, `increment_iteration()`, `check_budget()`, `can_continue()`. Token estimation: `estimate_tokens(text)` (~4 chars per token), `estimate_llm_call_tokens(prompt, response)`.
88
+
89
+ **Models**: All middleware models in `src/utils/models.py`. `IterationData`, `Conversation`, `ResearchLoop`, `BudgetStatus` are used by middleware.
90
+
91
+ ---
92
+
93
+ ## src/orchestrator/ - Orchestration Rules
94
+
95
+ **Research Flows**: Two patterns: `IterativeResearchFlow` (single loop) and `DeepResearchFlow` (plan → parallel loops → synthesis). Both support agent chains (`use_graph=False`) and graph execution (`use_graph=True`).
96
+
97
+ **IterativeResearchFlow**: Pattern: Generate observations → Evaluate gaps → Select tools → Execute → Judge → Continue/Complete. Uses `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`, `WriterAgent`, `JudgeHandler`. Tracks iterations, time, budget.
98
+
99
+ **DeepResearchFlow**: Pattern: Planner → Parallel iterative loops per section → Synthesizer. Uses `PlannerAgent`, `IterativeResearchFlow` (per section), `LongWriterAgent` or `ProofreaderAgent`. Uses `WorkflowManager` for parallel execution.
100
+
101
+ **Graph Orchestrator**: Uses Pydantic AI Graphs (when available) or agent chains (fallback). Routes based on research mode (iterative/deep/auto). Streams `AgentEvent` objects for UI.
102
+
103
+ **State Initialization**: Always call `init_workflow_state()` before running flows. Initialize `BudgetTracker` per loop. Use `WorkflowManager` for parallel coordination.
104
+
105
+ **Event Streaming**: Yield `AgentEvent` objects during execution. Event types: "started", "search_complete", "judge_complete", "hypothesizing", "synthesizing", "complete", "error". Include iteration numbers and data payloads.
106
+
107
+ ---
108
+
109
+ ## src/services/ - Service Rules
110
+
111
+ **EmbeddingService**: Local sentence-transformers (NO API key required). All operations async-safe via `run_in_executor()`. ChromaDB for vector storage. Deduplication threshold: 0.85 (85% similarity = duplicate).
112
+
113
+ **LlamaIndexRAGService**: Uses OpenAI embeddings (requires `OPENAI_API_KEY`). Methods: `ingest_evidence()`, `retrieve()`, `query()`. Returns documents with metadata (source, title, url, date, authors). Lazy initialization with graceful fallback.
114
+
115
+ **StatisticalAnalyzer**: Generates Python code via LLM. Executes in Modal sandbox (secure, isolated). Library versions pinned in `SANDBOX_LIBRARIES` dict. Returns `AnalysisResult` with verdict (SUPPORTED/REFUTED/INCONCLUSIVE).
116
+
117
+ **Singleton Pattern**: Use `@lru_cache(maxsize=1)` for singletons: `@lru_cache(maxsize=1); def get_service() -> Service: return Service()`. Lazy initialization to avoid requiring dependencies at import time.
118
+
119
+ ---
120
+
121
+ ## src/utils/ - Utility Rules
122
+
123
+ **Models**: All Pydantic models in `src/utils/models.py`. Use frozen models (`model_config = {"frozen": True}`) except where mutation needed. Use `Field()` with descriptions. Validate with constraints.
124
+
125
+ **Config**: Settings via Pydantic Settings (`src/utils/config.py`). Load from `.env` automatically. Use `settings` singleton: `from src.utils.config import settings`. Validate API keys with properties: `has_openai_key`, `has_anthropic_key`.
126
+
127
+ **Exceptions**: Custom exception hierarchy in `src/utils/exceptions.py`. Base: `DeepCriticalError`. Specific: `SearchError`, `RateLimitError`, `JudgeError`, `ConfigurationError`. Always chain exceptions.
128
+
129
+ **LLM Factory**: Centralized LLM model creation in `src/utils/llm_factory.py`. Supports OpenAI, Anthropic, HF Inference. Use `get_model()` or factory functions. Check requirements before initialization.
130
+
131
+ **Citation Validator**: Use `validate_references()` from `src/utils/citation_validator.py`. Removes hallucinated citations (URLs not in evidence). Logs warnings. Returns validated report string.
132
+
133
+ ---
134
+
135
+ ## src/orchestrator_factory.py Rules
136
+
137
+ **Purpose**: Factory for creating orchestrators. Supports "simple" (legacy) and "advanced" (magentic) modes. Auto-detects mode based on API key availability.
138
+
139
+ **Pattern**: Lazy import for optional dependencies (`_get_magentic_orchestrator_class()`). Handles `ImportError` gracefully with clear error messages.
140
+
141
+ **Mode Detection**: `_determine_mode()` checks explicit mode or auto-detects: "advanced" if `settings.has_openai_key`, else "simple". Maps "magentic" → "advanced".
142
+
143
+ **Function Signature**: `create_orchestrator(search_handler, judge_handler, config, mode) -> Any`. Simple mode requires handlers. Advanced mode uses MagenticOrchestrator.
144
+
145
+ **Error Handling**: Raise `ValueError` with clear messages if requirements not met. Log mode selection with structlog.
146
+
147
+ ---
148
+
149
+ ## src/orchestrator_hierarchical.py Rules
150
+
151
+ **Purpose**: Hierarchical orchestrator using middleware and sub-teams. Adapts Magentic ChatAgent to SubIterationTeam protocol.
152
+
153
+ **Pattern**: Uses `SubIterationMiddleware` with `ResearchTeam` and `LLMSubIterationJudge`. Event-driven via callback queue.
154
+
155
+ **State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated, but kept for compatibility).
156
+
157
+ **Event Streaming**: Uses `asyncio.Queue` for event coordination. Yields `AgentEvent` objects. Handles event callback pattern with `asyncio.wait()`.
158
+
159
+ **Error Handling**: Log errors with context. Yield error events. Process remaining events after task completion.
160
+
161
+ ---
162
+
163
+ ## src/orchestrator_magentic.py Rules
164
+
165
+ **Purpose**: Magentic-based orchestrator using ChatAgent pattern. Each agent has internal LLM. Manager orchestrates agents.
166
+
167
+ **Pattern**: Uses `MagenticBuilder` with participants (searcher, hypothesizer, judge, reporter). Manager uses `OpenAIChatClient`. Workflow built in `_build_workflow()`.
168
+
169
+ **Event Processing**: `_process_event()` converts Magentic events to `AgentEvent`. Handles: `MagenticOrchestratorMessageEvent`, `MagenticAgentMessageEvent`, `MagenticFinalResultEvent`, `MagenticAgentDeltaEvent`, `WorkflowOutputEvent`.
170
+
171
+ **Text Extraction**: `_extract_text()` defensively extracts text from messages. Priority: `.content` → `.text` → `str(message)`. Handles buggy message objects.
172
+
173
+ **State Initialization**: Initialize embedding service with graceful fallback. Use `init_magentic_state()` (deprecated).
174
+
175
+ **Requirements**: Must call `check_magentic_requirements()` in `__init__`. Requires `agent-framework-core` and OpenAI API key.
176
+
177
+ **Event Types**: Maps agent names to event types: "search" → "search_complete", "judge" → "judge_complete", "hypothes" → "hypothesizing", "report" → "synthesizing".
178
+
179
+ ---
180
+
181
+ ## src/agent_factory/ - Factory Rules
182
+
183
+ **Pattern**: Factory functions for creating agents and handlers. Lazy initialization for optional dependencies. Support OpenAI/Anthropic/HF Inference.
184
+
185
+ **Judges**: `create_judge_handler()` creates `JudgeHandler` with structured output (`JudgeAssessment`). Supports `MockJudgeHandler`, `HFInferenceJudgeHandler` as fallbacks.
186
+
187
+ **Agents**: Factory functions in `agents.py` for all Pydantic AI agents. Pattern: `create_agent_name(model: Any | None = None) -> AgentName`. Use `get_model()` if model not provided.
188
+
189
+ **Graph Builder**: `graph_builder.py` contains utilities for building research graphs. Supports iterative and deep research graph construction.
190
+
191
+ **Error Handling**: Raise `ConfigurationError` if required API keys missing. Log agent creation. Handle import errors gracefully.
192
+
193
+ ---
194
+
195
+ ## src/prompts/ - Prompt Rules
196
+
197
+ **Pattern**: System prompts stored as module-level constants. Include date injection: `datetime.now().strftime("%Y-%m-%d")`. Format evidence with truncation (1500 chars per item).
198
+
199
+ **Judge Prompts**: In `judge.py`. Handle empty evidence case separately. Always request structured JSON output.
200
+
201
+ **Hypothesis Prompts**: In `hypothesis.py`. Use diverse evidence selection (MMR algorithm). Sentence-aware truncation.
202
+
203
+ **Report Prompts**: In `report.py`. Include full citation details. Use diverse evidence selection (n=20). Emphasize citation validation rules.
204
+
205
+ ---
206
+
207
+ ## Testing Rules
208
+
209
+ **Structure**: Unit tests in `tests/unit/` (mocked, fast). Integration tests in `tests/integration/` (real APIs, marked `@pytest.mark.integration`).
210
+
211
+ **Mocking**: Use `respx` for httpx mocking. Use `pytest-mock` for general mocking. Mock LLM calls in unit tests (use `MockJudgeHandler`).
212
+
213
+ **Fixtures**: Common fixtures in `tests/conftest.py`: `mock_httpx_client`, `mock_llm_response`.
214
+
215
+ **Coverage**: Aim for >80% coverage. Test error handling, edge cases, and integration paths.
216
+
217
+ ---
218
+
219
+ ## File-Specific Agent Rules
220
+
221
+ **knowledge_gap.py**: Outputs `KnowledgeGapOutput`. System prompt evaluates research completeness. Handles conversation history. Returns fallback on error.
222
+
223
+ **writer.py**: Returns markdown string. System prompt includes citation format examples. Validates inputs. Truncates long findings. Retry logic for transient failures.
224
+
225
+ **long_writer.py**: Uses `ReportDraft` input/output. Writes sections iteratively. Reformats references (deduplicates, renumbers). Reformats section headings.
226
+
227
+ **proofreader.py**: Takes `ReportDraft`, returns polished markdown. Removes duplicates. Adds summary. Preserves references.
228
+
229
+ **tool_selector.py**: Outputs `AgentSelectionPlan`. System prompt lists available agents (WebSearchAgent, SiteCrawlerAgent, RAGAgent). Guidelines for when to use each.
230
+
231
+ **thinking.py**: Returns observation string. Generates observations from conversation history. Uses query and background context.
232
+
233
+ **input_parser.py**: Outputs `ParsedQuery`. Detects research mode (iterative/deep). Extracts entities and research questions. Improves/refines query.
234
+
235
+
236
+
CONTRIBUTING.md ADDED
@@ -0,0 +1 @@
 
 
1
+ make sure you run the full pre-commit checks before opening a PR (not draft) otherwise Obstacle is the Way will loose his mind
Dockerfile ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dockerfile for DeepCritical
2
+ FROM python:3.11-slim
3
+
4
+ # Set working directory
5
+ WORKDIR /app
6
+
7
+ # Install system dependencies (curl needed for HEALTHCHECK)
8
+ RUN apt-get update && apt-get install -y \
9
+ git \
10
+ curl \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
+ # Install uv
14
+ RUN pip install uv==0.5.4
15
+
16
+ # Copy project files
17
+ COPY pyproject.toml .
18
+ COPY uv.lock .
19
+ COPY src/ src/
20
+ COPY README.md .
21
+
22
+ # Install runtime dependencies only (no dev/test tools)
23
+ RUN uv sync --frozen --no-dev --extra embeddings --extra magentic
24
+
25
+ # Create non-root user BEFORE downloading models
26
+ RUN useradd --create-home --shell /bin/bash appuser
27
+
28
+ # Set cache directory for HuggingFace models (must be writable by appuser)
29
+ ENV HF_HOME=/app/.cache
30
+ ENV TRANSFORMERS_CACHE=/app/.cache
31
+
32
+ # Create cache dir with correct ownership
33
+ RUN mkdir -p /app/.cache && chown -R appuser:appuser /app/.cache
34
+
35
+ # Pre-download the embedding model during build (as appuser to set correct ownership)
36
+ USER appuser
37
+ RUN uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
38
+
39
+ # Expose port
40
+ EXPOSE 7860
41
+
42
+ # Health check
43
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
44
+ CMD curl -f http://localhost:7860/ || exit 1
45
+
46
+ # Set environment variables
47
+ ENV GRADIO_SERVER_NAME=0.0.0.0
48
+ ENV GRADIO_SERVER_PORT=7860
49
+ ENV PYTHONPATH=/app
50
+
51
+ # Run the app
52
+ CMD ["uv", "run", "python", "-m", "src.app"]
Makefile ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .PHONY: install test lint format typecheck check clean all cov cov-html
2
+
3
+ # Default target
4
+ all: check
5
+
6
+ install:
7
+ uv sync --all-extras
8
+ uv run pre-commit install
9
+
10
+ test:
11
+ uv run pytest tests/unit/ -v -m "not openai" -p no:logfire
12
+
13
+ test-hf:
14
+ uv run pytest tests/ -v -m "huggingface" -p no:logfire
15
+
16
+ test-all:
17
+ uv run pytest tests/ -v -p no:logfire
18
+
19
+ # Coverage aliases
20
+ cov: test-cov
21
+ test-cov:
22
+ uv run pytest --cov=src --cov-report=term-missing -m "not openai" -p no:logfire
23
+
24
+ cov-html:
25
+ uv run pytest --cov=src --cov-report=html -p no:logfire
26
+ @echo "Coverage report: open htmlcov/index.html"
27
+
28
+ lint:
29
+ uv run ruff check src tests
30
+
31
+ format:
32
+ uv run ruff format src tests
33
+
34
+ typecheck:
35
+ uv run mypy src
36
+
37
+ check: lint typecheck test-cov
38
+ @echo "All checks passed!"
39
+
40
+ clean:
41
+ rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ .coverage htmlcov
42
+ find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DeepCritical
3
+ emoji: 🧬
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: "6.0.1"
8
+ python_version: "3.11"
9
+ app_file: src/app.py
10
+ pinned: false
11
+ license: mit
12
+ tags:
13
+ - mcp-in-action-track-enterprise
14
+ - mcp-hackathon
15
+ - drug-repurposing
16
+ - biomedical-ai
17
+ - pydantic-ai
18
+ - llamaindex
19
+ - modal
20
+ ---
21
+
22
+ # DeepCritical
23
+
24
+ ## Intro
25
+
26
+ ## Features
27
+
28
+ - **Multi-Source Search**: PubMed, ClinicalTrials.gov, bioRxiv/medRxiv
29
+ - **MCP Integration**: Use our tools from Claude Desktop or any MCP client
30
+ - **Modal Sandbox**: Secure execution of AI-generated statistical code
31
+ - **LlamaIndex RAG**: Semantic search and evidence synthesis
32
+ - **HuggingfaceInference**:
33
+ - **HuggingfaceMCP Custom Config To Use Community Tools**:
34
+ - **Strongly Typed Composable Graphs**:
35
+ - **Specialized Research Teams of Agents**:
36
+
37
+ ## Quick Start
38
+
39
+ ### 1. Environment Setup
40
+
41
+ ```bash
42
+ # Install uv if you haven't already
43
+ pip install uv
44
+
45
+ # Sync dependencies
46
+ uv sync
47
+ ```
48
+
49
+ ### 2. Run the UI
50
+
51
+ ```bash
52
+ # Start the Gradio app
53
+ uv run gradio run src/app.py
54
+ ```
55
+
56
+ Open your browser to `http://localhost:7860`.
57
+
58
+ ### 3. Connect via MCP
59
+
60
+ This application exposes a Model Context Protocol (MCP) server, allowing you to use its search tools directly from Claude Desktop or other MCP clients.
61
+
62
+ **MCP Server URL**: `http://localhost:7860/gradio_api/mcp/`
63
+
64
+ **Claude Desktop Configuration**:
65
+ Add this to your `claude_desktop_config.json`:
66
+ ```json
67
+ {
68
+ "mcpServers": {
69
+ "deepcritical": {
70
+ "url": "http://localhost:7860/gradio_api/mcp/"
71
+ }
72
+ }
73
+ }
74
+ ```
75
+
76
+ **Available Tools**:
77
+ - `search_pubmed`: Search peer-reviewed biomedical literature.
78
+ - `search_clinical_trials`: Search ClinicalTrials.gov.
79
+ - `search_biorxiv`: Search bioRxiv/medRxiv preprints.
80
+ - `search_all`: Search all sources simultaneously.
81
+ - `analyze_hypothesis`: Secure statistical analysis using Modal sandboxes.
82
+
83
+
84
+
85
+ ## Architecture
86
+
87
+ DeepCritical uses a Vertical Slice Architecture:
88
+
89
+ 1. **Search Slice**: Retrieving evidence from PubMed, ClinicalTrials.gov, and bioRxiv.
90
+ 2. **Judge Slice**: Evaluating evidence quality using LLMs.
91
+ 3. **Orchestrator Slice**: Managing the research loop and UI.
92
+
93
+ - iterativeResearch
94
+ - deepResearch
95
+ - researchTeam
96
+
97
+ ### Iterative Research
98
+
99
+ sequenceDiagram
100
+ participant IterativeFlow
101
+ participant ThinkingAgent
102
+ participant KnowledgeGapAgent
103
+ participant ToolSelector
104
+ participant ToolExecutor
105
+ participant JudgeHandler
106
+ participant WriterAgent
107
+
108
+ IterativeFlow->>IterativeFlow: run(query)
109
+
110
+ loop Until complete or max_iterations
111
+ IterativeFlow->>ThinkingAgent: generate_observations()
112
+ ThinkingAgent-->>IterativeFlow: observations
113
+
114
+ IterativeFlow->>KnowledgeGapAgent: evaluate_gaps()
115
+ KnowledgeGapAgent-->>IterativeFlow: KnowledgeGapOutput
116
+
117
+ alt Research complete
118
+ IterativeFlow->>WriterAgent: create_final_report()
119
+ WriterAgent-->>IterativeFlow: final_report
120
+ else Gaps remain
121
+ IterativeFlow->>ToolSelector: select_agents(gap)
122
+ ToolSelector-->>IterativeFlow: AgentSelectionPlan
123
+
124
+ IterativeFlow->>ToolExecutor: execute_tool_tasks()
125
+ ToolExecutor-->>IterativeFlow: ToolAgentOutput[]
126
+
127
+ IterativeFlow->>JudgeHandler: assess_evidence()
128
+ JudgeHandler-->>IterativeFlow: should_continue
129
+ end
130
+ end
131
+
132
+
133
+ ### Deep Research
134
+
135
+ sequenceDiagram
136
+ actor User
137
+ participant GraphOrchestrator
138
+ participant InputParser
139
+ participant GraphBuilder
140
+ participant GraphExecutor
141
+ participant Agent
142
+ participant BudgetTracker
143
+ participant WorkflowState
144
+
145
+ User->>GraphOrchestrator: run(query)
146
+ GraphOrchestrator->>InputParser: detect_research_mode(query)
147
+ InputParser-->>GraphOrchestrator: mode (iterative/deep)
148
+ GraphOrchestrator->>GraphBuilder: build_graph(mode)
149
+ GraphBuilder-->>GraphOrchestrator: ResearchGraph
150
+ GraphOrchestrator->>WorkflowState: init_workflow_state()
151
+ GraphOrchestrator->>BudgetTracker: create_budget()
152
+ GraphOrchestrator->>GraphExecutor: _execute_graph(graph)
153
+
154
+ loop For each node in graph
155
+ GraphExecutor->>Agent: execute_node(agent_node)
156
+ Agent->>Agent: process_input
157
+ Agent-->>GraphExecutor: result
158
+ GraphExecutor->>WorkflowState: update_state(result)
159
+ GraphExecutor->>BudgetTracker: add_tokens(used)
160
+ GraphExecutor->>BudgetTracker: check_budget()
161
+ alt Budget exceeded
162
+ GraphExecutor->>GraphOrchestrator: emit(error_event)
163
+ else Continue
164
+ GraphExecutor->>GraphOrchestrator: emit(progress_event)
165
+ end
166
+ end
167
+
168
+ GraphOrchestrator->>User: AsyncGenerator[AgentEvent]
169
+
170
+ ### Research Team
171
+
172
+ Critical Deep Research Agent
173
+
174
+ ## Development
175
+
176
+ ### Run Tests
177
+
178
+ ```bash
179
+ uv run pytest
180
+ ```
181
+
182
+ ### Run Checks
183
+
184
+ ```bash
185
+ make check
186
+ ```
187
+
188
+ ## Join Us
189
+
190
+ - The-Obstacle-Is-The-Way
191
+ - MarioAderman
192
+ - Josephrp
193
+
194
+ ## Links
195
+
196
+ - [GitHub Repository](https://github.com/The-Obstacle-Is-The-Way/DeepCritical-1)
docs/CONFIGURATION.md ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuration Guide
2
+
3
+ ## Overview
4
+
5
+ DeepCritical uses **Pydantic Settings** for centralized configuration management. All settings are defined in `src/utils/config.py` and can be configured via environment variables or a `.env` file.
6
+
7
+ ## Quick Start
8
+
9
+ 1. Copy the example environment file (if available) or create a `.env` file in the project root
10
+ 2. Set at least one LLM API key (`OPENAI_API_KEY` or `ANTHROPIC_API_KEY`)
11
+ 3. Optionally configure other services as needed
12
+
13
+ ## Configuration System
14
+
15
+ ### How It Works
16
+
17
+ - **Settings Class**: `Settings` class in `src/utils/config.py` extends `BaseSettings` from `pydantic_settings`
18
+ - **Environment File**: Automatically loads from `.env` file (if present)
19
+ - **Environment Variables**: Reads from environment variables (case-insensitive)
20
+ - **Type Safety**: Strongly-typed fields with validation
21
+ - **Singleton Pattern**: Global `settings` instance for easy access
22
+
23
+ ### Usage
24
+
25
+ ```python
26
+ from src.utils.config import settings
27
+
28
+ # Check if API keys are available
29
+ if settings.has_openai_key:
30
+ # Use OpenAI
31
+ pass
32
+
33
+ # Access configuration values
34
+ max_iterations = settings.max_iterations
35
+ web_search_provider = settings.web_search_provider
36
+ ```
37
+
38
+ ## Required Configuration
39
+
40
+ ### At Least One LLM Provider
41
+
42
+ You must configure at least one LLM provider:
43
+
44
+ **OpenAI:**
45
+ ```bash
46
+ LLM_PROVIDER=openai
47
+ OPENAI_API_KEY=your_openai_api_key_here
48
+ OPENAI_MODEL=gpt-5.1
49
+ ```
50
+
51
+ **Anthropic:**
52
+ ```bash
53
+ LLM_PROVIDER=anthropic
54
+ ANTHROPIC_API_KEY=your_anthropic_api_key_here
55
+ ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
56
+ ```
57
+
58
+ ## Optional Configuration
59
+
60
+ ### Embedding Configuration
61
+
62
+ ```bash
63
+ # Embedding Provider: "openai", "local", or "huggingface"
64
+ EMBEDDING_PROVIDER=local
65
+
66
+ # OpenAI Embedding Model (used by LlamaIndex RAG)
67
+ OPENAI_EMBEDDING_MODEL=text-embedding-3-small
68
+
69
+ # Local Embedding Model (sentence-transformers)
70
+ LOCAL_EMBEDDING_MODEL=all-MiniLM-L6-v2
71
+
72
+ # HuggingFace Embedding Model
73
+ HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
74
+ ```
75
+
76
+ ### HuggingFace Configuration
77
+
78
+ ```bash
79
+ # HuggingFace API Token (for inference API)
80
+ HUGGINGFACE_API_KEY=your_huggingface_api_key_here
81
+ # Or use HF_TOKEN (alternative name)
82
+
83
+ # Default HuggingFace Model ID
84
+ HUGGINGFACE_MODEL=meta-llama/Llama-3.1-8B-Instruct
85
+ ```
86
+
87
+ ### Web Search Configuration
88
+
89
+ ```bash
90
+ # Web Search Provider: "serper", "searchxng", "brave", "tavily", or "duckduckgo"
91
+ # Default: "duckduckgo" (no API key required)
92
+ WEB_SEARCH_PROVIDER=duckduckgo
93
+
94
+ # Serper API Key (for Google search via Serper)
95
+ SERPER_API_KEY=your_serper_api_key_here
96
+
97
+ # SearchXNG Host URL
98
+ SEARCHXNG_HOST=http://localhost:8080
99
+
100
+ # Brave Search API Key
101
+ BRAVE_API_KEY=your_brave_api_key_here
102
+
103
+ # Tavily API Key
104
+ TAVILY_API_KEY=your_tavily_api_key_here
105
+ ```
106
+
107
+ ### PubMed Configuration
108
+
109
+ ```bash
110
+ # NCBI API Key (optional, for higher rate limits: 10 req/sec vs 3 req/sec)
111
+ NCBI_API_KEY=your_ncbi_api_key_here
112
+ ```
113
+
114
+ ### Agent Configuration
115
+
116
+ ```bash
117
+ # Maximum iterations per research loop
118
+ MAX_ITERATIONS=10
119
+
120
+ # Search timeout in seconds
121
+ SEARCH_TIMEOUT=30
122
+
123
+ # Use graph-based execution for research flows
124
+ USE_GRAPH_EXECUTION=false
125
+ ```
126
+
127
+ ### Budget & Rate Limiting Configuration
128
+
129
+ ```bash
130
+ # Default token budget per research loop
131
+ DEFAULT_TOKEN_LIMIT=100000
132
+
133
+ # Default time limit per research loop (minutes)
134
+ DEFAULT_TIME_LIMIT_MINUTES=10
135
+
136
+ # Default iterations limit per research loop
137
+ DEFAULT_ITERATIONS_LIMIT=10
138
+ ```
139
+
140
+ ### RAG Service Configuration
141
+
142
+ ```bash
143
+ # ChromaDB collection name for RAG
144
+ RAG_COLLECTION_NAME=deepcritical_evidence
145
+
146
+ # Number of top results to retrieve from RAG
147
+ RAG_SIMILARITY_TOP_K=5
148
+
149
+ # Automatically ingest evidence into RAG
150
+ RAG_AUTO_INGEST=true
151
+ ```
152
+
153
+ ### ChromaDB Configuration
154
+
155
+ ```bash
156
+ # ChromaDB storage path
157
+ CHROMA_DB_PATH=./chroma_db
158
+
159
+ # Whether to persist ChromaDB to disk
160
+ CHROMA_DB_PERSIST=true
161
+
162
+ # ChromaDB server host (for remote ChromaDB, optional)
163
+ # CHROMA_DB_HOST=localhost
164
+
165
+ # ChromaDB server port (for remote ChromaDB, optional)
166
+ # CHROMA_DB_PORT=8000
167
+ ```
168
+
169
+ ### External Services
170
+
171
+ ```bash
172
+ # Modal Token ID (for Modal sandbox execution)
173
+ MODAL_TOKEN_ID=your_modal_token_id_here
174
+
175
+ # Modal Token Secret
176
+ MODAL_TOKEN_SECRET=your_modal_token_secret_here
177
+ ```
178
+
179
+ ### Logging Configuration
180
+
181
+ ```bash
182
+ # Log Level: "DEBUG", "INFO", "WARNING", or "ERROR"
183
+ LOG_LEVEL=INFO
184
+ ```
185
+
186
+ ## Configuration Properties
187
+
188
+ The `Settings` class provides helpful properties for checking configuration:
189
+
190
+ ```python
191
+ from src.utils.config import settings
192
+
193
+ # Check API key availability
194
+ settings.has_openai_key # bool
195
+ settings.has_anthropic_key # bool
196
+ settings.has_huggingface_key # bool
197
+ settings.has_any_llm_key # bool
198
+
199
+ # Check service availability
200
+ settings.modal_available # bool
201
+ settings.web_search_available # bool
202
+ ```
203
+
204
+ ## Environment Variables Reference
205
+
206
+ ### Required (at least one LLM)
207
+ - `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` - At least one LLM provider key
208
+
209
+ ### Optional LLM Providers
210
+ - `DEEPSEEK_API_KEY` (Phase 2)
211
+ - `OPENROUTER_API_KEY` (Phase 2)
212
+ - `GEMINI_API_KEY` (Phase 2)
213
+ - `PERPLEXITY_API_KEY` (Phase 2)
214
+ - `HUGGINGFACE_API_KEY` or `HF_TOKEN`
215
+ - `AZURE_OPENAI_ENDPOINT` (Phase 2)
216
+ - `AZURE_OPENAI_DEPLOYMENT` (Phase 2)
217
+ - `AZURE_OPENAI_API_KEY` (Phase 2)
218
+ - `AZURE_OPENAI_API_VERSION` (Phase 2)
219
+ - `LOCAL_MODEL_URL` (Phase 2)
220
+
221
+ ### Web Search
222
+ - `WEB_SEARCH_PROVIDER` (default: "duckduckgo")
223
+ - `SERPER_API_KEY`
224
+ - `SEARCHXNG_HOST`
225
+ - `BRAVE_API_KEY`
226
+ - `TAVILY_API_KEY`
227
+
228
+ ### Embeddings
229
+ - `EMBEDDING_PROVIDER` (default: "local")
230
+ - `HUGGINGFACE_EMBEDDING_MODEL` (optional)
231
+
232
+ ### RAG
233
+ - `RAG_COLLECTION_NAME` (default: "deepcritical_evidence")
234
+ - `RAG_SIMILARITY_TOP_K` (default: 5)
235
+ - `RAG_AUTO_INGEST` (default: true)
236
+
237
+ ### ChromaDB
238
+ - `CHROMA_DB_PATH` (default: "./chroma_db")
239
+ - `CHROMA_DB_PERSIST` (default: true)
240
+ - `CHROMA_DB_HOST` (optional)
241
+ - `CHROMA_DB_PORT` (optional)
242
+
243
+ ### Budget
244
+ - `DEFAULT_TOKEN_LIMIT` (default: 100000)
245
+ - `DEFAULT_TIME_LIMIT_MINUTES` (default: 10)
246
+ - `DEFAULT_ITERATIONS_LIMIT` (default: 10)
247
+
248
+ ### Other
249
+ - `LLM_PROVIDER` (default: "openai")
250
+ - `NCBI_API_KEY` (optional)
251
+ - `MODAL_TOKEN_ID` (optional)
252
+ - `MODAL_TOKEN_SECRET` (optional)
253
+ - `MAX_ITERATIONS` (default: 10)
254
+ - `LOG_LEVEL` (default: "INFO")
255
+ - `USE_GRAPH_EXECUTION` (default: false)
256
+
257
+ ## Validation
258
+
259
+ Settings are validated on load using Pydantic validation:
260
+
261
+ - **Type checking**: All fields are strongly typed
262
+ - **Range validation**: Numeric fields have min/max constraints
263
+ - **Literal validation**: Enum fields only accept specific values
264
+ - **Required fields**: API keys are checked when accessed via `get_api_key()`
265
+
266
+ ## Error Handling
267
+
268
+ Configuration errors raise `ConfigurationError`:
269
+
270
+ ```python
271
+ from src.utils.config import settings
272
+ from src.utils.exceptions import ConfigurationError
273
+
274
+ try:
275
+ api_key = settings.get_api_key()
276
+ except ConfigurationError as e:
277
+ print(f"Configuration error: {e}")
278
+ ```
279
+
280
+ ## Future Enhancements (Phase 2)
281
+
282
+ The following configurations are planned for Phase 2:
283
+
284
+ 1. **Additional LLM Providers**: DeepSeek, OpenRouter, Gemini, Perplexity, Azure OpenAI, Local models
285
+ 2. **Model Selection**: Reasoning/main/fast model configuration
286
+ 3. **Service Integration**: Migrate `folder/llm_config.py` to centralized config
287
+
288
+ See `CONFIGURATION_ANALYSIS.md` for the complete implementation plan.
289
+
290
+
291
+
292
+
293
+
294
+
295
+
296
+
297
+
298
+
299
+
300
+
301
+
docs/architecture/design-patterns.md ADDED
@@ -0,0 +1,1509 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Design Patterns & Technical Decisions
2
+ ## Explicit Answers to Architecture Questions
3
+
4
+ ---
5
+
6
+ ## Purpose of This Document
7
+
8
+ This document explicitly answers all the "design pattern" questions raised in team discussions. It provides clear technical decisions with rationale.
9
+
10
+ ---
11
+
12
+ ## 1. Primary Architecture Pattern
13
+
14
+ ### Decision: Orchestrator with Search-Judge Loop
15
+
16
+ **Pattern Name**: Iterative Research Orchestrator
17
+
18
+ **Structure**:
19
+ ```
20
+ ┌─────────────────────────────────────┐
21
+ │ Research Orchestrator │
22
+ │ ┌───────────────────────────────┐ │
23
+ │ │ Search Strategy Planner │ │
24
+ │ └───────────────────────────────┘ │
25
+ │ ↓ │
26
+ │ ┌───────────────────────────────┐ │
27
+ │ │ Tool Coordinator │ │
28
+ │ │ - PubMed Search │ │
29
+ │ │ - Web Search │ │
30
+ │ │ - Clinical Trials │ │
31
+ │ └───────────────────────────────┘ │
32
+ │ ↓ │
33
+ │ ┌───────────────────────────────┐ │
34
+ │ │ Evidence Aggregator │ │
35
+ │ └───────────────────────────────┘ │
36
+ │ ↓ │
37
+ │ ┌───────────────────────────────┐ │
38
+ │ │ Quality Judge │ │
39
+ │ │ (LLM-based assessment) │ │
40
+ │ └───────────────────────────────┘ │
41
+ │ ↓ │
42
+ │ Loop or Synthesize? │
43
+ │ ↓ │
44
+ │ ┌───────────────────────────────┐ │
45
+ │ │ Report Generator │ │
46
+ │ └───────────────────────────────┘ │
47
+ └─────────────────────────────────────┘
48
+ ```
49
+
50
+ **Why NOT single-agent?**
51
+ - Need coordinated multi-tool queries
52
+ - Need iterative refinement
53
+ - Need quality assessment between searches
54
+
55
+ **Why NOT pure ReAct?**
56
+ - Medical research requires structured workflow
57
+ - Need explicit quality gates
58
+ - Want deterministic tool selection
59
+
60
+ **Why THIS pattern?**
61
+ - Clear separation of concerns
62
+ - Testable components
63
+ - Easy to debug
64
+ - Proven in similar systems
65
+
66
+ ---
67
+
68
+ ## 2. Tool Selection & Orchestration Pattern
69
+
70
+ ### Decision: Static Tool Registry with Dynamic Selection
71
+
72
+ **Pattern**:
73
+ ```python
74
+ class ToolRegistry:
75
+ """Central registry of available research tools"""
76
+ tools = {
77
+ 'pubmed': PubMedSearchTool(),
78
+ 'web': WebSearchTool(),
79
+ 'trials': ClinicalTrialsTool(),
80
+ 'drugs': DrugInfoTool(),
81
+ }
82
+
83
+ class Orchestrator:
84
+ def select_tools(self, question: str, iteration: int) -> List[Tool]:
85
+ """Dynamically choose tools based on context"""
86
+ if iteration == 0:
87
+ # First pass: broad search
88
+ return [tools['pubmed'], tools['web']]
89
+ else:
90
+ # Refinement: targeted search
91
+ return self.judge.recommend_tools(question, context)
92
+ ```
93
+
94
+ **Why NOT on-the-fly agent factories?**
95
+ - 6-day timeline (too complex)
96
+ - Tools are known upfront
97
+ - Simpler to test and debug
98
+
99
+ **Why NOT single tool?**
100
+ - Need multiple evidence sources
101
+ - Different tools for different info types
102
+ - Better coverage
103
+
104
+ **Why THIS pattern?**
105
+ - Balance flexibility vs simplicity
106
+ - Tools can be added easily
107
+ - Selection logic is transparent
108
+
109
+ ---
110
+
111
+ ## 3. Judge Pattern
112
+
113
+ ### Decision: Dual-Judge System (Quality + Budget)
114
+
115
+ **Pattern**:
116
+ ```python
117
+ class QualityJudge:
118
+ """LLM-based evidence quality assessment"""
119
+
120
+ def is_sufficient(self, question: str, evidence: List[Evidence]) -> bool:
121
+ """Main decision: do we have enough?"""
122
+ return (
123
+ self.has_mechanism_explanation(evidence) and
124
+ self.has_drug_candidates(evidence) and
125
+ self.has_clinical_evidence(evidence) and
126
+ self.confidence_score(evidence) > threshold
127
+ )
128
+
129
+ def identify_gaps(self, question: str, evidence: List[Evidence]) -> List[str]:
130
+ """What's missing?"""
131
+ gaps = []
132
+ if not self.has_mechanism_explanation(evidence):
133
+ gaps.append("disease mechanism")
134
+ if not self.has_drug_candidates(evidence):
135
+ gaps.append("potential drug candidates")
136
+ if not self.has_clinical_evidence(evidence):
137
+ gaps.append("clinical trial data")
138
+ return gaps
139
+
140
+ class BudgetJudge:
141
+ """Resource constraint enforcement"""
142
+
143
+ def should_stop(self, state: ResearchState) -> bool:
144
+ """Hard limits"""
145
+ return (
146
+ state.tokens_used >= max_tokens or
147
+ state.iterations >= max_iterations or
148
+ state.time_elapsed >= max_time
149
+ )
150
+ ```
151
+
152
+ **Why NOT just LLM judge?**
153
+ - Cost control (prevent runaway queries)
154
+ - Time bounds (hackathon demo needs to be fast)
155
+ - Safety (prevent infinite loops)
156
+
157
+ **Why NOT just token budget?**
158
+ - Want early exit when answer is good
159
+ - Quality matters, not just quantity
160
+ - Better user experience
161
+
162
+ **Why THIS pattern?**
163
+ - Best of both worlds
164
+ - Clear separation (quality vs resources)
165
+ - Each judge has single responsibility
166
+
167
+ ---
168
+
169
+ ## 4. Break/Stopping Pattern
170
+
171
+ ### Decision: Three-Tier Break Conditions
172
+
173
+ **Pattern**:
174
+ ```python
175
+ def should_continue(state: ResearchState) -> bool:
176
+ """Multi-tier stopping logic"""
177
+
178
+ # Tier 1: Quality-based (ideal stop)
179
+ if quality_judge.is_sufficient(state.question, state.evidence):
180
+ state.stop_reason = "sufficient_evidence"
181
+ return False
182
+
183
+ # Tier 2: Budget-based (cost control)
184
+ if state.tokens_used >= config.max_tokens:
185
+ state.stop_reason = "token_budget_exceeded"
186
+ return False
187
+
188
+ # Tier 3: Iteration-based (safety)
189
+ if state.iterations >= config.max_iterations:
190
+ state.stop_reason = "max_iterations_reached"
191
+ return False
192
+
193
+ # Tier 4: Time-based (demo friendly)
194
+ if state.time_elapsed >= config.max_time:
195
+ state.stop_reason = "timeout"
196
+ return False
197
+
198
+ return True # Continue researching
199
+ ```
200
+
201
+ **Configuration**:
202
+ ```toml
203
+ [research.limits]
204
+ max_tokens = 50000 # ~$0.50 at Claude pricing
205
+ max_iterations = 5 # Reasonable depth
206
+ max_time_seconds = 120 # 2 minutes for demo
207
+ judge_threshold = 0.8 # Quality confidence score
208
+ ```
209
+
210
+ **Why multiple conditions?**
211
+ - Defense in depth
212
+ - Different failure modes
213
+ - Graceful degradation
214
+
215
+ **Why these specific limits?**
216
+ - Tokens: Balances cost vs quality
217
+ - Iterations: Enough for refinement, not too deep
218
+ - Time: Fast enough for live demo
219
+ - Judge: High bar for quality
220
+
221
+ ---
222
+
223
+ ## 5. State Management Pattern
224
+
225
+ ### Decision: Pydantic State Machine with Checkpoints
226
+
227
+ **Pattern**:
228
+ ```python
229
+ class ResearchState(BaseModel):
230
+ """Immutable state snapshots"""
231
+ query_id: str
232
+ question: str
233
+ iteration: int = 0
234
+ evidence: List[Evidence] = []
235
+ tokens_used: int = 0
236
+ search_history: List[SearchQuery] = []
237
+ stop_reason: Optional[str] = None
238
+ created_at: datetime
239
+ updated_at: datetime
240
+
241
+ class StateManager:
242
+ def save_checkpoint(self, state: ResearchState) -> None:
243
+ """Save state to disk"""
244
+ path = f".deepresearch/checkpoints/{state.query_id}_iter{state.iteration}.json"
245
+ path.write_text(state.model_dump_json(indent=2))
246
+
247
+ def load_checkpoint(self, query_id: str, iteration: int) -> ResearchState:
248
+ """Resume from checkpoint"""
249
+ path = f".deepresearch/checkpoints/{query_id}_iter{iteration}.json"
250
+ return ResearchState.model_validate_json(path.read_text())
251
+ ```
252
+
253
+ **Directory Structure**:
254
+ ```
255
+ .deepresearch/
256
+ ├── state/
257
+ │ └── current_123.json # Active research state
258
+ ├── checkpoints/
259
+ │ ├── query_123_iter0.json # Checkpoint after iteration 0
260
+ │ ├── query_123_iter1.json # Checkpoint after iteration 1
261
+ │ └── query_123_iter2.json # Checkpoint after iteration 2
262
+ └── workspace/
263
+ └── query_123/
264
+ ├── papers/ # Downloaded PDFs
265
+ ├── search_results/ # Raw search results
266
+ └── analysis/ # Intermediate analysis
267
+ ```
268
+
269
+ **Why Pydantic?**
270
+ - Type safety
271
+ - Validation
272
+ - Easy serialization
273
+ - Integration with Pydantic AI
274
+
275
+ **Why checkpoints?**
276
+ - Resume interrupted research
277
+ - Debugging (inspect state at each iteration)
278
+ - Cost savings (don't re-query)
279
+ - Demo resilience
280
+
281
+ ---
282
+
283
+ ## 6. Tool Interface Pattern
284
+
285
+ ### Decision: Async Unified Tool Protocol
286
+
287
+ **Pattern**:
288
+ ```python
289
+ from typing import Protocol, Optional, List, Dict
290
+ import asyncio
291
+
292
+ class ResearchTool(Protocol):
293
+ """Standard async interface all tools must implement"""
294
+
295
+ async def search(
296
+ self,
297
+ query: str,
298
+ max_results: int = 10,
299
+ filters: Optional[Dict] = None
300
+ ) -> List[Evidence]:
301
+ """Execute search and return structured evidence"""
302
+ ...
303
+
304
+ def get_metadata(self) -> ToolMetadata:
305
+ """Tool capabilities and requirements"""
306
+ ...
307
+
308
+ class PubMedSearchTool:
309
+ """Concrete async implementation"""
310
+
311
+ def __init__(self):
312
+ self._rate_limiter = asyncio.Semaphore(3) # 3 req/sec
313
+ self._cache: Dict[str, List[Evidence]] = {}
314
+
315
+ async def search(self, query: str, max_results: int = 10, **kwargs) -> List[Evidence]:
316
+ # Check cache first
317
+ cache_key = f"{query}:{max_results}"
318
+ if cache_key in self._cache:
319
+ return self._cache[cache_key]
320
+
321
+ async with self._rate_limiter:
322
+ # 1. Query PubMed E-utilities API (async httpx)
323
+ async with httpx.AsyncClient() as client:
324
+ response = await client.get(
325
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
326
+ params={"db": "pubmed", "term": query, "retmax": max_results}
327
+ )
328
+ # 2. Parse XML response
329
+ # 3. Extract: title, abstract, authors, citations
330
+ # 4. Convert to Evidence objects
331
+ evidence_list = self._parse_response(response.text)
332
+
333
+ # Cache results
334
+ self._cache[cache_key] = evidence_list
335
+ return evidence_list
336
+
337
+ def get_metadata(self) -> ToolMetadata:
338
+ return ToolMetadata(
339
+ name="PubMed",
340
+ description="Biomedical literature search",
341
+ rate_limit="3 requests/second",
342
+ requires_api_key=False
343
+ )
344
+ ```
345
+
346
+ **Parallel Tool Execution**:
347
+ ```python
348
+ async def search_all_tools(query: str, tools: List[ResearchTool]) -> List[Evidence]:
349
+ """Run all tool searches in parallel"""
350
+ tasks = [tool.search(query) for tool in tools]
351
+ results = await asyncio.gather(*tasks, return_exceptions=True)
352
+
353
+ # Flatten and filter errors
354
+ evidence = []
355
+ for result in results:
356
+ if isinstance(result, Exception):
357
+ logger.warning(f"Tool failed: {result}")
358
+ else:
359
+ evidence.extend(result)
360
+ return evidence
361
+ ```
362
+
363
+ **Why Async?**
364
+ - Tools are I/O bound (network calls)
365
+ - Parallel execution = faster searches
366
+ - Better UX (streaming progress)
367
+ - Standard in 2025 Python
368
+
369
+ **Why Protocol?**
370
+ - Loose coupling
371
+ - Easy to add new tools
372
+ - Testable with mocks
373
+ - Clear contract
374
+
375
+ **Why NOT abstract base class?**
376
+ - More Pythonic (PEP 544)
377
+ - Duck typing friendly
378
+ - Runtime checking with isinstance
379
+
380
+ ---
381
+
382
+ ## 7. Report Generation Pattern
383
+
384
+ ### Decision: Structured Output with Citations
385
+
386
+ **Pattern**:
387
+ ```python
388
+ class DrugCandidate(BaseModel):
389
+ name: str
390
+ mechanism: str
391
+ evidence_quality: Literal["strong", "moderate", "weak"]
392
+ clinical_status: str # "FDA approved", "Phase 2", etc.
393
+ citations: List[Citation]
394
+
395
+ class ResearchReport(BaseModel):
396
+ query: str
397
+ disease_mechanism: str
398
+ candidates: List[DrugCandidate]
399
+ methodology: str # How we searched
400
+ confidence: float
401
+ sources_used: List[str]
402
+ generated_at: datetime
403
+
404
+ def to_markdown(self) -> str:
405
+ """Human-readable format"""
406
+ ...
407
+
408
+ def to_json(self) -> str:
409
+ """Machine-readable format"""
410
+ ...
411
+ ```
412
+
413
+ **Output Example**:
414
+ ```markdown
415
+ # Research Report: Long COVID Fatigue
416
+
417
+ ## Disease Mechanism
418
+ Long COVID fatigue is associated with mitochondrial dysfunction
419
+ and persistent inflammation [1, 2].
420
+
421
+ ## Drug Candidates
422
+
423
+ ### 1. Coenzyme Q10 (CoQ10) - STRONG EVIDENCE
424
+ - **Mechanism**: Mitochondrial support, ATP production
425
+ - **Status**: FDA approved (supplement)
426
+ - **Evidence**: 2 randomized controlled trials showing fatigue reduction
427
+ - **Citations**:
428
+ - Smith et al. (2023) - PubMed: 12345678
429
+ - Johnson et al. (2023) - PubMed: 87654321
430
+
431
+ ### 2. Low-dose Naltrexone (LDN) - MODERATE EVIDENCE
432
+ - **Mechanism**: Anti-inflammatory, immune modulation
433
+ - **Status**: FDA approved (different indication)
434
+ - **Evidence**: 3 case studies, 1 ongoing Phase 2 trial
435
+ - **Citations**: ...
436
+
437
+ ## Methodology
438
+ - Searched PubMed: 45 papers reviewed
439
+ - Searched Web: 12 sources
440
+ - Clinical trials: 8 trials identified
441
+ - Total iterations: 3
442
+ - Tokens used: 12,450
443
+
444
+ ## Confidence: 85%
445
+
446
+ ## Sources
447
+ - PubMed E-utilities
448
+ - ClinicalTrials.gov
449
+ - OpenFDA Database
450
+ ```
451
+
452
+ **Why structured?**
453
+ - Parseable by other systems
454
+ - Consistent format
455
+ - Easy to validate
456
+ - Good for datasets
457
+
458
+ **Why markdown?**
459
+ - Human-readable
460
+ - Renders nicely in Gradio
461
+ - Easy to convert to PDF
462
+ - Standard format
463
+
464
+ ---
465
+
466
+ ## 8. Error Handling Pattern
467
+
468
+ ### Decision: Graceful Degradation with Fallbacks
469
+
470
+ **Pattern**:
471
+ ```python
472
+ class ResearchAgent:
473
+ def research(self, question: str) -> ResearchReport:
474
+ try:
475
+ return self._research_with_retry(question)
476
+ except TokenBudgetExceeded:
477
+ # Return partial results
478
+ return self._synthesize_partial(state)
479
+ except ToolFailure as e:
480
+ # Try alternate tools
481
+ return self._research_with_fallback(question, failed_tool=e.tool)
482
+ except Exception as e:
483
+ # Log and return error report
484
+ logger.error(f"Research failed: {e}")
485
+ return self._error_report(question, error=e)
486
+ ```
487
+
488
+ **Why NOT fail fast?**
489
+ - Hackathon demo must be robust
490
+ - Partial results better than nothing
491
+ - Good user experience
492
+
493
+ **Why NOT silent failures?**
494
+ - Need visibility for debugging
495
+ - User should know limitations
496
+ - Honest about confidence
497
+
498
+ ---
499
+
500
+ ## 9. Configuration Pattern
501
+
502
+ ### Decision: Hydra-inspired but Simpler
503
+
504
+ **Pattern**:
505
+ ```toml
506
+ # config.toml
507
+
508
+ [research]
509
+ max_iterations = 5
510
+ max_tokens = 50000
511
+ max_time_seconds = 120
512
+ judge_threshold = 0.85
513
+
514
+ [tools]
515
+ enabled = ["pubmed", "web", "trials"]
516
+
517
+ [tools.pubmed]
518
+ max_results = 20
519
+ rate_limit = 3 # per second
520
+
521
+ [tools.web]
522
+ engine = "serpapi"
523
+ max_results = 10
524
+
525
+ [llm]
526
+ provider = "anthropic"
527
+ model = "claude-3-5-sonnet-20241022"
528
+ temperature = 0.1
529
+
530
+ [output]
531
+ format = "markdown"
532
+ include_citations = true
533
+ include_methodology = true
534
+ ```
535
+
536
+ **Loading**:
537
+ ```python
538
+ from pathlib import Path
539
+ import tomllib
540
+
541
+ def load_config() -> dict:
542
+ config_path = Path("config.toml")
543
+ with open(config_path, "rb") as f:
544
+ return tomllib.load(f)
545
+ ```
546
+
547
+ **Why NOT full Hydra?**
548
+ - Simpler for hackathon
549
+ - Easier to understand
550
+ - Faster to modify
551
+ - Can upgrade later
552
+
553
+ **Why TOML?**
554
+ - Human-readable
555
+ - Standard (PEP 680)
556
+ - Better than YAML edge cases
557
+ - Native in Python 3.11+
558
+
559
+ ---
560
+
561
+ ## 10. Testing Pattern
562
+
563
+ ### Decision: Three-Level Testing Strategy
564
+
565
+ **Pattern**:
566
+ ```python
567
+ # Level 1: Unit tests (fast, isolated)
568
+ def test_pubmed_tool():
569
+ tool = PubMedSearchTool()
570
+ results = tool.search("aspirin cardiovascular")
571
+ assert len(results) > 0
572
+ assert all(isinstance(r, Evidence) for r in results)
573
+
574
+ # Level 2: Integration tests (tools + agent)
575
+ def test_research_loop():
576
+ agent = ResearchAgent(config=test_config)
577
+ report = agent.research("aspirin repurposing")
578
+ assert report.candidates
579
+ assert report.confidence > 0
580
+
581
+ # Level 3: End-to-end tests (full system)
582
+ def test_full_workflow():
583
+ # Simulate user query through Gradio UI
584
+ response = gradio_app.predict("test query")
585
+ assert "Drug Candidates" in response
586
+ ```
587
+
588
+ **Why three levels?**
589
+ - Fast feedback (unit tests)
590
+ - Confidence (integration tests)
591
+ - Reality check (e2e tests)
592
+
593
+ **Test Data**:
594
+ ```python
595
+ # tests/fixtures/
596
+ - mock_pubmed_response.xml
597
+ - mock_web_results.json
598
+ - sample_research_query.txt
599
+ - expected_report.md
600
+ ```
601
+
602
+ ---
603
+
604
+ ## 11. Judge Prompt Templates
605
+
606
+ ### Decision: Structured JSON Output with Domain-Specific Criteria
607
+
608
+ **Quality Judge System Prompt**:
609
+ ```python
610
+ QUALITY_JUDGE_SYSTEM = """You are a medical research quality assessor specializing in drug repurposing.
611
+ Your task is to evaluate if collected evidence is sufficient to answer a drug repurposing question.
612
+
613
+ You assess evidence against four criteria specific to drug repurposing research:
614
+ 1. MECHANISM: Understanding of the disease's molecular/cellular mechanisms
615
+ 2. CANDIDATES: Identification of potential drug candidates with known mechanisms
616
+ 3. EVIDENCE: Clinical or preclinical evidence supporting repurposing
617
+ 4. SOURCES: Quality and credibility of sources (peer-reviewed > preprints > web)
618
+
619
+ You MUST respond with valid JSON only. No other text."""
620
+ ```
621
+
622
+ **Quality Judge User Prompt**:
623
+ ```python
624
+ QUALITY_JUDGE_USER = """
625
+ ## Research Question
626
+ {question}
627
+
628
+ ## Evidence Collected (Iteration {iteration} of {max_iterations})
629
+ {evidence_summary}
630
+
631
+ ## Token Budget
632
+ Used: {tokens_used} / {max_tokens}
633
+
634
+ ## Your Assessment
635
+
636
+ Evaluate the evidence and respond with this exact JSON structure:
637
+
638
+ ```json
639
+ {{
640
+ "assessment": {{
641
+ "mechanism_score": <0-10>,
642
+ "mechanism_reasoning": "<Step-by-step analysis of mechanism understanding>",
643
+ "candidates_score": <0-10>,
644
+ "candidates_found": ["<drug1>", "<drug2>", ...],
645
+ "evidence_score": <0-10>,
646
+ "evidence_reasoning": "<Critical evaluation of clinical/preclinical support>",
647
+ "sources_score": <0-10>,
648
+ "sources_breakdown": {{
649
+ "peer_reviewed": <count>,
650
+ "clinical_trials": <count>,
651
+ "preprints": <count>,
652
+ "other": <count>
653
+ }}
654
+ }},
655
+ "overall_confidence": <0.0-1.0>,
656
+ "sufficient": <true/false>,
657
+ "gaps": ["<missing info 1>", "<missing info 2>"],
658
+ "recommended_searches": ["<search query 1>", "<search query 2>"],
659
+ "recommendation": "<continue|synthesize>"
660
+ }}
661
+ ```
662
+
663
+ Decision rules:
664
+ - sufficient=true if overall_confidence >= 0.8 AND mechanism_score >= 6 AND candidates_score >= 6
665
+ - sufficient=true if remaining budget < 10% (must synthesize with what we have)
666
+ - Otherwise, provide recommended_searches to fill gaps
667
+ """
668
+ ```
669
+
670
+ **Report Synthesis Prompt**:
671
+ ```python
672
+ SYNTHESIS_PROMPT = """You are a medical research synthesizer creating a drug repurposing report.
673
+
674
+ ## Research Question
675
+ {question}
676
+
677
+ ## Collected Evidence
678
+ {all_evidence}
679
+
680
+ ## Judge Assessment
681
+ {final_assessment}
682
+
683
+ ## Your Task
684
+ Create a comprehensive research report with this structure:
685
+
686
+ 1. **Executive Summary** (2-3 sentences)
687
+ 2. **Disease Mechanism** - What we understand about the condition
688
+ 3. **Drug Candidates** - For each candidate:
689
+ - Drug name and current FDA status
690
+ - Proposed mechanism for this condition
691
+ - Evidence quality (strong/moderate/weak)
692
+ - Key citations
693
+ 4. **Methodology** - How we searched (tools used, queries, iterations)
694
+ 5. **Limitations** - What we couldn't find or verify
695
+ 6. **Confidence Score** - Overall confidence in findings
696
+
697
+ Format as Markdown. Include PubMed IDs as citations [PMID: 12345678].
698
+ Be scientifically accurate. Do not hallucinate drug names or mechanisms.
699
+ If evidence is weak, say so clearly."""
700
+ ```
701
+
702
+ **Why Structured JSON?**
703
+ - Parseable by code (not just LLM output)
704
+ - Consistent format for logging/debugging
705
+ - Can trigger specific actions (continue vs synthesize)
706
+ - Testable with expected outputs
707
+
708
+ **Why Domain-Specific Criteria?**
709
+ - Generic "is this good?" prompts fail
710
+ - Drug repurposing has specific requirements
711
+ - Physician on team validated criteria
712
+ - Maps to real research workflow
713
+
714
+ ---
715
+
716
+ ## 12. MCP Server Integration (Hackathon Track)
717
+
718
+ ### Decision: Tools as MCP Servers for Reusability
719
+
720
+ **Why MCP?**
721
+ - Hackathon has dedicated MCP track
722
+ - Makes our tools reusable by others
723
+ - Standard protocol (Model Context Protocol)
724
+ - Future-proof (industry adoption growing)
725
+
726
+ **Architecture**:
727
+ ```
728
+ ┌─────────────────────────────────────────────────┐
729
+ │ DeepCritical Agent │
730
+ │ (uses tools directly OR via MCP) │
731
+ └─────────────────────────────────────────────────┘
732
+
733
+ ┌────────────┼────────────┐
734
+ ↓ ↓ ↓
735
+ ┌─────────────┐ ┌──────────┐ ┌───────────────┐
736
+ │ PubMed MCP │ │ Web MCP │ │ Trials MCP │
737
+ │ Server │ │ Server │ │ Server │
738
+ └─────────────┘ └──────────┘ └───────────────┘
739
+ │ │ │
740
+ ↓ ↓ ↓
741
+ PubMed API Brave/DDG ClinicalTrials.gov
742
+ ```
743
+
744
+ **PubMed MCP Server Implementation**:
745
+ ```python
746
+ # src/mcp_servers/pubmed_server.py
747
+ from fastmcp import FastMCP
748
+
749
+ mcp = FastMCP("PubMed Research Tool")
750
+
751
+ @mcp.tool()
752
+ async def search_pubmed(
753
+ query: str,
754
+ max_results: int = 10,
755
+ date_range: str = "5y"
756
+ ) -> dict:
757
+ """
758
+ Search PubMed for biomedical literature.
759
+
760
+ Args:
761
+ query: Search terms (supports PubMed syntax like [MeSH])
762
+ max_results: Maximum papers to return (default 10, max 100)
763
+ date_range: Time filter - "1y", "5y", "10y", or "all"
764
+
765
+ Returns:
766
+ dict with papers list containing title, abstract, authors, pmid, date
767
+ """
768
+ tool = PubMedSearchTool()
769
+ results = await tool.search(query, max_results)
770
+ return {
771
+ "query": query,
772
+ "count": len(results),
773
+ "papers": [r.model_dump() for r in results]
774
+ }
775
+
776
+ @mcp.tool()
777
+ async def get_paper_details(pmid: str) -> dict:
778
+ """
779
+ Get full details for a specific PubMed paper.
780
+
781
+ Args:
782
+ pmid: PubMed ID (e.g., "12345678")
783
+
784
+ Returns:
785
+ Full paper metadata including abstract, MeSH terms, references
786
+ """
787
+ tool = PubMedSearchTool()
788
+ return await tool.get_details(pmid)
789
+
790
+ if __name__ == "__main__":
791
+ mcp.run()
792
+ ```
793
+
794
+ **Running the MCP Server**:
795
+ ```bash
796
+ # Start the server
797
+ python -m src.mcp_servers.pubmed_server
798
+
799
+ # Or with uvx (recommended)
800
+ uvx fastmcp run src/mcp_servers/pubmed_server.py
801
+
802
+ # Note: fastmcp uses stdio transport by default, which is perfect
803
+ # for local integration with Claude Desktop or the main agent.
804
+ ```
805
+
806
+ **Claude Desktop Integration** (for demo):
807
+ ```json
808
+ // ~/Library/Application Support/Claude/claude_desktop_config.json
809
+ {
810
+ "mcpServers": {
811
+ "pubmed": {
812
+ "command": "python",
813
+ "args": ["-m", "src.mcp_servers.pubmed_server"],
814
+ "cwd": "/path/to/deepcritical"
815
+ }
816
+ }
817
+ }
818
+ ```
819
+
820
+ **Why FastMCP?**
821
+ - Simple decorator syntax
822
+ - Handles protocol complexity
823
+ - Good docs and examples
824
+ - Works with Claude Desktop and API
825
+
826
+ **MCP Track Submission Requirements**:
827
+ - [ ] At least one tool as MCP server
828
+ - [ ] README with setup instructions
829
+ - [ ] Demo showing MCP usage
830
+ - [ ] Bonus: Multiple tools as MCP servers
831
+
832
+ ---
833
+
834
+ ## 13. Gradio UI Pattern (Hackathon Track)
835
+
836
+ ### Decision: Streaming Progress with Modern UI
837
+
838
+ **Pattern**:
839
+ ```python
840
+ import gradio as gr
841
+ from typing import Generator
842
+
843
+ def research_with_streaming(question: str) -> Generator[str, None, None]:
844
+ """Stream research progress to UI"""
845
+ yield "🔍 Starting research...\n\n"
846
+
847
+ agent = ResearchAgent()
848
+
849
+ async for event in agent.research_stream(question):
850
+ match event.type:
851
+ case "search_start":
852
+ yield f"📚 Searching {event.tool}...\n"
853
+ case "search_complete":
854
+ yield f"✅ Found {event.count} results from {event.tool}\n"
855
+ case "judge_thinking":
856
+ yield f"🤔 Evaluating evidence quality...\n"
857
+ case "judge_decision":
858
+ yield f"📊 Confidence: {event.confidence:.0%}\n"
859
+ case "iteration_complete":
860
+ yield f"🔄 Iteration {event.iteration} complete\n\n"
861
+ case "synthesis_start":
862
+ yield f"📝 Generating report...\n"
863
+ case "complete":
864
+ yield f"\n---\n\n{event.report}"
865
+
866
+ # Gradio 5 UI
867
+ with gr.Blocks(theme=gr.themes.Soft()) as demo:
868
+ gr.Markdown("# 🔬 DeepCritical: Drug Repurposing Research Agent")
869
+ gr.Markdown("Ask a question about potential drug repurposing opportunities.")
870
+
871
+ with gr.Row():
872
+ with gr.Column(scale=2):
873
+ question = gr.Textbox(
874
+ label="Research Question",
875
+ placeholder="What existing drugs might help treat long COVID fatigue?",
876
+ lines=2
877
+ )
878
+ examples = gr.Examples(
879
+ examples=[
880
+ "What existing drugs might help treat long COVID fatigue?",
881
+ "Find existing drugs that might slow Alzheimer's progression",
882
+ "Which diabetes drugs show promise for cancer treatment?"
883
+ ],
884
+ inputs=question
885
+ )
886
+ submit = gr.Button("🚀 Start Research", variant="primary")
887
+
888
+ with gr.Column(scale=3):
889
+ output = gr.Markdown(label="Research Progress & Report")
890
+
891
+ submit.click(
892
+ fn=research_with_streaming,
893
+ inputs=question,
894
+ outputs=output,
895
+ )
896
+
897
+ demo.launch()
898
+ ```
899
+
900
+ **Why Streaming?**
901
+ - User sees progress, not loading spinner
902
+ - Builds trust (system is working)
903
+ - Better UX for long operations
904
+ - Gradio 5 native support
905
+
906
+ **Why gr.Markdown Output?**
907
+ - Research reports are markdown
908
+ - Renders citations nicely
909
+ - Code blocks for methodology
910
+ - Tables for drug comparisons
911
+
912
+ ---
913
+
914
+ ## Summary: Design Decision Table
915
+
916
+ | # | Question | Decision | Why |
917
+ |---|----------|----------|-----|
918
+ | 1 | **Architecture** | Orchestrator with search-judge loop | Clear, testable, proven |
919
+ | 2 | **Tools** | Static registry, dynamic selection | Balance flexibility vs simplicity |
920
+ | 3 | **Judge** | Dual (quality + budget) | Quality + cost control |
921
+ | 4 | **Stopping** | Four-tier conditions | Defense in depth |
922
+ | 5 | **State** | Pydantic + checkpoints | Type-safe, resumable |
923
+ | 6 | **Tool Interface** | Async Protocol + parallel execution | Fast I/O, modern Python |
924
+ | 7 | **Output** | Structured + Markdown | Human & machine readable |
925
+ | 8 | **Errors** | Graceful degradation + fallbacks | Robust for demo |
926
+ | 9 | **Config** | TOML (Hydra-inspired) | Simple, standard |
927
+ | 10 | **Testing** | Three levels | Fast feedback + confidence |
928
+ | 11 | **Judge Prompts** | Structured JSON + domain criteria | Parseable, medical-specific |
929
+ | 12 | **MCP** | Tools as MCP servers | Hackathon track, reusability |
930
+ | 13 | **UI** | Gradio 5 streaming | Progress visibility, modern UX |
931
+
932
+ ---
933
+
934
+ ## Answers to Specific Questions
935
+
936
+ ### "What's the orchestrator pattern?"
937
+ **Answer**: See Section 1 - Iterative Research Orchestrator with search-judge loop
938
+
939
+ ### "LLM-as-judge or token budget?"
940
+ **Answer**: Both - See Section 3 (Dual-Judge System) and Section 4 (Three-Tier Break Conditions)
941
+
942
+ ### "What's the break pattern?"
943
+ **Answer**: See Section 4 - Three stopping conditions: quality threshold, token budget, max iterations
944
+
945
+ ### "Should we use agent factories?"
946
+ **Answer**: No - See Section 2. Static tool registry is simpler for 6-day timeline
947
+
948
+ ### "How do we handle state?"
949
+ **Answer**: See Section 5 - Pydantic state machine with checkpoints
950
+
951
+ ---
952
+
953
+ ## Appendix: Complete Data Models
954
+
955
+ ```python
956
+ # src/deepresearch/models.py
957
+ from pydantic import BaseModel, Field
958
+ from typing import List, Optional, Literal
959
+ from datetime import datetime
960
+
961
+ class Citation(BaseModel):
962
+ """Reference to a source"""
963
+ source_type: Literal["pubmed", "web", "trial", "fda"]
964
+ identifier: str # PMID, URL, NCT number, etc.
965
+ title: str
966
+ authors: Optional[List[str]] = None
967
+ date: Optional[str] = None
968
+ url: Optional[str] = None
969
+
970
+ class Evidence(BaseModel):
971
+ """Single piece of evidence from search"""
972
+ content: str
973
+ source: Citation
974
+ relevance_score: float = Field(ge=0, le=1)
975
+ evidence_type: Literal["mechanism", "candidate", "clinical", "safety"]
976
+
977
+ class DrugCandidate(BaseModel):
978
+ """Potential drug for repurposing"""
979
+ name: str
980
+ generic_name: Optional[str] = None
981
+ mechanism: str
982
+ current_indications: List[str]
983
+ proposed_mechanism: str
984
+ evidence_quality: Literal["strong", "moderate", "weak"]
985
+ fda_status: str
986
+ citations: List[Citation]
987
+
988
+ class JudgeAssessment(BaseModel):
989
+ """Output from quality judge"""
990
+ mechanism_score: int = Field(ge=0, le=10)
991
+ candidates_score: int = Field(ge=0, le=10)
992
+ evidence_score: int = Field(ge=0, le=10)
993
+ sources_score: int = Field(ge=0, le=10)
994
+ overall_confidence: float = Field(ge=0, le=1)
995
+ sufficient: bool
996
+ gaps: List[str]
997
+ recommended_searches: List[str]
998
+ recommendation: Literal["continue", "synthesize"]
999
+
1000
+ class ResearchState(BaseModel):
1001
+ """Complete state of a research session"""
1002
+ query_id: str
1003
+ question: str
1004
+ iteration: int = 0
1005
+ evidence: List[Evidence] = []
1006
+ assessments: List[JudgeAssessment] = []
1007
+ tokens_used: int = 0
1008
+ search_history: List[str] = []
1009
+ stop_reason: Optional[str] = None
1010
+ created_at: datetime = Field(default_factory=datetime.utcnow)
1011
+ updated_at: datetime = Field(default_factory=datetime.utcnow)
1012
+
1013
+ class ResearchReport(BaseModel):
1014
+ """Final output report"""
1015
+ query: str
1016
+ executive_summary: str
1017
+ disease_mechanism: str
1018
+ candidates: List[DrugCandidate]
1019
+ methodology: str
1020
+ limitations: str
1021
+ confidence: float
1022
+ sources_used: int
1023
+ tokens_used: int
1024
+ iterations: int
1025
+ generated_at: datetime = Field(default_factory=datetime.utcnow)
1026
+
1027
+ def to_markdown(self) -> str:
1028
+ """Render as markdown for Gradio"""
1029
+ md = f"# Research Report: {self.query}\n\n"
1030
+ md += f"## Executive Summary\n{self.executive_summary}\n\n"
1031
+ md += f"## Disease Mechanism\n{self.disease_mechanism}\n\n"
1032
+ md += "## Drug Candidates\n\n"
1033
+ for i, drug in enumerate(self.candidates, 1):
1034
+ md += f"### {i}. {drug.name} - {drug.evidence_quality.upper()} EVIDENCE\n"
1035
+ md += f"- **Mechanism**: {drug.proposed_mechanism}\n"
1036
+ md += f"- **FDA Status**: {drug.fda_status}\n"
1037
+ md += f"- **Current Uses**: {', '.join(drug.current_indications)}\n"
1038
+ md += f"- **Citations**: {len(drug.citations)} sources\n\n"
1039
+ md += f"## Methodology\n{self.methodology}\n\n"
1040
+ md += f"## Limitations\n{self.limitations}\n\n"
1041
+ md += f"## Confidence: {self.confidence:.0%}\n"
1042
+ return md
1043
+ ```
1044
+
1045
+ ---
1046
+
1047
+ ## 14. Alternative Frameworks Considered
1048
+
1049
+ We researched major agent frameworks before settling on our stack. Here's why we chose what we chose, and what we'd steal if we're shipping like animals and have time for Gucci upgrades.
1050
+
1051
+ ### Frameworks Evaluated
1052
+
1053
+ | Framework | Repo | What It Does |
1054
+ |-----------|------|--------------|
1055
+ | **Microsoft AutoGen** | [github.com/microsoft/autogen](https://github.com/microsoft/autogen) | Multi-agent orchestration, complex workflows |
1056
+ | **Claude Agent SDK** | [github.com/anthropics/claude-agent-sdk-python](https://github.com/anthropics/claude-agent-sdk-python) | Anthropic's official agent framework |
1057
+ | **Pydantic AI** | [github.com/pydantic/pydantic-ai](https://github.com/pydantic/pydantic-ai) | Type-safe agents, structured outputs |
1058
+
1059
+ ### Why NOT AutoGen (Microsoft)?
1060
+
1061
+ **Pros:**
1062
+ - Battle-tested multi-agent orchestration
1063
+ - `reflect_on_tool_use` - model reviews its own tool results
1064
+ - `max_tool_iterations` - built-in iteration limits
1065
+ - Concurrent tool execution
1066
+ - Rich ecosystem (AutoGen Studio, benchmarks)
1067
+
1068
+ **Cons for MVP:**
1069
+ - Heavy dependency tree (50+ packages)
1070
+ - Complex configuration (YAML + Python)
1071
+ - Overkill for single-agent search-judge loop
1072
+ - Learning curve eats into 6-day timeline
1073
+
1074
+ **Verdict:** Great for multi-agent systems. Overkill for our MVP.
1075
+
1076
+ ### Why NOT Claude Agent SDK (Anthropic)?
1077
+
1078
+ **Pros:**
1079
+ - Official Anthropic framework
1080
+ - Clean `@tool` decorator pattern
1081
+ - In-process MCP servers (no subprocess)
1082
+ - Hooks for pre/post tool execution
1083
+ - Direct Claude Code integration
1084
+
1085
+ **Cons for MVP:**
1086
+ - Requires Claude Code CLI bundled
1087
+ - Node.js dependency for some features
1088
+ - Designed for Claude Code ecosystem, not standalone agents
1089
+ - Less flexible for custom LLM providers
1090
+
1091
+ **Verdict:** Would be great if we were building ON Claude Code. We're building a standalone agent.
1092
+
1093
+ ### Why Pydantic AI + FastMCP (Our Choice)
1094
+
1095
+ **Pros:**
1096
+ - ✅ Simple, Pythonic API
1097
+ - ✅ Native async/await
1098
+ - ✅ Type-safe with Pydantic
1099
+ - ✅ Works with any LLM provider
1100
+ - ✅ FastMCP for clean MCP servers
1101
+ - ✅ Minimal dependencies
1102
+ - ✅ Can ship MVP in 6 days
1103
+
1104
+ **Cons:**
1105
+ - Newer framework (less battle-tested)
1106
+ - Smaller ecosystem
1107
+ - May need to build more from scratch
1108
+
1109
+ **Verdict:** Right tool for the job. Ship fast, iterate later.
1110
+
1111
+ ---
1112
+
1113
+ ## 15. Stretch Goals: Gucci Bangers (If We're Shipping Like Animals)
1114
+
1115
+ If MVP ships early and we're crushing it, here's what we'd steal from other frameworks:
1116
+
1117
+ ### Tier 1: Quick Wins (2-4 hours each)
1118
+
1119
+ #### From Claude Agent SDK: `@tool` Decorator Pattern
1120
+ Replace our Protocol-based tools with cleaner decorators:
1121
+
1122
+ ```python
1123
+ # CURRENT (Protocol-based)
1124
+ class PubMedSearchTool:
1125
+ async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
1126
+ ...
1127
+
1128
+ # UPGRADE (Decorator-based, stolen from Claude SDK)
1129
+ from claude_agent_sdk import tool
1130
+
1131
+ @tool("search_pubmed", "Search PubMed for biomedical papers", {
1132
+ "query": str,
1133
+ "max_results": int
1134
+ })
1135
+ async def search_pubmed(args):
1136
+ results = await _do_pubmed_search(args["query"], args["max_results"])
1137
+ return {"content": [{"type": "text", "text": json.dumps(results)}]}
1138
+ ```
1139
+
1140
+ **Why it's Gucci:** Cleaner syntax, automatic schema generation, less boilerplate.
1141
+
1142
+ #### From AutoGen: Reflect on Tool Use
1143
+ Add a reflection step where the model reviews its own tool results:
1144
+
1145
+ ```python
1146
+ # CURRENT: Judge evaluates evidence
1147
+ assessment = await judge.assess(question, evidence)
1148
+
1149
+ # UPGRADE: Add reflection step (stolen from AutoGen)
1150
+ class ReflectiveJudge:
1151
+ async def assess_with_reflection(self, question, evidence, tool_results):
1152
+ # First pass: raw assessment
1153
+ initial = await self._assess(question, evidence)
1154
+
1155
+ # Reflection: "Did I use the tools correctly?"
1156
+ reflection = await self._reflect_on_tool_use(tool_results)
1157
+
1158
+ # Final: combine assessment + reflection
1159
+ return self._combine(initial, reflection)
1160
+ ```
1161
+
1162
+ **Why it's Gucci:** Catches tool misuse, improves accuracy, more robust judge.
1163
+
1164
+ ### Tier 2: Medium Lifts (4-8 hours each)
1165
+
1166
+ #### From AutoGen: Concurrent Tool Execution
1167
+ Run multiple tools in parallel with proper error handling:
1168
+
1169
+ ```python
1170
+ # CURRENT: Sequential with asyncio.gather
1171
+ results = await asyncio.gather(*[tool.search(query) for tool in tools])
1172
+
1173
+ # UPGRADE: AutoGen-style with cancellation + timeout
1174
+ from autogen_core import CancellationToken
1175
+
1176
+ async def execute_tools_concurrent(tools, query, timeout=30):
1177
+ token = CancellationToken()
1178
+
1179
+ async def run_with_timeout(tool):
1180
+ try:
1181
+ return await asyncio.wait_for(
1182
+ tool.search(query, cancellation_token=token),
1183
+ timeout=timeout
1184
+ )
1185
+ except asyncio.TimeoutError:
1186
+ token.cancel() # Cancel other tools
1187
+ return ToolError(f"{tool.name} timed out")
1188
+
1189
+ return await asyncio.gather(*[run_with_timeout(t) for t in tools])
1190
+ ```
1191
+
1192
+ **Why it's Gucci:** Proper timeout handling, cancellation propagation, production-ready.
1193
+
1194
+ #### From Claude SDK: Hooks System
1195
+ Add pre/post hooks for logging, validation, cost tracking:
1196
+
1197
+ ```python
1198
+ # UPGRADE: Hook system (stolen from Claude SDK)
1199
+ class HookManager:
1200
+ async def pre_tool_use(self, tool_name, args):
1201
+ """Called before every tool execution"""
1202
+ logger.info(f"Calling {tool_name} with {args}")
1203
+ self.cost_tracker.start_timer()
1204
+
1205
+ async def post_tool_use(self, tool_name, result, duration):
1206
+ """Called after every tool execution"""
1207
+ self.cost_tracker.record(tool_name, duration)
1208
+ if result.is_error:
1209
+ self.error_tracker.record(tool_name, result.error)
1210
+ ```
1211
+
1212
+ **Why it's Gucci:** Observability, debugging, cost tracking, production-ready.
1213
+
1214
+ ### Tier 3: Big Lifts (Post-Hackathon)
1215
+
1216
+ #### Full AutoGen Integration
1217
+ If we want multi-agent capabilities later:
1218
+
1219
+ ```python
1220
+ # POST-HACKATHON: Multi-agent drug repurposing
1221
+ from autogen_agentchat import AssistantAgent, GroupChat
1222
+
1223
+ literature_agent = AssistantAgent(
1224
+ name="LiteratureReviewer",
1225
+ tools=[pubmed_search, web_search],
1226
+ system_message="You search and summarize medical literature."
1227
+ )
1228
+
1229
+ mechanism_agent = AssistantAgent(
1230
+ name="MechanismAnalyzer",
1231
+ tools=[pathway_db, protein_db],
1232
+ system_message="You analyze disease mechanisms and drug targets."
1233
+ )
1234
+
1235
+ synthesis_agent = AssistantAgent(
1236
+ name="ReportSynthesizer",
1237
+ system_message="You synthesize findings into actionable reports."
1238
+ )
1239
+
1240
+ # Orchestrate multi-agent workflow
1241
+ group_chat = GroupChat(
1242
+ agents=[literature_agent, mechanism_agent, synthesis_agent],
1243
+ max_round=10
1244
+ )
1245
+ ```
1246
+
1247
+ **Why it's Gucci:** True multi-agent collaboration, specialized roles, scalable.
1248
+
1249
+ ---
1250
+
1251
+ ## Priority Order for Stretch Goals
1252
+
1253
+ | Priority | Feature | Source | Effort | Impact |
1254
+ |----------|---------|--------|--------|--------|
1255
+ | 1 | `@tool` decorator | Claude SDK | 2 hrs | High - cleaner code |
1256
+ | 2 | Reflect on tool use | AutoGen | 3 hrs | High - better accuracy |
1257
+ | 3 | Hooks system | Claude SDK | 4 hrs | Medium - observability |
1258
+ | 4 | Concurrent + cancellation | AutoGen | 4 hrs | Medium - robustness |
1259
+ | 5 | Multi-agent | AutoGen | 8+ hrs | Post-hackathon |
1260
+
1261
+ ---
1262
+
1263
+ ## The Bottom Line
1264
+
1265
+ ```
1266
+ ┌─────────────────────────────────────────────────────────────┐
1267
+ │ MVP (Days 1-4): Pydantic AI + FastMCP │
1268
+ │ - Ship working drug repurposing agent │
1269
+ │ - Search-judge loop with PubMed + Web │
1270
+ │ - Gradio UI with streaming │
1271
+ │ - MCP server for hackathon track │
1272
+ ├─────────────────────────────────────────────────────────────┤
1273
+ │ If Crushing It (Days 5-6): Steal the Gucci │
1274
+ │ - @tool decorators from Claude SDK │
1275
+ │ - Reflect on tool use from AutoGen │
1276
+ │ - Hooks for observability │
1277
+ ├─────────────────────────────────────────────────────────────┤
1278
+ │ Post-Hackathon: Full AutoGen Integration │
1279
+ │ - Multi-agent workflows │
1280
+ │ - Specialized agent roles │
1281
+ │ - Production-grade orchestration │
1282
+ └─────────────────────────────────────────────────────────────┘
1283
+ ```
1284
+
1285
+ **Ship MVP first. Steal bangers if time. Scale later.**
1286
+
1287
+ ---
1288
+
1289
+ ## 16. Reference Implementation Resources
1290
+
1291
+ We've cloned production-ready repos into `reference_repos/` that we can vendor, copy from, or just USE directly. This section documents what's available and how to leverage it.
1292
+
1293
+ ### Cloned Repositories
1294
+
1295
+ | Repository | Location | What It Provides |
1296
+ |------------|----------|------------------|
1297
+ | **pydanticai-research-agent** | `reference_repos/pydanticai-research-agent/` | Complete PydanticAI agent with Brave Search |
1298
+ | **pubmed-mcp-server** | `reference_repos/pubmed-mcp-server/` | Production-grade PubMed MCP server (TypeScript) |
1299
+ | **autogen-microsoft** | `reference_repos/autogen-microsoft/` | Microsoft's multi-agent framework |
1300
+ | **claude-agent-sdk** | `reference_repos/claude-agent-sdk/` | Anthropic's agent SDK with @tool decorator |
1301
+
1302
+ ### 🔥 CHEAT CODE: Production PubMed MCP Already Exists
1303
+
1304
+ The `pubmed-mcp-server` is **production-grade** and has EVERYTHING we need:
1305
+
1306
+ ```bash
1307
+ # Already available tools in pubmed-mcp-server:
1308
+ pubmed_search_articles # Search PubMed with filters, date ranges
1309
+ pubmed_fetch_contents # Get full article details by PMID
1310
+ pubmed_article_connections # Find citations, related articles
1311
+ pubmed_research_agent # Generate research plan outlines
1312
+ pubmed_generate_chart # Create PNG charts from data
1313
+ ```
1314
+
1315
+ **Option 1: Use it directly via npx**
1316
+ ```json
1317
+ {
1318
+ "mcpServers": {
1319
+ "pubmed": {
1320
+ "command": "npx",
1321
+ "args": ["@cyanheads/pubmed-mcp-server"],
1322
+ "env": { "NCBI_API_KEY": "your_key" }
1323
+ }
1324
+ }
1325
+ }
1326
+ ```
1327
+
1328
+ **Option 2: Vendor the logic into Python**
1329
+ The TypeScript code in `reference_repos/pubmed-mcp-server/src/` shows exactly how to:
1330
+ - Construct PubMed E-utilities queries
1331
+ - Handle rate limiting (3/sec without key, 10/sec with key)
1332
+ - Parse XML responses
1333
+ - Extract article metadata
1334
+
1335
+ ### PydanticAI Research Agent Patterns
1336
+
1337
+ The `pydanticai-research-agent` repo provides copy-paste patterns:
1338
+
1339
+ **Agent Definition** (`agents/research_agent.py`):
1340
+ ```python
1341
+ from pydantic_ai import Agent, RunContext
1342
+ from dataclasses import dataclass
1343
+
1344
+ @dataclass
1345
+ class ResearchAgentDependencies:
1346
+ brave_api_key: str
1347
+ session_id: Optional[str] = None
1348
+
1349
+ research_agent = Agent(
1350
+ get_llm_model(),
1351
+ deps_type=ResearchAgentDependencies,
1352
+ system_prompt=SYSTEM_PROMPT
1353
+ )
1354
+
1355
+ @research_agent.tool
1356
+ async def search_web(
1357
+ ctx: RunContext[ResearchAgentDependencies],
1358
+ query: str,
1359
+ max_results: int = 10
1360
+ ) -> List[Dict[str, Any]]:
1361
+ """Search with context access via ctx.deps"""
1362
+ results = await search_web_tool(ctx.deps.brave_api_key, query, max_results)
1363
+ return results
1364
+ ```
1365
+
1366
+ **Brave Search Tool** (`tools/brave_search.py`):
1367
+ ```python
1368
+ async def search_web_tool(api_key: str, query: str, count: int = 10) -> List[Dict]:
1369
+ headers = {"X-Subscription-Token": api_key, "Accept": "application/json"}
1370
+ async with httpx.AsyncClient() as client:
1371
+ response = await client.get(
1372
+ "https://api.search.brave.com/res/v1/web/search",
1373
+ headers=headers,
1374
+ params={"q": query, "count": count},
1375
+ timeout=30.0
1376
+ )
1377
+ # Handle 429 rate limit, 401 auth errors
1378
+ data = response.json()
1379
+ return data.get("web", {}).get("results", [])
1380
+ ```
1381
+
1382
+ **Pydantic Models** (`models/research_models.py`):
1383
+ ```python
1384
+ class BraveSearchResult(BaseModel):
1385
+ title: str
1386
+ url: str
1387
+ description: str
1388
+ score: float = Field(ge=0.0, le=1.0)
1389
+ ```
1390
+
1391
+ ### Microsoft Agent Framework Orchestration Patterns
1392
+
1393
+ From [deepwiki.com/microsoft/agent-framework](https://deepwiki.com/microsoft/agent-framework/3.4-workflows-and-orchestration):
1394
+
1395
+ #### Sequential Orchestration
1396
+ ```
1397
+ Agent A → Agent B → Agent C (each receives prior outputs)
1398
+ ```
1399
+ **Use when:** Tasks have dependencies, results inform next steps.
1400
+
1401
+ #### Concurrent (Fan-out/Fan-in)
1402
+ ```
1403
+ ┌→ Agent A ─┐
1404
+ Dispatcher ├→ Agent B ─┼→ Aggregator
1405
+ └→ Agent C ─┘
1406
+ ```
1407
+ **Use when:** Independent tasks can run in parallel, results need consolidation.
1408
+ **Our use:** Parallel PubMed + Web search.
1409
+
1410
+ #### Handoff Orchestration
1411
+ ```
1412
+ Coordinator → routes to → Specialist A, B, or C based on request
1413
+ ```
1414
+ **Use when:** Router decides which search strategy based on query type.
1415
+ **Our use:** Route "mechanism" vs "clinical trial" vs "drug info" queries.
1416
+
1417
+ #### HITL (Human-in-the-Loop)
1418
+ ```
1419
+ Agent → RequestInfoEvent → Human validates → Agent continues
1420
+ ```
1421
+ **Use when:** Critical judgment points need human validation.
1422
+ **Our use:** Optional "approve drug candidates before synthesis" step.
1423
+
1424
+ ### Recommended Hybrid Pattern for Our Agent
1425
+
1426
+ Based on all the research, here's our recommended implementation:
1427
+
1428
+ ```
1429
+ ┌─────────────────────────────────────────────────────────┐
1430
+ │ 1. ROUTER (Handoff Pattern) │
1431
+ │ - Analyze query type │
1432
+ │ - Choose search strategy │
1433
+ ├─────────────────────────────────────────────────────────┤
1434
+ │ 2. SEARCH (Concurrent Pattern) │
1435
+ │ - Fan-out to PubMed + Web in parallel │
1436
+ │ - Timeout handling per AutoGen patterns │
1437
+ │ - Aggregate results │
1438
+ ├─────────────────────────────────────────────────────────┤
1439
+ │ 3. JUDGE (Sequential + Budget) │
1440
+ │ - Quality assessment │
1441
+ │ - Token/iteration budget check │
1442
+ │ - Recommend: continue or synthesize │
1443
+ ├─────────────────────────────────────────────────────────┤
1444
+ │ 4. SYNTHESIZE (Final Agent) │
1445
+ │ - Generate research report │
1446
+ │ - Include citations │
1447
+ │ - Stream to Gradio UI │
1448
+ └─────────────────────────────────────────────────────────┘
1449
+ ```
1450
+
1451
+ ### Quick Start: Minimal Implementation Path
1452
+
1453
+ **Day 1-2: Core Loop**
1454
+ 1. Copy `search_web_tool` from `pydanticai-research-agent/tools/brave_search.py`
1455
+ 2. Implement PubMed search (reference `pubmed-mcp-server/src/` for E-utilities patterns)
1456
+ 3. Wire up basic search-judge loop
1457
+
1458
+ **Day 3: Judge + State**
1459
+ 1. Implement quality judge with JSON structured output
1460
+ 2. Add budget judge
1461
+ 3. Add Pydantic state management
1462
+
1463
+ **Day 4: UI + MCP**
1464
+ 1. Gradio streaming UI
1465
+ 2. Wrap PubMed tool as FastMCP server
1466
+
1467
+ **Day 5-6: Polish + Deploy**
1468
+ 1. HuggingFace Spaces deployment
1469
+ 2. Demo video
1470
+ 3. Stretch goals if time
1471
+
1472
+ ---
1473
+
1474
+ ## 17. External Resources & MCP Servers
1475
+
1476
+ ### Available PubMed MCP Servers (Community)
1477
+
1478
+ | Server | Author | Features | Link |
1479
+ |--------|--------|----------|------|
1480
+ | **pubmed-mcp-server** | cyanheads | Full E-utilities, research agent, charts | [GitHub](https://github.com/cyanheads/pubmed-mcp-server) |
1481
+ | **BioMCP** | GenomOncology | PubMed + ClinicalTrials + MyVariant | [GitHub](https://github.com/genomoncology/biomcp) |
1482
+ | **PubMed-MCP-Server** | JackKuo666 | Basic search, metadata access | [GitHub](https://github.com/JackKuo666/PubMed-MCP-Server) |
1483
+
1484
+ ### Web Search Options
1485
+
1486
+ | Tool | Free Tier | API Key | Async Support |
1487
+ |------|-----------|---------|---------------|
1488
+ | **Brave Search** | 2000/month | Required | Yes (httpx) |
1489
+ | **DuckDuckGo** | Unlimited | No | Yes (duckduckgo-search) |
1490
+ | **SerpAPI** | None | Required | Yes |
1491
+
1492
+ **Recommended:** Start with DuckDuckGo (free, no key), upgrade to Brave for production.
1493
+
1494
+ ```python
1495
+ # DuckDuckGo async search (no API key needed!)
1496
+ from duckduckgo_search import DDGS
1497
+
1498
+ async def search_ddg(query: str, max_results: int = 10) -> List[Dict]:
1499
+ with DDGS() as ddgs:
1500
+ results = list(ddgs.text(query, max_results=max_results))
1501
+ return [{"title": r["title"], "url": r["href"], "description": r["body"]} for r in results]
1502
+ ```
1503
+
1504
+ ---
1505
+
1506
+ **Document Status**: Official Architecture Spec
1507
+ **Review Score**: 100/100 (Ironclad Gucci Banger Edition)
1508
+ **Sections**: 17 design patterns + data models appendix + reference repos + stretch goals
1509
+ **Last Updated**: November 2025
docs/architecture/graph_orchestration.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Graph Orchestration Architecture
2
+
3
+ ## Overview
4
+
5
+ Phase 4 implements a graph-based orchestration system for research workflows using Pydantic AI agents as nodes. This enables better parallel execution, conditional routing, and state management compared to simple agent chains.
6
+
7
+ ## Graph Structure
8
+
9
+ ### Nodes
10
+
11
+ Graph nodes represent different stages in the research workflow:
12
+
13
+ 1. **Agent Nodes**: Execute Pydantic AI agents
14
+ - Input: Prompt/query
15
+ - Output: Structured or unstructured response
16
+ - Examples: `KnowledgeGapAgent`, `ToolSelectorAgent`, `ThinkingAgent`
17
+
18
+ 2. **State Nodes**: Update or read workflow state
19
+ - Input: Current state
20
+ - Output: Updated state
21
+ - Examples: Update evidence, update conversation history
22
+
23
+ 3. **Decision Nodes**: Make routing decisions based on conditions
24
+ - Input: Current state/results
25
+ - Output: Next node ID
26
+ - Examples: Continue research vs. complete research
27
+
28
+ 4. **Parallel Nodes**: Execute multiple nodes concurrently
29
+ - Input: List of node IDs
30
+ - Output: Aggregated results
31
+ - Examples: Parallel iterative research loops
32
+
33
+ ### Edges
34
+
35
+ Edges define transitions between nodes:
36
+
37
+ 1. **Sequential Edges**: Always traversed (no condition)
38
+ - From: Source node
39
+ - To: Target node
40
+ - Condition: None (always True)
41
+
42
+ 2. **Conditional Edges**: Traversed based on condition
43
+ - From: Source node
44
+ - To: Target node
45
+ - Condition: Callable that returns bool
46
+ - Example: If research complete → go to writer, else → continue loop
47
+
48
+ 3. **Parallel Edges**: Used for parallel execution branches
49
+ - From: Parallel node
50
+ - To: Multiple target nodes
51
+ - Execution: All targets run concurrently
52
+
53
+ ## Graph Patterns
54
+
55
+ ### Iterative Research Graph
56
+
57
+ ```
58
+ [Input] → [Thinking] → [Knowledge Gap] → [Decision: Complete?]
59
+ ↓ No ↓ Yes
60
+ [Tool Selector] [Writer]
61
+
62
+ [Execute Tools] → [Loop Back]
63
+ ```
64
+
65
+ ### Deep Research Graph
66
+
67
+ ```
68
+ [Input] → [Planner] → [Parallel Iterative Loops] → [Synthesizer]
69
+ ↓ ↓ ↓
70
+ [Loop1] [Loop2] [Loop3]
71
+ ```
72
+
73
+ ## State Management
74
+
75
+ State is managed via `WorkflowState` using `ContextVar` for thread-safe isolation:
76
+
77
+ - **Evidence**: Collected evidence from searches
78
+ - **Conversation**: Iteration history (gaps, tool calls, findings, thoughts)
79
+ - **Embedding Service**: For semantic search
80
+
81
+ State transitions occur at state nodes, which update the global workflow state.
82
+
83
+ ## Execution Flow
84
+
85
+ 1. **Graph Construction**: Build graph from nodes and edges
86
+ 2. **Graph Validation**: Ensure graph is valid (no cycles, all nodes reachable)
87
+ 3. **Graph Execution**: Traverse graph from entry node
88
+ 4. **Node Execution**: Execute each node based on type
89
+ 5. **Edge Evaluation**: Determine next node(s) based on edges
90
+ 6. **Parallel Execution**: Use `asyncio.gather()` for parallel nodes
91
+ 7. **State Updates**: Update state at state nodes
92
+ 8. **Event Streaming**: Yield events during execution for UI
93
+
94
+ ## Conditional Routing
95
+
96
+ Decision nodes evaluate conditions and return next node IDs:
97
+
98
+ - **Knowledge Gap Decision**: If `research_complete` → writer, else → tool selector
99
+ - **Budget Decision**: If budget exceeded → exit, else → continue
100
+ - **Iteration Decision**: If max iterations → exit, else → continue
101
+
102
+ ## Parallel Execution
103
+
104
+ Parallel nodes execute multiple nodes concurrently:
105
+
106
+ - Each parallel branch runs independently
107
+ - Results are aggregated after all branches complete
108
+ - State is synchronized after parallel execution
109
+ - Errors in one branch don't stop other branches
110
+
111
+ ## Budget Enforcement
112
+
113
+ Budget constraints are enforced at decision nodes:
114
+
115
+ - **Token Budget**: Track LLM token usage
116
+ - **Time Budget**: Track elapsed time
117
+ - **Iteration Budget**: Track iteration count
118
+
119
+ If any budget is exceeded, execution routes to exit node.
120
+
121
+ ## Error Handling
122
+
123
+ Errors are handled at multiple levels:
124
+
125
+ 1. **Node Level**: Catch errors in individual node execution
126
+ 2. **Graph Level**: Handle errors during graph traversal
127
+ 3. **State Level**: Rollback state changes on error
128
+
129
+ Errors are logged and yield error events for UI.
130
+
131
+ ## Backward Compatibility
132
+
133
+ Graph execution is optional via feature flag:
134
+
135
+ - `USE_GRAPH_EXECUTION=true`: Use graph-based execution
136
+ - `USE_GRAPH_EXECUTION=false`: Use agent chain execution (existing)
137
+
138
+ This allows gradual migration and fallback if needed.
139
+
140
+
141
+
142
+
143
+
144
+
145
+
146
+
147
+
148
+
149
+
150
+
151
+
docs/architecture/overview.md ADDED
@@ -0,0 +1,474 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepCritical: Medical Drug Repurposing Research Agent
2
+ ## Project Overview
3
+
4
+ ---
5
+
6
+ ## Executive Summary
7
+
8
+ **DeepCritical** is a deep research agent designed to accelerate medical drug repurposing research by autonomously searching, analyzing, and synthesizing evidence from multiple biomedical databases.
9
+
10
+ ### The Problem We Solve
11
+
12
+ Drug repurposing - finding new therapeutic uses for existing FDA-approved drugs - can take years of manual literature review. Researchers must:
13
+ - Search thousands of papers across multiple databases
14
+ - Identify molecular mechanisms
15
+ - Find relevant clinical trials
16
+ - Assess safety profiles
17
+ - Synthesize evidence into actionable insights
18
+
19
+ **DeepCritical automates this process from hours to minutes.**
20
+
21
+ ### What Is Drug Repurposing?
22
+
23
+ **Simple Explanation:**
24
+ Using existing approved drugs to treat NEW diseases they weren't originally designed for.
25
+
26
+ **Real Examples:**
27
+ - **Viagra** (sildenafil): Originally for heart disease → Now treats erectile dysfunction
28
+ - **Thalidomide**: Once banned → Now treats multiple myeloma
29
+ - **Aspirin**: Pain reliever → Heart attack prevention
30
+ - **Metformin**: Diabetes drug → Being tested for aging/longevity
31
+
32
+ **Why It Matters:**
33
+ - Faster than developing new drugs (years vs decades)
34
+ - Cheaper (known safety profiles)
35
+ - Lower risk (already FDA approved)
36
+ - Immediate patient benefit potential
37
+
38
+ ---
39
+
40
+ ## Core Use Case
41
+
42
+ ### Primary Query Type
43
+ > "What existing drugs might help treat [disease/condition]?"
44
+
45
+ ### Example Queries
46
+
47
+ 1. **Long COVID Fatigue**
48
+ - Query: "What existing drugs might help treat long COVID fatigue?"
49
+ - Agent searches: PubMed, clinical trials, drug databases
50
+ - Output: List of candidate drugs with mechanisms + evidence + citations
51
+
52
+ 2. **Alzheimer's Disease**
53
+ - Query: "Find existing drugs that target beta-amyloid pathways"
54
+ - Agent identifies: Disease mechanisms → Drug candidates → Clinical evidence
55
+ - Output: Comprehensive research report with drug candidates
56
+
57
+ 3. **Rare Disease Treatment**
58
+ - Query: "What drugs might help with fibrodysplasia ossificans progressiva?"
59
+ - Agent finds: Similar conditions → Shared pathways → Potential treatments
60
+ - Output: Evidence-based treatment suggestions
61
+
62
+ ---
63
+
64
+ ## System Architecture
65
+
66
+ ### High-Level Design (Phases 1-8)
67
+
68
+ ```text
69
+ User Query
70
+
71
+ Gradio UI (Phase 4)
72
+
73
+ Magentic Manager (Phase 5) ← LLM-powered coordinator
74
+ ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6)
75
+ ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning
76
+ ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment
77
+ └── ReportAgent (Phase 8) ←→ Final Synthesis
78
+
79
+ Structured Research Report
80
+ ```
81
+
82
+ ### Key Components
83
+
84
+ 1. **Magentic Manager (Orchestrator)**
85
+ - LLM-powered multi-agent coordinator
86
+ - Dynamic planning and agent selection
87
+ - Built-in stall detection and replanning
88
+ - Microsoft Agent Framework integration
89
+
90
+ 2. **SearchAgent (Phase 2+5+6)**
91
+ - PubMed E-utilities search
92
+ - DuckDuckGo web search
93
+ - Semantic search via ChromaDB (Phase 6)
94
+ - Evidence deduplication
95
+
96
+ 3. **HypothesisAgent (Phase 7)**
97
+ - Generates Drug → Target → Pathway → Effect hypotheses
98
+ - Guides targeted searches
99
+ - Scientific reasoning about mechanisms
100
+
101
+ 4. **JudgeAgent (Phase 3+5)**
102
+ - LLM-based evidence assessment
103
+ - Mechanism score + Clinical score
104
+ - Recommends continue/synthesize
105
+ - Generates refined search queries
106
+
107
+ 5. **ReportAgent (Phase 8)**
108
+ - Structured scientific reports
109
+ - Executive summary, methodology
110
+ - Hypotheses tested with evidence counts
111
+ - Proper citations and limitations
112
+
113
+ 6. **Gradio UI (Phase 4)**
114
+ - Chat interface for questions
115
+ - Real-time progress via events
116
+ - Mode toggle (Simple/Magentic)
117
+ - Formatted markdown output
118
+
119
+ ---
120
+
121
+ ## Design Patterns
122
+
123
+ ### 1. Search-and-Judge Loop (Primary Pattern)
124
+
125
+ ```python
126
+ def research(question: str) -> Report:
127
+ context = []
128
+ for iteration in range(max_iterations):
129
+ # SEARCH: Query relevant tools
130
+ results = search_tools(question, context)
131
+ context.extend(results)
132
+
133
+ # JUDGE: Evaluate quality
134
+ if judge.is_sufficient(question, context):
135
+ break
136
+
137
+ # REFINE: Adjust search strategy
138
+ query = refine_query(question, context)
139
+
140
+ # SYNTHESIZE: Generate report
141
+ return synthesize_report(question, context)
142
+ ```
143
+
144
+ **Why This Pattern:**
145
+ - Simple to implement and debug
146
+ - Clear loop termination conditions
147
+ - Iterative improvement of search quality
148
+ - Balances depth vs speed
149
+
150
+ ### 2. Multi-Tool Orchestration
151
+
152
+ ```
153
+ Question → Agent decides which tools to use
154
+
155
+ ┌───┴────┬─────────┬──────────┐
156
+ ↓ ↓ ↓ ↓
157
+ PubMed Web Search Trials DB Drug DB
158
+ ↓ ↓ ↓ ↓
159
+ └───┬────┴─────────┴──��───────┘
160
+
161
+ Aggregate Results → Judge
162
+ ```
163
+
164
+ **Why This Pattern:**
165
+ - Different sources provide different evidence types
166
+ - Parallel tool execution (when possible)
167
+ - Comprehensive coverage
168
+
169
+ ### 3. LLM-as-Judge with Token Budget
170
+
171
+ **Dual Stopping Conditions:**
172
+ - **Smart Stop**: LLM judge says "we have sufficient evidence"
173
+ - **Hard Stop**: Token budget exhausted OR max iterations reached
174
+
175
+ **Why Both:**
176
+ - Judge enables early exit when answer is good
177
+ - Budget prevents runaway costs
178
+ - Iterations prevent infinite loops
179
+
180
+ ### 4. Stateful Checkpointing
181
+
182
+ ```
183
+ .deepresearch/
184
+ ├── state/
185
+ │ └── query_123.json # Current research state
186
+ ├── checkpoints/
187
+ │ └── query_123_iter3/ # Checkpoint at iteration 3
188
+ └── workspace/
189
+ └── query_123/ # Downloaded papers, data
190
+ ```
191
+
192
+ **Why This Pattern:**
193
+ - Resume interrupted research
194
+ - Debugging and analysis
195
+ - Cost savings (don't re-search)
196
+
197
+ ---
198
+
199
+ ## Component Breakdown
200
+
201
+ ### Agent (Orchestrator)
202
+ - **Responsibility**: Coordinate research process
203
+ - **Size**: ~100 lines
204
+ - **Key Methods**:
205
+ - `research(question)` - Main entry point
206
+ - `plan_search_strategy()` - Decide what to search
207
+ - `execute_search()` - Run tool queries
208
+ - `evaluate_progress()` - Call judge
209
+ - `synthesize_findings()` - Generate report
210
+
211
+ ### Tools
212
+ - **Responsibility**: Interface with external data sources
213
+ - **Size**: ~50 lines per tool
214
+ - **Implementations**:
215
+ - `PubMedTool` - Search biomedical literature
216
+ - `WebSearchTool` - General medical information
217
+ - `ClinicalTrialsTool` - Trial data (optional)
218
+ - `DrugInfoTool` - FDA drug database (optional)
219
+
220
+ ### Judge
221
+ - **Responsibility**: Evaluate evidence quality
222
+ - **Size**: ~50 lines
223
+ - **Key Methods**:
224
+ - `is_sufficient(question, evidence)` → bool
225
+ - `assess_quality(evidence)` → score
226
+ - `identify_gaps(question, evidence)` → missing_info
227
+
228
+ ### Gradio App
229
+ - **Responsibility**: User interface
230
+ - **Size**: ~50 lines
231
+ - **Features**:
232
+ - Text input for questions
233
+ - Progress indicators
234
+ - Formatted output with citations
235
+ - Download research report
236
+
237
+ ---
238
+
239
+ ## Technical Stack
240
+
241
+ ### Core Dependencies
242
+ ```toml
243
+ [dependencies]
244
+ python = ">=3.10"
245
+ pydantic = "^2.7"
246
+ pydantic-ai = "^0.0.16"
247
+ fastmcp = "^0.1.0"
248
+ gradio = "^5.0"
249
+ beautifulsoup4 = "^4.12"
250
+ httpx = "^0.27"
251
+ ```
252
+
253
+ ### Optional Enhancements
254
+ - `modal` - For GPU-accelerated local LLM
255
+ - `fastmcp` - MCP server integration
256
+ - `sentence-transformers` - Semantic search
257
+ - `faiss-cpu` - Vector similarity
258
+
259
+ ### Tool APIs & Rate Limits
260
+
261
+ | API | Cost | Rate Limit | API Key? | Notes |
262
+ |-----|------|------------|----------|-------|
263
+ | **PubMed E-utilities** | Free | 3/sec (no key), 10/sec (with key) | Optional | Register at NCBI for higher limits |
264
+ | **Brave Search API** | Free tier | 2000/month free | Required | Primary web search |
265
+ | **DuckDuckGo** | Free | Unofficial, ~1/sec | No | Fallback web search |
266
+ | **ClinicalTrials.gov** | Free | 100/min | No | Stretch goal |
267
+ | **OpenFDA** | Free | 240/min (no key), 120K/day (with key) | Optional | Drug info |
268
+
269
+ **Web Search Strategy (Priority Order):**
270
+ 1. **Brave Search API** (free tier: 2000 queries/month) - Primary
271
+ 2. **DuckDuckGo** (unofficial, no API key) - Fallback
272
+ 3. **SerpAPI** ($50/month) - Only if free options fail
273
+
274
+ **Why NOT SerpAPI first?**
275
+ - Costs money (hackathon budget = $0)
276
+ - Free alternatives work fine for demo
277
+ - Can upgrade later if needed
278
+
279
+ ---
280
+
281
+ ## Success Criteria
282
+
283
+ ### Phase 1-5 (MVP) ✅ COMPLETE
284
+ **Completed in ONE DAY:**
285
+ - [x] User can ask drug repurposing question
286
+ - [x] Agent searches PubMed (async)
287
+ - [x] Agent searches web (DuckDuckGo)
288
+ - [x] LLM judge evaluates evidence quality
289
+ - [x] System respects token budget and iterations
290
+ - [x] Output includes drug candidates + citations
291
+ - [x] Works end-to-end for demo query
292
+ - [x] Gradio UI with streaming progress
293
+ - [x] Magentic multi-agent orchestration
294
+ - [x] 38 unit tests passing
295
+ - [x] CI/CD pipeline green
296
+
297
+ ### Hackathon Submission ✅ COMPLETE
298
+ - [x] Gradio UI deployed on HuggingFace Spaces
299
+ - [x] Example queries working and tested
300
+ - [x] Architecture documentation
301
+ - [x] README with setup instructions
302
+
303
+ ### Phase 6-8 (Enhanced)
304
+ **Specs ready for implementation:**
305
+ - [ ] Embeddings & Semantic Search (Phase 6)
306
+ - [ ] Hypothesis Agent (Phase 7)
307
+ - [ ] Report Agent (Phase 8)
308
+
309
+ ### What's EXPLICITLY Out of Scope
310
+ **NOT building (to stay focused):**
311
+ - ❌ User authentication
312
+ - ❌ Database storage of queries
313
+ - ❌ Multi-user support
314
+ - ❌ Payment/billing
315
+ - ❌ Production monitoring
316
+ - ❌ Mobile UI
317
+
318
+ ---
319
+
320
+ ## Implementation Timeline
321
+
322
+ ### Day 1 (Today): Architecture & Setup
323
+ - [x] Define use case (drug repurposing) ✅
324
+ - [x] Write architecture docs ✅
325
+ - [ ] Create project structure
326
+ - [ ] First PR: Structure + Docs
327
+
328
+ ### Day 2: Core Agent Loop
329
+ - [ ] Implement basic orchestrator
330
+ - [ ] Add PubMed search tool
331
+ - [ ] Simple judge (keyword-based)
332
+ - [ ] Test with 1 query
333
+
334
+ ### Day 3: Intelligence Layer
335
+ - [ ] Upgrade to LLM judge
336
+ - [ ] Add web search tool
337
+ - [ ] Token budget tracking
338
+ - [ ] Test with multiple queries
339
+
340
+ ### Day 4: UI & Integration
341
+ - [ ] Build Gradio interface
342
+ - [ ] Wire up agent to UI
343
+ - [ ] Add progress indicators
344
+ - [ ] Format output nicely
345
+
346
+ ### Day 5: Polish & Extend
347
+ - [ ] Add more tools (clinical trials)
348
+ - [ ] Improve judge prompts
349
+ - [ ] Checkpoint system
350
+ - [ ] Error handling
351
+
352
+ ### Day 6: Deploy & Document
353
+ - [ ] Deploy to HuggingFace Spaces
354
+ - [ ] Record demo video
355
+ - [ ] Write submission materials
356
+ - [ ] Final testing
357
+
358
+ ---
359
+
360
+ ## Questions This Document Answers
361
+
362
+ ### For The Maintainer
363
+
364
+ **Q: "What should our design pattern be?"**
365
+ A: Search-and-judge loop with multi-tool orchestration (detailed in Design Patterns section)
366
+
367
+ **Q: "Should we use LLM-as-judge or token budget?"**
368
+ A: Both - judge for smart stopping, budget for cost control
369
+
370
+ **Q: "What's the break pattern?"**
371
+ A: Three conditions: judge approval, token limit, or max iterations (whichever comes first)
372
+
373
+ **Q: "What components do we need?"**
374
+ A: Agent orchestrator, tools (PubMed/web), judge, Gradio UI (see Component Breakdown)
375
+
376
+ ### For The Team
377
+
378
+ **Q: "What are we actually building?"**
379
+ A: Medical drug repurposing research agent (see Core Use Case)
380
+
381
+ **Q: "How complex should it be?"**
382
+ A: Simple but complete - ~300 lines of core code (see Component sizes)
383
+
384
+ **Q: "What's the timeline?"**
385
+ A: 6 days, MVP by Day 3, polish Days 4-6 (see Implementation Timeline)
386
+
387
+ **Q: "What datasets/APIs do we use?"**
388
+ A: PubMed (free), web search, clinical trials.gov (see Tool APIs)
389
+
390
+ ---
391
+
392
+ ## Next Steps
393
+
394
+ 1. **Review this document** - Team feedback on architecture
395
+ 2. **Finalize design** - Incorporate feedback
396
+ 3. **Create project structure** - Scaffold repository
397
+ 4. **Move to proper docs** - `docs/architecture/` folder
398
+ 5. **Open first PR** - Structure + Documentation
399
+ 6. **Start implementation** - Day 2 onward
400
+
401
+ ---
402
+
403
+ ## Notes & Decisions
404
+
405
+ ### Why Drug Repurposing?
406
+ - Clear, impressive use case
407
+ - Real-world medical impact
408
+ - Good data availability (PubMed, trials)
409
+ - Easy to explain (Viagra example!)
410
+ - Physician on team ✅
411
+
412
+ ### Why Simple Architecture?
413
+ - 6-day timeline
414
+ - Need working end-to-end system
415
+ - Hackathon judges value "works" over "complex"
416
+ - Can extend later if successful
417
+
418
+ ### Why These Tools First?
419
+ - PubMed: Best biomedical literature source
420
+ - Web search: General medical knowledge
421
+ - Clinical trials: Evidence of actual testing
422
+ - Others: Nice-to-have, not critical for MVP
423
+
424
+ ---
425
+
426
+ ---
427
+
428
+ ## Appendix A: Demo Queries (Pre-tested)
429
+
430
+ These queries will be used for demo and testing. They're chosen because:
431
+ 1. They have good PubMed coverage
432
+ 2. They're medically interesting
433
+ 3. They show the system's capabilities
434
+
435
+ ### Primary Demo Query
436
+ ```
437
+ "What existing drugs might help treat long COVID fatigue?"
438
+ ```
439
+ **Expected candidates**: CoQ10, Low-dose Naltrexone, Modafinil
440
+ **Expected sources**: 20+ PubMed papers, 2-3 clinical trials
441
+
442
+ ### Secondary Demo Queries
443
+ ```
444
+ "Find existing drugs that might slow Alzheimer's progression"
445
+ "What approved medications could help with fibromyalgia pain?"
446
+ "Which diabetes drugs show promise for cancer treatment?"
447
+ ```
448
+
449
+ ### Why These Queries?
450
+ - Represent real clinical needs
451
+ - Have substantial literature
452
+ - Show diverse drug classes
453
+ - Physician on team can validate results
454
+
455
+ ---
456
+
457
+ ## Appendix B: Risk Assessment
458
+
459
+ | Risk | Likelihood | Impact | Mitigation |
460
+ |------|------------|--------|------------|
461
+ | PubMed rate limiting | Medium | High | Implement caching, respect 3/sec |
462
+ | Web search API fails | Low | Medium | DuckDuckGo fallback |
463
+ | LLM costs exceed budget | Medium | Medium | Hard token cap at 50K |
464
+ | Judge quality poor | Medium | High | Pre-test prompts, iterate |
465
+ | HuggingFace deploy issues | Low | High | Test deployment Day 4 |
466
+ | Demo crashes live | Medium | High | Pre-recorded backup video |
467
+
468
+ ---
469
+
470
+ ---
471
+
472
+ **Document Status**: Official Architecture Spec
473
+ **Review Score**: 98/100
474
+ **Last Updated**: November 2025
docs/brainstorming/00_ROADMAP_SUMMARY.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepCritical Data Sources: Roadmap Summary
2
+
3
+ **Created**: 2024-11-27
4
+ **Purpose**: Future maintainability and hackathon continuation
5
+
6
+ ---
7
+
8
+ ## Current State
9
+
10
+ ### Working Tools
11
+
12
+ | Tool | Status | Data Quality |
13
+ |------|--------|--------------|
14
+ | PubMed | ✅ Works | Good (abstracts only) |
15
+ | ClinicalTrials.gov | ✅ Works | Good (filtered for interventional) |
16
+ | Europe PMC | ✅ Works | Good (includes preprints) |
17
+
18
+ ### Removed Tools
19
+
20
+ | Tool | Status | Reason |
21
+ |------|--------|--------|
22
+ | bioRxiv | ❌ Removed | No search API - only date/DOI lookup |
23
+
24
+ ---
25
+
26
+ ## Priority Improvements
27
+
28
+ ### P0: Critical (Do First)
29
+
30
+ 1. **Add Rate Limiting to PubMed**
31
+ - NCBI will block us without it
32
+ - Use `limits` library (see reference repo)
33
+ - 3/sec without key, 10/sec with key
34
+
35
+ ### P1: High Value, Medium Effort
36
+
37
+ 2. **Add OpenAlex as 4th Source**
38
+ - Citation network (huge for drug repurposing)
39
+ - Concept tagging (semantic discovery)
40
+ - Already implemented in reference repo
41
+ - Free, no API key
42
+
43
+ 3. **PubMed Full-Text via BioC**
44
+ - Get full paper text for PMC papers
45
+ - Already in reference repo
46
+
47
+ ### P2: Nice to Have
48
+
49
+ 4. **ClinicalTrials.gov Results**
50
+ - Get efficacy data from completed trials
51
+ - Requires more complex API calls
52
+
53
+ 5. **Europe PMC Annotations**
54
+ - Text-mined entities (genes, drugs, diseases)
55
+ - Automatic entity extraction
56
+
57
+ ---
58
+
59
+ ## Effort Estimates
60
+
61
+ | Improvement | Effort | Impact | Priority |
62
+ |-------------|--------|--------|----------|
63
+ | PubMed rate limiting | 1 hour | Stability | P0 |
64
+ | OpenAlex basic search | 2 hours | High | P1 |
65
+ | OpenAlex citations | 2 hours | Very High | P1 |
66
+ | PubMed full-text | 3 hours | Medium | P1 |
67
+ | CT.gov results | 4 hours | Medium | P2 |
68
+ | Europe PMC annotations | 3 hours | Medium | P2 |
69
+
70
+ ---
71
+
72
+ ## Architecture Decision
73
+
74
+ ### Option A: Keep Current + Add OpenAlex
75
+
76
+ ```
77
+ User Query
78
+
79
+ ┌───────────────────┼───────────────────┐
80
+ ↓ ↓ ↓
81
+ PubMed ClinicalTrials Europe PMC
82
+ (abstracts) (trials only) (preprints)
83
+ ↓ ↓ ↓
84
+ └───────────────────┼───────────────────┘
85
+
86
+ OpenAlex ← NEW
87
+ (citations, concepts)
88
+
89
+ Orchestrator
90
+
91
+ Report
92
+ ```
93
+
94
+ **Pros**: Low risk, additive
95
+ **Cons**: More complexity, some overlap
96
+
97
+ ### Option B: OpenAlex as Primary
98
+
99
+ ```
100
+ User Query
101
+
102
+ ┌───────────────────┼───────────────────┐
103
+ ↓ ↓ ↓
104
+ OpenAlex ClinicalTrials Europe PMC
105
+ (primary (trials only) (full-text
106
+ search) fallback)
107
+ ↓ ↓ ↓
108
+ └───────────────────┼───────────────────┘
109
+
110
+ Orchestrator
111
+
112
+ Report
113
+ ```
114
+
115
+ **Pros**: Simpler, citation network built-in
116
+ **Cons**: Lose some PubMed-specific features
117
+
118
+ ### Recommendation: Option A
119
+
120
+ Keep current architecture working, add OpenAlex incrementally.
121
+
122
+ ---
123
+
124
+ ## Quick Wins (Can Do Today)
125
+
126
+ 1. **Add `limits` to `pyproject.toml`**
127
+ ```toml
128
+ dependencies = [
129
+ "limits>=3.0",
130
+ ]
131
+ ```
132
+
133
+ 2. **Copy OpenAlex tool from reference repo**
134
+ - File: `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`
135
+ - Adapt to our `SearchTool` base class
136
+
137
+ 3. **Enable NCBI API Key**
138
+ - Add to `.env`: `NCBI_API_KEY=your_key`
139
+ - 10x rate limit improvement
140
+
141
+ ---
142
+
143
+ ## External Resources Worth Exploring
144
+
145
+ ### Python Libraries
146
+
147
+ | Library | For | Notes |
148
+ |---------|-----|-------|
149
+ | `limits` | Rate limiting | Used by reference repo |
150
+ | `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
151
+ | `metapub` | PubMed | Full-featured |
152
+ | `sentence-transformers` | Semantic search | For embeddings |
153
+
154
+ ### APIs Not Yet Used
155
+
156
+ | API | Provides | Effort |
157
+ |-----|----------|--------|
158
+ | RxNorm | Drug name normalization | Low |
159
+ | DrugBank | Drug targets/mechanisms | Medium (license) |
160
+ | UniProt | Protein data | Medium |
161
+ | ChEMBL | Bioactivity data | Medium |
162
+
163
+ ### RAG Tools (Future)
164
+
165
+ | Tool | Purpose |
166
+ |------|---------|
167
+ | [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
168
+ | [txtai](https://github.com/neuml/txtai) | Embeddings + search |
169
+ | [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
170
+
171
+ ---
172
+
173
+ ## Files in This Directory
174
+
175
+ | File | Contents |
176
+ |------|----------|
177
+ | `00_ROADMAP_SUMMARY.md` | This file |
178
+ | `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
179
+ | `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
180
+ | `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
181
+ | `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
182
+
183
+ ---
184
+
185
+ ## For Future Maintainers
186
+
187
+ If you're picking this up after the hackathon:
188
+
189
+ 1. **Start with OpenAlex** - biggest bang for buck
190
+ 2. **Add rate limiting** - prevents API blocks
191
+ 3. **Don't bother with bioRxiv** - use Europe PMC instead
192
+ 4. **Reference repo is gold** - `reference_repos/DeepCritical/` has working implementations
193
+
194
+ Good luck! 🚀
docs/brainstorming/01_PUBMED_IMPROVEMENTS.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PubMed Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented
4
+ **Priority**: High (Core Data Source)
5
+
6
+ ---
7
+
8
+ ## Current Implementation
9
+
10
+ ### What We Have (`src/tools/pubmed.py`)
11
+
12
+ - Basic E-utilities search via `esearch.fcgi` and `efetch.fcgi`
13
+ - Query preprocessing (strips question words, expands synonyms)
14
+ - Returns: title, abstract, authors, journal, PMID
15
+ - Rate limiting: None implemented (relying on NCBI defaults)
16
+
17
+ ### Current Limitations
18
+
19
+ 1. **No Full-Text Access**: Only retrieves abstracts, not full paper text
20
+ 2. **No Rate Limiting**: Risk of being blocked by NCBI
21
+ 3. **No BioC Format**: Missing structured full-text extraction
22
+ 4. **No Figure Retrieval**: No supplementary materials access
23
+ 5. **No PMC Integration**: Missing open-access full-text via PMC
24
+
25
+ ---
26
+
27
+ ## Reference Implementation (DeepCritical Reference Repo)
28
+
29
+ The reference repo at `reference_repos/DeepCritical/DeepResearch/src/tools/bioinformatics_tools.py` has a more sophisticated implementation:
30
+
31
+ ### Features We're Missing
32
+
33
+ ```python
34
+ # Rate limiting (lines 47-50)
35
+ from limits import parse
36
+ from limits.storage import MemoryStorage
37
+ from limits.strategies import MovingWindowRateLimiter
38
+
39
+ storage = MemoryStorage()
40
+ limiter = MovingWindowRateLimiter(storage)
41
+ rate_limit = parse("3/second") # NCBI allows 3/sec without API key, 10/sec with
42
+
43
+ # Full-text via BioC format (lines 108-120)
44
+ def _get_fulltext(pmid: int) -> dict[str, Any] | None:
45
+ pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
46
+ # Returns structured JSON with full text for open-access papers
47
+
48
+ # Figure retrieval via Europe PMC (lines 123-149)
49
+ def _get_figures(pmcid: str) -> dict[str, str]:
50
+ suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
51
+ # Returns base64-encoded images from supplementary materials
52
+ ```
53
+
54
+ ---
55
+
56
+ ## Recommended Improvements
57
+
58
+ ### Phase 1: Rate Limiting (Critical)
59
+
60
+ ```python
61
+ # Add to src/tools/pubmed.py
62
+ from limits import parse
63
+ from limits.storage import MemoryStorage
64
+ from limits.strategies import MovingWindowRateLimiter
65
+
66
+ storage = MemoryStorage()
67
+ limiter = MovingWindowRateLimiter(storage)
68
+
69
+ # With NCBI_API_KEY: 10/sec, without: 3/sec
70
+ def get_rate_limit():
71
+ if settings.ncbi_api_key:
72
+ return parse("10/second")
73
+ return parse("3/second")
74
+ ```
75
+
76
+ **Dependencies**: `pip install limits`
77
+
78
+ ### Phase 2: Full-Text Retrieval
79
+
80
+ ```python
81
+ async def get_fulltext(pmid: str) -> str | None:
82
+ """Get full text for open-access papers via BioC API."""
83
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
84
+ # Only works for PMC papers (open access)
85
+ ```
86
+
87
+ ### Phase 3: PMC ID Resolution
88
+
89
+ ```python
90
+ async def get_pmc_id(pmid: str) -> str | None:
91
+ """Convert PMID to PMCID for full-text access."""
92
+ url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
93
+ ```
94
+
95
+ ---
96
+
97
+ ## Python Libraries to Consider
98
+
99
+ | Library | Purpose | Notes |
100
+ |---------|---------|-------|
101
+ | [Biopython](https://biopython.org/) | `Bio.Entrez` module | Official, well-maintained |
102
+ | [PyMed](https://pypi.org/project/pymed/) | PubMed wrapper | Simpler API, less control |
103
+ | [metapub](https://pypi.org/project/metapub/) | Full-featured | Tested on 1/3 of PubMed |
104
+ | [limits](https://pypi.org/project/limits/) | Rate limiting | Used by reference repo |
105
+
106
+ ---
107
+
108
+ ## API Endpoints Reference
109
+
110
+ | Endpoint | Purpose | Rate Limit |
111
+ |----------|---------|------------|
112
+ | `esearch.fcgi` | Search for PMIDs | 3/sec (10 with key) |
113
+ | `efetch.fcgi` | Fetch metadata | 3/sec (10 with key) |
114
+ | `esummary.fcgi` | Quick metadata | 3/sec (10 with key) |
115
+ | `pmcoa.cgi/BioC_json` | Full text (PMC only) | Unknown |
116
+ | `idconv/v1.0` | PMID ↔ PMCID | Unknown |
117
+
118
+ ---
119
+
120
+ ## Sources
121
+
122
+ - [PubMed E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
123
+ - [NCBI BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
124
+ - [Searching PubMed with Python](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/)
125
+ - [PyMed on PyPI](https://pypi.org/project/pymed/)
docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ClinicalTrials.gov Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented
4
+ **Priority**: High (Core Data Source for Drug Repurposing)
5
+
6
+ ---
7
+
8
+ ## Current Implementation
9
+
10
+ ### What We Have (`src/tools/clinicaltrials.py`)
11
+
12
+ - V2 API search via `clinicaltrials.gov/api/v2/studies`
13
+ - Filters: `INTERVENTIONAL` study type, `RECRUITING` status
14
+ - Returns: NCT ID, title, conditions, interventions, phase, status
15
+ - Query preprocessing via shared `query_utils.py`
16
+
17
+ ### Current Strengths
18
+
19
+ 1. **Good Filtering**: Already filtering for interventional + recruiting
20
+ 2. **V2 API**: Using the modern API (v1 deprecated)
21
+ 3. **Phase Info**: Extracting trial phases for drug development context
22
+
23
+ ### Current Limitations
24
+
25
+ 1. **No Outcome Data**: Missing primary/secondary outcomes
26
+ 2. **No Eligibility Criteria**: Missing inclusion/exclusion details
27
+ 3. **No Sponsor Info**: Missing who's running the trial
28
+ 4. **No Result Data**: For completed trials, no efficacy data
29
+ 5. **Limited Drug Mapping**: No integration with drug databases
30
+
31
+ ---
32
+
33
+ ## API Capabilities We're Not Using
34
+
35
+ ### Fields We Could Request
36
+
37
+ ```python
38
+ # Current fields
39
+ fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
40
+
41
+ # Additional valuable fields
42
+ additional_fields = [
43
+ "PrimaryOutcomeMeasure", # What are they measuring?
44
+ "SecondaryOutcomeMeasure", # Secondary endpoints
45
+ "EligibilityCriteria", # Who can participate?
46
+ "LeadSponsorName", # Who's funding?
47
+ "ResultsFirstPostDate", # Has results?
48
+ "StudyFirstPostDate", # When started?
49
+ "CompletionDate", # When finished?
50
+ "EnrollmentCount", # Sample size
51
+ "InterventionDescription", # Drug details
52
+ "ArmGroupLabel", # Treatment arms
53
+ "InterventionOtherName", # Drug aliases
54
+ ]
55
+ ```
56
+
57
+ ### Filter Enhancements
58
+
59
+ ```python
60
+ # Current
61
+ aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
62
+
63
+ # Could add
64
+ "status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED" # Include completed for results
65
+ "phase:PHASE2,PHASE3" # Only later-stage trials
66
+ "resultsFirstPostDateRange:2020-01-01_" # Trials with posted results
67
+ ```
68
+
69
+ ---
70
+
71
+ ## Recommended Improvements
72
+
73
+ ### Phase 1: Richer Metadata
74
+
75
+ ```python
76
+ EXTENDED_FIELDS = [
77
+ "NCTId",
78
+ "BriefTitle",
79
+ "OfficialTitle",
80
+ "Condition",
81
+ "InterventionName",
82
+ "InterventionDescription",
83
+ "InterventionOtherName", # Drug synonyms!
84
+ "Phase",
85
+ "OverallStatus",
86
+ "PrimaryOutcomeMeasure",
87
+ "EnrollmentCount",
88
+ "LeadSponsorName",
89
+ "StudyFirstPostDate",
90
+ ]
91
+ ```
92
+
93
+ ### Phase 2: Results Retrieval
94
+
95
+ For completed trials, we can get actual efficacy data:
96
+
97
+ ```python
98
+ async def get_trial_results(nct_id: str) -> dict | None:
99
+ """Fetch results for completed trials."""
100
+ url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
101
+ params = {
102
+ "fields": "ResultsSection",
103
+ }
104
+ # Returns outcome measures and statistics
105
+ ```
106
+
107
+ ### Phase 3: Drug Name Normalization
108
+
109
+ Map intervention names to standard identifiers:
110
+
111
+ ```python
112
+ # Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
113
+ # Solution: Use RxNorm or DrugBank for normalization
114
+
115
+ async def normalize_drug_name(intervention: str) -> str:
116
+ """Normalize drug name via RxNorm API."""
117
+ url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
118
+ # Returns standardized RxCUI
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Integration Opportunities
124
+
125
+ ### With PubMed
126
+
127
+ Cross-reference trials with publications:
128
+ ```python
129
+ # ClinicalTrials.gov provides PMID links
130
+ # Can correlate trial results with published papers
131
+ ```
132
+
133
+ ### With DrugBank/ChEMBL
134
+
135
+ Map interventions to:
136
+ - Mechanism of action
137
+ - Known targets
138
+ - Adverse effects
139
+ - Drug-drug interactions
140
+
141
+ ---
142
+
143
+ ## Python Libraries to Consider
144
+
145
+ | Library | Purpose | Notes |
146
+ |---------|---------|-------|
147
+ | [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
148
+ | [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
149
+ | [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
150
+
151
+ ---
152
+
153
+ ## API Quirks & Gotchas
154
+
155
+ 1. **Rate Limiting**: Undocumented, be conservative
156
+ 2. **Pagination**: Max 1000 results per request
157
+ 3. **Field Names**: Case-sensitive, camelCase
158
+ 4. **Empty Results**: Some fields may be null even if requested
159
+ 5. **Status Changes**: Trials change status frequently
160
+
161
+ ---
162
+
163
+ ## Example Enhanced Query
164
+
165
+ ```python
166
+ async def search_drug_repurposing_trials(
167
+ drug_name: str,
168
+ condition: str,
169
+ include_completed: bool = True,
170
+ ) -> list[Evidence]:
171
+ """Search for trials repurposing a drug for a new condition."""
172
+
173
+ statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
174
+ if include_completed:
175
+ statuses.append("COMPLETED")
176
+
177
+ params = {
178
+ "query.intr": drug_name,
179
+ "query.cond": condition,
180
+ "filter.overallStatus": ",".join(statuses),
181
+ "filter.studyType": "INTERVENTIONAL",
182
+ "fields": ",".join(EXTENDED_FIELDS),
183
+ "pageSize": 50,
184
+ }
185
+ ```
186
+
187
+ ---
188
+
189
+ ## Sources
190
+
191
+ - [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
192
+ - [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
193
+ - [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)
docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Europe PMC Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented (Replaced bioRxiv)
4
+ **Priority**: High (Preprint + Open Access Source)
5
+
6
+ ---
7
+
8
+ ## Why Europe PMC Over bioRxiv?
9
+
10
+ ### bioRxiv API Limitations (Why We Abandoned It)
11
+
12
+ 1. **No Search API**: Only returns papers by date range or DOI
13
+ 2. **No Query Capability**: Cannot search for "metformin cancer"
14
+ 3. **Workaround Required**: Would need to download ALL preprints and build local search
15
+ 4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
16
+
17
+ ### Europe PMC Advantages
18
+
19
+ 1. **Full Search API**: Boolean queries, filters, facets
20
+ 2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
21
+ 3. **Includes PubMed**: Also has MEDLINE content
22
+ 4. **34 Preprint Servers**: Not just bioRxiv
23
+ 5. **Open Access Focus**: Full-text when available
24
+
25
+ ---
26
+
27
+ ## Current Implementation
28
+
29
+ ### What We Have (`src/tools/europepmc.py`)
30
+
31
+ - REST API search via `europepmc.org/webservices/rest/search`
32
+ - Preprint flagging via `firstPublicationDate` heuristics
33
+ - Returns: title, abstract, authors, DOI, source
34
+ - Marks preprints for transparency
35
+
36
+ ### Current Limitations
37
+
38
+ 1. **No Full-Text Retrieval**: Only metadata/abstracts
39
+ 2. **No Citation Network**: Missing references/citations
40
+ 3. **No Supplementary Files**: Not fetching figures/data
41
+ 4. **Basic Preprint Detection**: Heuristic, not explicit flag
42
+
43
+ ---
44
+
45
+ ## Europe PMC API Capabilities
46
+
47
+ ### Endpoints We Could Use
48
+
49
+ | Endpoint | Purpose | Currently Using |
50
+ |----------|---------|-----------------|
51
+ | `/search` | Query papers | Yes |
52
+ | `/fulltext/{ID}` | Full text (XML/JSON) | No |
53
+ | `/{PMCID}/supplementaryFiles` | Figures, data | No |
54
+ | `/citations/{ID}` | Who cited this | No |
55
+ | `/references/{ID}` | What this cites | No |
56
+ | `/annotations` | Text-mined entities | No |
57
+
58
+ ### Rich Query Syntax
59
+
60
+ ```python
61
+ # Current simple query
62
+ query = "metformin cancer"
63
+
64
+ # Could use advanced syntax
65
+ query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
66
+ query += " AND (SRC:PPR)" # Only preprints
67
+ query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
68
+ query += " AND (OPEN_ACCESS:y)" # Only open access
69
+ ```
70
+
71
+ ### Source Filters
72
+
73
+ ```python
74
+ # Filter by source
75
+ "SRC:MED" # MEDLINE
76
+ "SRC:PMC" # PubMed Central
77
+ "SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
78
+ "SRC:AGR" # Agricola
79
+ "SRC:CBA" # Chinese Biological Abstracts
80
+ ```
81
+
82
+ ---
83
+
84
+ ## Recommended Improvements
85
+
86
+ ### Phase 1: Rich Metadata
87
+
88
+ ```python
89
+ # Add to search results
90
+ additional_fields = [
91
+ "citedByCount", # Impact indicator
92
+ "source", # Explicit source (MED, PMC, PPR)
93
+ "isOpenAccess", # Boolean flag
94
+ "fullTextUrlList", # URLs for full text
95
+ "authorAffiliations", # Institution info
96
+ "grantsList", # Funding info
97
+ ]
98
+ ```
99
+
100
+ ### Phase 2: Full-Text Retrieval
101
+
102
+ ```python
103
+ async def get_fulltext(pmcid: str) -> str | None:
104
+ """Get full text for open access papers."""
105
+ # XML format
106
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
107
+ # Or JSON
108
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
109
+ ```
110
+
111
+ ### Phase 3: Citation Network
112
+
113
+ ```python
114
+ async def get_citations(pmcid: str) -> list[str]:
115
+ """Get papers that cite this one."""
116
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
117
+
118
+ async def get_references(pmcid: str) -> list[str]:
119
+ """Get papers this one cites."""
120
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
121
+ ```
122
+
123
+ ### Phase 4: Text-Mined Annotations
124
+
125
+ Europe PMC extracts entities automatically:
126
+
127
+ ```python
128
+ async def get_annotations(pmcid: str) -> dict:
129
+ """Get text-mined entities (genes, diseases, drugs)."""
130
+ url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
131
+ params = {
132
+ "articleIds": f"PMC:{pmcid}",
133
+ "type": "Gene_Proteins,Diseases,Chemicals",
134
+ "format": "JSON",
135
+ }
136
+ # Returns structured entity mentions with positions
137
+ ```
138
+
139
+ ---
140
+
141
+ ## Supplementary File Retrieval
142
+
143
+ From reference repo (`bioinformatics_tools.py` lines 123-149):
144
+
145
+ ```python
146
+ def get_figures(pmcid: str) -> dict[str, str]:
147
+ """Download figures and supplementary files."""
148
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
149
+ # Returns ZIP with images, returns base64-encoded
150
+ ```
151
+
152
+ ---
153
+
154
+ ## Preprint-Specific Features
155
+
156
+ ### Identify Preprint Servers
157
+
158
+ ```python
159
+ PREPRINT_SOURCES = {
160
+ "PPR": "General preprints",
161
+ "bioRxiv": "Biology preprints",
162
+ "medRxiv": "Medical preprints",
163
+ "chemRxiv": "Chemistry preprints",
164
+ "Research Square": "Multi-disciplinary",
165
+ "Preprints.org": "MDPI preprints",
166
+ }
167
+
168
+ # Check if published version exists
169
+ async def check_published_version(preprint_doi: str) -> str | None:
170
+ """Check if preprint has been peer-reviewed and published."""
171
+ # Europe PMC links preprints to final versions
172
+ ```
173
+
174
+ ---
175
+
176
+ ## Rate Limiting
177
+
178
+ Europe PMC is more generous than NCBI:
179
+
180
+ ```python
181
+ # No documented hard limit, but be respectful
182
+ # Recommend: 10-20 requests/second max
183
+ # Use email in User-Agent for polite pool
184
+ headers = {
185
+ "User-Agent": "DeepCritical/1.0 (mailto:your@email.com)"
186
+ }
187
+ ```
188
+
189
+ ---
190
+
191
+ ## vs. The Lens & OpenAlex
192
+
193
+ | Feature | Europe PMC | The Lens | OpenAlex |
194
+ |---------|------------|----------|----------|
195
+ | Biomedical Focus | Yes | Partial | Partial |
196
+ | Preprints | Yes (34 servers) | Yes | Yes |
197
+ | Full Text | PMC papers | Links | No |
198
+ | Citations | Yes | Yes | Yes |
199
+ | Annotations | Yes (text-mined) | No | No |
200
+ | Rate Limits | Generous | Moderate | Very generous |
201
+ | API Key | Optional | Required | Optional |
202
+
203
+ ---
204
+
205
+ ## Sources
206
+
207
+ - [Europe PMC REST API](https://europepmc.org/RestfulWebService)
208
+ - [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
209
+ - [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
210
+ - [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
211
+ - [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)
docs/brainstorming/04_OPENALEX_INTEGRATION.md ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenAlex Integration: The Missing Piece?
2
+
3
+ **Status**: NOT Implemented (Candidate for Addition)
4
+ **Priority**: HIGH - Could Replace Multiple Tools
5
+ **Reference**: Already implemented in `reference_repos/DeepCritical`
6
+
7
+ ---
8
+
9
+ ## What is OpenAlex?
10
+
11
+ OpenAlex is a **fully open** index of the global research system:
12
+
13
+ - **209M+ works** (papers, books, datasets)
14
+ - **2B+ author records** (disambiguated)
15
+ - **124K+ venues** (journals, repositories)
16
+ - **109K+ institutions**
17
+ - **65K+ concepts** (hierarchical, linked to Wikidata)
18
+
19
+ **Free. Open. No API key required.**
20
+
21
+ ---
22
+
23
+ ## Why OpenAlex for DeepCritical?
24
+
25
+ ### Current Architecture
26
+
27
+ ```
28
+ User Query
29
+
30
+ ┌──────────────────────────────────────┐
31
+ │ PubMed ClinicalTrials Europe PMC │ ← 3 separate APIs
32
+ └──────────────────────────────────────┘
33
+
34
+ Orchestrator (deduplicate, judge, synthesize)
35
+ ```
36
+
37
+ ### With OpenAlex
38
+
39
+ ```
40
+ User Query
41
+
42
+ ┌──────────────────────────────────────┐
43
+ │ OpenAlex │ ← Single API
44
+ │ (includes PubMed + preprints + │
45
+ │ citations + concepts + authors) │
46
+ └──────────────────────────────────────┘
47
+
48
+ Orchestrator (enrich with CT.gov for trials)
49
+ ```
50
+
51
+ **OpenAlex already aggregates**:
52
+ - PubMed/MEDLINE
53
+ - Crossref
54
+ - ORCID
55
+ - Unpaywall (open access links)
56
+ - Microsoft Academic Graph (legacy)
57
+ - Preprint servers
58
+
59
+ ---
60
+
61
+ ## Reference Implementation
62
+
63
+ From `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`:
64
+
65
+ ```python
66
+ class OpenAlexFetchTool(ToolRunner):
67
+ def __init__(self):
68
+ super().__init__(
69
+ ToolSpec(
70
+ name="openalex_fetch",
71
+ description="Fetch OpenAlex work or author",
72
+ inputs={"entity": "TEXT", "identifier": "TEXT"},
73
+ outputs={"result": "JSON"},
74
+ )
75
+ )
76
+
77
+ def run(self, params: dict[str, Any]) -> ExecutionResult:
78
+ entity = params["entity"] # "works", "authors", "venues"
79
+ identifier = params["identifier"]
80
+ base = "https://api.openalex.org"
81
+ url = f"{base}/{entity}/{identifier}"
82
+ resp = requests.get(url, timeout=30)
83
+ return ExecutionResult(success=True, data={"result": resp.json()})
84
+ ```
85
+
86
+ ---
87
+
88
+ ## OpenAlex API Features
89
+
90
+ ### Search Works (Papers)
91
+
92
+ ```python
93
+ # Search for metformin + cancer papers
94
+ url = "https://api.openalex.org/works"
95
+ params = {
96
+ "search": "metformin cancer drug repurposing",
97
+ "filter": "publication_year:>2020,type:article",
98
+ "sort": "cited_by_count:desc",
99
+ "per_page": 50,
100
+ }
101
+ ```
102
+
103
+ ### Rich Filtering
104
+
105
+ ```python
106
+ # Filter examples
107
+ "publication_year:2023"
108
+ "type:article" # vs preprint, book, etc.
109
+ "is_oa:true" # Open access only
110
+ "concepts.id:C71924100" # Papers about "Medicine"
111
+ "authorships.institutions.id:I27837315" # From Harvard
112
+ "cited_by_count:>100" # Highly cited
113
+ "has_fulltext:true" # Full text available
114
+ ```
115
+
116
+ ### What You Get Back
117
+
118
+ ```json
119
+ {
120
+ "id": "W2741809807",
121
+ "title": "Metformin: A candidate drug for...",
122
+ "publication_year": 2023,
123
+ "type": "article",
124
+ "cited_by_count": 45,
125
+ "is_oa": true,
126
+ "primary_location": {
127
+ "source": {"display_name": "Nature Medicine"},
128
+ "pdf_url": "https://...",
129
+ "landing_page_url": "https://..."
130
+ },
131
+ "concepts": [
132
+ {"id": "C71924100", "display_name": "Medicine", "score": 0.95},
133
+ {"id": "C54355233", "display_name": "Pharmacology", "score": 0.88}
134
+ ],
135
+ "authorships": [
136
+ {
137
+ "author": {"id": "A123", "display_name": "John Smith"},
138
+ "institutions": [{"display_name": "Harvard Medical School"}]
139
+ }
140
+ ],
141
+ "referenced_works": ["W123", "W456"], # Citations
142
+ "related_works": ["W789", "W012"] # Similar papers
143
+ }
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Key Advantages Over Current Tools
149
+
150
+ ### 1. Citation Network (We Don't Have This!)
151
+
152
+ ```python
153
+ # Get papers that cite a work
154
+ url = f"https://api.openalex.org/works?filter=cites:{work_id}"
155
+
156
+ # Get papers cited by a work
157
+ # Already in `referenced_works` field
158
+ ```
159
+
160
+ ### 2. Concept Tagging (We Don't Have This!)
161
+
162
+ OpenAlex auto-tags papers with hierarchical concepts:
163
+ - "Medicine" → "Pharmacology" → "Drug Repurposing"
164
+ - Can search by concept, not just keywords
165
+
166
+ ### 3. Author Disambiguation (We Don't Have This!)
167
+
168
+ ```python
169
+ # Find all works by an author
170
+ url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}"
171
+ ```
172
+
173
+ ### 4. Institution Tracking
174
+
175
+ ```python
176
+ # Find drug repurposing papers from top institutions
177
+ url = "https://api.openalex.org/works"
178
+ params = {
179
+ "search": "drug repurposing",
180
+ "filter": "authorships.institutions.id:I27837315", # Harvard
181
+ }
182
+ ```
183
+
184
+ ### 5. Related Works
185
+
186
+ Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML.
187
+
188
+ ---
189
+
190
+ ## Proposed Implementation
191
+
192
+ ### New Tool: `src/tools/openalex.py`
193
+
194
+ ```python
195
+ """OpenAlex search tool for comprehensive scholarly data."""
196
+
197
+ import httpx
198
+ from src.tools.base import SearchTool
199
+ from src.utils.models import Evidence
200
+
201
+ class OpenAlexTool(SearchTool):
202
+ """Search OpenAlex for scholarly works with rich metadata."""
203
+
204
+ name = "openalex"
205
+
206
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
207
+ async with httpx.AsyncClient() as client:
208
+ resp = await client.get(
209
+ "https://api.openalex.org/works",
210
+ params={
211
+ "search": query,
212
+ "filter": "type:article,is_oa:true",
213
+ "sort": "cited_by_count:desc",
214
+ "per_page": max_results,
215
+ "mailto": "deepcritical@example.com", # Polite pool
216
+ },
217
+ )
218
+ data = resp.json()
219
+
220
+ return [
221
+ Evidence(
222
+ source="openalex",
223
+ title=work["title"],
224
+ abstract=work.get("abstract", ""),
225
+ url=work["primary_location"]["landing_page_url"],
226
+ metadata={
227
+ "cited_by_count": work["cited_by_count"],
228
+ "concepts": [c["display_name"] for c in work["concepts"][:5]],
229
+ "is_open_access": work["is_oa"],
230
+ "pdf_url": work["primary_location"].get("pdf_url"),
231
+ },
232
+ )
233
+ for work in data["results"]
234
+ ]
235
+ ```
236
+
237
+ ---
238
+
239
+ ## Rate Limits
240
+
241
+ OpenAlex is **extremely generous**:
242
+
243
+ - No hard rate limit documented
244
+ - Recommended: <100,000 requests/day
245
+ - **Polite pool**: Add `mailto=your@email.com` param for faster responses
246
+ - No API key required (optional for priority support)
247
+
248
+ ---
249
+
250
+ ## Should We Add OpenAlex?
251
+
252
+ ### Arguments FOR
253
+
254
+ 1. **Already in reference repo** - proven pattern
255
+ 2. **Richer data** - citations, concepts, authors
256
+ 3. **Single source** - reduces API complexity
257
+ 4. **Free & open** - no keys, no limits
258
+ 5. **Institution adoption** - Leiden, Sorbonne switched to it
259
+
260
+ ### Arguments AGAINST
261
+
262
+ 1. **Adds complexity** - another data source
263
+ 2. **Overlap** - duplicates some PubMed data
264
+ 3. **Not biomedical-focused** - covers all disciplines
265
+ 4. **No full text** - still need PMC/Europe PMC for that
266
+
267
+ ### Recommendation
268
+
269
+ **Add OpenAlex as a 4th source**, don't replace existing tools.
270
+
271
+ Use it for:
272
+ - Citation network analysis
273
+ - Concept-based discovery
274
+ - High-impact paper finding
275
+ - Author/institution tracking
276
+
277
+ Keep PubMed, ClinicalTrials, Europe PMC for:
278
+ - Authoritative biomedical search
279
+ - Clinical trial data
280
+ - Full-text access
281
+ - Preprint tracking
282
+
283
+ ---
284
+
285
+ ## Implementation Priority
286
+
287
+ | Task | Effort | Value |
288
+ |------|--------|-------|
289
+ | Basic search | Low | High |
290
+ | Citation network | Medium | Very High |
291
+ | Concept filtering | Low | High |
292
+ | Related works | Low | High |
293
+ | Author tracking | Medium | Medium |
294
+
295
+ ---
296
+
297
+ ## Sources
298
+
299
+ - [OpenAlex Documentation](https://docs.openalex.org)
300
+ - [OpenAlex API Overview](https://docs.openalex.org/api)
301
+ - [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex)
302
+ - [Leiden University Announcement](https://www.leidenranking.com/information/openalex)
303
+ - [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833)
docs/brainstorming/implementation/15_PHASE_OPENALEX.md ADDED
@@ -0,0 +1,603 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 15: OpenAlex Integration
2
+
3
+ **Priority**: HIGH - Biggest bang for buck
4
+ **Effort**: ~2-3 hours
5
+ **Dependencies**: None (existing codebase patterns sufficient)
6
+
7
+ ---
8
+
9
+ ## Prerequisites (COMPLETED)
10
+
11
+ The following model changes have been implemented to support this integration:
12
+
13
+ 1. **`SourceName` Literal Updated** (`src/utils/models.py:9`)
14
+ ```python
15
+ SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
16
+ ```
17
+ - Without this, `source="openalex"` would fail Pydantic validation
18
+
19
+ 2. **`Evidence.metadata` Field Added** (`src/utils/models.py:39-42`)
20
+ ```python
21
+ metadata: dict[str, Any] = Field(
22
+ default_factory=dict,
23
+ description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
24
+ )
25
+ ```
26
+ - Required for storing `cited_by_count`, `concepts`, etc.
27
+ - Model is still frozen - metadata must be passed at construction time
28
+
29
+ 3. **`__init__.py` Exports Updated** (`src/tools/__init__.py`)
30
+ - All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool`
31
+ - OpenAlexTool should be added here after implementation
32
+
33
+ ---
34
+
35
+ ## Overview
36
+
37
+ Add OpenAlex as a 4th data source for comprehensive scholarly data including:
38
+ - Citation networks (who cites whom)
39
+ - Concept tagging (hierarchical topic classification)
40
+ - Author disambiguation
41
+ - 209M+ works indexed
42
+
43
+ **Why OpenAlex?**
44
+ - Free, no API key required
45
+ - Already implemented in reference repo
46
+ - Provides citation data we don't have
47
+ - Aggregates PubMed + preprints + more
48
+
49
+ ---
50
+
51
+ ## TDD Implementation Plan
52
+
53
+ ### Step 1: Write the Tests First
54
+
55
+ **File**: `tests/unit/tools/test_openalex.py`
56
+
57
+ ```python
58
+ """Tests for OpenAlex search tool."""
59
+
60
+ import pytest
61
+ import respx
62
+ from httpx import Response
63
+
64
+ from src.tools.openalex import OpenAlexTool
65
+ from src.utils.models import Evidence
66
+
67
+
68
+ class TestOpenAlexTool:
69
+ """Test suite for OpenAlex search functionality."""
70
+
71
+ @pytest.fixture
72
+ def tool(self) -> OpenAlexTool:
73
+ return OpenAlexTool()
74
+
75
+ def test_name_property(self, tool: OpenAlexTool) -> None:
76
+ """Tool should identify itself as 'openalex'."""
77
+ assert tool.name == "openalex"
78
+
79
+ @respx.mock
80
+ @pytest.mark.asyncio
81
+ async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None:
82
+ """Search should return list of Evidence objects."""
83
+ mock_response = {
84
+ "results": [
85
+ {
86
+ "id": "W2741809807",
87
+ "title": "Metformin and cancer: A systematic review",
88
+ "publication_year": 2023,
89
+ "cited_by_count": 45,
90
+ "type": "article",
91
+ "is_oa": True,
92
+ "primary_location": {
93
+ "source": {"display_name": "Nature Medicine"},
94
+ "landing_page_url": "https://doi.org/10.1038/example",
95
+ "pdf_url": None,
96
+ },
97
+ "abstract_inverted_index": {
98
+ "Metformin": [0],
99
+ "shows": [1],
100
+ "anticancer": [2],
101
+ "effects": [3],
102
+ },
103
+ "concepts": [
104
+ {"display_name": "Medicine", "score": 0.95},
105
+ {"display_name": "Oncology", "score": 0.88},
106
+ ],
107
+ "authorships": [
108
+ {
109
+ "author": {"display_name": "John Smith"},
110
+ "institutions": [{"display_name": "Harvard"}],
111
+ }
112
+ ],
113
+ }
114
+ ]
115
+ }
116
+
117
+ respx.get("https://api.openalex.org/works").mock(
118
+ return_value=Response(200, json=mock_response)
119
+ )
120
+
121
+ results = await tool.search("metformin cancer", max_results=10)
122
+
123
+ assert len(results) == 1
124
+ assert isinstance(results[0], Evidence)
125
+ assert "Metformin and cancer" in results[0].citation.title
126
+ assert results[0].citation.source == "openalex"
127
+
128
+ @respx.mock
129
+ @pytest.mark.asyncio
130
+ async def test_search_empty_results(self, tool: OpenAlexTool) -> None:
131
+ """Search with no results should return empty list."""
132
+ respx.get("https://api.openalex.org/works").mock(
133
+ return_value=Response(200, json={"results": []})
134
+ )
135
+
136
+ results = await tool.search("xyznonexistentquery123")
137
+ assert results == []
138
+
139
+ @respx.mock
140
+ @pytest.mark.asyncio
141
+ async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None:
142
+ """Tool should handle papers without abstracts."""
143
+ mock_response = {
144
+ "results": [
145
+ {
146
+ "id": "W123",
147
+ "title": "Paper without abstract",
148
+ "publication_year": 2023,
149
+ "cited_by_count": 10,
150
+ "type": "article",
151
+ "is_oa": False,
152
+ "primary_location": {
153
+ "source": {"display_name": "Journal"},
154
+ "landing_page_url": "https://example.com",
155
+ },
156
+ "abstract_inverted_index": None,
157
+ "concepts": [],
158
+ "authorships": [],
159
+ }
160
+ ]
161
+ }
162
+
163
+ respx.get("https://api.openalex.org/works").mock(
164
+ return_value=Response(200, json=mock_response)
165
+ )
166
+
167
+ results = await tool.search("test query")
168
+ assert len(results) == 1
169
+ assert results[0].content == "" # No abstract
170
+
171
+ @respx.mock
172
+ @pytest.mark.asyncio
173
+ async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None:
174
+ """Citation count should be in metadata."""
175
+ mock_response = {
176
+ "results": [
177
+ {
178
+ "id": "W456",
179
+ "title": "Highly cited paper",
180
+ "publication_year": 2020,
181
+ "cited_by_count": 500,
182
+ "type": "article",
183
+ "is_oa": True,
184
+ "primary_location": {
185
+ "source": {"display_name": "Science"},
186
+ "landing_page_url": "https://example.com",
187
+ },
188
+ "abstract_inverted_index": {"Test": [0]},
189
+ "concepts": [],
190
+ "authorships": [],
191
+ }
192
+ ]
193
+ }
194
+
195
+ respx.get("https://api.openalex.org/works").mock(
196
+ return_value=Response(200, json=mock_response)
197
+ )
198
+
199
+ results = await tool.search("highly cited")
200
+ assert results[0].metadata["cited_by_count"] == 500
201
+
202
+ @respx.mock
203
+ @pytest.mark.asyncio
204
+ async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None:
205
+ """Concepts should be extracted for semantic discovery."""
206
+ mock_response = {
207
+ "results": [
208
+ {
209
+ "id": "W789",
210
+ "title": "Drug repurposing study",
211
+ "publication_year": 2023,
212
+ "cited_by_count": 25,
213
+ "type": "article",
214
+ "is_oa": True,
215
+ "primary_location": {
216
+ "source": {"display_name": "PLOS ONE"},
217
+ "landing_page_url": "https://example.com",
218
+ },
219
+ "abstract_inverted_index": {"Drug": [0], "repurposing": [1]},
220
+ "concepts": [
221
+ {"display_name": "Pharmacology", "score": 0.92},
222
+ {"display_name": "Drug Discovery", "score": 0.85},
223
+ {"display_name": "Medicine", "score": 0.80},
224
+ ],
225
+ "authorships": [],
226
+ }
227
+ ]
228
+ }
229
+
230
+ respx.get("https://api.openalex.org/works").mock(
231
+ return_value=Response(200, json=mock_response)
232
+ )
233
+
234
+ results = await tool.search("drug repurposing")
235
+ assert "Pharmacology" in results[0].metadata["concepts"]
236
+ assert "Drug Discovery" in results[0].metadata["concepts"]
237
+
238
+ @respx.mock
239
+ @pytest.mark.asyncio
240
+ async def test_search_api_error_raises_search_error(
241
+ self, tool: OpenAlexTool
242
+ ) -> None:
243
+ """API errors should raise SearchError."""
244
+ from src.utils.exceptions import SearchError
245
+
246
+ respx.get("https://api.openalex.org/works").mock(
247
+ return_value=Response(500, text="Internal Server Error")
248
+ )
249
+
250
+ with pytest.raises(SearchError):
251
+ await tool.search("test query")
252
+
253
+ def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
254
+ """Test abstract reconstruction from inverted index."""
255
+ inverted_index = {
256
+ "Metformin": [0, 5],
257
+ "is": [1],
258
+ "a": [2],
259
+ "diabetes": [3],
260
+ "drug": [4],
261
+ "effective": [6],
262
+ }
263
+ abstract = tool._reconstruct_abstract(inverted_index)
264
+ assert abstract == "Metformin is a diabetes drug Metformin effective"
265
+ ```
266
+
267
+ ---
268
+
269
+ ### Step 2: Create the Implementation
270
+
271
+ **File**: `src/tools/openalex.py`
272
+
273
+ ```python
274
+ """OpenAlex search tool for comprehensive scholarly data."""
275
+
276
+ from typing import Any
277
+
278
+ import httpx
279
+ from tenacity import retry, stop_after_attempt, wait_exponential
280
+
281
+ from src.utils.exceptions import SearchError
282
+ from src.utils.models import Citation, Evidence
283
+
284
+
285
+ class OpenAlexTool:
286
+ """
287
+ Search OpenAlex for scholarly works with rich metadata.
288
+
289
+ OpenAlex provides:
290
+ - 209M+ scholarly works
291
+ - Citation counts and networks
292
+ - Concept tagging (hierarchical)
293
+ - Author disambiguation
294
+ - Open access links
295
+
296
+ API Docs: https://docs.openalex.org/
297
+ """
298
+
299
+ BASE_URL = "https://api.openalex.org/works"
300
+
301
+ def __init__(self, email: str | None = None) -> None:
302
+ """
303
+ Initialize OpenAlex tool.
304
+
305
+ Args:
306
+ email: Optional email for polite pool (faster responses)
307
+ """
308
+ self.email = email or "deepcritical@example.com"
309
+
310
+ @property
311
+ def name(self) -> str:
312
+ return "openalex"
313
+
314
+ @retry(
315
+ stop=stop_after_attempt(3),
316
+ wait=wait_exponential(multiplier=1, min=1, max=10),
317
+ reraise=True,
318
+ )
319
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
320
+ """
321
+ Search OpenAlex for scholarly works.
322
+
323
+ Args:
324
+ query: Search terms
325
+ max_results: Maximum results to return (max 200 per request)
326
+
327
+ Returns:
328
+ List of Evidence objects with citation metadata
329
+
330
+ Raises:
331
+ SearchError: If API request fails
332
+ """
333
+ params = {
334
+ "search": query,
335
+ "filter": "type:article", # Only peer-reviewed articles
336
+ "sort": "cited_by_count:desc", # Most cited first
337
+ "per_page": min(max_results, 200),
338
+ "mailto": self.email, # Polite pool for faster responses
339
+ }
340
+
341
+ async with httpx.AsyncClient(timeout=30.0) as client:
342
+ try:
343
+ response = await client.get(self.BASE_URL, params=params)
344
+ response.raise_for_status()
345
+
346
+ data = response.json()
347
+ results = data.get("results", [])
348
+
349
+ return [self._to_evidence(work) for work in results[:max_results]]
350
+
351
+ except httpx.HTTPStatusError as e:
352
+ raise SearchError(f"OpenAlex API error: {e}") from e
353
+ except httpx.RequestError as e:
354
+ raise SearchError(f"OpenAlex connection failed: {e}") from e
355
+
356
+ def _to_evidence(self, work: dict[str, Any]) -> Evidence:
357
+ """Convert OpenAlex work to Evidence object."""
358
+ title = work.get("title", "Untitled")
359
+ pub_year = work.get("publication_year", "Unknown")
360
+ cited_by = work.get("cited_by_count", 0)
361
+ is_oa = work.get("is_oa", False)
362
+
363
+ # Reconstruct abstract from inverted index
364
+ abstract_index = work.get("abstract_inverted_index")
365
+ abstract = self._reconstruct_abstract(abstract_index) if abstract_index else ""
366
+
367
+ # Extract concepts (top 5)
368
+ concepts = [
369
+ c.get("display_name", "")
370
+ for c in work.get("concepts", [])[:5]
371
+ if c.get("display_name")
372
+ ]
373
+
374
+ # Extract authors (top 5)
375
+ authorships = work.get("authorships", [])
376
+ authors = [
377
+ a.get("author", {}).get("display_name", "")
378
+ for a in authorships[:5]
379
+ if a.get("author", {}).get("display_name")
380
+ ]
381
+
382
+ # Get URL
383
+ primary_loc = work.get("primary_location") or {}
384
+ url = primary_loc.get("landing_page_url", "")
385
+ if not url:
386
+ # Fallback to OpenAlex page
387
+ work_id = work.get("id", "").replace("https://openalex.org/", "")
388
+ url = f"https://openalex.org/{work_id}"
389
+
390
+ return Evidence(
391
+ content=abstract[:2000],
392
+ citation=Citation(
393
+ source="openalex",
394
+ title=title[:500],
395
+ url=url,
396
+ date=str(pub_year),
397
+ authors=authors,
398
+ ),
399
+ relevance=min(0.9, 0.5 + (cited_by / 1000)), # Boost by citations
400
+ metadata={
401
+ "cited_by_count": cited_by,
402
+ "is_open_access": is_oa,
403
+ "concepts": concepts,
404
+ "pdf_url": primary_loc.get("pdf_url"),
405
+ },
406
+ )
407
+
408
+ def _reconstruct_abstract(
409
+ self, inverted_index: dict[str, list[int]]
410
+ ) -> str:
411
+ """
412
+ Reconstruct abstract from OpenAlex inverted index format.
413
+
414
+ OpenAlex stores abstracts as {"word": [position1, position2, ...]}.
415
+ This rebuilds the original text.
416
+ """
417
+ if not inverted_index:
418
+ return ""
419
+
420
+ # Build position -> word mapping
421
+ position_word: dict[int, str] = {}
422
+ for word, positions in inverted_index.items():
423
+ for pos in positions:
424
+ position_word[pos] = word
425
+
426
+ # Reconstruct in order
427
+ if not position_word:
428
+ return ""
429
+
430
+ max_pos = max(position_word.keys())
431
+ words = [position_word.get(i, "") for i in range(max_pos + 1)]
432
+ return " ".join(w for w in words if w)
433
+ ```
434
+
435
+ ---
436
+
437
+ ### Step 3: Register in Search Handler
438
+
439
+ **File**: `src/tools/search_handler.py` (add to imports and tool list)
440
+
441
+ ```python
442
+ # Add import
443
+ from src.tools.openalex import OpenAlexTool
444
+
445
+ # Add to _create_tools method
446
+ def _create_tools(self) -> list[SearchTool]:
447
+ return [
448
+ PubMedTool(),
449
+ ClinicalTrialsTool(),
450
+ EuropePMCTool(),
451
+ OpenAlexTool(), # NEW
452
+ ]
453
+ ```
454
+
455
+ ---
456
+
457
+ ### Step 4: Update `__init__.py`
458
+
459
+ **File**: `src/tools/__init__.py`
460
+
461
+ ```python
462
+ from src.tools.openalex import OpenAlexTool
463
+
464
+ __all__ = [
465
+ "PubMedTool",
466
+ "ClinicalTrialsTool",
467
+ "EuropePMCTool",
468
+ "OpenAlexTool", # NEW
469
+ # ...
470
+ ]
471
+ ```
472
+
473
+ ---
474
+
475
+ ## Demo Script
476
+
477
+ **File**: `examples/openalex_demo.py`
478
+
479
+ ```python
480
+ #!/usr/bin/env python3
481
+ """Demo script to verify OpenAlex integration."""
482
+
483
+ import asyncio
484
+ from src.tools.openalex import OpenAlexTool
485
+
486
+
487
+ async def main():
488
+ """Run OpenAlex search demo."""
489
+ tool = OpenAlexTool()
490
+
491
+ print("=" * 60)
492
+ print("OpenAlex Integration Demo")
493
+ print("=" * 60)
494
+
495
+ # Test 1: Basic drug repurposing search
496
+ print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...")
497
+ results = await tool.search("metformin cancer drug repurposing", max_results=5)
498
+
499
+ for i, evidence in enumerate(results, 1):
500
+ print(f"\n--- Result {i} ---")
501
+ print(f"Title: {evidence.citation.title}")
502
+ print(f"Year: {evidence.citation.date}")
503
+ print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}")
504
+ print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}")
505
+ print(f"Open Access: {evidence.metadata.get('is_open_access', False)}")
506
+ print(f"URL: {evidence.citation.url}")
507
+ if evidence.content:
508
+ print(f"Abstract: {evidence.content[:200]}...")
509
+
510
+ # Test 2: High-impact papers
511
+ print("\n" + "=" * 60)
512
+ print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...")
513
+ results = await tool.search("long COVID treatment", max_results=3)
514
+
515
+ for evidence in results:
516
+ print(f"\n- {evidence.citation.title}")
517
+ print(f" Citations: {evidence.metadata.get('cited_by_count', 0)}")
518
+
519
+ print("\n" + "=" * 60)
520
+ print("Demo complete!")
521
+
522
+
523
+ if __name__ == "__main__":
524
+ asyncio.run(main())
525
+ ```
526
+
527
+ ---
528
+
529
+ ## Verification Checklist
530
+
531
+ ### Unit Tests
532
+ ```bash
533
+ # Run just OpenAlex tests
534
+ uv run pytest tests/unit/tools/test_openalex.py -v
535
+
536
+ # Expected: All tests pass
537
+ ```
538
+
539
+ ### Integration Test (Manual)
540
+ ```bash
541
+ # Run demo script with real API
542
+ uv run python examples/openalex_demo.py
543
+
544
+ # Expected: Real results from OpenAlex API
545
+ ```
546
+
547
+ ### Full Test Suite
548
+ ```bash
549
+ # Ensure nothing broke
550
+ make check
551
+
552
+ # Expected: All 110+ tests pass, mypy clean
553
+ ```
554
+
555
+ ---
556
+
557
+ ## Success Criteria
558
+
559
+ 1. **Unit tests pass**: All mocked tests in `test_openalex.py` pass
560
+ 2. **Integration works**: Demo script returns real results
561
+ 3. **No regressions**: `make check` passes completely
562
+ 4. **SearchHandler integration**: OpenAlex appears in search results alongside other sources
563
+ 5. **Citation metadata**: Results include `cited_by_count`, `concepts`, `is_open_access`
564
+
565
+ ---
566
+
567
+ ## Future Enhancements (P2)
568
+
569
+ Once basic integration works:
570
+
571
+ 1. **Citation Network Queries**
572
+ ```python
573
+ # Get papers citing a specific work
574
+ async def get_citing_works(self, work_id: str) -> list[Evidence]:
575
+ params = {"filter": f"cites:{work_id}"}
576
+ ...
577
+ ```
578
+
579
+ 2. **Concept-Based Search**
580
+ ```python
581
+ # Search by OpenAlex concept ID
582
+ async def search_by_concept(self, concept_id: str) -> list[Evidence]:
583
+ params = {"filter": f"concepts.id:{concept_id}"}
584
+ ...
585
+ ```
586
+
587
+ 3. **Author Tracking**
588
+ ```python
589
+ # Find all works by an author
590
+ async def search_by_author(self, author_id: str) -> list[Evidence]:
591
+ params = {"filter": f"authorships.author.id:{author_id}"}
592
+ ...
593
+ ```
594
+
595
+ ---
596
+
597
+ ## Notes
598
+
599
+ - OpenAlex is **very generous** with rate limits (no documented hard limit)
600
+ - Adding `mailto` parameter gives priority access (polite pool)
601
+ - Abstract is stored as inverted index - must reconstruct
602
+ - Citation count is a good proxy for paper quality/impact
603
+ - Consider caching responses for repeated queries
docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md ADDED
@@ -0,0 +1,586 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 16: PubMed Full-Text Retrieval
2
+
3
+ **Priority**: MEDIUM - Enhances evidence quality
4
+ **Effort**: ~3 hours
5
+ **Dependencies**: None (existing PubMed tool sufficient)
6
+
7
+ ---
8
+
9
+ ## Prerequisites (COMPLETED)
10
+
11
+ The `Evidence.metadata` field has been added to `src/utils/models.py` to support:
12
+ ```python
13
+ metadata={"has_fulltext": True}
14
+ ```
15
+
16
+ ---
17
+
18
+ ## Architecture Decision: Constructor Parameter vs Method Parameter
19
+
20
+ **IMPORTANT**: The original spec proposed `include_fulltext` as a method parameter:
21
+ ```python
22
+ # WRONG - SearchHandler won't pass this parameter
23
+ async def search(self, query: str, max_results: int = 10, include_fulltext: bool = False):
24
+ ```
25
+
26
+ **Problem**: `SearchHandler` calls `tool.search(query, max_results)` uniformly across all tools.
27
+ It has no mechanism to pass tool-specific parameters like `include_fulltext`.
28
+
29
+ **Solution**: Use constructor parameter instead:
30
+ ```python
31
+ # CORRECT - Configured at instantiation time
32
+ class PubMedTool:
33
+ def __init__(self, api_key: str | None = None, include_fulltext: bool = False):
34
+ self.include_fulltext = include_fulltext
35
+ ...
36
+ ```
37
+
38
+ This way, you can create a full-text-enabled PubMed tool:
39
+ ```python
40
+ # In orchestrator or wherever tools are created
41
+ tools = [
42
+ PubMedTool(include_fulltext=True), # Full-text enabled
43
+ ClinicalTrialsTool(),
44
+ EuropePMCTool(),
45
+ ]
46
+ ```
47
+
48
+ ---
49
+
50
+ ## Overview
51
+
52
+ Add full-text retrieval for PubMed papers via the BioC API, enabling:
53
+ - Complete paper text for open-access PMC papers
54
+ - Structured sections (intro, methods, results, discussion)
55
+ - Better evidence for LLM synthesis
56
+
57
+ **Why Full-Text?**
58
+ - Abstracts only give ~200-300 words
59
+ - Full text provides detailed methods, results, figures
60
+ - Reference repo already has this implemented
61
+ - Makes LLM judgments more accurate
62
+
63
+ ---
64
+
65
+ ## TDD Implementation Plan
66
+
67
+ ### Step 1: Write the Tests First
68
+
69
+ **File**: `tests/unit/tools/test_pubmed_fulltext.py`
70
+
71
+ ```python
72
+ """Tests for PubMed full-text retrieval."""
73
+
74
+ import pytest
75
+ import respx
76
+ from httpx import Response
77
+
78
+ from src.tools.pubmed import PubMedTool
79
+
80
+
81
+ class TestPubMedFullText:
82
+ """Test suite for PubMed full-text functionality."""
83
+
84
+ @pytest.fixture
85
+ def tool(self) -> PubMedTool:
86
+ return PubMedTool()
87
+
88
+ @respx.mock
89
+ @pytest.mark.asyncio
90
+ async def test_get_pmc_id_success(self, tool: PubMedTool) -> None:
91
+ """Should convert PMID to PMCID for full-text access."""
92
+ mock_response = {
93
+ "records": [
94
+ {
95
+ "pmid": "12345678",
96
+ "pmcid": "PMC1234567",
97
+ }
98
+ ]
99
+ }
100
+
101
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
102
+ return_value=Response(200, json=mock_response)
103
+ )
104
+
105
+ pmcid = await tool.get_pmc_id("12345678")
106
+ assert pmcid == "PMC1234567"
107
+
108
+ @respx.mock
109
+ @pytest.mark.asyncio
110
+ async def test_get_pmc_id_not_in_pmc(self, tool: PubMedTool) -> None:
111
+ """Should return None if paper not in PMC."""
112
+ mock_response = {
113
+ "records": [
114
+ {
115
+ "pmid": "12345678",
116
+ # No pmcid means not in PMC
117
+ }
118
+ ]
119
+ }
120
+
121
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
122
+ return_value=Response(200, json=mock_response)
123
+ )
124
+
125
+ pmcid = await tool.get_pmc_id("12345678")
126
+ assert pmcid is None
127
+
128
+ @respx.mock
129
+ @pytest.mark.asyncio
130
+ async def test_get_fulltext_success(self, tool: PubMedTool) -> None:
131
+ """Should retrieve full text for PMC papers."""
132
+ # Mock BioC API response
133
+ mock_bioc = {
134
+ "documents": [
135
+ {
136
+ "passages": [
137
+ {
138
+ "infons": {"section_type": "INTRO"},
139
+ "text": "Introduction text here.",
140
+ },
141
+ {
142
+ "infons": {"section_type": "METHODS"},
143
+ "text": "Methods description here.",
144
+ },
145
+ {
146
+ "infons": {"section_type": "RESULTS"},
147
+ "text": "Results summary here.",
148
+ },
149
+ {
150
+ "infons": {"section_type": "DISCUSS"},
151
+ "text": "Discussion and conclusions.",
152
+ },
153
+ ]
154
+ }
155
+ ]
156
+ }
157
+
158
+ respx.get(
159
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
160
+ ).mock(return_value=Response(200, json=mock_bioc))
161
+
162
+ fulltext = await tool.get_fulltext("12345678")
163
+
164
+ assert fulltext is not None
165
+ assert "Introduction text here" in fulltext
166
+ assert "Methods description here" in fulltext
167
+ assert "Results summary here" in fulltext
168
+
169
+ @respx.mock
170
+ @pytest.mark.asyncio
171
+ async def test_get_fulltext_not_available(self, tool: PubMedTool) -> None:
172
+ """Should return None if full text not available."""
173
+ respx.get(
174
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/99999999/unicode"
175
+ ).mock(return_value=Response(404))
176
+
177
+ fulltext = await tool.get_fulltext("99999999")
178
+ assert fulltext is None
179
+
180
+ @respx.mock
181
+ @pytest.mark.asyncio
182
+ async def test_get_fulltext_structured(self, tool: PubMedTool) -> None:
183
+ """Should return structured sections dict."""
184
+ mock_bioc = {
185
+ "documents": [
186
+ {
187
+ "passages": [
188
+ {"infons": {"section_type": "INTRO"}, "text": "Intro..."},
189
+ {"infons": {"section_type": "METHODS"}, "text": "Methods..."},
190
+ {"infons": {"section_type": "RESULTS"}, "text": "Results..."},
191
+ {"infons": {"section_type": "DISCUSS"}, "text": "Discussion..."},
192
+ ]
193
+ }
194
+ ]
195
+ }
196
+
197
+ respx.get(
198
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
199
+ ).mock(return_value=Response(200, json=mock_bioc))
200
+
201
+ sections = await tool.get_fulltext_structured("12345678")
202
+
203
+ assert sections is not None
204
+ assert "introduction" in sections
205
+ assert "methods" in sections
206
+ assert "results" in sections
207
+ assert "discussion" in sections
208
+
209
+ @respx.mock
210
+ @pytest.mark.asyncio
211
+ async def test_search_with_fulltext_enabled(self) -> None:
212
+ """Search should include full text when tool is configured for it."""
213
+ # Create tool WITH full-text enabled via constructor
214
+ tool = PubMedTool(include_fulltext=True)
215
+
216
+ # Mock esearch
217
+ respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi").mock(
218
+ return_value=Response(
219
+ 200, json={"esearchresult": {"idlist": ["12345678"]}}
220
+ )
221
+ )
222
+
223
+ # Mock efetch (abstract)
224
+ mock_xml = """
225
+ <PubmedArticleSet>
226
+ <PubmedArticle>
227
+ <MedlineCitation>
228
+ <PMID>12345678</PMID>
229
+ <Article>
230
+ <ArticleTitle>Test Paper</ArticleTitle>
231
+ <Abstract><AbstractText>Short abstract.</AbstractText></Abstract>
232
+ <AuthorList><Author><LastName>Smith</LastName></Author></AuthorList>
233
+ </Article>
234
+ </MedlineCitation>
235
+ </PubmedArticle>
236
+ </PubmedArticleSet>
237
+ """
238
+ respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi").mock(
239
+ return_value=Response(200, text=mock_xml)
240
+ )
241
+
242
+ # Mock ID converter
243
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
244
+ return_value=Response(
245
+ 200, json={"records": [{"pmid": "12345678", "pmcid": "PMC1234567"}]}
246
+ )
247
+ )
248
+
249
+ # Mock BioC full text
250
+ mock_bioc = {
251
+ "documents": [
252
+ {
253
+ "passages": [
254
+ {"infons": {"section_type": "INTRO"}, "text": "Full intro..."},
255
+ ]
256
+ }
257
+ ]
258
+ }
259
+ respx.get(
260
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
261
+ ).mock(return_value=Response(200, json=mock_bioc))
262
+
263
+ # NOTE: No include_fulltext param - it's set via constructor
264
+ results = await tool.search("test", max_results=1)
265
+
266
+ assert len(results) == 1
267
+ # Full text should be appended or replace abstract
268
+ assert "Full intro" in results[0].content or "Short abstract" in results[0].content
269
+ ```
270
+
271
+ ---
272
+
273
+ ### Step 2: Implement Full-Text Methods
274
+
275
+ **File**: `src/tools/pubmed.py` (additions to existing class)
276
+
277
+ ```python
278
+ # Add these methods to PubMedTool class
279
+
280
+ async def get_pmc_id(self, pmid: str) -> str | None:
281
+ """
282
+ Convert PMID to PMCID for full-text access.
283
+
284
+ Args:
285
+ pmid: PubMed ID
286
+
287
+ Returns:
288
+ PMCID if paper is in PMC, None otherwise
289
+ """
290
+ url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
291
+ params = {"ids": pmid, "format": "json"}
292
+
293
+ async with httpx.AsyncClient(timeout=30.0) as client:
294
+ try:
295
+ response = await client.get(url, params=params)
296
+ response.raise_for_status()
297
+ data = response.json()
298
+
299
+ records = data.get("records", [])
300
+ if records and records[0].get("pmcid"):
301
+ return records[0]["pmcid"]
302
+ return None
303
+
304
+ except httpx.HTTPError:
305
+ return None
306
+
307
+
308
+ async def get_fulltext(self, pmid: str) -> str | None:
309
+ """
310
+ Get full text for a PubMed paper via BioC API.
311
+
312
+ Only works for open-access papers in PubMed Central.
313
+
314
+ Args:
315
+ pmid: PubMed ID
316
+
317
+ Returns:
318
+ Full text as string, or None if not available
319
+ """
320
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
321
+
322
+ async with httpx.AsyncClient(timeout=60.0) as client:
323
+ try:
324
+ response = await client.get(url)
325
+ if response.status_code == 404:
326
+ return None
327
+ response.raise_for_status()
328
+ data = response.json()
329
+
330
+ # Extract text from all passages
331
+ documents = data.get("documents", [])
332
+ if not documents:
333
+ return None
334
+
335
+ passages = documents[0].get("passages", [])
336
+ text_parts = [p.get("text", "") for p in passages if p.get("text")]
337
+
338
+ return "\n\n".join(text_parts) if text_parts else None
339
+
340
+ except httpx.HTTPError:
341
+ return None
342
+
343
+
344
+ async def get_fulltext_structured(self, pmid: str) -> dict[str, str] | None:
345
+ """
346
+ Get structured full text with sections.
347
+
348
+ Args:
349
+ pmid: PubMed ID
350
+
351
+ Returns:
352
+ Dict mapping section names to text, or None if not available
353
+ """
354
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
355
+
356
+ async with httpx.AsyncClient(timeout=60.0) as client:
357
+ try:
358
+ response = await client.get(url)
359
+ if response.status_code == 404:
360
+ return None
361
+ response.raise_for_status()
362
+ data = response.json()
363
+
364
+ documents = data.get("documents", [])
365
+ if not documents:
366
+ return None
367
+
368
+ # Map section types to readable names
369
+ section_map = {
370
+ "INTRO": "introduction",
371
+ "METHODS": "methods",
372
+ "RESULTS": "results",
373
+ "DISCUSS": "discussion",
374
+ "CONCL": "conclusion",
375
+ "ABSTRACT": "abstract",
376
+ }
377
+
378
+ sections: dict[str, list[str]] = {}
379
+ for passage in documents[0].get("passages", []):
380
+ section_type = passage.get("infons", {}).get("section_type", "other")
381
+ section_name = section_map.get(section_type, "other")
382
+ text = passage.get("text", "")
383
+
384
+ if text:
385
+ if section_name not in sections:
386
+ sections[section_name] = []
387
+ sections[section_name].append(text)
388
+
389
+ # Join multiple passages per section
390
+ return {k: "\n\n".join(v) for k, v in sections.items()}
391
+
392
+ except httpx.HTTPError:
393
+ return None
394
+ ```
395
+
396
+ ---
397
+
398
+ ### Step 3: Update Constructor and Search Method
399
+
400
+ Add full-text flag to constructor and update search to use it:
401
+
402
+ ```python
403
+ class PubMedTool:
404
+ """Search tool for PubMed/NCBI."""
405
+
406
+ def __init__(
407
+ self,
408
+ api_key: str | None = None,
409
+ include_fulltext: bool = False, # NEW CONSTRUCTOR PARAM
410
+ ) -> None:
411
+ self.api_key = api_key or settings.ncbi_api_key
412
+ if self.api_key == "your-ncbi-key-here":
413
+ self.api_key = None
414
+ self._last_request_time = 0.0
415
+ self.include_fulltext = include_fulltext # Store for use in search()
416
+
417
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
418
+ """
419
+ Search PubMed and return evidence.
420
+
421
+ Note: Full-text enrichment is controlled by constructor parameter,
422
+ not method parameter, because SearchHandler doesn't pass extra args.
423
+ """
424
+ # ... existing search logic ...
425
+
426
+ evidence_list = self._parse_pubmed_xml(fetch_resp.text)
427
+
428
+ # Optionally enrich with full text (if configured at construction)
429
+ if self.include_fulltext:
430
+ evidence_list = await self._enrich_with_fulltext(evidence_list)
431
+
432
+ return evidence_list
433
+
434
+
435
+ async def _enrich_with_fulltext(
436
+ self, evidence_list: list[Evidence]
437
+ ) -> list[Evidence]:
438
+ """Attempt to add full text to evidence items."""
439
+ enriched = []
440
+
441
+ for evidence in evidence_list:
442
+ # Extract PMID from URL
443
+ url = evidence.citation.url
444
+ pmid = url.rstrip("/").split("/")[-1] if url else None
445
+
446
+ if pmid:
447
+ fulltext = await self.get_fulltext(pmid)
448
+ if fulltext:
449
+ # Replace abstract with full text (truncated)
450
+ evidence = Evidence(
451
+ content=fulltext[:8000], # Larger limit for full text
452
+ citation=evidence.citation,
453
+ relevance=evidence.relevance,
454
+ metadata={
455
+ **evidence.metadata,
456
+ "has_fulltext": True,
457
+ },
458
+ )
459
+
460
+ enriched.append(evidence)
461
+
462
+ return enriched
463
+ ```
464
+
465
+ ---
466
+
467
+ ## Demo Script
468
+
469
+ **File**: `examples/pubmed_fulltext_demo.py`
470
+
471
+ ```python
472
+ #!/usr/bin/env python3
473
+ """Demo script to verify PubMed full-text retrieval."""
474
+
475
+ import asyncio
476
+ from src.tools.pubmed import PubMedTool
477
+
478
+
479
+ async def main():
480
+ """Run PubMed full-text demo."""
481
+ tool = PubMedTool()
482
+
483
+ print("=" * 60)
484
+ print("PubMed Full-Text Demo")
485
+ print("=" * 60)
486
+
487
+ # Test 1: Convert PMID to PMCID
488
+ print("\n[Test 1] Converting PMID to PMCID...")
489
+ # Use a known open-access paper
490
+ test_pmid = "34450029" # Example: COVID-related open-access paper
491
+ pmcid = await tool.get_pmc_id(test_pmid)
492
+ print(f"PMID {test_pmid} -> PMCID: {pmcid or 'Not in PMC'}")
493
+
494
+ # Test 2: Get full text
495
+ print("\n[Test 2] Fetching full text...")
496
+ if pmcid:
497
+ fulltext = await tool.get_fulltext(test_pmid)
498
+ if fulltext:
499
+ print(f"Full text length: {len(fulltext)} characters")
500
+ print(f"Preview: {fulltext[:500]}...")
501
+ else:
502
+ print("Full text not available")
503
+
504
+ # Test 3: Get structured sections
505
+ print("\n[Test 3] Fetching structured sections...")
506
+ if pmcid:
507
+ sections = await tool.get_fulltext_structured(test_pmid)
508
+ if sections:
509
+ print("Available sections:")
510
+ for section, text in sections.items():
511
+ print(f" - {section}: {len(text)} chars")
512
+ else:
513
+ print("Structured text not available")
514
+
515
+ # Test 4: Search with full text
516
+ print("\n[Test 4] Search with full-text enrichment...")
517
+ results = await tool.search(
518
+ "metformin cancer open access",
519
+ max_results=3,
520
+ include_fulltext=True
521
+ )
522
+
523
+ for i, evidence in enumerate(results, 1):
524
+ has_ft = evidence.metadata.get("has_fulltext", False)
525
+ print(f"\n--- Result {i} ---")
526
+ print(f"Title: {evidence.citation.title}")
527
+ print(f"Has Full Text: {has_ft}")
528
+ print(f"Content Length: {len(evidence.content)} chars")
529
+
530
+ print("\n" + "=" * 60)
531
+ print("Demo complete!")
532
+
533
+
534
+ if __name__ == "__main__":
535
+ asyncio.run(main())
536
+ ```
537
+
538
+ ---
539
+
540
+ ## Verification Checklist
541
+
542
+ ### Unit Tests
543
+ ```bash
544
+ # Run full-text tests
545
+ uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
546
+
547
+ # Run all PubMed tests
548
+ uv run pytest tests/unit/tools/test_pubmed.py -v
549
+
550
+ # Expected: All tests pass
551
+ ```
552
+
553
+ ### Integration Test (Manual)
554
+ ```bash
555
+ # Run demo with real API
556
+ uv run python examples/pubmed_fulltext_demo.py
557
+
558
+ # Expected: Real full text from PMC papers
559
+ ```
560
+
561
+ ### Full Test Suite
562
+ ```bash
563
+ make check
564
+ # Expected: All tests pass, mypy clean
565
+ ```
566
+
567
+ ---
568
+
569
+ ## Success Criteria
570
+
571
+ 1. **ID Conversion works**: PMID -> PMCID conversion successful
572
+ 2. **Full text retrieval works**: BioC API returns paper text
573
+ 3. **Structured sections work**: Can get intro/methods/results/discussion separately
574
+ 4. **Search integration works**: `include_fulltext=True` enriches results
575
+ 5. **No regressions**: Existing tests still pass
576
+ 6. **Graceful degradation**: Non-PMC papers still return abstracts
577
+
578
+ ---
579
+
580
+ ## Notes
581
+
582
+ - Only ~30% of PubMed papers have full text in PMC
583
+ - BioC API has no documented rate limit, but be respectful
584
+ - Full text can be very long - truncate appropriately
585
+ - Consider caching full text responses (they don't change)
586
+ - Timeout should be longer for full text (60s vs 30s)
docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md ADDED
@@ -0,0 +1,540 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 17: Rate Limiting with `limits` Library
2
+
3
+ **Priority**: P0 CRITICAL - Prevents API blocks
4
+ **Effort**: ~1 hour
5
+ **Dependencies**: None
6
+
7
+ ---
8
+
9
+ ## CRITICAL: Async Safety Requirements
10
+
11
+ **WARNING**: The rate limiter MUST be async-safe. Blocking the event loop will freeze:
12
+ - The Gradio UI
13
+ - All parallel searches
14
+ - The orchestrator
15
+
16
+ **Rules**:
17
+ 1. **NEVER use `time.sleep()`** - Always use `await asyncio.sleep()`
18
+ 2. **NEVER use blocking while loops** - Use async-aware polling
19
+ 3. **The `limits` library check is synchronous** - Wrap it carefully
20
+
21
+ The implementation below uses a polling pattern that:
22
+ - Checks the limit (synchronous, fast)
23
+ - If exceeded, `await asyncio.sleep()` (non-blocking)
24
+ - Retry the check
25
+
26
+ **Alternative**: If `limits` proves problematic, use `aiolimiter` which is pure-async.
27
+
28
+ ---
29
+
30
+ ## Overview
31
+
32
+ Replace naive `asyncio.sleep` rate limiting with proper rate limiter using the `limits` library, which provides:
33
+ - Moving window rate limiting
34
+ - Per-API configurable limits
35
+ - Thread-safe storage
36
+ - Already used in reference repo
37
+
38
+ **Why This Matters?**
39
+ - NCBI will block us without proper rate limiting (3/sec without key, 10/sec with)
40
+ - Current implementation only has simple sleep delay
41
+ - Need coordinated limits across all PubMed calls
42
+ - Professional-grade rate limiting prevents production issues
43
+
44
+ ---
45
+
46
+ ## Current State
47
+
48
+ ### What We Have (`src/tools/pubmed.py:20-21, 34-41`)
49
+
50
+ ```python
51
+ RATE_LIMIT_DELAY = 0.34 # ~3 requests/sec without API key
52
+
53
+ async def _rate_limit(self) -> None:
54
+ """Enforce NCBI rate limiting."""
55
+ loop = asyncio.get_running_loop()
56
+ now = loop.time()
57
+ elapsed = now - self._last_request_time
58
+ if elapsed < self.RATE_LIMIT_DELAY:
59
+ await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
60
+ self._last_request_time = loop.time()
61
+ ```
62
+
63
+ ### Problems
64
+
65
+ 1. **Not shared across instances**: Each `PubMedTool()` has its own counter
66
+ 2. **Simple delay vs moving window**: Doesn't handle bursts properly
67
+ 3. **Hardcoded rate**: Doesn't adapt to API key presence
68
+ 4. **No backoff on 429**: Just retries blindly
69
+
70
+ ---
71
+
72
+ ## TDD Implementation Plan
73
+
74
+ ### Step 1: Add Dependency
75
+
76
+ **File**: `pyproject.toml`
77
+
78
+ ```toml
79
+ dependencies = [
80
+ # ... existing deps ...
81
+ "limits>=3.0",
82
+ ]
83
+ ```
84
+
85
+ Then run:
86
+ ```bash
87
+ uv sync
88
+ ```
89
+
90
+ ---
91
+
92
+ ### Step 2: Write the Tests First
93
+
94
+ **File**: `tests/unit/tools/test_rate_limiting.py`
95
+
96
+ ```python
97
+ """Tests for rate limiting functionality."""
98
+
99
+ import asyncio
100
+ import time
101
+
102
+ import pytest
103
+
104
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter
105
+
106
+
107
+ class TestRateLimiter:
108
+ """Test suite for rate limiter."""
109
+
110
+ def test_create_limiter_without_api_key(self) -> None:
111
+ """Should create 3/sec limiter without API key."""
112
+ limiter = RateLimiter(rate="3/second")
113
+ assert limiter.rate == "3/second"
114
+
115
+ def test_create_limiter_with_api_key(self) -> None:
116
+ """Should create 10/sec limiter with API key."""
117
+ limiter = RateLimiter(rate="10/second")
118
+ assert limiter.rate == "10/second"
119
+
120
+ @pytest.mark.asyncio
121
+ async def test_limiter_allows_requests_under_limit(self) -> None:
122
+ """Should allow requests under the rate limit."""
123
+ limiter = RateLimiter(rate="10/second")
124
+
125
+ # 3 requests should all succeed immediately
126
+ for _ in range(3):
127
+ allowed = await limiter.acquire()
128
+ assert allowed is True
129
+
130
+ @pytest.mark.asyncio
131
+ async def test_limiter_blocks_when_exceeded(self) -> None:
132
+ """Should wait when rate limit exceeded."""
133
+ limiter = RateLimiter(rate="2/second")
134
+
135
+ # First 2 should be instant
136
+ await limiter.acquire()
137
+ await limiter.acquire()
138
+
139
+ # Third should block briefly
140
+ start = time.monotonic()
141
+ await limiter.acquire()
142
+ elapsed = time.monotonic() - start
143
+
144
+ # Should have waited ~0.5 seconds (half second window for 2/sec)
145
+ assert elapsed >= 0.3
146
+
147
+ @pytest.mark.asyncio
148
+ async def test_limiter_resets_after_window(self) -> None:
149
+ """Rate limit should reset after time window."""
150
+ limiter = RateLimiter(rate="5/second")
151
+
152
+ # Use up the limit
153
+ for _ in range(5):
154
+ await limiter.acquire()
155
+
156
+ # Wait for window to pass
157
+ await asyncio.sleep(1.1)
158
+
159
+ # Should be allowed again
160
+ start = time.monotonic()
161
+ await limiter.acquire()
162
+ elapsed = time.monotonic() - start
163
+
164
+ assert elapsed < 0.1 # Should be nearly instant
165
+
166
+
167
+ class TestGetPubmedLimiter:
168
+ """Test PubMed-specific limiter factory."""
169
+
170
+ def test_limiter_without_api_key(self) -> None:
171
+ """Should return 3/sec limiter without key."""
172
+ limiter = get_pubmed_limiter(api_key=None)
173
+ assert "3" in limiter.rate
174
+
175
+ def test_limiter_with_api_key(self) -> None:
176
+ """Should return 10/sec limiter with key."""
177
+ limiter = get_pubmed_limiter(api_key="my-api-key")
178
+ assert "10" in limiter.rate
179
+
180
+ def test_limiter_is_singleton(self) -> None:
181
+ """Same API key should return same limiter instance."""
182
+ limiter1 = get_pubmed_limiter(api_key="key1")
183
+ limiter2 = get_pubmed_limiter(api_key="key1")
184
+ assert limiter1 is limiter2
185
+
186
+ def test_different_keys_different_limiters(self) -> None:
187
+ """Different API keys should return different limiters."""
188
+ limiter1 = get_pubmed_limiter(api_key="key1")
189
+ limiter2 = get_pubmed_limiter(api_key="key2")
190
+ # Clear cache for clean test
191
+ # Actually, different keys SHOULD share the same limiter
192
+ # since we're limiting against the same API
193
+ assert limiter1 is limiter2 # Shared NCBI rate limit
194
+ ```
195
+
196
+ ---
197
+
198
+ ### Step 3: Create Rate Limiter Module
199
+
200
+ **File**: `src/tools/rate_limiter.py`
201
+
202
+ ```python
203
+ """Rate limiting utilities using the limits library."""
204
+
205
+ import asyncio
206
+ from typing import ClassVar
207
+
208
+ from limits import RateLimitItem, parse
209
+ from limits.storage import MemoryStorage
210
+ from limits.strategies import MovingWindowRateLimiter
211
+
212
+
213
+ class RateLimiter:
214
+ """
215
+ Async-compatible rate limiter using limits library.
216
+
217
+ Uses moving window algorithm for smooth rate limiting.
218
+ """
219
+
220
+ def __init__(self, rate: str) -> None:
221
+ """
222
+ Initialize rate limiter.
223
+
224
+ Args:
225
+ rate: Rate string like "3/second" or "10/second"
226
+ """
227
+ self.rate = rate
228
+ self._storage = MemoryStorage()
229
+ self._limiter = MovingWindowRateLimiter(self._storage)
230
+ self._rate_limit: RateLimitItem = parse(rate)
231
+ self._identity = "default" # Single identity for shared limiting
232
+
233
+ async def acquire(self, wait: bool = True) -> bool:
234
+ """
235
+ Acquire permission to make a request.
236
+
237
+ ASYNC-SAFE: Uses asyncio.sleep(), never time.sleep().
238
+ The polling pattern allows other coroutines to run while waiting.
239
+
240
+ Args:
241
+ wait: If True, wait until allowed. If False, return immediately.
242
+
243
+ Returns:
244
+ True if allowed, False if not (only when wait=False)
245
+ """
246
+ while True:
247
+ # Check if we can proceed (synchronous, fast - ~microseconds)
248
+ if self._limiter.hit(self._rate_limit, self._identity):
249
+ return True
250
+
251
+ if not wait:
252
+ return False
253
+
254
+ # CRITICAL: Use asyncio.sleep(), NOT time.sleep()
255
+ # This yields control to the event loop, allowing other
256
+ # coroutines (UI, parallel searches) to run
257
+ await asyncio.sleep(0.1)
258
+
259
+ def reset(self) -> None:
260
+ """Reset the rate limiter (for testing)."""
261
+ self._storage.reset()
262
+
263
+
264
+ # Singleton limiter for PubMed/NCBI
265
+ _pubmed_limiter: RateLimiter | None = None
266
+
267
+
268
+ def get_pubmed_limiter(api_key: str | None = None) -> RateLimiter:
269
+ """
270
+ Get the shared PubMed rate limiter.
271
+
272
+ Rate depends on whether API key is provided:
273
+ - Without key: 3 requests/second
274
+ - With key: 10 requests/second
275
+
276
+ Args:
277
+ api_key: NCBI API key (optional)
278
+
279
+ Returns:
280
+ Shared RateLimiter instance
281
+ """
282
+ global _pubmed_limiter
283
+
284
+ if _pubmed_limiter is None:
285
+ rate = "10/second" if api_key else "3/second"
286
+ _pubmed_limiter = RateLimiter(rate)
287
+
288
+ return _pubmed_limiter
289
+
290
+
291
+ def reset_pubmed_limiter() -> None:
292
+ """Reset the PubMed limiter (for testing)."""
293
+ global _pubmed_limiter
294
+ _pubmed_limiter = None
295
+
296
+
297
+ # Factory for other APIs
298
+ class RateLimiterFactory:
299
+ """Factory for creating/getting rate limiters for different APIs."""
300
+
301
+ _limiters: ClassVar[dict[str, RateLimiter]] = {}
302
+
303
+ @classmethod
304
+ def get(cls, api_name: str, rate: str) -> RateLimiter:
305
+ """
306
+ Get or create a rate limiter for an API.
307
+
308
+ Args:
309
+ api_name: Unique identifier for the API
310
+ rate: Rate limit string (e.g., "10/second")
311
+
312
+ Returns:
313
+ RateLimiter instance (shared for same api_name)
314
+ """
315
+ if api_name not in cls._limiters:
316
+ cls._limiters[api_name] = RateLimiter(rate)
317
+ return cls._limiters[api_name]
318
+
319
+ @classmethod
320
+ def reset_all(cls) -> None:
321
+ """Reset all limiters (for testing)."""
322
+ cls._limiters.clear()
323
+ ```
324
+
325
+ ---
326
+
327
+ ### Step 4: Update PubMed Tool
328
+
329
+ **File**: `src/tools/pubmed.py` (replace rate limiting code)
330
+
331
+ ```python
332
+ # Replace imports and rate limiting
333
+
334
+ from src.tools.rate_limiter import get_pubmed_limiter
335
+
336
+
337
+ class PubMedTool:
338
+ """Search tool for PubMed/NCBI."""
339
+
340
+ BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
341
+ HTTP_TOO_MANY_REQUESTS = 429
342
+
343
+ def __init__(self, api_key: str | None = None) -> None:
344
+ self.api_key = api_key or settings.ncbi_api_key
345
+ if self.api_key == "your-ncbi-key-here":
346
+ self.api_key = None
347
+ # Use shared rate limiter
348
+ self._limiter = get_pubmed_limiter(self.api_key)
349
+
350
+ async def _rate_limit(self) -> None:
351
+ """Enforce NCBI rate limiting using shared limiter."""
352
+ await self._limiter.acquire()
353
+
354
+ # ... rest of class unchanged ...
355
+ ```
356
+
357
+ ---
358
+
359
+ ### Step 5: Add Rate Limiters for Other APIs
360
+
361
+ **File**: `src/tools/clinicaltrials.py` (optional)
362
+
363
+ ```python
364
+ from src.tools.rate_limiter import RateLimiterFactory
365
+
366
+
367
+ class ClinicalTrialsTool:
368
+ def __init__(self) -> None:
369
+ # ClinicalTrials.gov doesn't document limits, but be conservative
370
+ self._limiter = RateLimiterFactory.get("clinicaltrials", "5/second")
371
+
372
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
373
+ await self._limiter.acquire()
374
+ # ... rest of method ...
375
+ ```
376
+
377
+ **File**: `src/tools/europepmc.py` (optional)
378
+
379
+ ```python
380
+ from src.tools.rate_limiter import RateLimiterFactory
381
+
382
+
383
+ class EuropePMCTool:
384
+ def __init__(self) -> None:
385
+ # Europe PMC is generous, but still be respectful
386
+ self._limiter = RateLimiterFactory.get("europepmc", "10/second")
387
+
388
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
389
+ await self._limiter.acquire()
390
+ # ... rest of method ...
391
+ ```
392
+
393
+ ---
394
+
395
+ ## Demo Script
396
+
397
+ **File**: `examples/rate_limiting_demo.py`
398
+
399
+ ```python
400
+ #!/usr/bin/env python3
401
+ """Demo script to verify rate limiting works correctly."""
402
+
403
+ import asyncio
404
+ import time
405
+
406
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
407
+ from src.tools.pubmed import PubMedTool
408
+
409
+
410
+ async def test_basic_limiter():
411
+ """Test basic rate limiter behavior."""
412
+ print("=" * 60)
413
+ print("Rate Limiting Demo")
414
+ print("=" * 60)
415
+
416
+ # Test 1: Basic limiter
417
+ print("\n[Test 1] Testing 3/second limiter...")
418
+ limiter = RateLimiter("3/second")
419
+
420
+ start = time.monotonic()
421
+ for i in range(6):
422
+ await limiter.acquire()
423
+ elapsed = time.monotonic() - start
424
+ print(f" Request {i+1} at {elapsed:.2f}s")
425
+
426
+ total = time.monotonic() - start
427
+ print(f" Total time for 6 requests: {total:.2f}s (expected ~2s)")
428
+
429
+
430
+ async def test_pubmed_limiter():
431
+ """Test PubMed-specific limiter."""
432
+ print("\n[Test 2] Testing PubMed limiter (shared)...")
433
+
434
+ reset_pubmed_limiter() # Clean state
435
+
436
+ # Without API key: 3/sec
437
+ limiter = get_pubmed_limiter(api_key=None)
438
+ print(f" Rate without key: {limiter.rate}")
439
+
440
+ # Multiple tools should share the same limiter
441
+ tool1 = PubMedTool()
442
+ tool2 = PubMedTool()
443
+
444
+ # Verify they share the limiter
445
+ print(f" Tools share limiter: {tool1._limiter is tool2._limiter}")
446
+
447
+
448
+ async def test_concurrent_requests():
449
+ """Test rate limiting under concurrent load."""
450
+ print("\n[Test 3] Testing concurrent request limiting...")
451
+
452
+ limiter = RateLimiter("5/second")
453
+
454
+ async def make_request(i: int):
455
+ await limiter.acquire()
456
+ return time.monotonic()
457
+
458
+ start = time.monotonic()
459
+ # Launch 10 concurrent requests
460
+ tasks = [make_request(i) for i in range(10)]
461
+ times = await asyncio.gather(*tasks)
462
+
463
+ # Calculate distribution
464
+ relative_times = [t - start for t in times]
465
+ print(f" Request times: {[f'{t:.2f}s' for t in sorted(relative_times)]}")
466
+
467
+ total = max(relative_times)
468
+ print(f" All 10 requests completed in {total:.2f}s (expected ~2s)")
469
+
470
+
471
+ async def main():
472
+ await test_basic_limiter()
473
+ await test_pubmed_limiter()
474
+ await test_concurrent_requests()
475
+
476
+ print("\n" + "=" * 60)
477
+ print("Demo complete!")
478
+
479
+
480
+ if __name__ == "__main__":
481
+ asyncio.run(main())
482
+ ```
483
+
484
+ ---
485
+
486
+ ## Verification Checklist
487
+
488
+ ### Unit Tests
489
+ ```bash
490
+ # Run rate limiting tests
491
+ uv run pytest tests/unit/tools/test_rate_limiting.py -v
492
+
493
+ # Expected: All tests pass
494
+ ```
495
+
496
+ ### Integration Test (Manual)
497
+ ```bash
498
+ # Run demo
499
+ uv run python examples/rate_limiting_demo.py
500
+
501
+ # Expected: Requests properly spaced
502
+ ```
503
+
504
+ ### Full Test Suite
505
+ ```bash
506
+ make check
507
+ # Expected: All tests pass, mypy clean
508
+ ```
509
+
510
+ ---
511
+
512
+ ## Success Criteria
513
+
514
+ 1. **`limits` library installed**: Dependency added to pyproject.toml
515
+ 2. **RateLimiter class works**: Can create and use limiters
516
+ 3. **PubMed uses new limiter**: Shared limiter across instances
517
+ 4. **Rate adapts to API key**: 3/sec without, 10/sec with
518
+ 5. **Concurrent requests handled**: Multiple async requests properly queued
519
+ 6. **No regressions**: All existing tests pass
520
+
521
+ ---
522
+
523
+ ## API Rate Limit Reference
524
+
525
+ | API | Without Key | With Key |
526
+ |-----|-------------|----------|
527
+ | PubMed/NCBI | 3/sec | 10/sec |
528
+ | ClinicalTrials.gov | Undocumented (~5/sec safe) | N/A |
529
+ | Europe PMC | ~10-20/sec (generous) | N/A |
530
+ | OpenAlex | ~100k/day (no per-sec limit) | Faster with `mailto` |
531
+
532
+ ---
533
+
534
+ ## Notes
535
+
536
+ - `limits` library uses moving window algorithm (fairer than fixed window)
537
+ - Singleton pattern ensures all PubMed calls share the limit
538
+ - The factory pattern allows easy extension to other APIs
539
+ - Consider adding 429 response detection + exponential backoff
540
+ - In production, consider Redis storage for distributed rate limiting
docs/brainstorming/implementation/README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plans
2
+
3
+ TDD implementation plans based on the brainstorming documents. Each phase is a self-contained vertical slice with tests, implementation, and demo scripts.
4
+
5
+ ---
6
+
7
+ ## Prerequisites (COMPLETED)
8
+
9
+ The following foundational changes have been implemented to support all three phases:
10
+
11
+ | Change | File | Status |
12
+ |--------|------|--------|
13
+ | Add `"openalex"` to `SourceName` | `src/utils/models.py:9` | ✅ Done |
14
+ | Add `metadata` field to `Evidence` | `src/utils/models.py:39-42` | ✅ Done |
15
+ | Export all tools from `__init__.py` | `src/tools/__init__.py` | ✅ Done |
16
+
17
+ All 110 tests pass after these changes.
18
+
19
+ ---
20
+
21
+ ## Priority Order
22
+
23
+ | Phase | Name | Priority | Effort | Value |
24
+ |-------|------|----------|--------|-------|
25
+ | **17** | Rate Limiting | P0 CRITICAL | 1 hour | Stability |
26
+ | **15** | OpenAlex | HIGH | 2-3 hours | Very High |
27
+ | **16** | PubMed Full-Text | MEDIUM | 3 hours | High |
28
+
29
+ **Recommended implementation order**: 17 → 15 → 16
30
+
31
+ ---
32
+
33
+ ## Phase 15: OpenAlex Integration
34
+
35
+ **File**: [15_PHASE_OPENALEX.md](./15_PHASE_OPENALEX.md)
36
+
37
+ Add OpenAlex as 4th data source for:
38
+ - Citation networks (who cites whom)
39
+ - Concept tagging (semantic discovery)
40
+ - 209M+ scholarly works
41
+ - Free, no API key required
42
+
43
+ **Quick Start**:
44
+ ```bash
45
+ # Create the tool
46
+ touch src/tools/openalex.py
47
+ touch tests/unit/tools/test_openalex.py
48
+
49
+ # Run tests first (TDD)
50
+ uv run pytest tests/unit/tools/test_openalex.py -v
51
+
52
+ # Demo
53
+ uv run python examples/openalex_demo.py
54
+ ```
55
+
56
+ ---
57
+
58
+ ## Phase 16: PubMed Full-Text
59
+
60
+ **File**: [16_PHASE_PUBMED_FULLTEXT.md](./16_PHASE_PUBMED_FULLTEXT.md)
61
+
62
+ Add full-text retrieval via BioC API for:
63
+ - Complete paper text (not just abstracts)
64
+ - Structured sections (intro, methods, results)
65
+ - Better evidence for LLM synthesis
66
+
67
+ **Quick Start**:
68
+ ```bash
69
+ # Add methods to existing pubmed.py
70
+ # Tests in test_pubmed_fulltext.py
71
+
72
+ # Run tests
73
+ uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
74
+
75
+ # Demo
76
+ uv run python examples/pubmed_fulltext_demo.py
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Phase 17: Rate Limiting
82
+
83
+ **File**: [17_PHASE_RATE_LIMITING.md](./17_PHASE_RATE_LIMITING.md)
84
+
85
+ Replace naive sleep-based rate limiting with `limits` library for:
86
+ - Moving window algorithm
87
+ - Shared limits across instances
88
+ - Configurable per-API rates
89
+ - Production-grade stability
90
+
91
+ **Quick Start**:
92
+ ```bash
93
+ # Add dependency
94
+ uv add limits
95
+
96
+ # Create module
97
+ touch src/tools/rate_limiter.py
98
+ touch tests/unit/tools/test_rate_limiting.py
99
+
100
+ # Run tests
101
+ uv run pytest tests/unit/tools/test_rate_limiting.py -v
102
+
103
+ # Demo
104
+ uv run python examples/rate_limiting_demo.py
105
+ ```
106
+
107
+ ---
108
+
109
+ ## TDD Workflow
110
+
111
+ Each implementation doc follows this pattern:
112
+
113
+ 1. **Write tests first** - Define expected behavior
114
+ 2. **Run tests** - Verify they fail (red)
115
+ 3. **Implement** - Write minimal code to pass
116
+ 4. **Run tests** - Verify they pass (green)
117
+ 5. **Refactor** - Clean up if needed
118
+ 6. **Demo** - Verify end-to-end with real APIs
119
+ 7. **`make check`** - Ensure no regressions
120
+
121
+ ---
122
+
123
+ ## Related Brainstorming Docs
124
+
125
+ These implementation plans are derived from:
126
+
127
+ - [00_ROADMAP_SUMMARY.md](../00_ROADMAP_SUMMARY.md) - Priority overview
128
+ - [01_PUBMED_IMPROVEMENTS.md](../01_PUBMED_IMPROVEMENTS.md) - PubMed details
129
+ - [02_CLINICALTRIALS_IMPROVEMENTS.md](../02_CLINICALTRIALS_IMPROVEMENTS.md) - CT.gov details
130
+ - [03_EUROPEPMC_IMPROVEMENTS.md](../03_EUROPEPMC_IMPROVEMENTS.md) - Europe PMC details
131
+ - [04_OPENALEX_INTEGRATION.md](../04_OPENALEX_INTEGRATION.md) - OpenAlex integration
132
+
133
+ ---
134
+
135
+ ## Future Phases (Not Yet Documented)
136
+
137
+ Based on brainstorming, these could be added later:
138
+
139
+ - **Phase 18**: ClinicalTrials.gov Results Retrieval
140
+ - **Phase 19**: Europe PMC Annotations API
141
+ - **Phase 20**: Drug Name Normalization (RxNorm)
142
+ - **Phase 21**: Citation Network Queries (OpenAlex)
143
+ - **Phase 22**: Semantic Search with Embeddings
docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Situation Analysis: Pydantic-AI + Microsoft Agent Framework Integration
2
+
3
+ **Date:** November 27, 2025
4
+ **Status:** ACTIVE DECISION REQUIRED
5
+ **Risk Level:** HIGH - DO NOT MERGE PR #41 UNTIL RESOLVED
6
+
7
+ ---
8
+
9
+ ## 1. The Problem
10
+
11
+ We almost merged a refactor that would have **deleted** multi-agent orchestration capability from the codebase, mistakenly believing pydantic-ai and Microsoft Agent Framework were mutually exclusive.
12
+
13
+ **They are not.** They are complementary:
14
+ - **pydantic-ai** (Library): Ensures LLM outputs match Pydantic schemas
15
+ - **Microsoft Agent Framework** (Framework): Orchestrates multi-agent workflows
16
+
17
+ ---
18
+
19
+ ## 2. Current Branch State
20
+
21
+ | Branch | Location | Has Agent Framework? | Has Pydantic-AI Improvements? | Status |
22
+ |--------|----------|---------------------|------------------------------|--------|
23
+ | `origin/dev` | GitHub | YES | NO | **SAFE - Source of Truth** |
24
+ | `huggingface-upstream/dev` | HF Spaces | YES | NO | **SAFE - Same as GitHub** |
25
+ | `origin/main` | GitHub | YES | NO | **SAFE** |
26
+ | `feat/pubmed-fulltext` | GitHub | NO (deleted) | YES | **DANGER - Has destructive refactor** |
27
+ | `refactor/pydantic-unification` | Local | NO (deleted) | YES | **DANGER - Redundant, delete** |
28
+ | Local `dev` | Local only | NO (deleted) | YES | **DANGER - NOT PUSHED (thankfully)** |
29
+
30
+ ### Key Files at Risk
31
+
32
+ **On `origin/dev` (PRESERVED):**
33
+ ```text
34
+ src/agents/
35
+ ├── analysis_agent.py # StatisticalAnalyzer wrapper
36
+ ├── hypothesis_agent.py # Hypothesis generation
37
+ ├── judge_agent.py # JudgeHandler wrapper
38
+ ├── magentic_agents.py # Multi-agent definitions
39
+ ├── report_agent.py # Report synthesis
40
+ ├── search_agent.py # SearchHandler wrapper
41
+ ├── state.py # Thread-safe state management
42
+ └── tools.py # @ai_function decorated tools
43
+
44
+ src/orchestrator_magentic.py # Multi-agent orchestrator
45
+ src/utils/llm_factory.py # Centralized LLM client factory
46
+ ```
47
+
48
+ **Deleted in refactor branch (would be lost if merged):**
49
+ - All of the above
50
+
51
+ ---
52
+
53
+ ## 3. Target Architecture
54
+
55
+ ```text
56
+ ┌─────────────────────────────────────────────────────────────────┐
57
+ │ Microsoft Agent Framework (Orchestration Layer) │
58
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
59
+ │ │ SearchAgent │→ │ JudgeAgent │→ │ ReportAgent │ │
60
+ │ │ (BaseAgent) │ │ (BaseAgent) │ │ (BaseAgent) │ │
61
+ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
62
+ │ │ │ │ │
63
+ │ ▼ ▼ ▼ │
64
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
65
+ │ │ pydantic-ai │ │ pydantic-ai │ │ pydantic-ai │ │
66
+ │ │ Agent() │ │ Agent() │ │ Agent() │ │
67
+ │ │ output_type= │ │ output_type= │ │ output_type= │ │
68
+ │ │ SearchResult │ │ JudgeAssess │ │ Report │ │
69
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
70
+ └─────────────────────────────────────────────────────────────────┘
71
+ ```
72
+
73
+ **Why this architecture:**
74
+ 1. **Agent Framework** handles: workflow coordination, state passing, middleware, observability
75
+ 2. **pydantic-ai** handles: type-safe LLM calls within each agent
76
+
77
+ ---
78
+
79
+ ## 4. CRITICAL: Naming Confusion Clarification
80
+
81
+ > **Senior Agent Review Finding:** The codebase uses "magentic" in file names (e.g., `orchestrator_magentic.py`, `magentic_agents.py`) but this is **NOT** the `magentic` PyPI package by Jacky Liang. It's Microsoft Agent Framework (`agent-framework-core`).
82
+
83
+ **The naming confusion:**
84
+ - `magentic` (PyPI package): A different library for structured LLM outputs
85
+ - "Magentic" (in our codebase): Our internal name for Microsoft Agent Framework integration
86
+ - `agent-framework-core` (PyPI package): Microsoft's actual multi-agent orchestration framework
87
+
88
+ **Recommended future action:** Rename `orchestrator_magentic.py` → `orchestrator_advanced.py` to eliminate confusion.
89
+
90
+ ---
91
+
92
+ ## 5. What the Refactor DID Get Right
93
+
94
+ The refactor branch (`feat/pubmed-fulltext`) has some valuable improvements:
95
+
96
+ 1. **`judges.py` unified `get_model()`** - Supports OpenAI, Anthropic, AND HuggingFace via pydantic-ai
97
+ 2. **HuggingFace free tier support** - `HuggingFaceModel` integration
98
+ 3. **Test fix** - Properly mocks `HuggingFaceModel` class
99
+ 4. **Removed broken magentic optional dependency** from pyproject.toml (this was correct - the old `magentic` package is different from Microsoft Agent Framework)
100
+
101
+ **What it got WRONG:**
102
+ 1. Deleted `src/agents/` entirely instead of refactoring them
103
+ 2. Deleted `src/orchestrator_magentic.py` instead of fixing it
104
+ 3. Conflated "magentic" (old package) with "Microsoft Agent Framework" (current framework)
105
+
106
+ ---
107
+
108
+ ## 6. Options for Path Forward
109
+
110
+ ### Option A: Abandon Refactor, Start Fresh
111
+ - Close PR #41
112
+ - Delete `feat/pubmed-fulltext` and `refactor/pydantic-unification` branches
113
+ - Reset local `dev` to match `origin/dev`
114
+ - Cherry-pick ONLY the good parts (judges.py improvements, HF support)
115
+ - **Pros:** Clean, safe
116
+ - **Cons:** Lose some work, need to redo carefully
117
+
118
+ ### Option B: Cherry-Pick Good Parts to origin/dev
119
+ - Do NOT merge PR #41
120
+ - Create new branch from `origin/dev`
121
+ - Cherry-pick specific commits/changes that improve pydantic-ai usage
122
+ - Keep agent framework code intact
123
+ - **Pros:** Preserves both, surgical
124
+ - **Cons:** Requires careful file-by-file review
125
+
126
+ ### Option C: Revert Deletions in Refactor Branch
127
+ - On `feat/pubmed-fulltext`, restore deleted agent files from `origin/dev`
128
+ - Keep the pydantic-ai improvements
129
+ - Merge THAT to dev
130
+ - **Pros:** Gets both
131
+ - **Cons:** Complex git operations, risk of conflicts
132
+
133
+ ---
134
+
135
+ ## 7. Recommended Action: Option B (Cherry-Pick)
136
+
137
+ **Step-by-step:**
138
+
139
+ 1. **Close PR #41** (do not merge)
140
+ 2. **Delete redundant branches:**
141
+ - `refactor/pydantic-unification` (local)
142
+ - Reset local `dev` to `origin/dev`
143
+ 3. **Create new branch from origin/dev:**
144
+ ```bash
145
+ git checkout -b feat/pydantic-ai-improvements origin/dev
146
+ ```
147
+ 4. **Cherry-pick or manually port these improvements:**
148
+ - `src/agent_factory/judges.py` - the unified `get_model()` function
149
+ - `examples/free_tier_demo.py` - HuggingFace demo
150
+ - Test improvements
151
+ 5. **Do NOT delete any agent framework files**
152
+ 6. **Create PR for review**
153
+
154
+ ---
155
+
156
+ ## 8. Files to Cherry-Pick (Safe Improvements)
157
+
158
+ | File | What Changed | Safe to Port? |
159
+ |------|-------------|---------------|
160
+ | `src/agent_factory/judges.py` | Added `HuggingFaceModel` support in `get_model()` | YES |
161
+ | `examples/free_tier_demo.py` | New demo for HF inference | YES |
162
+ | `tests/unit/agent_factory/test_judges.py` | Fixed HF model mocking | YES |
163
+ | `pyproject.toml` | Removed old `magentic` optional dep | MAYBE (review carefully) |
164
+
165
+ ---
166
+
167
+ ## 9. Questions to Answer Before Proceeding
168
+
169
+ 1. **For the hackathon**: Do we need full multi-agent orchestration, or is single-agent sufficient?
170
+ 2. **For DeepCritical mainline**: Is the plan to use Microsoft Agent Framework for orchestration?
171
+ 3. **Timeline**: How much time do we have to get this right?
172
+
173
+ ---
174
+
175
+ ## 10. Immediate Actions (DO NOW)
176
+
177
+ - [ ] **DO NOT merge PR #41**
178
+ - [ ] Close PR #41 with comment explaining the situation
179
+ - [ ] Do not push local `dev` branch anywhere
180
+ - [ ] Confirm HuggingFace Spaces is untouched (it is - verified)
181
+
182
+ ---
183
+
184
+ ## 11. Decision Log
185
+
186
+ | Date | Decision | Rationale |
187
+ |------|----------|-----------|
188
+ | 2025-11-27 | Pause refactor merge | Discovered agent framework and pydantic-ai are complementary, not exclusive |
189
+ | TBD | ? | Awaiting decision on path forward |
docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture Specification: Dual-Mode Agent System
2
+
3
+ **Date:** November 27, 2025
4
+ **Status:** SPECIFICATION
5
+ **Goal:** Graceful degradation from full multi-agent orchestration to simple single-agent mode
6
+
7
+ ---
8
+
9
+ ## 1. Core Concept: Two Operating Modes
10
+
11
+ ```text
12
+ ┌─────────────────────────────────────────────────────────────────────┐
13
+ │ USER REQUEST │
14
+ │ │ │
15
+ │ ▼ │
16
+ │ ┌─────────────────┐ │
17
+ │ │ Mode Selection │ │
18
+ │ │ (Auto-detect) │ │
19
+ │ └────────┬────────┘ │
20
+ │ │ │
21
+ │ ┌───────────────┴───────────────┐ │
22
+ │ │ │ │
23
+ │ ▼ ▼ │
24
+ │ ┌─────────────────┐ ┌─────────────────┐ │
25
+ │ │ SIMPLE MODE │ │ ADVANCED MODE │ │
26
+ │ │ (Free Tier) │ │ (Paid Tier) │ │
27
+ │ │ │ │ │ │
28
+ │ │ pydantic-ai │ │ MS Agent Fwk │ │
29
+ │ │ single-agent │ │ + pydantic-ai │ │
30
+ │ │ loop │ │ multi-agent │ │
31
+ │ └─────────────────┘ └─────────────────┘ │
32
+ │ │ │ │
33
+ │ └───────────────┬───────────────┘ │
34
+ │ ▼ │
35
+ │ ┌─────────────────┐ │
36
+ │ │ Research Report │ │
37
+ │ │ with Citations │ │
38
+ │ └─────────────────┘ │
39
+ └─────────────────────────────────────────────────────────────────────┘
40
+ ```
41
+
42
+ ---
43
+
44
+ ## 2. Mode Comparison
45
+
46
+ | Aspect | Simple Mode | Advanced Mode |
47
+ |--------|-------------|---------------|
48
+ | **Trigger** | No API key OR `LLM_PROVIDER=huggingface` | OpenAI API key present (currently OpenAI only) |
49
+ | **Framework** | pydantic-ai only | Microsoft Agent Framework + pydantic-ai |
50
+ | **Architecture** | Single orchestrator loop | Multi-agent coordination |
51
+ | **Agents** | One agent does Search→Judge→Report | SearchAgent, JudgeAgent, ReportAgent, AnalysisAgent |
52
+ | **State Management** | Simple dict | Thread-safe `MagenticState` with context vars |
53
+ | **Quality** | Good (functional) | Better (specialized agents, coordination) |
54
+ | **Cost** | Free (HuggingFace Inference) | Paid (OpenAI/Anthropic) |
55
+ | **Use Case** | Demos, hackathon, budget-constrained | Production, research quality |
56
+
57
+ ---
58
+
59
+ ## 3. Simple Mode Architecture (pydantic-ai Only)
60
+
61
+ ```text
62
+ ┌─────────────────────────────────────────────────────┐
63
+ │ Orchestrator │
64
+ │ │
65
+ │ while not sufficient and iteration < max: │
66
+ │ 1. SearchHandler.execute(query) │
67
+ │ 2. JudgeHandler.assess(evidence) ◄── pydantic-ai Agent │
68
+ │ 3. if sufficient: break │
69
+ │ 4. query = judge.next_queries │
70
+ │ │
71
+ │ return ReportGenerator.generate(evidence) │
72
+ └─────────────────────────────────────────────────────┘
73
+ ```
74
+
75
+ **Components:**
76
+ - `src/orchestrator.py` - Simple loop orchestrator
77
+ - `src/agent_factory/judges.py` - JudgeHandler with pydantic-ai
78
+ - `src/tools/search_handler.py` - Scatter-gather search
79
+ - `src/tools/pubmed.py`, `clinicaltrials.py`, `europepmc.py` - Search tools
80
+
81
+ ---
82
+
83
+ ## 4. Advanced Mode Architecture (MS Agent Framework + pydantic-ai)
84
+
85
+ ```text
86
+ ┌─────────────────────────────────────────────────────────────────────┐
87
+ │ Microsoft Agent Framework Orchestrator │
88
+ │ │
89
+ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
90
+ │ │ SearchAgent │───▶│ JudgeAgent │───▶│ ReportAgent │ │
91
+ │ │ (BaseAgent) │ │ (BaseAgent) │ │ (BaseAgent) │ │
92
+ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
93
+ │ │ │ │ │
94
+ │ ▼ ▼ ▼ │
95
+ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
96
+ │ │ pydantic-ai │ │ pydantic-ai │ │ pydantic-ai │ │
97
+ │ │ Agent() │ │ Agent() │ │ Agent() │ │
98
+ │ │ output_type=│ │ output_type=│ │ output_type=│ │
99
+ │ │ SearchResult│ │ JudgeAssess │ │ Report │ │
100
+ │ └─────────────┘ └─────────────┘ └─────────────┘ │
101
+ │ │
102
+ │ Shared State: MagenticState (thread-safe via contextvars) │
103
+ │ - evidence: list[Evidence] │
104
+ │ - embedding_service: EmbeddingService │
105
+ └─────────────────────────────────────────────────────────────────────┘
106
+ ```
107
+
108
+ **Components:**
109
+ - `src/orchestrator_magentic.py` - Multi-agent orchestrator
110
+ - `src/agents/search_agent.py` - SearchAgent (BaseAgent)
111
+ - `src/agents/judge_agent.py` - JudgeAgent (BaseAgent)
112
+ - `src/agents/report_agent.py` - ReportAgent (BaseAgent)
113
+ - `src/agents/analysis_agent.py` - AnalysisAgent (BaseAgent)
114
+ - `src/agents/state.py` - Thread-safe state management
115
+ - `src/agents/tools.py` - @ai_function decorated tools
116
+
117
+ ---
118
+
119
+ ## 5. Mode Selection Logic
120
+
121
+ ```python
122
+ # src/orchestrator_factory.py (actual implementation)
123
+
124
+ def create_orchestrator(
125
+ search_handler: SearchHandlerProtocol | None = None,
126
+ judge_handler: JudgeHandlerProtocol | None = None,
127
+ config: OrchestratorConfig | None = None,
128
+ mode: Literal["simple", "magentic", "advanced"] | None = None,
129
+ ) -> Any:
130
+ """
131
+ Auto-select orchestrator based on available credentials.
132
+
133
+ Priority:
134
+ 1. If mode explicitly set, use that
135
+ 2. If OpenAI key available -> Advanced Mode (currently OpenAI only)
136
+ 3. Otherwise -> Simple Mode (HuggingFace free tier)
137
+ """
138
+ effective_mode = _determine_mode(mode)
139
+
140
+ if effective_mode == "advanced":
141
+ orchestrator_cls = _get_magentic_orchestrator_class()
142
+ return orchestrator_cls(max_rounds=config.max_iterations if config else 10)
143
+
144
+ # Simple mode requires handlers
145
+ if search_handler is None or judge_handler is None:
146
+ raise ValueError("Simple mode requires search_handler and judge_handler")
147
+
148
+ return Orchestrator(
149
+ search_handler=search_handler,
150
+ judge_handler=judge_handler,
151
+ config=config,
152
+ )
153
+ ```
154
+
155
+ ---
156
+
157
+ ## 6. Shared Components (Both Modes Use)
158
+
159
+ These components work in both modes:
160
+
161
+ | Component | Purpose |
162
+ |-----------|---------|
163
+ | `src/tools/pubmed.py` | PubMed search |
164
+ | `src/tools/clinicaltrials.py` | ClinicalTrials.gov search |
165
+ | `src/tools/europepmc.py` | Europe PMC search |
166
+ | `src/tools/search_handler.py` | Scatter-gather orchestration |
167
+ | `src/tools/rate_limiter.py` | Rate limiting |
168
+ | `src/utils/models.py` | Evidence, Citation, JudgeAssessment |
169
+ | `src/utils/config.py` | Settings |
170
+ | `src/services/embeddings.py` | Vector search (optional) |
171
+
172
+ ---
173
+
174
+ ## 7. pydantic-ai Integration Points
175
+
176
+ Both modes use pydantic-ai for structured LLM outputs:
177
+
178
+ ```python
179
+ # In JudgeHandler (both modes)
180
+ from pydantic_ai import Agent
181
+ from pydantic_ai.models.huggingface import HuggingFaceModel
182
+ from pydantic_ai.models.openai import OpenAIModel
183
+ from pydantic_ai.models.anthropic import AnthropicModel
184
+
185
+ class JudgeHandler:
186
+ def __init__(self, model: Any = None):
187
+ self.model = model or get_model() # Auto-selects based on config
188
+ self.agent = Agent(
189
+ model=self.model,
190
+ output_type=JudgeAssessment, # Structured output!
191
+ system_prompt=SYSTEM_PROMPT,
192
+ )
193
+
194
+ async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
195
+ result = await self.agent.run(format_prompt(question, evidence))
196
+ return result.output # Guaranteed to be JudgeAssessment
197
+ ```
198
+
199
+ ---
200
+
201
+ ## 8. Microsoft Agent Framework Integration Points
202
+
203
+ Advanced mode wraps pydantic-ai agents in BaseAgent:
204
+
205
+ ```python
206
+ # In JudgeAgent (advanced mode only)
207
+ from agent_framework import BaseAgent, AgentRunResponse, ChatMessage, Role
208
+
209
+ class JudgeAgent(BaseAgent):
210
+ def __init__(self, judge_handler: JudgeHandlerProtocol):
211
+ super().__init__(
212
+ name="JudgeAgent",
213
+ description="Evaluates evidence quality",
214
+ )
215
+ self._handler = judge_handler # Uses pydantic-ai internally
216
+
217
+ async def run(self, messages, **kwargs) -> AgentRunResponse:
218
+ question = extract_question(messages)
219
+ evidence = self._evidence_store.get("current", [])
220
+
221
+ # Delegate to pydantic-ai powered handler
222
+ assessment = await self._handler.assess(question, evidence)
223
+
224
+ return AgentRunResponse(
225
+ messages=[ChatMessage(role=Role.ASSISTANT, text=format_response(assessment))],
226
+ additional_properties={"assessment": assessment.model_dump()},
227
+ )
228
+ ```
229
+
230
+ ---
231
+
232
+ ## 9. Benefits of This Architecture
233
+
234
+ 1. **Graceful Degradation**: Works without API keys (free tier)
235
+ 2. **Progressive Enhancement**: Better with API keys (orchestration)
236
+ 3. **Code Reuse**: pydantic-ai handlers shared between modes
237
+ 4. **Hackathon Ready**: Demo works without requiring paid keys
238
+ 5. **Production Ready**: Full orchestration available when needed
239
+ 6. **Future Proof**: Can add more agents to advanced mode
240
+ 7. **Testable**: Simple mode is easier to unit test
241
+
242
+ ---
243
+
244
+ ## 10. Known Risks and Mitigations
245
+
246
+ > **From Senior Agent Review**
247
+
248
+ ### 10.1 Bridge Complexity (MEDIUM)
249
+
250
+ **Risk:** In Advanced Mode, agents (Agent Framework) wrap handlers (pydantic-ai). Both are async. Context variables (`MagenticState`) must propagate correctly through the pydantic-ai call stack.
251
+
252
+ **Mitigation:**
253
+ - pydantic-ai uses standard Python `contextvars`, which naturally propagate through `await` chains
254
+ - Test context propagation explicitly in integration tests
255
+ - If issues arise, pass state explicitly rather than via context vars
256
+
257
+ ### 10.2 Integration Drift (MEDIUM)
258
+
259
+ **Risk:** Simple Mode and Advanced Mode might diverge in behavior over time (e.g., Simple Mode uses logic A, Advanced Mode uses logic B).
260
+
261
+ **Mitigation:**
262
+ - Both modes MUST call the exact same underlying Tools (`src/tools/*`) and Handlers (`src/agent_factory/*`)
263
+ - Handlers are the single source of truth for business logic
264
+ - Agents are thin wrappers that delegate to handlers
265
+
266
+ ### 10.3 Testing Burden (LOW-MEDIUM)
267
+
268
+ **Risk:** Two distinct orchestrators (`src/orchestrator.py` and `src/orchestrator_magentic.py`) doubles integration testing surface area.
269
+
270
+ **Mitigation:**
271
+ - Unit test handlers independently (shared code)
272
+ - Integration tests for each mode separately
273
+ - End-to-end tests verify same output for same input (determinism permitting)
274
+
275
+ ### 10.4 Dependency Conflicts (LOW)
276
+
277
+ **Risk:** `agent-framework-core` might conflict with `pydantic-ai`'s dependencies (e.g., different pydantic versions).
278
+
279
+ **Status:** Both use `pydantic>=2.x`. Should be compatible.
280
+
281
+ ---
282
+
283
+ ## 11. Naming Clarification
284
+
285
+ > See `00_SITUATION_AND_PLAN.md` Section 4 for full details.
286
+
287
+ **Important:** The codebase uses "magentic" in file names (`orchestrator_magentic.py`, `magentic_agents.py`) but this refers to our internal naming for Microsoft Agent Framework integration, **NOT** the `magentic` PyPI package.
288
+
289
+ **Future action:** Rename to `orchestrator_advanced.py` to eliminate confusion.
docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Phases: Dual-Mode Agent System
2
+
3
+ **Date:** November 27, 2025
4
+ **Status:** IMPLEMENTATION PLAN (REVISED)
5
+ **Strategy:** TDD (Test-Driven Development), SOLID Principles
6
+ **Dependency Strategy:** PyPI (agent-framework-core)
7
+
8
+ ---
9
+
10
+ ## Phase 0: Environment Validation & Cleanup
11
+
12
+ **Goal:** Ensure clean state and dependencies are correctly installed.
13
+
14
+ ### Step 0.1: Verify PyPI Package
15
+ The `agent-framework-core` package is published on PyPI by Microsoft. Verify installation:
16
+
17
+ ```bash
18
+ uv sync --all-extras
19
+ python -c "from agent_framework import ChatAgent; print('OK')"
20
+ ```
21
+
22
+ ### Step 0.2: Branch State
23
+ We are on `feat/dual-mode-architecture`. Ensure it is up to date with `origin/dev` before starting.
24
+
25
+ **Note:** The `reference_repos/agent-framework` folder is kept for reference/documentation only.
26
+ The production dependency uses the official PyPI release.
27
+
28
+ ---
29
+
30
+ ## Phase 1: Pydantic-AI Improvements (Simple Mode)
31
+
32
+ **Goal:** Implement `HuggingFaceModel` support in `JudgeHandler` using strict TDD.
33
+
34
+ ### Step 1.1: Test First (Red)
35
+ Create `tests/unit/agent_factory/test_judges_factory.py`:
36
+ - Test `get_model()` returns `HuggingFaceModel` when `LLM_PROVIDER=huggingface`.
37
+ - Test `get_model()` respects `HF_TOKEN`.
38
+ - Test fallback to OpenAI.
39
+
40
+ ### Step 1.2: Implementation (Green)
41
+ Update `src/utils/config.py`:
42
+ - Add `huggingface_model` and `hf_token` fields.
43
+
44
+ Update `src/agent_factory/judges.py`:
45
+ - Implement `get_model` with the logic derived from the tests.
46
+ - Use dependency injection for the model where possible.
47
+
48
+ ### Step 1.3: Refactor
49
+ Ensure `JudgeHandler` is loosely coupled from the specific model provider.
50
+
51
+ ---
52
+
53
+ ## Phase 2: Orchestrator Factory (The Switch)
54
+
55
+ **Goal:** Implement the factory pattern to switch between Simple and Advanced modes.
56
+
57
+ ### Step 2.1: Test First (Red)
58
+ Create `tests/unit/test_orchestrator_factory.py`:
59
+ - Test `create_orchestrator` returns `Orchestrator` (simple) when API keys are missing.
60
+ - Test `create_orchestrator` returns `MagenticOrchestrator` (advanced) when OpenAI key exists.
61
+ - Test explicit mode override.
62
+
63
+ ### Step 2.2: Implementation (Green)
64
+ Update `src/orchestrator_factory.py` to implement the selection logic.
65
+
66
+ ---
67
+
68
+ ## Phase 3: Agent Framework Integration (Advanced Mode)
69
+
70
+ **Goal:** Integrate Microsoft Agent Framework from PyPI.
71
+
72
+ ### Step 3.1: Dependency Management
73
+ The `agent-framework-core` package is installed from PyPI:
74
+ ```toml
75
+ [project.optional-dependencies]
76
+ magentic = [
77
+ "agent-framework-core>=1.0.0b251120,<2.0.0", # Microsoft Agent Framework (PyPI)
78
+ ]
79
+ ```
80
+ Install with: `uv sync --all-extras`
81
+
82
+ ### Step 3.2: Verify Imports (Test First)
83
+ Create `tests/unit/agents/test_agent_imports.py`:
84
+ - Verify `from agent_framework import ChatAgent` works.
85
+ - Verify instantiation of `ChatAgent` with a mock client.
86
+
87
+ ### Step 3.3: Update Agents
88
+ Refactor `src/agents/*.py` to ensure they match the exact signature of the local `ChatAgent` class.
89
+ - **SOLID:** Ensure agents have single responsibilities.
90
+ - **DRY:** Share tool definitions between Pydantic-AI simple mode and Agent Framework advanced mode.
91
+
92
+ ---
93
+
94
+ ## Phase 4: UI & End-to-End Verification
95
+
96
+ **Goal:** Update Gradio to reflect the active mode.
97
+
98
+ ### Step 4.1: UI Updates
99
+ Update `src/app.py` to display "Simple Mode" vs "Advanced Mode".
100
+
101
+ ### Step 4.2: End-to-End Test
102
+ Run the full loop:
103
+ 1. Simple Mode (No Keys) -> Search -> Judge (HF) -> Report.
104
+ 2. Advanced Mode (OpenAI Key) -> SearchAgent -> JudgeAgent -> ReportAgent.
105
+
106
+ ---
107
+
108
+ ## Phase 5: Cleanup & Documentation
109
+
110
+ - Remove unused code.
111
+ - Update main README.md.
112
+ - Final `make check`.
docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Immediate Actions Checklist
2
+
3
+ **Date:** November 27, 2025
4
+ **Priority:** Execute in order
5
+
6
+ ---
7
+
8
+ ## Before Starting Implementation
9
+
10
+ ### 1. Close PR #41 (CRITICAL)
11
+
12
+ ```bash
13
+ gh pr close 41 --comment "Architecture decision changed. Cherry-picking improvements to preserve both pydantic-ai and Agent Framework capabilities."
14
+ ```
15
+
16
+ ### 2. Verify HuggingFace Spaces is Safe
17
+
18
+ ```bash
19
+ # Should show agent framework files exist
20
+ git ls-tree --name-only huggingface-upstream/dev -- src/agents/
21
+ git ls-tree --name-only huggingface-upstream/dev -- src/orchestrator_magentic.py
22
+ ```
23
+
24
+ Expected output: Files should exist (they do as of this writing).
25
+
26
+ ### 3. Clean Local Environment
27
+
28
+ ```bash
29
+ # Switch to main first
30
+ git checkout main
31
+
32
+ # Delete problematic branches
33
+ git branch -D refactor/pydantic-unification 2>/dev/null || true
34
+ git branch -D feat/pubmed-fulltext 2>/dev/null || true
35
+
36
+ # Reset local dev to origin/dev
37
+ git branch -D dev 2>/dev/null || true
38
+ git checkout -b dev origin/dev
39
+
40
+ # Verify agent framework code exists
41
+ ls src/agents/
42
+ # Expected: __init__.py, analysis_agent.py, hypothesis_agent.py, judge_agent.py,
43
+ # magentic_agents.py, report_agent.py, search_agent.py, state.py, tools.py
44
+
45
+ ls src/orchestrator_magentic.py
46
+ # Expected: file exists
47
+ ```
48
+
49
+ ### 4. Create Fresh Feature Branch
50
+
51
+ ```bash
52
+ git checkout -b feat/dual-mode-architecture origin/dev
53
+ ```
54
+
55
+ ---
56
+
57
+ ## Decision Points
58
+
59
+ Before proceeding, confirm:
60
+
61
+ 1. **For hackathon**: Do we need advanced mode, or is simple mode sufficient?
62
+ - Simple mode = faster to implement, works today
63
+ - Advanced mode = better quality, more work
64
+
65
+ 2. **Timeline**: How much time do we have?
66
+ - If < 1 day: Focus on simple mode only
67
+ - If > 1 day: Implement dual-mode
68
+
69
+ 3. **Dependencies**: Is `agent-framework-core` available?
70
+ - Check: `pip index versions agent-framework-core`
71
+ - If not on PyPI, may need to install from GitHub
72
+
73
+ ---
74
+
75
+ ## Quick Start (Simple Mode Only)
76
+
77
+ If time is limited, implement only simple mode improvements:
78
+
79
+ ```bash
80
+ # On feat/dual-mode-architecture branch
81
+
82
+ # 1. Update judges.py to add HuggingFace support
83
+ # 2. Update config.py to add HF settings
84
+ # 3. Create free_tier_demo.py
85
+ # 4. Run make check
86
+ # 5. Create PR to dev
87
+ ```
88
+
89
+ This gives you free-tier capability without touching agent framework code.
90
+
91
+ ---
92
+
93
+ ## Quick Start (Full Dual-Mode)
94
+
95
+ If time permits, implement full dual-mode:
96
+
97
+ Follow phases 1-6 in `02_IMPLEMENTATION_PHASES.md`
98
+
99
+ ---
100
+
101
+ ## Emergency Rollback
102
+
103
+ If anything goes wrong:
104
+
105
+ ```bash
106
+ # Reset to safe state
107
+ git checkout main
108
+ git branch -D feat/dual-mode-architecture
109
+ git checkout -b feat/dual-mode-architecture origin/dev
110
+ ```
111
+
112
+ Origin/dev is the safe fallback - it has agent framework intact.
docs/brainstorming/magentic-pydantic/04_FOLLOWUP_REVIEW_REQUEST.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Follow-Up Review Request: Did We Implement Your Feedback?
2
+
3
+ **Date:** November 27, 2025
4
+ **Context:** You previously reviewed our dual-mode architecture plan and provided feedback. We have updated the documentation. Please verify we correctly implemented your recommendations.
5
+
6
+ ---
7
+
8
+ ## Your Original Feedback vs Our Changes
9
+
10
+ ### 1. Naming Confusion Clarification
11
+
12
+ **Your feedback:** "You are using Microsoft Agent Framework, but you've named your integration 'Magentic'. This caused the confusion."
13
+
14
+ **Our change:** Added Section 4 in `00_SITUATION_AND_PLAN.md`:
15
+ ```markdown
16
+ ## 4. CRITICAL: Naming Confusion Clarification
17
+
18
+ > **Senior Agent Review Finding:** The codebase uses "magentic" in file names
19
+ > (e.g., `orchestrator_magentic.py`, `magentic_agents.py`) but this is **NOT**
20
+ > the `magentic` PyPI package by Jacky Liang. It's Microsoft Agent Framework.
21
+
22
+ **The naming confusion:**
23
+ - `magentic` (PyPI package): A different library for structured LLM outputs
24
+ - "Magentic" (in our codebase): Our internal name for Microsoft Agent Framework integration
25
+ - `agent-framework-core` (PyPI package): Microsoft's actual multi-agent orchestration framework
26
+
27
+ **Recommended future action:** Rename `orchestrator_magentic.py` → `orchestrator_advanced.py`
28
+ ```
29
+
30
+ **Status:** ✅ IMPLEMENTED
31
+
32
+ ---
33
+
34
+ ### 2. Bridge Complexity Warning
35
+
36
+ **Your feedback:** "You must ensure MagenticState (context vars) propagates correctly through the pydantic-ai call stack."
37
+
38
+ **Our change:** Added Section 10.1 in `01_ARCHITECTURE_SPEC.md`:
39
+ ```markdown
40
+ ### 10.1 Bridge Complexity (MEDIUM)
41
+
42
+ **Risk:** In Advanced Mode, agents (Agent Framework) wrap handlers (pydantic-ai).
43
+ Both are async. Context variables (`MagenticState`) must propagate correctly.
44
+
45
+ **Mitigation:**
46
+ - pydantic-ai uses standard Python `contextvars`, which naturally propagate through `await` chains
47
+ - Test context propagation explicitly in integration tests
48
+ - If issues arise, pass state explicitly rather than via context vars
49
+ ```
50
+
51
+ **Status:** ✅ IMPLEMENTED
52
+
53
+ ---
54
+
55
+ ### 3. Integration Drift Warning
56
+
57
+ **Your feedback:** "Simple Mode and Advanced Mode might diverge in behavior."
58
+
59
+ **Our change:** Added Section 10.2 in `01_ARCHITECTURE_SPEC.md`:
60
+ ```markdown
61
+ ### 10.2 Integration Drift (MEDIUM)
62
+
63
+ **Risk:** Simple Mode and Advanced Mode might diverge in behavior over time.
64
+
65
+ **Mitigation:**
66
+ - Both modes MUST call the exact same underlying Tools (`src/tools/*`) and Handlers (`src/agent_factory/*`)
67
+ - Handlers are the single source of truth for business logic
68
+ - Agents are thin wrappers that delegate to handlers
69
+ ```
70
+
71
+ **Status:** ✅ IMPLEMENTED
72
+
73
+ ---
74
+
75
+ ### 4. Testing Burden Warning
76
+
77
+ **Your feedback:** "You now have two distinct orchestrators to maintain. This doubles your integration testing surface area."
78
+
79
+ **Our change:** Added Section 10.3 in `01_ARCHITECTURE_SPEC.md`:
80
+ ```markdown
81
+ ### 10.3 Testing Burden (LOW-MEDIUM)
82
+
83
+ **Risk:** Two distinct orchestrators doubles integration testing surface area.
84
+
85
+ **Mitigation:**
86
+ - Unit test handlers independently (shared code)
87
+ - Integration tests for each mode separately
88
+ - End-to-end tests verify same output for same input
89
+ ```
90
+
91
+ **Status:** ✅ IMPLEMENTED
92
+
93
+ ---
94
+
95
+ ### 5. Rename Recommendation
96
+
97
+ **Your feedback:** "Rename `src/orchestrator_magentic.py` to `src/orchestrator_advanced.py`"
98
+
99
+ **Our change:** Added Step 3.4 in `02_IMPLEMENTATION_PHASES.md`:
100
+ ```markdown
101
+ ### Step 3.4: (OPTIONAL) Rename "Magentic" to "Advanced"
102
+
103
+ > **Senior Agent Recommendation:** Rename files to eliminate confusion.
104
+
105
+ git mv src/orchestrator_magentic.py src/orchestrator_advanced.py
106
+ git mv src/agents/magentic_agents.py src/agents/advanced_agents.py
107
+
108
+ **Note:** This is optional for the hackathon. Can be done in a follow-up PR.
109
+ ```
110
+
111
+ **Status:** ✅ DOCUMENTED (marked as optional for hackathon)
112
+
113
+ ---
114
+
115
+ ### 6. Standardize Wrapper Recommendation
116
+
117
+ **Your feedback:** "Create a generic `PydanticAiAgentWrapper(BaseAgent)` class instead of manually wrapping each handler."
118
+
119
+ **Our change:** NOT YET DOCUMENTED
120
+
121
+ **Status:** ⚠️ NOT IMPLEMENTED - Should we add this?
122
+
123
+ ---
124
+
125
+ ## Questions for Your Review
126
+
127
+ 1. **Did we correctly implement your feedback?** Are there any misunderstandings in how we interpreted your recommendations?
128
+
129
+ 2. **Is the "Standardize Wrapper" recommendation critical?** Should we add it to the implementation phases, or is it a nice-to-have for later?
130
+
131
+ 3. **Dependency versioning:** You noted `agent-framework-core>=1.0.0b251120` might be ephemeral. Should we:
132
+ - Pin to a specific version?
133
+ - Use a version range?
134
+ - Install from GitHub source?
135
+
136
+ 4. **Anything else we missed?**
137
+
138
+ ---
139
+
140
+ ## Files to Re-Review
141
+
142
+ 1. `00_SITUATION_AND_PLAN.md` - Added Section 4 (Naming Clarification)
143
+ 2. `01_ARCHITECTURE_SPEC.md` - Added Sections 10-11 (Risks, Naming)
144
+ 3. `02_IMPLEMENTATION_PHASES.md` - Added Step 3.4 (Optional Rename)
145
+
146
+ ---
147
+
148
+ ## Current Branch State
149
+
150
+ We are now on `feat/dual-mode-architecture` branched from `origin/dev`:
151
+ - ✅ Agent framework code intact (`src/agents/`, `src/orchestrator_magentic.py`)
152
+ - ✅ Documentation committed
153
+ - ❌ PR #41 still open (need to close it)
154
+ - ❌ Cherry-pick of pydantic-ai improvements not yet done
155
+
156
+ ---
157
+
158
+ Please confirm: **GO / NO-GO** to proceed with Phase 1 (cherry-picking pydantic-ai improvements)?
docs/brainstorming/magentic-pydantic/REVIEW_PROMPT_FOR_SENIOR_AGENT.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Senior Agent Review Prompt
2
+
3
+ Copy and paste everything below this line to a fresh Claude/AI session:
4
+
5
+ ---
6
+
7
+ ## Context
8
+
9
+ I am a junior developer working on a HuggingFace hackathon project called DeepCritical. We made a significant architectural mistake and are now trying to course-correct. I need you to act as a **senior staff engineer** and critically review our proposed solution.
10
+
11
+ ## The Situation
12
+
13
+ We almost merged a refactor that would have **deleted** our multi-agent orchestration capability, mistakenly believing that `pydantic-ai` (a library for structured LLM outputs) and Microsoft's `agent-framework` (a framework for multi-agent orchestration) were mutually exclusive alternatives.
14
+
15
+ **They are not.** They are complementary:
16
+ - `pydantic-ai` ensures LLM responses match Pydantic schemas (type-safe outputs)
17
+ - `agent-framework` orchestrates multiple agents working together (coordination layer)
18
+
19
+ We now want to implement a **dual-mode architecture** where:
20
+ - **Simple Mode (No API key):** Uses only pydantic-ai with HuggingFace free tier
21
+ - **Advanced Mode (With API key):** Uses Microsoft Agent Framework for orchestration, with pydantic-ai inside each agent for structured outputs
22
+
23
+ ## Your Task
24
+
25
+ Please perform a **deep, critical review** of:
26
+
27
+ 1. **The architecture diagram** (image attached: `assets/magentic-pydantic.png`)
28
+ 2. **Our documentation** (4 files listed below)
29
+ 3. **The actual codebase** to verify our claims
30
+
31
+ ## Specific Questions to Answer
32
+
33
+ ### Architecture Validation
34
+ 1. Is our understanding correct that pydantic-ai and agent-framework are complementary, not competing?
35
+ 2. Does the dual-mode architecture diagram accurately represent how these should integrate?
36
+ 3. Are there any architectural flaws or anti-patterns in our proposed design?
37
+
38
+ ### Documentation Accuracy
39
+ 4. Are the branch states we documented accurate? (Check `git log`, `git ls-tree`)
40
+ 5. Is our understanding of what code exists where correct?
41
+ 6. Are the implementation phases realistic and in the correct order?
42
+ 7. Are there any missing steps or dependencies we overlooked?
43
+
44
+ ### Codebase Reality Check
45
+ 8. Does `origin/dev` actually have the agent framework code intact? Verify by checking:
46
+ - `git ls-tree origin/dev -- src/agents/`
47
+ - `git ls-tree origin/dev -- src/orchestrator_magentic.py`
48
+ 9. What does the current `src/agents/` code actually import? Does it use `agent_framework` or `agent-framework-core`?
49
+ 10. Is the `agent-framework-core` package actually available on PyPI, or do we need to install from source?
50
+
51
+ ### Implementation Feasibility
52
+ 11. Can the cherry-pick strategy we outlined actually work, or are there merge conflicts we're not seeing?
53
+ 12. Is the mode auto-detection logic sound?
54
+ 13. What are the risks we haven't identified?
55
+
56
+ ### Critical Errors Check
57
+ 14. Did we miss anything critical in our analysis?
58
+ 15. Are there any factual errors in our documentation?
59
+ 16. Would a Google/DeepMind senior engineer approve this plan, or would they flag issues?
60
+
61
+ ## Files to Review
62
+
63
+ Please read these files in order:
64
+
65
+ 1. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/00_SITUATION_AND_PLAN.md`
66
+ 2. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/01_ARCHITECTURE_SPEC.md`
67
+ 3. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/02_IMPLEMENTATION_PHASES.md`
68
+ 4. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/docs/brainstorming/magentic-pydantic/03_IMMEDIATE_ACTIONS.md`
69
+
70
+ And the architecture diagram:
71
+ 5. `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/assets/magentic-pydantic.png`
72
+
73
+ ## Reference Repositories to Consult
74
+
75
+ We have local clones of the source-of-truth repositories:
76
+
77
+ - **Original DeepCritical:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/DeepCritical/`
78
+ - **Microsoft Agent Framework:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/agent-framework/`
79
+ - **Microsoft AutoGen:** `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/reference_repos/autogen-microsoft/`
80
+
81
+ Please cross-reference our hackathon fork against these to verify architectural alignment.
82
+
83
+ ## Codebase to Analyze
84
+
85
+ Our hackathon fork is at:
86
+ `/Users/ray/Desktop/CLARITY-DIGITAL-TWIN/DeepCritical-1/`
87
+
88
+ Key files to examine:
89
+ - `src/agents/` - Agent framework integration
90
+ - `src/agent_factory/judges.py` - pydantic-ai integration
91
+ - `src/orchestrator.py` - Simple mode orchestrator
92
+ - `src/orchestrator_magentic.py` - Advanced mode orchestrator
93
+ - `src/orchestrator_factory.py` - Mode selection
94
+ - `pyproject.toml` - Dependencies
95
+
96
+ ## Expected Output
97
+
98
+ Please provide:
99
+
100
+ 1. **Validation Summary:** Is our plan sound? (YES/NO with explanation)
101
+ 2. **Errors Found:** List any factual errors in our documentation
102
+ 3. **Missing Items:** What did we overlook?
103
+ 4. **Risk Assessment:** What could go wrong?
104
+ 5. **Recommended Changes:** Specific edits to our documentation or plan
105
+ 6. **Go/No-Go Recommendation:** Should we proceed with this plan?
106
+
107
+ ## Tone
108
+
109
+ Be brutally honest. If our plan is flawed, say so directly. We would rather know now than after implementation. Don't soften criticism - we need accuracy.
110
+
111
+ ---
112
+
113
+ END OF PROMPT
docs/bugs/FIX_PLAN_MAGENTIC_MODE.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fix Plan: Magentic Mode Report Generation
2
+
3
+ **Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
4
+ **Approach**: Test-Driven Development (TDD)
5
+ **Estimated Scope**: 4 tasks, ~2-3 hours
6
+
7
+ ---
8
+
9
+ ## Problem Summary
10
+
11
+ Magentic mode runs but fails to produce readable reports due to:
12
+
13
+ 1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
14
+ 2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
15
+ 3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
16
+
17
+ ---
18
+
19
+ ## Fix Order (TDD)
20
+
21
+ ### Phase 1: Write Failing Tests
22
+
23
+ **Task 1.1**: Create test for ChatMessage text extraction
24
+
25
+ ```python
26
+ # tests/unit/test_orchestrator_magentic.py
27
+
28
+ def test_process_event_extracts_text_from_chat_message():
29
+ """Final result event should extract text from ChatMessage object."""
30
+ # Arrange: Mock ChatMessage with .content attribute
31
+ # Act: Call _process_event with MagenticFinalResultEvent
32
+ # Assert: Returned AgentEvent.message is a string, not object repr
33
+ ```
34
+
35
+ **Task 1.2**: Create test for max rounds configuration
36
+
37
+ ```python
38
+ def test_orchestrator_uses_configured_max_rounds():
39
+ """MagenticOrchestrator should use max_rounds from constructor."""
40
+ # Arrange: Create orchestrator with max_rounds=10
41
+ # Act: Build workflow
42
+ # Assert: Workflow has max_round_count=10
43
+ ```
44
+
45
+ **Task 1.3**: Create test for bioRxiv reference removal
46
+
47
+ ```python
48
+ def test_task_prompt_references_europe_pmc():
49
+ """Task prompt should reference Europe PMC, not bioRxiv."""
50
+ # Arrange: Create orchestrator
51
+ # Act: Check task string in run()
52
+ # Assert: Contains "Europe PMC", not "bioRxiv"
53
+ ```
54
+
55
+ ---
56
+
57
+ ### Phase 2: Fix ChatMessage Text Extraction
58
+
59
+ **File**: `src/orchestrator_magentic.py`
60
+ **Lines**: 192-199
61
+
62
+ **Current Code**:
63
+ ```python
64
+ elif isinstance(event, MagenticFinalResultEvent):
65
+ text = event.message.text if event.message else "No result"
66
+ ```
67
+
68
+ **Fixed Code**:
69
+ ```python
70
+ elif isinstance(event, MagenticFinalResultEvent):
71
+ if event.message:
72
+ # ChatMessage may have .content or .text depending on version
73
+ if hasattr(event.message, 'content') and event.message.content:
74
+ text = str(event.message.content)
75
+ elif hasattr(event.message, 'text') and event.message.text:
76
+ text = str(event.message.text)
77
+ else:
78
+ # Fallback: convert entire message to string
79
+ text = str(event.message)
80
+ else:
81
+ text = "No result generated"
82
+ ```
83
+
84
+ **Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
85
+
86
+ ---
87
+
88
+ ### Phase 3: Fix Max Rounds Configuration
89
+
90
+ **File**: `src/orchestrator_magentic.py`
91
+ **Lines**: 97-99
92
+
93
+ **Current Code**:
94
+ ```python
95
+ .with_standard_manager(
96
+ chat_client=manager_client,
97
+ max_round_count=self._max_rounds, # Already uses config
98
+ max_stall_count=3,
99
+ max_reset_count=2,
100
+ )
101
+ ```
102
+
103
+ **Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
104
+
105
+ **Fix**: Verify the value flows through correctly. Add logging.
106
+
107
+ ```python
108
+ logger.info(
109
+ "Building Magentic workflow",
110
+ max_rounds=self._max_rounds,
111
+ max_stall=3,
112
+ max_reset=2,
113
+ )
114
+ ```
115
+
116
+ **Also check**: `src/orchestrator_factory.py` passes config correctly:
117
+ ```python
118
+ return MagenticOrchestrator(
119
+ max_rounds=config.max_iterations if config else 10,
120
+ )
121
+ ```
122
+
123
+ ---
124
+
125
+ ### Phase 4: Fix Stale bioRxiv References
126
+
127
+ **Files to update**:
128
+
129
+ | File | Line | Change |
130
+ |------|------|--------|
131
+ | `src/orchestrator_magentic.py` | 131 | "bioRxiv" → "Europe PMC" |
132
+ | `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" → "Europe PMC" |
133
+ | `src/app.py` | 202-203 | "bioRxiv" → "Europe PMC" |
134
+
135
+ **Search command to verify**:
136
+ ```bash
137
+ grep -rn "bioRxiv\|biorxiv" src/
138
+ ```
139
+
140
+ ---
141
+
142
+ ## Implementation Checklist
143
+
144
+ ```
145
+ [ ] Phase 1: Write failing tests
146
+ [ ] 1.1 Test ChatMessage text extraction
147
+ [ ] 1.2 Test max rounds configuration
148
+ [ ] 1.3 Test Europe PMC references
149
+
150
+ [ ] Phase 2: Fix ChatMessage extraction
151
+ [ ] Update _process_event() in orchestrator_magentic.py
152
+ [ ] Run test 1.1 - should pass
153
+
154
+ [ ] Phase 3: Fix max rounds
155
+ [ ] Add logging to _build_workflow()
156
+ [ ] Verify factory passes config correctly
157
+ [ ] Run test 1.2 - should pass
158
+
159
+ [ ] Phase 4: Fix bioRxiv references
160
+ [ ] Update orchestrator_magentic.py task prompt
161
+ [ ] Update magentic_agents.py descriptions
162
+ [ ] Update app.py UI text
163
+ [ ] Run test 1.3 - should pass
164
+ [ ] Run grep to verify no remaining refs
165
+
166
+ [ ] Final Verification
167
+ [ ] make check passes
168
+ [ ] All tests pass (108+)
169
+ [ ] Manual test: run_magentic.py produces readable report
170
+ ```
171
+
172
+ ---
173
+
174
+ ## Test Commands
175
+
176
+ ```bash
177
+ # Run specific test file
178
+ uv run pytest tests/unit/test_orchestrator_magentic.py -v
179
+
180
+ # Run all tests
181
+ uv run pytest tests/unit/ -v
182
+
183
+ # Full check
184
+ make check
185
+
186
+ # Manual integration test
187
+ set -a && source .env && set +a
188
+ uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
189
+ ```
190
+
191
+ ---
192
+
193
+ ## Success Criteria
194
+
195
+ 1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
196
+ 2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
197
+ 3. No "Max round count reached" error with default settings
198
+ 4. No "bioRxiv" references anywhere in codebase
199
+ 5. All 108+ tests pass
200
+ 6. `make check` passes
201
+
202
+ ---
203
+
204
+ ## Files Modified
205
+
206
+ ```
207
+ src/
208
+ ├── orchestrator_magentic.py # ChatMessage fix, logging
209
+ ├── agents/magentic_agents.py # bioRxiv → Europe PMC
210
+ └── app.py # bioRxiv → Europe PMC
211
+
212
+ tests/unit/
213
+ └── test_orchestrator_magentic.py # NEW: 3 tests
214
+ ```
215
+
216
+ ---
217
+
218
+ ## Notes for AI Agent
219
+
220
+ When implementing this fix plan:
221
+
222
+ 1. **DO NOT** create mock data or fake responses
223
+ 2. **DO** write real tests that verify actual behavior
224
+ 3. **DO** run `make check` after each phase
225
+ 4. **DO** test with real OpenAI API key via `.env`
226
+ 5. **DO** preserve existing functionality - simple mode must still work
227
+ 6. **DO NOT** over-engineer - minimal changes to fix the specific bugs
docs/bugs/P0_MAGENTIC_MODE_BROKEN.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P0 Bug: Magentic Mode Returns ChatMessage Object Instead of Report Text
2
+
3
+ **Status**: OPEN
4
+ **Priority**: P0 (Critical)
5
+ **Date**: 2025-11-27
6
+
7
+ ---
8
+
9
+ ## Actual Bug Found (Not What We Thought)
10
+
11
+ **The OpenAI key works fine.** The real bug is different:
12
+
13
+ ### The Problem
14
+
15
+ When Magentic mode completes, the final report returns a `ChatMessage` object instead of the actual text:
16
+
17
+ ```
18
+ FINAL REPORT:
19
+ <agent_framework._types.ChatMessage object at 0x11db70310>
20
+ ```
21
+
22
+ ### Evidence
23
+
24
+ Full test output shows:
25
+ 1. Magentic orchestrator starts correctly
26
+ 2. SearchAgent finds evidence
27
+ 3. HypothesisAgent generates hypotheses
28
+ 4. JudgeAgent evaluates
29
+ 5. **BUT**: Final output is `ChatMessage` object, not text
30
+
31
+ ### Root Cause
32
+
33
+ In `src/orchestrator_magentic.py` line 193:
34
+
35
+ ```python
36
+ elif isinstance(event, MagenticFinalResultEvent):
37
+ text = event.message.text if event.message else "No result"
38
+ ```
39
+
40
+ The `event.message` is a `ChatMessage` object, and `.text` may not extract the content correctly, or the message structure changed in the agent-framework library.
41
+
42
+ ---
43
+
44
+ ## Secondary Issue: Max Rounds Reached
45
+
46
+ The orchestrator hits max rounds before producing a report:
47
+
48
+ ```
49
+ [ERROR] Magentic Orchestrator: Max round count reached
50
+ ```
51
+
52
+ This means the workflow times out before the ReportAgent synthesizes the final output.
53
+
54
+ ---
55
+
56
+ ## What Works
57
+
58
+ - OpenAI API key: **Works** (loaded from .env)
59
+ - SearchAgent: **Works** (finds evidence from PubMed, ClinicalTrials, Europe PMC)
60
+ - HypothesisAgent: **Works** (generates Drug -> Target -> Pathway chains)
61
+ - JudgeAgent: **Partial** (evaluates but sometimes loses context)
62
+
63
+ ---
64
+
65
+ ## Files to Fix
66
+
67
+ | File | Line | Issue |
68
+ |------|------|-------|
69
+ | `src/orchestrator_magentic.py` | 193 | `event.message.text` returns object, not string |
70
+ | `src/orchestrator_magentic.py` | 97-99 | `max_round_count=3` too low for full pipeline |
71
+
72
+ ---
73
+
74
+ ## Suggested Fix
75
+
76
+ ```python
77
+ # In _process_event, line 192-199
78
+ elif isinstance(event, MagenticFinalResultEvent):
79
+ # Handle ChatMessage object properly
80
+ if event.message:
81
+ if hasattr(event.message, 'content'):
82
+ text = event.message.content
83
+ elif hasattr(event.message, 'text'):
84
+ text = event.message.text
85
+ else:
86
+ text = str(event.message)
87
+ else:
88
+ text = "No result"
89
+ ```
90
+
91
+ And increase rounds:
92
+
93
+ ```python
94
+ # In _build_workflow, line 97
95
+ max_round_count=self._max_rounds, # Use configured value, default 10
96
+ ```
97
+
98
+ ---
99
+
100
+ ## Test Command
101
+
102
+ ```bash
103
+ set -a && source .env && set +a && uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
104
+ ```
105
+
106
+ ---
107
+
108
+ ## Simple Mode Works
109
+
110
+ For reference, simple mode produces full reports:
111
+
112
+ ```bash
113
+ uv run python examples/orchestrator_demo/run_agent.py "metformin alzheimer"
114
+ ```
115
+
116
+ Output includes structured report with Drug Candidates, Key Findings, etc.
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P1 Bug: Gradio Settings Accordion Not Collapsing
2
+
3
+ **Priority**: P1 (UX Bug)
4
+ **Status**: OPEN
5
+ **Date**: 2025-11-27
6
+ **Target Component**: `src/app.py`
7
+
8
+ ---
9
+
10
+ ## 1. Problem Description
11
+
12
+ The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
13
+
14
+ ### Symptoms
15
+ - Accordion arrow toggles visually, but content remains visible.
16
+ - Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
17
+
18
+ ---
19
+
20
+ ## 2. Root Cause Analysis
21
+
22
+ **Definitive Cause**: Nested `Blocks` Context Bug.
23
+ `gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
24
+
25
+ **Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
26
+
27
+ ---
28
+
29
+ ## 3. Solution Strategy: "The Unwrap Fix"
30
+
31
+ We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
32
+
33
+ ### Implementation Plan
34
+
35
+ **Refactor `src/app.py` / `create_demo()`**:
36
+
37
+ 1. **Remove** the `with gr.Blocks() as demo:` context manager.
38
+ 2. **Instantiate** `gr.ChatInterface` directly as the `demo` object.
39
+ 3. **Migrate UI Elements**:
40
+ * **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
41
+ * **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
42
+
43
+ ### Before (Buggy)
44
+ ```python
45
+ def create_demo():
46
+ with gr.Blocks() as demo: # <--- CAUSE OF BUG
47
+ gr.Markdown("# Title")
48
+ gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
49
+ gr.Markdown("Footer")
50
+ return demo
51
+ ```
52
+
53
+ ### After (Correct)
54
+ ```python
55
+ def create_demo():
56
+ return gr.ChatInterface( # <--- FIX: Top-level component
57
+ ...,
58
+ title="🧬 DeepCritical",
59
+ description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
60
+ additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
61
+ )
62
+ ```
63
+
64
+ ---
65
+
66
+ ## 4. Validation
67
+
68
+ 1. **Run**: `uv run python src/app.py`
69
+ 2. **Check**: Open `http://localhost:7860`
70
+ 3. **Verify**:
71
+ * Settings accordion starts **COLLAPSED**.
72
+ * Header title ("DeepCritical") is visible.
73
+ * Footer text ("MCP Server Active") is visible in the description area.
74
+ * Chat functionality works (Magentic/Simple modes).
75
+
76
+ ---
77
+
78
+ ## 5. Constraints & Notes
79
+
80
+ - **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
81
+ - **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.
docs/development/testing.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Testing Strategy
2
+ ## ensuring DeepCritical is Ironclad
3
+
4
+ ---
5
+
6
+ ## Overview
7
+
8
+ Our testing strategy follows a strict **Pyramid of Reliability**:
9
+ 1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
10
+ 2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
11
+ 3. **E2E / Regression Tests**: Full research workflows (10% of tests)
12
+
13
+ **Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
14
+
15
+ ---
16
+
17
+ ## 1. Unit Tests (Fast & Cheap)
18
+
19
+ **Location**: `tests/unit/`
20
+
21
+ Focus on individual components without external network calls. Mock everything.
22
+
23
+ ### Key Test Cases
24
+
25
+ #### Agent Logic
26
+ - **Initialization**: Verify default config loads correctly.
27
+ - **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
28
+ - **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
29
+ - **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).
30
+
31
+ #### Tools (Mocked)
32
+ - **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
33
+ - **Validation**: Ensure tools reject invalid queries (empty strings, etc.).
34
+
35
+ #### Judge Prompts
36
+ - **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
37
+ - **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.
38
+
39
+ ```python
40
+ # Example: Testing State Logic
41
+ def test_budget_stop():
42
+ state = ResearchState(tokens_used=50001, max_tokens=50000)
43
+ assert should_continue(state) is False
44
+ ```
45
+
46
+ ---
47
+
48
+ ## 2. Integration Tests (Realistic & Mocked I/O)
49
+
50
+ **Location**: `tests/integration/`
51
+
52
+ Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.
53
+
54
+ ### Key Test Cases
55
+
56
+ #### Search Loop
57
+ - **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
58
+ - **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
59
+ - **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.
60
+
61
+ #### MCP Server Integration
62
+ - **Server Startup**: Verify MCP server starts and exposes tools.
63
+ - **Client Connection**: Verify agent can call tools via MCP protocol.
64
+
65
+ ```python
66
+ # Example: Testing Search Loop with Mocked Tools
67
+ async def test_search_loop_flow():
68
+ agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
69
+ report = await agent.run("test query")
70
+ assert agent.state.iterations > 0
71
+ assert len(report.sources) > 0
72
+ ```
73
+
74
+ ---
75
+
76
+ ## 3. End-to-End (E2E) Tests (The "Real Deal")
77
+
78
+ **Location**: `tests/e2e/`
79
+
80
+ Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.
81
+
82
+ ### Key Test Cases
83
+
84
+ #### The "Golden Query"
85
+ Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
86
+ - **Success Criteria**:
87
+ - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
88
+ - Includes citations from PubMed.
89
+ - Completes within 3 iterations.
90
+ - JSON output matches schema.
91
+
92
+ #### Deployment Smoke Test
93
+ - **Gradio UI**: Verify UI launches and accepts input.
94
+ - **Streaming**: Verify generator yields chunks (first chunk within 2s).
95
+
96
+ ---
97
+
98
+ ## 4. Tools & Config
99
+
100
+ ### Pytest Configuration
101
+ ```toml
102
+ # pyproject.toml
103
+ [tool.pytest.ini_options]
104
+ markers = [
105
+ "unit: fast, isolated tests",
106
+ "integration: mocked network tests",
107
+ "e2e: real network tests (slow, expensive)"
108
+ ]
109
+ asyncio_mode = "auto"
110
+ ```
111
+
112
+ ### CI/CD Pipeline (GitHub Actions)
113
+ 1. **Lint**: `ruff check .`
114
+ 2. **Type Check**: `mypy .`
115
+ 3. **Unit**: `pytest -m unit`
116
+ 4. **Integration**: `pytest -m integration`
117
+ 5. **E2E**: (Manual trigger only)
118
+
119
+ ---
120
+
121
+ ## 5. Anti-Hallucination Validation
122
+
123
+ How do we test if the agent is lying?
124
+
125
+ 1. **Citation Check**:
126
+ - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
127
+ - Fail if a citation is "orphaned" (hallucinated ID).
128
+
129
+ 2. **Negative Constraints**:
130
+ - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
131
+
132
+ ---
133
+
134
+ ## Checklist for Implementation
135
+
136
+ - [ ] Set up `tests/` directory structure
137
+ - [ ] Configure `pytest` and `vcrpy`
138
+ - [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
139
+ - [ ] Write first unit test for `ResearchState`
docs/examples/writer_agents_usage.md ADDED
@@ -0,0 +1,425 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Writer Agents Usage Examples
2
+
3
+ This document provides examples of how to use the writer agents in DeepCritical for generating research reports.
4
+
5
+ ## Overview
6
+
7
+ DeepCritical provides three writer agents for different report generation scenarios:
8
+
9
+ 1. **WriterAgent** - Basic writer for simple reports from findings
10
+ 2. **LongWriterAgent** - Iterative writer for long-form multi-section reports
11
+ 3. **ProofreaderAgent** - Finalizes and polishes report drafts
12
+
13
+ ## WriterAgent
14
+
15
+ The `WriterAgent` generates final reports from research findings. It's used in iterative research flows.
16
+
17
+ ### Basic Usage
18
+
19
+ ```python
20
+ from src.agent_factory.agents import create_writer_agent
21
+
22
+ # Create writer agent
23
+ writer = create_writer_agent()
24
+
25
+ # Generate report
26
+ query = "What is the capital of France?"
27
+ findings = """
28
+ Paris is the capital of France [1].
29
+ It is located in the north-central part of the country [2].
30
+
31
+ [1] https://example.com/france-info
32
+ [2] https://example.com/paris-info
33
+ """
34
+
35
+ report = await writer.write_report(
36
+ query=query,
37
+ findings=findings,
38
+ )
39
+
40
+ print(report)
41
+ ```
42
+
43
+ ### With Output Length Specification
44
+
45
+ ```python
46
+ report = await writer.write_report(
47
+ query="Explain machine learning",
48
+ findings=findings,
49
+ output_length="500 words",
50
+ )
51
+ ```
52
+
53
+ ### With Additional Instructions
54
+
55
+ ```python
56
+ report = await writer.write_report(
57
+ query="Explain machine learning",
58
+ findings=findings,
59
+ output_length="A comprehensive overview",
60
+ output_instructions="Use formal academic language and include examples",
61
+ )
62
+ ```
63
+
64
+ ### Integration with IterativeResearchFlow
65
+
66
+ The `WriterAgent` is automatically used by `IterativeResearchFlow`:
67
+
68
+ ```python
69
+ from src.agent_factory.agents import create_iterative_flow
70
+
71
+ flow = create_iterative_flow(max_iterations=5, max_time_minutes=10)
72
+ report = await flow.run(
73
+ query="What is quantum computing?",
74
+ output_length="A detailed explanation",
75
+ output_instructions="Include practical applications",
76
+ )
77
+ ```
78
+
79
+ ## LongWriterAgent
80
+
81
+ The `LongWriterAgent` iteratively writes report sections with proper citation management. It's used in deep research flows.
82
+
83
+ ### Basic Usage
84
+
85
+ ```python
86
+ from src.agent_factory.agents import create_long_writer_agent
87
+ from src.utils.models import ReportDraft, ReportDraftSection
88
+
89
+ # Create long writer agent
90
+ long_writer = create_long_writer_agent()
91
+
92
+ # Create report draft with sections
93
+ report_draft = ReportDraft(
94
+ sections=[
95
+ ReportDraftSection(
96
+ section_title="Introduction",
97
+ section_content="Draft content for introduction with [1].",
98
+ ),
99
+ ReportDraftSection(
100
+ section_title="Methods",
101
+ section_content="Draft content for methods with [2].",
102
+ ),
103
+ ReportDraftSection(
104
+ section_title="Results",
105
+ section_content="Draft content for results with [3].",
106
+ ),
107
+ ]
108
+ )
109
+
110
+ # Generate full report
111
+ report = await long_writer.write_report(
112
+ original_query="What are the main features of Python?",
113
+ report_title="Python Programming Language Overview",
114
+ report_draft=report_draft,
115
+ )
116
+
117
+ print(report)
118
+ ```
119
+
120
+ ### Writing Individual Sections
121
+
122
+ You can also write sections one at a time:
123
+
124
+ ```python
125
+ # Write first section
126
+ section_output = await long_writer.write_next_section(
127
+ original_query="What is Python?",
128
+ report_draft="", # No existing draft
129
+ next_section_title="Introduction",
130
+ next_section_draft="Python is a programming language...",
131
+ )
132
+
133
+ print(section_output.next_section_markdown)
134
+ print(section_output.references)
135
+
136
+ # Write second section with existing draft
137
+ section_output = await long_writer.write_next_section(
138
+ original_query="What is Python?",
139
+ report_draft="# Report\n\n## Introduction\n\nContent...",
140
+ next_section_title="Features",
141
+ next_section_draft="Python features include...",
142
+ )
143
+ ```
144
+
145
+ ### Integration with DeepResearchFlow
146
+
147
+ The `LongWriterAgent` is automatically used by `DeepResearchFlow`:
148
+
149
+ ```python
150
+ from src.agent_factory.agents import create_deep_flow
151
+
152
+ flow = create_deep_flow(
153
+ max_iterations=5,
154
+ max_time_minutes=10,
155
+ use_long_writer=True, # Use long writer (default)
156
+ )
157
+
158
+ report = await flow.run("What are the main features of Python programming language?")
159
+ ```
160
+
161
+ ## ProofreaderAgent
162
+
163
+ The `ProofreaderAgent` finalizes and polishes report drafts by removing duplicates, adding summaries, and refining wording.
164
+
165
+ ### Basic Usage
166
+
167
+ ```python
168
+ from src.agent_factory.agents import create_proofreader_agent
169
+ from src.utils.models import ReportDraft, ReportDraftSection
170
+
171
+ # Create proofreader agent
172
+ proofreader = create_proofreader_agent()
173
+
174
+ # Create report draft
175
+ report_draft = ReportDraft(
176
+ sections=[
177
+ ReportDraftSection(
178
+ section_title="Introduction",
179
+ section_content="Python is a programming language [1].",
180
+ ),
181
+ ReportDraftSection(
182
+ section_title="Features",
183
+ section_content="Python has many features [2].",
184
+ ),
185
+ ]
186
+ )
187
+
188
+ # Proofread and finalize
189
+ final_report = await proofreader.proofread(
190
+ query="What is Python?",
191
+ report_draft=report_draft,
192
+ )
193
+
194
+ print(final_report)
195
+ ```
196
+
197
+ ### Integration with DeepResearchFlow
198
+
199
+ Use `ProofreaderAgent` instead of `LongWriterAgent`:
200
+
201
+ ```python
202
+ from src.agent_factory.agents import create_deep_flow
203
+
204
+ flow = create_deep_flow(
205
+ max_iterations=5,
206
+ max_time_minutes=10,
207
+ use_long_writer=False, # Use proofreader instead
208
+ )
209
+
210
+ report = await flow.run("What are the main features of Python?")
211
+ ```
212
+
213
+ ## Error Handling
214
+
215
+ All writer agents include robust error handling:
216
+
217
+ ### Handling Empty Inputs
218
+
219
+ ```python
220
+ # WriterAgent handles empty findings gracefully
221
+ report = await writer.write_report(
222
+ query="Test query",
223
+ findings="", # Empty findings
224
+ )
225
+ # Returns a fallback report
226
+
227
+ # LongWriterAgent handles empty sections
228
+ report = await long_writer.write_report(
229
+ original_query="Test",
230
+ report_title="Test Report",
231
+ report_draft=ReportDraft(sections=[]), # Empty draft
232
+ )
233
+ # Returns minimal report
234
+
235
+ # ProofreaderAgent handles empty drafts
236
+ report = await proofreader.proofread(
237
+ query="Test",
238
+ report_draft=ReportDraft(sections=[]),
239
+ )
240
+ # Returns minimal report
241
+ ```
242
+
243
+ ### Retry Logic
244
+
245
+ All agents automatically retry on transient errors (timeouts, connection errors):
246
+
247
+ ```python
248
+ # Automatically retries up to 3 times on transient failures
249
+ report = await writer.write_report(
250
+ query="Test query",
251
+ findings=findings,
252
+ )
253
+ ```
254
+
255
+ ### Fallback Reports
256
+
257
+ If all retries fail, agents return fallback reports:
258
+
259
+ ```python
260
+ # Returns fallback report with query and findings
261
+ report = await writer.write_report(
262
+ query="Test query",
263
+ findings=findings,
264
+ )
265
+ # Fallback includes: "# Research Report\n\n## Query\n...\n\n## Findings\n..."
266
+ ```
267
+
268
+ ## Citation Validation
269
+
270
+ ### For Markdown Reports
271
+
272
+ Use the markdown citation validator:
273
+
274
+ ```python
275
+ from src.utils.citation_validator import validate_markdown_citations
276
+ from src.utils.models import Evidence, Citation
277
+
278
+ # Collect evidence during research
279
+ evidence = [
280
+ Evidence(
281
+ content="Paris is the capital of France",
282
+ citation=Citation(
283
+ source="web",
284
+ title="France Information",
285
+ url="https://example.com/france",
286
+ date="2024-01-01",
287
+ ),
288
+ ),
289
+ ]
290
+
291
+ # Generate report
292
+ report = await writer.write_report(query="What is the capital of France?", findings=findings)
293
+
294
+ # Validate citations
295
+ validated_report, removed_count = validate_markdown_citations(report, evidence)
296
+
297
+ if removed_count > 0:
298
+ print(f"Removed {removed_count} invalid citations")
299
+ ```
300
+
301
+ ### For ResearchReport Objects
302
+
303
+ Use the structured citation validator:
304
+
305
+ ```python
306
+ from src.utils.citation_validator import validate_references
307
+
308
+ # For ResearchReport objects (from ReportAgent)
309
+ validated_report = validate_references(report, evidence)
310
+ ```
311
+
312
+ ## Custom Model Configuration
313
+
314
+ All writer agents support custom model configuration:
315
+
316
+ ```python
317
+ from pydantic_ai import Model
318
+
319
+ # Create custom model
320
+ custom_model = Model("openai", "gpt-4")
321
+
322
+ # Use with writer agents
323
+ writer = create_writer_agent(model=custom_model)
324
+ long_writer = create_long_writer_agent(model=custom_model)
325
+ proofreader = create_proofreader_agent(model=custom_model)
326
+ ```
327
+
328
+ ## Best Practices
329
+
330
+ 1. **Use WriterAgent for simple reports** - When you have findings as a string and need a quick report
331
+ 2. **Use LongWriterAgent for structured reports** - When you need multiple sections with proper citation management
332
+ 3. **Use ProofreaderAgent for final polish** - When you have draft sections and need a polished final report
333
+ 4. **Validate citations** - Always validate citations against collected evidence
334
+ 5. **Handle errors gracefully** - All agents return fallback reports on failure
335
+ 6. **Specify output length** - Use `output_length` parameter to control report size
336
+ 7. **Provide instructions** - Use `output_instructions` for specific formatting requirements
337
+
338
+ ## Integration Examples
339
+
340
+ ### Full Iterative Research Flow
341
+
342
+ ```python
343
+ from src.agent_factory.agents import create_iterative_flow
344
+
345
+ flow = create_iterative_flow(
346
+ max_iterations=5,
347
+ max_time_minutes=10,
348
+ )
349
+
350
+ report = await flow.run(
351
+ query="What is machine learning?",
352
+ output_length="A comprehensive 1000-word explanation",
353
+ output_instructions="Include practical examples and use cases",
354
+ )
355
+ ```
356
+
357
+ ### Full Deep Research Flow with Long Writer
358
+
359
+ ```python
360
+ from src.agent_factory.agents import create_deep_flow
361
+
362
+ flow = create_deep_flow(
363
+ max_iterations=5,
364
+ max_time_minutes=10,
365
+ use_long_writer=True,
366
+ )
367
+
368
+ report = await flow.run("What are the main features of Python programming language?")
369
+ ```
370
+
371
+ ### Full Deep Research Flow with Proofreader
372
+
373
+ ```python
374
+ from src.agent_factory.agents import create_deep_flow
375
+
376
+ flow = create_deep_flow(
377
+ max_iterations=5,
378
+ max_time_minutes=10,
379
+ use_long_writer=False, # Use proofreader
380
+ )
381
+
382
+ report = await flow.run("Explain quantum computing basics")
383
+ ```
384
+
385
+ ## Troubleshooting
386
+
387
+ ### Empty Reports
388
+
389
+ If you get empty reports, check:
390
+ - Input validation logs (agents log warnings for empty inputs)
391
+ - LLM API key configuration
392
+ - Network connectivity
393
+
394
+ ### Citation Issues
395
+
396
+ If citations are missing or invalid:
397
+ - Use `validate_markdown_citations()` to check citations
398
+ - Ensure Evidence objects are properly collected during research
399
+ - Check that URLs in findings match Evidence URLs
400
+
401
+ ### Performance Issues
402
+
403
+ For large reports:
404
+ - Use `LongWriterAgent` for better section management
405
+ - Consider truncating very long findings (agents do this automatically)
406
+ - Use appropriate `max_time_minutes` settings
407
+
408
+ ## See Also
409
+
410
+ - [Research Flows Documentation](../orchestrator/research_flows.md)
411
+ - [Citation Validation](../utils/citation_validation.md)
412
+ - [Agent Factory](../agent_factory/agents.md)
413
+
414
+
415
+
416
+
417
+
418
+
419
+
420
+
421
+
422
+
423
+
424
+
425
+
docs/guides/deployment.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Guide
2
+ ## Launching DeepCritical: Gradio, MCP, & Modal
3
+
4
+ ---
5
+
6
+ ## Overview
7
+
8
+ DeepCritical is designed for a multi-platform deployment strategy to maximize hackathon impact:
9
+
10
+ 1. **HuggingFace Spaces**: Host the Gradio UI (User Interface).
11
+ 2. **MCP Server**: Expose research tools to Claude Desktop/Agents.
12
+ 3. **Modal (Optional)**: Run heavy inference or local LLMs if API costs are prohibitive.
13
+
14
+ ---
15
+
16
+ ## 1. HuggingFace Spaces (Gradio UI)
17
+
18
+ **Goal**: A public URL where judges/users can try the research agent.
19
+
20
+ ### Prerequisites
21
+ - HuggingFace Account
22
+ - `gradio` installed (`uv add gradio`)
23
+
24
+ ### Steps
25
+
26
+ 1. **Create Space**:
27
+ - Go to HF Spaces -> Create New Space.
28
+ - SDK: **Gradio**.
29
+ - Hardware: **CPU Basic** (Free) is sufficient (since we use APIs).
30
+
31
+ 2. **Prepare Files**:
32
+ - Ensure `app.py` contains the Gradio interface construction.
33
+ - Ensure `requirements.txt` or `pyproject.toml` lists all dependencies.
34
+
35
+ 3. **Secrets**:
36
+ - Go to Space Settings -> **Repository secrets**.
37
+ - Add `ANTHROPIC_API_KEY` (or your chosen LLM provider key).
38
+ - Add `BRAVE_API_KEY` (for web search).
39
+
40
+ 4. **Deploy**:
41
+ - Push code to the Space's git repo.
42
+ - Watch "Build" logs.
43
+
44
+ ### Streaming Optimization
45
+ Ensure `app.py` uses generator functions for the chat interface to prevent timeouts:
46
+ ```python
47
+ # app.py
48
+ def predict(message, history):
49
+ agent = ResearchAgent()
50
+ for update in agent.research_stream(message):
51
+ yield update
52
+ ```
53
+
54
+ ---
55
+
56
+ ## 2. MCP Server Deployment
57
+
58
+ **Goal**: Allow other agents (like Claude Desktop) to use our PubMed/Research tools directly.
59
+
60
+ ### Local Usage (Claude Desktop)
61
+
62
+ 1. **Install**:
63
+ ```bash
64
+ uv sync
65
+ ```
66
+
67
+ 2. **Configure Claude Desktop**:
68
+ Edit `~/Library/Application Support/Claude/claude_desktop_config.json`:
69
+ ```json
70
+ {
71
+ "mcpServers": {
72
+ "deepcritical": {
73
+ "command": "uv",
74
+ "args": ["run", "fastmcp", "run", "src/mcp_servers/pubmed_server.py"],
75
+ "cwd": "/absolute/path/to/DeepCritical"
76
+ }
77
+ }
78
+ }
79
+ ```
80
+
81
+ 3. **Restart Claude**: You should see a 🔌 icon indicating connected tools.
82
+
83
+ ### Remote Deployment (Smithery/Glama)
84
+ *Target for "MCP Track" bonus points.*
85
+
86
+ 1. **Dockerize**: Create a `Dockerfile` for the MCP server.
87
+ ```dockerfile
88
+ FROM python:3.11-slim
89
+ COPY . /app
90
+ RUN pip install fastmcp httpx
91
+ CMD ["fastmcp", "run", "src/mcp_servers/pubmed_server.py", "--transport", "sse"]
92
+ ```
93
+ *Note: Use SSE transport for remote/HTTP servers.*
94
+
95
+ 2. **Deploy**: Host on Fly.io or Railway.
96
+
97
+ ---
98
+
99
+ ## 3. Modal (GPU/Heavy Compute)
100
+
101
+ **Goal**: Run a local LLM (e.g., Llama-3-70B) or handle massive parallel searches if APIs are too slow/expensive.
102
+
103
+ ### Setup
104
+ 1. **Install**: `uv add modal`
105
+ 2. **Auth**: `modal token new`
106
+
107
+ ### Logic
108
+ Instead of calling Anthropic API, we call a Modal function:
109
+
110
+ ```python
111
+ # src/llm/modal_client.py
112
+ import modal
113
+
114
+ stub = modal.Stub("deepcritical-inference")
115
+
116
+ @stub.function(gpu="A100")
117
+ def generate_text(prompt: str):
118
+ # Load vLLM or similar
119
+ ...
120
+ ```
121
+
122
+ ### When to use?
123
+ - **Hackathon Demo**: Stick to Anthropic/OpenAI APIs for speed/reliability.
124
+ - **Production/Stretch**: Use Modal if you hit rate limits or want to show off "Open Source Models" capability.
125
+
126
+ ---
127
+
128
+ ## Deployment Checklist
129
+
130
+ ### Pre-Flight
131
+ - [ ] Run `pytest -m unit` to ensure logic is sound.
132
+ - [ ] Run `pytest -m e2e` (one pass) to verify APIs connect.
133
+ - [ ] Check `requirements.txt` matches `pyproject.toml`.
134
+
135
+ ### Secrets Management
136
+ - [ ] **NEVER** commit `.env` files.
137
+ - [ ] Verify keys are added to HF Space settings.
138
+
139
+ ### Post-Launch
140
+ - [ ] Test the live URL.
141
+ - [ ] Verify "Stop" button in Gradio works (interrupts the agent).
142
+ - [ ] Record a walkthrough video (crucial for hackathon submission).
docs/implementation/01_phase_foundation.md ADDED
@@ -0,0 +1,587 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 1 Implementation Spec: Foundation & Tooling
2
+
3
+ **Goal**: Establish a "Gucci Banger" development environment using 2025 best practices.
4
+ **Philosophy**: "If the build isn't solid, the agent won't be."
5
+
6
+ ---
7
+
8
+ ## 1. Prerequisites
9
+
10
+ Before starting, ensure these are installed:
11
+
12
+ ```bash
13
+ # Install uv (Rust-based package manager)
14
+ curl -LsSf https://astral.sh/uv/install.sh | sh
15
+
16
+ # Verify
17
+ uv --version # Should be >= 0.4.0
18
+ ```
19
+
20
+ ---
21
+
22
+ ## 2. Project Initialization
23
+
24
+ ```bash
25
+ # From project root
26
+ uv init --name deepcritical
27
+ uv python install 3.11 # Pin Python version
28
+ ```
29
+
30
+ ---
31
+
32
+ ## 3. The Tooling Stack (Exact Dependencies)
33
+
34
+ ### `pyproject.toml` (Complete, Copy-Paste Ready)
35
+
36
+ ```toml
37
+ [project]
38
+ name = "deepcritical"
39
+ version = "0.1.0"
40
+ description = "AI-Native Drug Repurposing Research Agent"
41
+ readme = "README.md"
42
+ requires-python = ">=3.11"
43
+ dependencies = [
44
+ # Core
45
+ "pydantic>=2.7",
46
+ "pydantic-settings>=2.2", # For BaseSettings (config)
47
+ "pydantic-ai>=0.0.16", # Agent framework
48
+
49
+ # HTTP & Parsing
50
+ "httpx>=0.27", # Async HTTP client
51
+ "beautifulsoup4>=4.12", # HTML parsing
52
+ "xmltodict>=0.13", # PubMed XML -> dict
53
+
54
+ # Search
55
+ "duckduckgo-search>=6.0", # Free web search
56
+
57
+ # UI
58
+ "gradio>=5.0", # Chat interface
59
+
60
+ # Utils
61
+ "python-dotenv>=1.0", # .env loading
62
+ "tenacity>=8.2", # Retry logic
63
+ "structlog>=24.1", # Structured logging
64
+ ]
65
+
66
+ [project.optional-dependencies]
67
+ dev = [
68
+ # Testing
69
+ "pytest>=8.0",
70
+ "pytest-asyncio>=0.23",
71
+ "pytest-sugar>=1.0",
72
+ "pytest-cov>=5.0",
73
+ "pytest-mock>=3.12",
74
+ "respx>=0.21", # Mock httpx requests
75
+
76
+ # Quality
77
+ "ruff>=0.4.0",
78
+ "mypy>=1.10",
79
+ "pre-commit>=3.7",
80
+ ]
81
+
82
+ [build-system]
83
+ requires = ["hatchling"]
84
+ build-backend = "hatchling.build"
85
+
86
+ [tool.hatch.build.targets.wheel]
87
+ packages = ["src"]
88
+
89
+ # ============== RUFF CONFIG ==============
90
+ [tool.ruff]
91
+ line-length = 100
92
+ target-version = "py311"
93
+ src = ["src", "tests"]
94
+
95
+ [tool.ruff.lint]
96
+ select = [
97
+ "E", # pycodestyle errors
98
+ "F", # pyflakes
99
+ "B", # flake8-bugbear
100
+ "I", # isort
101
+ "N", # pep8-naming
102
+ "UP", # pyupgrade
103
+ "PL", # pylint
104
+ "RUF", # ruff-specific
105
+ ]
106
+ ignore = [
107
+ "PLR0913", # Too many arguments (agents need many params)
108
+ ]
109
+
110
+ [tool.ruff.lint.isort]
111
+ known-first-party = ["src"]
112
+
113
+ # ============== MYPY CONFIG ==============
114
+ [tool.mypy]
115
+ python_version = "3.11"
116
+ strict = true
117
+ ignore_missing_imports = true
118
+ disallow_untyped_defs = true
119
+ warn_return_any = true
120
+ warn_unused_ignores = true
121
+
122
+ # ============== PYTEST CONFIG ==============
123
+ [tool.pytest.ini_options]
124
+ testpaths = ["tests"]
125
+ asyncio_mode = "auto"
126
+ addopts = [
127
+ "-v",
128
+ "--tb=short",
129
+ "--strict-markers",
130
+ ]
131
+ markers = [
132
+ "unit: Unit tests (mocked)",
133
+ "integration: Integration tests (real APIs)",
134
+ "slow: Slow tests",
135
+ ]
136
+
137
+ # ============== COVERAGE CONFIG ==============
138
+ [tool.coverage.run]
139
+ source = ["src"]
140
+ omit = ["*/__init__.py"]
141
+
142
+ [tool.coverage.report]
143
+ exclude_lines = [
144
+ "pragma: no cover",
145
+ "if TYPE_CHECKING:",
146
+ "raise NotImplementedError",
147
+ ]
148
+ ```
149
+
150
+ ---
151
+
152
+ ## 4. Directory Structure (Maintainer's Structure)
153
+
154
+ ```bash
155
+ # Execute these commands to create the directory structure
156
+ mkdir -p src/utils
157
+ mkdir -p src/tools
158
+ mkdir -p src/prompts
159
+ mkdir -p src/agent_factory
160
+ mkdir -p src/middleware
161
+ mkdir -p src/database_services
162
+ mkdir -p src/retrieval_factory
163
+ mkdir -p tests/unit/tools
164
+ mkdir -p tests/unit/agent_factory
165
+ mkdir -p tests/unit/utils
166
+ mkdir -p tests/integration
167
+
168
+ # Create __init__.py files (required for imports)
169
+ touch src/__init__.py
170
+ touch src/utils/__init__.py
171
+ touch src/tools/__init__.py
172
+ touch src/prompts/__init__.py
173
+ touch src/agent_factory/__init__.py
174
+ touch tests/__init__.py
175
+ touch tests/unit/__init__.py
176
+ touch tests/unit/tools/__init__.py
177
+ touch tests/unit/agent_factory/__init__.py
178
+ touch tests/unit/utils/__init__.py
179
+ touch tests/integration/__init__.py
180
+ ```
181
+
182
+ ### Final Structure:
183
+
184
+ ```
185
+ src/
186
+ ├── __init__.py
187
+ ├── app.py # Entry point (Gradio UI)
188
+ ├── orchestrator.py # Agent loop
189
+ ├── agent_factory/ # Agent creation and judges
190
+ │ ├── __init__.py
191
+ │ ├── agents.py
192
+ │ └── judges.py
193
+ ├── tools/ # Search tools
194
+ │ ├── __init__.py
195
+ │ ├── pubmed.py
196
+ │ ├── websearch.py
197
+ │ └── search_handler.py
198
+ ├── prompts/ # Prompt templates
199
+ │ ├── __init__.py
200
+ │ └── judge.py
201
+ ├── utils/ # Shared utilities
202
+ │ ├── __init__.py
203
+ │ ├── config.py
204
+ │ ├── exceptions.py
205
+ │ ├── models.py
206
+ │ ├── dataloaders.py
207
+ │ └── parsers.py
208
+ ├── middleware/ # (Future)
209
+ ├── database_services/ # (Future)
210
+ └── retrieval_factory/ # (Future)
211
+
212
+ tests/
213
+ ├── __init__.py
214
+ ├── conftest.py
215
+ ├── unit/
216
+ │ ├── __init__.py
217
+ │ ├── tools/
218
+ │ │ ├── __init__.py
219
+ │ │ ├── test_pubmed.py
220
+ │ │ ├── test_websearch.py
221
+ │ │ └── test_search_handler.py
222
+ │ ├── agent_factory/
223
+ │ │ ├── __init__.py
224
+ │ │ └── test_judges.py
225
+ │ ├── utils/
226
+ │ │ ├── __init__.py
227
+ │ │ └── test_config.py
228
+ │ └── test_orchestrator.py
229
+ └── integration/
230
+ ├── __init__.py
231
+ └── test_pubmed_live.py
232
+ ```
233
+
234
+ ---
235
+
236
+ ## 5. Configuration Files
237
+
238
+ ### `.env.example` (Copy to `.env` and fill)
239
+
240
+ ```bash
241
+ # LLM Provider (choose one)
242
+ OPENAI_API_KEY=sk-your-key-here
243
+ ANTHROPIC_API_KEY=sk-ant-your-key-here
244
+
245
+ # Optional: PubMed API key (higher rate limits)
246
+ NCBI_API_KEY=your-ncbi-key-here
247
+
248
+ # Optional: For HuggingFace deployment
249
+ HF_TOKEN=hf_your-token-here
250
+
251
+ # Agent Config
252
+ MAX_ITERATIONS=10
253
+ LOG_LEVEL=INFO
254
+ ```
255
+
256
+ ### `.pre-commit-config.yaml`
257
+
258
+ ```yaml
259
+ repos:
260
+ - repo: https://github.com/astral-sh/ruff-pre-commit
261
+ rev: v0.4.4
262
+ hooks:
263
+ - id: ruff
264
+ args: [--fix]
265
+ - id: ruff-format
266
+
267
+ - repo: https://github.com/pre-commit/mirrors-mypy
268
+ rev: v1.10.0
269
+ hooks:
270
+ - id: mypy
271
+ additional_dependencies:
272
+ - pydantic>=2.7
273
+ - pydantic-settings>=2.2
274
+ args: [--ignore-missing-imports]
275
+ ```
276
+
277
+ ### `tests/conftest.py` (Pytest Fixtures)
278
+
279
+ ```python
280
+ """Shared pytest fixtures for all tests."""
281
+ import pytest
282
+ from unittest.mock import AsyncMock
283
+
284
+
285
+ @pytest.fixture
286
+ def mock_httpx_client(mocker):
287
+ """Mock httpx.AsyncClient for API tests."""
288
+ mock = mocker.patch("httpx.AsyncClient")
289
+ mock.return_value.__aenter__ = AsyncMock(return_value=mock.return_value)
290
+ mock.return_value.__aexit__ = AsyncMock(return_value=None)
291
+ return mock
292
+
293
+
294
+ @pytest.fixture
295
+ def mock_llm_response():
296
+ """Factory fixture for mocking LLM responses."""
297
+ def _mock(content: str):
298
+ return AsyncMock(return_value=content)
299
+ return _mock
300
+
301
+
302
+ @pytest.fixture
303
+ def sample_evidence():
304
+ """Sample Evidence objects for testing."""
305
+ from src.utils.models import Evidence, Citation
306
+ return [
307
+ Evidence(
308
+ content="Metformin shows promise in Alzheimer's...",
309
+ citation=Citation(
310
+ source="pubmed",
311
+ title="Metformin and Alzheimer's Disease",
312
+ url="https://pubmed.ncbi.nlm.nih.gov/12345678/",
313
+ date="2024-01-15"
314
+ ),
315
+ relevance=0.85
316
+ )
317
+ ]
318
+ ```
319
+
320
+ ---
321
+
322
+ ## 6. Core Utilities Implementation
323
+
324
+ ### `src/utils/config.py`
325
+
326
+ ```python
327
+ """Application configuration using Pydantic Settings."""
328
+ from pydantic_settings import BaseSettings, SettingsConfigDict
329
+ from pydantic import Field
330
+ from typing import Literal
331
+ import structlog
332
+
333
+
334
+ class Settings(BaseSettings):
335
+ """Strongly-typed application settings."""
336
+
337
+ model_config = SettingsConfigDict(
338
+ env_file=".env",
339
+ env_file_encoding="utf-8",
340
+ case_sensitive=False,
341
+ extra="ignore",
342
+ )
343
+
344
+ # LLM Configuration
345
+ openai_api_key: str | None = Field(default=None, description="OpenAI API key")
346
+ anthropic_api_key: str | None = Field(default=None, description="Anthropic API key")
347
+ llm_provider: Literal["openai", "anthropic"] = Field(
348
+ default="openai",
349
+ description="Which LLM provider to use"
350
+ )
351
+ openai_model: str = Field(default="gpt-4o", description="OpenAI model name")
352
+ anthropic_model: str = Field(default="claude-3-5-sonnet-20241022", description="Anthropic model")
353
+
354
+ # PubMed Configuration
355
+ ncbi_api_key: str | None = Field(default=None, description="NCBI API key for higher rate limits")
356
+
357
+ # Agent Configuration
358
+ max_iterations: int = Field(default=10, ge=1, le=50)
359
+ search_timeout: int = Field(default=30, description="Seconds to wait for search")
360
+
361
+ # Logging
362
+ log_level: Literal["DEBUG", "INFO", "WARNING", "ERROR"] = "INFO"
363
+
364
+ def get_api_key(self) -> str:
365
+ """Get the API key for the configured provider."""
366
+ if self.llm_provider == "openai":
367
+ if not self.openai_api_key:
368
+ raise ValueError("OPENAI_API_KEY not set")
369
+ return self.openai_api_key
370
+ else:
371
+ if not self.anthropic_api_key:
372
+ raise ValueError("ANTHROPIC_API_KEY not set")
373
+ return self.anthropic_api_key
374
+
375
+
376
+ def get_settings() -> Settings:
377
+ """Factory function to get settings (allows mocking in tests)."""
378
+ return Settings()
379
+
380
+
381
+ def configure_logging(settings: Settings) -> None:
382
+ """Configure structured logging."""
383
+ structlog.configure(
384
+ processors=[
385
+ structlog.stdlib.filter_by_level,
386
+ structlog.stdlib.add_logger_name,
387
+ structlog.stdlib.add_log_level,
388
+ structlog.processors.TimeStamper(fmt="iso"),
389
+ structlog.processors.JSONRenderer(),
390
+ ],
391
+ wrapper_class=structlog.stdlib.BoundLogger,
392
+ context_class=dict,
393
+ logger_factory=structlog.stdlib.LoggerFactory(),
394
+ )
395
+
396
+
397
+ # Singleton for easy import
398
+ settings = get_settings()
399
+ ```
400
+
401
+ ### `src/utils/exceptions.py`
402
+
403
+ ```python
404
+ """Custom exceptions for DeepCritical."""
405
+
406
+
407
+ class DeepCriticalError(Exception):
408
+ """Base exception for all DeepCritical errors."""
409
+ pass
410
+
411
+
412
+ class SearchError(DeepCriticalError):
413
+ """Raised when a search operation fails."""
414
+ pass
415
+
416
+
417
+ class JudgeError(DeepCriticalError):
418
+ """Raised when the judge fails to assess evidence."""
419
+ pass
420
+
421
+
422
+ class ConfigurationError(DeepCriticalError):
423
+ """Raised when configuration is invalid."""
424
+ pass
425
+
426
+
427
+ class RateLimitError(SearchError):
428
+ """Raised when we hit API rate limits."""
429
+ pass
430
+ ```
431
+
432
+ ---
433
+
434
+ ## 7. TDD Workflow: First Test
435
+
436
+ ### `tests/unit/utils/test_config.py`
437
+
438
+ ```python
439
+ """Unit tests for configuration loading."""
440
+ import pytest
441
+ from unittest.mock import patch
442
+ import os
443
+
444
+
445
+ class TestSettings:
446
+ """Tests for Settings class."""
447
+
448
+ def test_default_max_iterations(self):
449
+ """Settings should have default max_iterations of 10."""
450
+ from src.utils.config import Settings
451
+
452
+ # Clear any env vars
453
+ with patch.dict(os.environ, {}, clear=True):
454
+ settings = Settings()
455
+ assert settings.max_iterations == 10
456
+
457
+ def test_max_iterations_from_env(self):
458
+ """Settings should read MAX_ITERATIONS from env."""
459
+ from src.utils.config import Settings
460
+
461
+ with patch.dict(os.environ, {"MAX_ITERATIONS": "25"}):
462
+ settings = Settings()
463
+ assert settings.max_iterations == 25
464
+
465
+ def test_invalid_max_iterations_raises(self):
466
+ """Settings should reject invalid max_iterations."""
467
+ from src.utils.config import Settings
468
+ from pydantic import ValidationError
469
+
470
+ with patch.dict(os.environ, {"MAX_ITERATIONS": "100"}):
471
+ with pytest.raises(ValidationError):
472
+ Settings() # 100 > 50 (max)
473
+
474
+ def test_get_api_key_openai(self):
475
+ """get_api_key should return OpenAI key when provider is openai."""
476
+ from src.utils.config import Settings
477
+
478
+ with patch.dict(os.environ, {
479
+ "LLM_PROVIDER": "openai",
480
+ "OPENAI_API_KEY": "sk-test-key"
481
+ }):
482
+ settings = Settings()
483
+ assert settings.get_api_key() == "sk-test-key"
484
+
485
+ def test_get_api_key_missing_raises(self):
486
+ """get_api_key should raise when key is not set."""
487
+ from src.utils.config import Settings
488
+
489
+ with patch.dict(os.environ, {"LLM_PROVIDER": "openai"}, clear=True):
490
+ settings = Settings()
491
+ with pytest.raises(ValueError, match="OPENAI_API_KEY not set"):
492
+ settings.get_api_key()
493
+ ```
494
+
495
+ ---
496
+
497
+ ## 8. Makefile (Developer Experience)
498
+
499
+ Create a `Makefile` for standard devex commands:
500
+
501
+ ```makefile
502
+ .PHONY: install test lint format typecheck check clean
503
+
504
+ install:
505
+ uv sync --all-extras
506
+ uv run pre-commit install
507
+
508
+ test:
509
+ uv run pytest tests/unit/ -v
510
+
511
+ test-cov:
512
+ uv run pytest --cov=src --cov-report=term-missing
513
+
514
+ lint:
515
+ uv run ruff check src tests
516
+
517
+ format:
518
+ uv run ruff format src tests
519
+
520
+ typecheck:
521
+ uv run mypy src
522
+
523
+ check: lint typecheck test
524
+ @echo "All checks passed!"
525
+
526
+ clean:
527
+ rm -rf .pytest_cache .mypy_cache .ruff_cache __pycache__ .coverage
528
+ find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
529
+ ```
530
+
531
+ ---
532
+
533
+ ## 9. Execution Commands
534
+
535
+ ```bash
536
+ # Install all dependencies
537
+ uv sync --all-extras
538
+
539
+ # Run tests (should pass after implementing config.py)
540
+ uv run pytest tests/unit/utils/test_config.py -v
541
+
542
+ # Run full test suite with coverage
543
+ uv run pytest --cov=src --cov-report=term-missing
544
+
545
+ # Run linting
546
+ uv run ruff check src tests
547
+ uv run ruff format src tests
548
+
549
+ # Run type checking
550
+ uv run mypy src
551
+
552
+ # Set up pre-commit hooks
553
+ uv run pre-commit install
554
+ ```
555
+
556
+ ---
557
+
558
+ ## 10. Implementation Checklist
559
+
560
+ - [ ] Install `uv` and verify version
561
+ - [ ] Run `uv init --name deepcritical`
562
+ - [ ] Create `pyproject.toml` (copy from above)
563
+ - [ ] Create directory structure (run mkdir commands)
564
+ - [ ] Create `.env.example` and `.env`
565
+ - [ ] Create `.pre-commit-config.yaml`
566
+ - [ ] Create `Makefile` (copy from above)
567
+ - [ ] Create `tests/conftest.py`
568
+ - [ ] Implement `src/utils/config.py`
569
+ - [ ] Implement `src/utils/exceptions.py`
570
+ - [ ] Write tests in `tests/unit/utils/test_config.py`
571
+ - [ ] Run `make install`
572
+ - [ ] Run `make check` — **ALL CHECKS MUST PASS**
573
+ - [ ] Commit: `git commit -m "feat: phase 1 foundation complete"`
574
+
575
+ ---
576
+
577
+ ## 11. Definition of Done
578
+
579
+ Phase 1 is **COMPLETE** when:
580
+
581
+ 1. `uv run pytest` passes with 100% of tests green
582
+ 2. `uv run ruff check src tests` has 0 errors
583
+ 3. `uv run mypy src` has 0 errors
584
+ 4. Pre-commit hooks are installed and working
585
+ 5. `from src.utils.config import settings` works in Python REPL
586
+
587
+ **Proceed to Phase 2 ONLY after all checkboxes are complete.**
docs/implementation/02_phase_search.md ADDED
@@ -0,0 +1,822 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 2 Implementation Spec: Search Vertical Slice
2
+
3
+ **Goal**: Implement the "Eyes and Ears" of the agent — retrieving real biomedical data.
4
+ **Philosophy**: "Real data, mocked connections."
5
+ **Prerequisite**: Phase 1 complete (all tests passing)
6
+
7
+ > **⚠️ Implementation Note (2025-01-27)**: The DuckDuckGo WebTool specified in this phase was removed in favor of the Europe PMC tool (see Phase 11). Europe PMC provides better coverage for biomedical research by including preprints, peer-reviewed articles, and patents. The current implementation uses PubMed, ClinicalTrials.gov, and Europe PMC as search sources.
8
+
9
+ ---
10
+
11
+ ## 1. The Slice Definition
12
+
13
+ This slice covers:
14
+ 1. **Input**: A string query (e.g., "metformin Alzheimer's disease").
15
+ 2. **Process**:
16
+ - Fetch from PubMed (E-utilities API).
17
+ - ~~Fetch from Web (DuckDuckGo).~~ **REMOVED** - Replaced by Europe PMC in Phase 11
18
+ - Normalize results into `Evidence` models.
19
+ 3. **Output**: A list of `Evidence` objects.
20
+
21
+ **Files to Create**:
22
+ - `src/utils/models.py` - Pydantic models (Evidence, Citation, SearchResult)
23
+ - `src/tools/pubmed.py` - PubMed E-utilities tool
24
+ - ~~`src/tools/websearch.py` - DuckDuckGo search tool~~ **REMOVED** - See Phase 11 for Europe PMC replacement
25
+ - `src/tools/search_handler.py` - Orchestrates multiple tools
26
+ - `src/tools/__init__.py` - Exports
27
+
28
+ **Additional Files (Post-Phase 2 Enhancements)**:
29
+ - `src/tools/query_utils.py` - Query preprocessing (removes question words, expands medical synonyms)
30
+
31
+ ---
32
+
33
+ ## 2. PubMed E-utilities API Reference
34
+
35
+ **Base URL**: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/`
36
+
37
+ ### Key Endpoints
38
+
39
+ | Endpoint | Purpose | Example |
40
+ |----------|---------|---------|
41
+ | `esearch.fcgi` | Search for article IDs | `?db=pubmed&term=metformin+alzheimer&retmax=10` |
42
+ | `efetch.fcgi` | Fetch article details | `?db=pubmed&id=12345,67890&rettype=abstract&retmode=xml` |
43
+
44
+ ### Rate Limiting (CRITICAL!)
45
+
46
+ NCBI **requires** rate limiting:
47
+ - **Without API key**: 3 requests/second
48
+ - **With API key**: 10 requests/second
49
+
50
+ Get a free API key: https://www.ncbi.nlm.nih.gov/account/settings/
51
+
52
+ ```python
53
+ # Add to .env
54
+ NCBI_API_KEY=your-key-here # Optional but recommended
55
+ ```
56
+
57
+ ### Example Search Flow
58
+
59
+ ```
60
+ 1. esearch: "metformin alzheimer" → [PMID: 12345, 67890, ...]
61
+ 2. efetch: PMIDs → Full abstracts/metadata
62
+ 3. Parse XML → Evidence objects
63
+ ```
64
+
65
+ ---
66
+
67
+ ## 3. Models (`src/utils/models.py`)
68
+
69
+ ```python
70
+ """Data models for the Search feature."""
71
+ from pydantic import BaseModel, Field
72
+ from typing import Literal
73
+
74
+
75
+ class Citation(BaseModel):
76
+ """A citation to a source document."""
77
+
78
+ source: Literal["pubmed", "web"] = Field(description="Where this came from")
79
+ title: str = Field(min_length=1, max_length=500)
80
+ url: str = Field(description="URL to the source")
81
+ date: str = Field(description="Publication date (YYYY-MM-DD or 'Unknown')")
82
+ authors: list[str] = Field(default_factory=list)
83
+
84
+ @property
85
+ def formatted(self) -> str:
86
+ """Format as a citation string."""
87
+ author_str = ", ".join(self.authors[:3])
88
+ if len(self.authors) > 3:
89
+ author_str += " et al."
90
+ return f"{author_str} ({self.date}). {self.title}. {self.source.upper()}"
91
+
92
+
93
+ class Evidence(BaseModel):
94
+ """A piece of evidence retrieved from search."""
95
+
96
+ content: str = Field(min_length=1, description="The actual text content")
97
+ citation: Citation
98
+ relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
99
+
100
+ class Config:
101
+ frozen = True # Immutable after creation
102
+
103
+
104
+ class SearchResult(BaseModel):
105
+ """Result of a search operation."""
106
+
107
+ query: str
108
+ evidence: list[Evidence]
109
+ sources_searched: list[Literal["pubmed", "web"]]
110
+ total_found: int
111
+ errors: list[str] = Field(default_factory=list)
112
+ ```
113
+
114
+ ---
115
+
116
+ ## 4. Tool Protocol (`src/tools/pubmed.py` and `src/tools/websearch.py`)
117
+
118
+ ### The Interface (Protocol) - Add to `src/tools/__init__.py`
119
+
120
+ ```python
121
+ """Search tools package."""
122
+ from typing import Protocol, List
123
+
124
+ # Import implementations
125
+ from src.tools.pubmed import PubMedTool
126
+ from src.tools.websearch import WebTool
127
+ from src.tools.search_handler import SearchHandler
128
+
129
+ # Re-export
130
+ __all__ = ["SearchTool", "PubMedTool", "WebTool", "SearchHandler"]
131
+
132
+
133
+ class SearchTool(Protocol):
134
+ """Protocol defining the interface for all search tools."""
135
+
136
+ @property
137
+ def name(self) -> str:
138
+ """Human-readable name of this tool."""
139
+ ...
140
+
141
+ async def search(self, query: str, max_results: int = 10) -> List["Evidence"]:
142
+ """
143
+ Execute a search and return evidence.
144
+
145
+ Args:
146
+ query: The search query string
147
+ max_results: Maximum number of results to return
148
+
149
+ Returns:
150
+ List of Evidence objects
151
+
152
+ Raises:
153
+ SearchError: If the search fails
154
+ RateLimitError: If we hit rate limits
155
+ """
156
+ ...
157
+ ```
158
+
159
+ ### PubMed Tool Implementation (`src/tools/pubmed.py`)
160
+
161
+ ```python
162
+ """PubMed search tool using NCBI E-utilities."""
163
+ import asyncio
164
+ import httpx
165
+ import xmltodict
166
+ from typing import List
167
+ from tenacity import retry, stop_after_attempt, wait_exponential
168
+
169
+ from src.utils.config import settings
170
+ from src.utils.exceptions import SearchError, RateLimitError
171
+ from src.utils.models import Evidence, Citation
172
+
173
+
174
+ class PubMedTool:
175
+ """Search tool for PubMed/NCBI."""
176
+
177
+ BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
178
+ RATE_LIMIT_DELAY = 0.34 # ~3 requests/sec without API key
179
+
180
+ def __init__(self, api_key: str | None = None):
181
+ self.api_key = api_key or getattr(settings, "ncbi_api_key", None)
182
+ self._last_request_time = 0.0
183
+
184
+ @property
185
+ def name(self) -> str:
186
+ return "pubmed"
187
+
188
+ async def _rate_limit(self) -> None:
189
+ """Enforce NCBI rate limiting."""
190
+ now = asyncio.get_event_loop().time()
191
+ elapsed = now - self._last_request_time
192
+ if elapsed < self.RATE_LIMIT_DELAY:
193
+ await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
194
+ self._last_request_time = asyncio.get_event_loop().time()
195
+
196
+ def _build_params(self, **kwargs) -> dict:
197
+ """Build request params with optional API key."""
198
+ params = {**kwargs, "retmode": "json"}
199
+ if self.api_key:
200
+ params["api_key"] = self.api_key
201
+ return params
202
+
203
+ @retry(
204
+ stop=stop_after_attempt(3),
205
+ wait=wait_exponential(multiplier=1, min=1, max=10),
206
+ reraise=True,
207
+ )
208
+ async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
209
+ """
210
+ Search PubMed and return evidence.
211
+
212
+ 1. ESearch: Get PMIDs matching query
213
+ 2. EFetch: Get abstracts for those PMIDs
214
+ 3. Parse and return Evidence objects
215
+ """
216
+ await self._rate_limit()
217
+
218
+ async with httpx.AsyncClient(timeout=30.0) as client:
219
+ # Step 1: Search for PMIDs
220
+ search_params = self._build_params(
221
+ db="pubmed",
222
+ term=query,
223
+ retmax=max_results,
224
+ sort="relevance",
225
+ )
226
+
227
+ try:
228
+ search_resp = await client.get(
229
+ f"{self.BASE_URL}/esearch.fcgi",
230
+ params=search_params,
231
+ )
232
+ search_resp.raise_for_status()
233
+ except httpx.HTTPStatusError as e:
234
+ if e.response.status_code == 429:
235
+ raise RateLimitError("PubMed rate limit exceeded")
236
+ raise SearchError(f"PubMed search failed: {e}")
237
+
238
+ search_data = search_resp.json()
239
+ pmids = search_data.get("esearchresult", {}).get("idlist", [])
240
+
241
+ if not pmids:
242
+ return []
243
+
244
+ # Step 2: Fetch abstracts
245
+ await self._rate_limit()
246
+ fetch_params = self._build_params(
247
+ db="pubmed",
248
+ id=",".join(pmids),
249
+ rettype="abstract",
250
+ )
251
+ # Use XML for fetch (more reliable parsing)
252
+ fetch_params["retmode"] = "xml"
253
+
254
+ fetch_resp = await client.get(
255
+ f"{self.BASE_URL}/efetch.fcgi",
256
+ params=fetch_params,
257
+ )
258
+ fetch_resp.raise_for_status()
259
+
260
+ # Step 3: Parse XML to Evidence
261
+ return self._parse_pubmed_xml(fetch_resp.text)
262
+
263
+ def _parse_pubmed_xml(self, xml_text: str) -> List[Evidence]:
264
+ """Parse PubMed XML into Evidence objects."""
265
+ try:
266
+ data = xmltodict.parse(xml_text)
267
+ except Exception as e:
268
+ raise SearchError(f"Failed to parse PubMed XML: {e}")
269
+
270
+ articles = data.get("PubmedArticleSet", {}).get("PubmedArticle", [])
271
+
272
+ # Handle single article (xmltodict returns dict instead of list)
273
+ if isinstance(articles, dict):
274
+ articles = [articles]
275
+
276
+ evidence_list = []
277
+ for article in articles:
278
+ try:
279
+ evidence = self._article_to_evidence(article)
280
+ if evidence:
281
+ evidence_list.append(evidence)
282
+ except Exception:
283
+ continue # Skip malformed articles
284
+
285
+ return evidence_list
286
+
287
+ def _article_to_evidence(self, article: dict) -> Evidence | None:
288
+ """Convert a single PubMed article to Evidence."""
289
+ medline = article.get("MedlineCitation", {})
290
+ article_data = medline.get("Article", {})
291
+
292
+ # Extract PMID
293
+ pmid = medline.get("PMID", {})
294
+ if isinstance(pmid, dict):
295
+ pmid = pmid.get("#text", "")
296
+
297
+ # Extract title
298
+ title = article_data.get("ArticleTitle", "")
299
+ if isinstance(title, dict):
300
+ title = title.get("#text", str(title))
301
+
302
+ # Extract abstract
303
+ abstract_data = article_data.get("Abstract", {}).get("AbstractText", "")
304
+ if isinstance(abstract_data, list):
305
+ abstract = " ".join(
306
+ item.get("#text", str(item)) if isinstance(item, dict) else str(item)
307
+ for item in abstract_data
308
+ )
309
+ elif isinstance(abstract_data, dict):
310
+ abstract = abstract_data.get("#text", str(abstract_data))
311
+ else:
312
+ abstract = str(abstract_data)
313
+
314
+ if not abstract or not title:
315
+ return None
316
+
317
+ # Extract date
318
+ pub_date = article_data.get("Journal", {}).get("JournalIssue", {}).get("PubDate", {})
319
+ year = pub_date.get("Year", "Unknown")
320
+ month = pub_date.get("Month", "01")
321
+ day = pub_date.get("Day", "01")
322
+ date_str = f"{year}-{month}-{day}" if year != "Unknown" else "Unknown"
323
+
324
+ # Extract authors
325
+ author_list = article_data.get("AuthorList", {}).get("Author", [])
326
+ if isinstance(author_list, dict):
327
+ author_list = [author_list]
328
+ authors = []
329
+ for author in author_list[:5]: # Limit to 5 authors
330
+ last = author.get("LastName", "")
331
+ first = author.get("ForeName", "")
332
+ if last:
333
+ authors.append(f"{last} {first}".strip())
334
+
335
+ return Evidence(
336
+ content=abstract[:2000], # Truncate long abstracts
337
+ citation=Citation(
338
+ source="pubmed",
339
+ title=title[:500],
340
+ url=f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/",
341
+ date=date_str,
342
+ authors=authors,
343
+ ),
344
+ )
345
+ ```
346
+
347
+ ### DuckDuckGo Tool Implementation (`src/tools/websearch.py`)
348
+
349
+ ```python
350
+ """Web search tool using DuckDuckGo."""
351
+ from typing import List
352
+ from duckduckgo_search import DDGS
353
+
354
+ from src.utils.exceptions import SearchError
355
+ from src.utils.models import Evidence, Citation
356
+
357
+
358
+ class WebTool:
359
+ """Search tool for general web search via DuckDuckGo."""
360
+
361
+ def __init__(self):
362
+ pass
363
+
364
+ @property
365
+ def name(self) -> str:
366
+ return "web"
367
+
368
+ async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
369
+ """
370
+ Search DuckDuckGo and return evidence.
371
+
372
+ Note: duckduckgo-search is synchronous, so we run it in executor.
373
+ """
374
+ import asyncio
375
+
376
+ loop = asyncio.get_event_loop()
377
+ try:
378
+ results = await loop.run_in_executor(
379
+ None,
380
+ lambda: self._sync_search(query, max_results),
381
+ )
382
+ return results
383
+ except Exception as e:
384
+ raise SearchError(f"Web search failed: {e}")
385
+
386
+ def _sync_search(self, query: str, max_results: int) -> List[Evidence]:
387
+ """Synchronous search implementation."""
388
+ evidence_list = []
389
+
390
+ with DDGS() as ddgs:
391
+ results = list(ddgs.text(query, max_results=max_results))
392
+
393
+ for result in results:
394
+ evidence_list.append(
395
+ Evidence(
396
+ content=result.get("body", "")[:1000],
397
+ citation=Citation(
398
+ source="web",
399
+ title=result.get("title", "Unknown")[:500],
400
+ url=result.get("href", ""),
401
+ date="Unknown",
402
+ authors=[],
403
+ ),
404
+ )
405
+ )
406
+
407
+ return evidence_list
408
+ ```
409
+
410
+ ---
411
+
412
+ ## 5. Search Handler (`src/tools/search_handler.py`)
413
+
414
+ The handler orchestrates multiple tools using the **Scatter-Gather** pattern.
415
+
416
+ ```python
417
+ """Search handler - orchestrates multiple search tools."""
418
+ import asyncio
419
+ from typing import List, Protocol
420
+ import structlog
421
+
422
+ from src.utils.exceptions import SearchError
423
+ from src.utils.models import Evidence, SearchResult
424
+
425
+ logger = structlog.get_logger()
426
+
427
+
428
+ class SearchTool(Protocol):
429
+ """Protocol defining the interface for all search tools."""
430
+
431
+ @property
432
+ def name(self) -> str:
433
+ ...
434
+
435
+ async def search(self, query: str, max_results: int = 10) -> List[Evidence]:
436
+ ...
437
+
438
+
439
+ def flatten(nested: List[List[Evidence]]) -> List[Evidence]:
440
+ """Flatten a list of lists into a single list."""
441
+ return [item for sublist in nested for item in sublist]
442
+
443
+
444
+ class SearchHandler:
445
+ """Orchestrates parallel searches across multiple tools."""
446
+
447
+ def __init__(self, tools: List[SearchTool], timeout: float = 30.0):
448
+ """
449
+ Initialize the search handler.
450
+
451
+ Args:
452
+ tools: List of search tools to use
453
+ timeout: Timeout for each search in seconds
454
+ """
455
+ self.tools = tools
456
+ self.timeout = timeout
457
+
458
+ async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
459
+ """
460
+ Execute search across all tools in parallel.
461
+
462
+ Args:
463
+ query: The search query
464
+ max_results_per_tool: Max results from each tool
465
+
466
+ Returns:
467
+ SearchResult containing all evidence and metadata
468
+ """
469
+ logger.info("Starting search", query=query, tools=[t.name for t in self.tools])
470
+
471
+ # Create tasks for parallel execution
472
+ tasks = [
473
+ self._search_with_timeout(tool, query, max_results_per_tool)
474
+ for tool in self.tools
475
+ ]
476
+
477
+ # Gather results (don't fail if one tool fails)
478
+ results = await asyncio.gather(*tasks, return_exceptions=True)
479
+
480
+ # Process results
481
+ all_evidence: List[Evidence] = []
482
+ sources_searched: List[str] = []
483
+ errors: List[str] = []
484
+
485
+ for tool, result in zip(self.tools, results):
486
+ if isinstance(result, Exception):
487
+ errors.append(f"{tool.name}: {str(result)}")
488
+ logger.warning("Search tool failed", tool=tool.name, error=str(result))
489
+ else:
490
+ all_evidence.extend(result)
491
+ sources_searched.append(tool.name)
492
+ logger.info("Search tool succeeded", tool=tool.name, count=len(result))
493
+
494
+ return SearchResult(
495
+ query=query,
496
+ evidence=all_evidence,
497
+ sources_searched=sources_searched,
498
+ total_found=len(all_evidence),
499
+ errors=errors,
500
+ )
501
+
502
+ async def _search_with_timeout(
503
+ self,
504
+ tool: SearchTool,
505
+ query: str,
506
+ max_results: int,
507
+ ) -> List[Evidence]:
508
+ """Execute a single tool search with timeout."""
509
+ try:
510
+ return await asyncio.wait_for(
511
+ tool.search(query, max_results),
512
+ timeout=self.timeout,
513
+ )
514
+ except asyncio.TimeoutError:
515
+ raise SearchError(f"{tool.name} search timed out after {self.timeout}s")
516
+ ```
517
+
518
+ ---
519
+
520
+ ## 6. TDD Workflow
521
+
522
+ ### Test File: `tests/unit/tools/test_pubmed.py`
523
+
524
+ ```python
525
+ """Unit tests for PubMed tool."""
526
+ import pytest
527
+ from unittest.mock import AsyncMock, MagicMock
528
+
529
+
530
+ # Sample PubMed XML response for mocking
531
+ SAMPLE_PUBMED_XML = """<?xml version="1.0" ?>
532
+ <PubmedArticleSet>
533
+ <PubmedArticle>
534
+ <MedlineCitation>
535
+ <PMID>12345678</PMID>
536
+ <Article>
537
+ <ArticleTitle>Metformin in Alzheimer's Disease: A Systematic Review</ArticleTitle>
538
+ <Abstract>
539
+ <AbstractText>Metformin shows neuroprotective properties...</AbstractText>
540
+ </Abstract>
541
+ <AuthorList>
542
+ <Author>
543
+ <LastName>Smith</LastName>
544
+ <ForeName>John</ForeName>
545
+ </Author>
546
+ </AuthorList>
547
+ <Journal>
548
+ <JournalIssue>
549
+ <PubDate>
550
+ <Year>2024</Year>
551
+ <Month>01</Month>
552
+ </PubDate>
553
+ </JournalIssue>
554
+ </Journal>
555
+ </Article>
556
+ </MedlineCitation>
557
+ </PubmedArticle>
558
+ </PubmedArticleSet>
559
+ """
560
+
561
+
562
+ class TestPubMedTool:
563
+ """Tests for PubMedTool."""
564
+
565
+ @pytest.mark.asyncio
566
+ async def test_search_returns_evidence(self, mocker):
567
+ """PubMedTool should return Evidence objects from search."""
568
+ from src.tools.pubmed import PubMedTool
569
+
570
+ # Mock the HTTP responses
571
+ mock_search_response = MagicMock()
572
+ mock_search_response.json.return_value = {
573
+ "esearchresult": {"idlist": ["12345678"]}
574
+ }
575
+ mock_search_response.raise_for_status = MagicMock()
576
+
577
+ mock_fetch_response = MagicMock()
578
+ mock_fetch_response.text = SAMPLE_PUBMED_XML
579
+ mock_fetch_response.raise_for_status = MagicMock()
580
+
581
+ mock_client = AsyncMock()
582
+ mock_client.get = AsyncMock(side_effect=[mock_search_response, mock_fetch_response])
583
+ mock_client.__aenter__ = AsyncMock(return_value=mock_client)
584
+ mock_client.__aexit__ = AsyncMock(return_value=None)
585
+
586
+ mocker.patch("httpx.AsyncClient", return_value=mock_client)
587
+
588
+ # Act
589
+ tool = PubMedTool()
590
+ results = await tool.search("metformin alzheimer")
591
+
592
+ # Assert
593
+ assert len(results) == 1
594
+ assert results[0].citation.source == "pubmed"
595
+ assert "Metformin" in results[0].citation.title
596
+ assert "12345678" in results[0].citation.url
597
+
598
+ @pytest.mark.asyncio
599
+ async def test_search_empty_results(self, mocker):
600
+ """PubMedTool should return empty list when no results."""
601
+ from src.tools.pubmed import PubMedTool
602
+
603
+ mock_response = MagicMock()
604
+ mock_response.json.return_value = {"esearchresult": {"idlist": []}}
605
+ mock_response.raise_for_status = MagicMock()
606
+
607
+ mock_client = AsyncMock()
608
+ mock_client.get = AsyncMock(return_value=mock_response)
609
+ mock_client.__aenter__ = AsyncMock(return_value=mock_client)
610
+ mock_client.__aexit__ = AsyncMock(return_value=None)
611
+
612
+ mocker.patch("httpx.AsyncClient", return_value=mock_client)
613
+
614
+ tool = PubMedTool()
615
+ results = await tool.search("xyznonexistentquery123")
616
+
617
+ assert results == []
618
+
619
+ def test_parse_pubmed_xml(self):
620
+ """PubMedTool should correctly parse XML."""
621
+ from src.tools.pubmed import PubMedTool
622
+
623
+ tool = PubMedTool()
624
+ results = tool._parse_pubmed_xml(SAMPLE_PUBMED_XML)
625
+
626
+ assert len(results) == 1
627
+ assert results[0].citation.source == "pubmed"
628
+ assert "Smith John" in results[0].citation.authors
629
+ ```
630
+
631
+ ### Test File: `tests/unit/tools/test_websearch.py`
632
+
633
+ ```python
634
+ """Unit tests for WebTool."""
635
+ import pytest
636
+ from unittest.mock import MagicMock
637
+
638
+
639
+ class TestWebTool:
640
+ """Tests for WebTool."""
641
+
642
+ @pytest.mark.asyncio
643
+ async def test_search_returns_evidence(self, mocker):
644
+ """WebTool should return Evidence objects from search."""
645
+ from src.tools.websearch import WebTool
646
+
647
+ mock_results = [
648
+ {
649
+ "title": "Drug Repurposing Article",
650
+ "href": "https://example.com/article",
651
+ "body": "Some content about drug repurposing...",
652
+ }
653
+ ]
654
+
655
+ mock_ddgs = MagicMock()
656
+ mock_ddgs.__enter__ = MagicMock(return_value=mock_ddgs)
657
+ mock_ddgs.__exit__ = MagicMock(return_value=None)
658
+ mock_ddgs.text = MagicMock(return_value=mock_results)
659
+
660
+ mocker.patch("src.tools.websearch.DDGS", return_value=mock_ddgs)
661
+
662
+ tool = WebTool()
663
+ results = await tool.search("drug repurposing")
664
+
665
+ assert len(results) == 1
666
+ assert results[0].citation.source == "web"
667
+ assert "Drug Repurposing" in results[0].citation.title
668
+ ```
669
+
670
+ ### Test File: `tests/unit/tools/test_search_handler.py`
671
+
672
+ ```python
673
+ """Unit tests for SearchHandler."""
674
+ import pytest
675
+ from unittest.mock import AsyncMock
676
+
677
+ from src.utils.models import Evidence, Citation
678
+ from src.utils.exceptions import SearchError
679
+
680
+
681
+ class TestSearchHandler:
682
+ """Tests for SearchHandler."""
683
+
684
+ @pytest.mark.asyncio
685
+ async def test_execute_aggregates_results(self):
686
+ """SearchHandler should aggregate results from all tools."""
687
+ from src.tools.search_handler import SearchHandler
688
+
689
+ # Create mock tools
690
+ mock_tool_1 = AsyncMock()
691
+ mock_tool_1.name = "mock1"
692
+ mock_tool_1.search = AsyncMock(return_value=[
693
+ Evidence(
694
+ content="Result 1",
695
+ citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
696
+ )
697
+ ])
698
+
699
+ mock_tool_2 = AsyncMock()
700
+ mock_tool_2.name = "mock2"
701
+ mock_tool_2.search = AsyncMock(return_value=[
702
+ Evidence(
703
+ content="Result 2",
704
+ citation=Citation(source="web", title="T2", url="u2", date="2024"),
705
+ )
706
+ ])
707
+
708
+ handler = SearchHandler(tools=[mock_tool_1, mock_tool_2])
709
+ result = await handler.execute("test query")
710
+
711
+ assert result.total_found == 2
712
+ assert "mock1" in result.sources_searched
713
+ assert "mock2" in result.sources_searched
714
+ assert len(result.errors) == 0
715
+
716
+ @pytest.mark.asyncio
717
+ async def test_execute_handles_tool_failure(self):
718
+ """SearchHandler should continue if one tool fails."""
719
+ from src.tools.search_handler import SearchHandler
720
+
721
+ mock_tool_ok = AsyncMock()
722
+ mock_tool_ok.name = "ok_tool"
723
+ mock_tool_ok.search = AsyncMock(return_value=[
724
+ Evidence(
725
+ content="Good result",
726
+ citation=Citation(source="pubmed", title="T", url="u", date="2024"),
727
+ )
728
+ ])
729
+
730
+ mock_tool_fail = AsyncMock()
731
+ mock_tool_fail.name = "fail_tool"
732
+ mock_tool_fail.search = AsyncMock(side_effect=SearchError("API down"))
733
+
734
+ handler = SearchHandler(tools=[mock_tool_ok, mock_tool_fail])
735
+ result = await handler.execute("test")
736
+
737
+ assert result.total_found == 1
738
+ assert "ok_tool" in result.sources_searched
739
+ assert len(result.errors) == 1
740
+ assert "fail_tool" in result.errors[0]
741
+ ```
742
+
743
+ ---
744
+
745
+ ## 7. Integration Test (Optional, Real API)
746
+
747
+ ```python
748
+ # tests/integration/test_pubmed_live.py
749
+ """Integration tests that hit real APIs (run manually)."""
750
+ import pytest
751
+
752
+
753
+ @pytest.mark.integration
754
+ @pytest.mark.slow
755
+ @pytest.mark.asyncio
756
+ async def test_pubmed_live_search():
757
+ """Test real PubMed search (requires network)."""
758
+ from src.tools.pubmed import PubMedTool
759
+
760
+ tool = PubMedTool()
761
+ results = await tool.search("metformin diabetes", max_results=3)
762
+
763
+ assert len(results) > 0
764
+ assert results[0].citation.source == "pubmed"
765
+ assert "pubmed.ncbi.nlm.nih.gov" in results[0].citation.url
766
+
767
+
768
+ # Run with: uv run pytest tests/integration -m integration
769
+ ```
770
+
771
+ ---
772
+
773
+ ## 8. Implementation Checklist
774
+
775
+ - [x] Create `src/utils/models.py` with all Pydantic models (Evidence, Citation, SearchResult) - **COMPLETE**
776
+ - [x] Create `src/tools/__init__.py` with SearchTool Protocol and exports - **COMPLETE**
777
+ - [x] Implement `src/tools/pubmed.py` with PubMedTool class - **COMPLETE**
778
+ - [ ] ~~Implement `src/tools/websearch.py` with WebTool class~~ - **REMOVED** (replaced by Europe PMC in Phase 11)
779
+ - [x] Create `src/tools/search_handler.py` with SearchHandler class - **COMPLETE**
780
+ - [x] Write tests in `tests/unit/tools/test_pubmed.py` - **COMPLETE** (basic tests)
781
+ - [ ] Write tests in `tests/unit/tools/test_websearch.py` - **N/A** (WebTool removed)
782
+ - [x] Write tests in `tests/unit/tools/test_search_handler.py` - **COMPLETE** (basic tests)
783
+ - [x] Run `uv run pytest tests/unit/tools/ -v` — **ALL TESTS MUST PASS** - **PASSING**
784
+ - [ ] (Optional) Run integration test: `uv run pytest -m integration`
785
+ - [ ] Add edge case tests (rate limiting, error handling, timeouts) - **PENDING**
786
+ - [ ] Commit: `git commit -m "feat: phase 2 search slice complete"` - **DONE**
787
+
788
+ **Post-Phase 2 Enhancements**:
789
+ - [x] Query preprocessing (`src/tools/query_utils.py`) - **ADDED**
790
+ - [x] Europe PMC tool (Phase 11) - **ADDED**
791
+ - [x] ClinicalTrials tool (Phase 10) - **ADDED**
792
+
793
+ ---
794
+
795
+ ## 9. Definition of Done
796
+
797
+ Phase 2 is **COMPLETE** when:
798
+
799
+ 1. ✅ All unit tests pass: `uv run pytest tests/unit/tools/ -v` - **PASSING**
800
+ 2. ✅ `SearchHandler` can execute with search tools - **WORKING**
801
+ 3. ✅ Graceful degradation: if one tool fails, other tools still return results - **IMPLEMENTED**
802
+ 4. ✅ Rate limiting is enforced (verify no 429 errors) - **IMPLEMENTED**
803
+ 5. ✅ Can run this in Python REPL:
804
+
805
+ ```python
806
+ import asyncio
807
+ from src.tools.pubmed import PubMedTool
808
+ from src.tools.search_handler import SearchHandler
809
+
810
+ async def test():
811
+ handler = SearchHandler([PubMedTool()])
812
+ result = await handler.execute("metformin alzheimer")
813
+ print(f"Found {result.total_found} results")
814
+ for e in result.evidence[:3]:
815
+ print(f"- {e.citation.title}")
816
+
817
+ asyncio.run(test())
818
+ ```
819
+
820
+ **Note**: WebTool was removed in favor of Europe PMC (Phase 11). The current implementation uses PubMed as the primary Phase 2 tool, with Europe PMC and ClinicalTrials added in later phases.
821
+
822
+ **Proceed to Phase 3 ONLY after all checkboxes are complete.**
docs/implementation/03_phase_judge.md ADDED
@@ -0,0 +1,1052 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 3 Implementation Spec: Judge Vertical Slice
2
+
3
+ **Goal**: Implement the "Brain" of the agent — evaluating evidence quality.
4
+ **Philosophy**: "Structured Output or Bust."
5
+ **Prerequisite**: Phase 2 complete (all search tests passing)
6
+
7
+ ---
8
+
9
+ ## 1. The Slice Definition
10
+
11
+ This slice covers:
12
+ 1. **Input**: A user question + a list of `Evidence` (from Phase 2).
13
+ 2. **Process**:
14
+ - Construct a prompt with the evidence.
15
+ - Call LLM (PydanticAI / OpenAI / Anthropic).
16
+ - Force JSON structured output.
17
+ 3. **Output**: A `JudgeAssessment` object.
18
+
19
+ **Files to Create**:
20
+ - `src/utils/models.py` - Add JudgeAssessment models (extend from Phase 2)
21
+ - `src/prompts/judge.py` - Judge prompt templates
22
+ - `src/agent_factory/judges.py` - JudgeHandler with PydanticAI
23
+ - `tests/unit/agent_factory/test_judges.py` - Unit tests
24
+
25
+ ---
26
+
27
+ ## 2. Models (Add to `src/utils/models.py`)
28
+
29
+ The output schema must be strict for reliable structured output.
30
+
31
+ ```python
32
+ """Add these models to src/utils/models.py (after Evidence models from Phase 2)."""
33
+ from pydantic import BaseModel, Field
34
+ from typing import List, Literal
35
+
36
+
37
+ class AssessmentDetails(BaseModel):
38
+ """Detailed assessment of evidence quality."""
39
+
40
+ mechanism_score: int = Field(
41
+ ...,
42
+ ge=0,
43
+ le=10,
44
+ description="How well does the evidence explain the mechanism? 0-10"
45
+ )
46
+ mechanism_reasoning: str = Field(
47
+ ...,
48
+ min_length=10,
49
+ description="Explanation of mechanism score"
50
+ )
51
+ clinical_evidence_score: int = Field(
52
+ ...,
53
+ ge=0,
54
+ le=10,
55
+ description="Strength of clinical/preclinical evidence. 0-10"
56
+ )
57
+ clinical_reasoning: str = Field(
58
+ ...,
59
+ min_length=10,
60
+ description="Explanation of clinical evidence score"
61
+ )
62
+ drug_candidates: List[str] = Field(
63
+ default_factory=list,
64
+ description="List of specific drug candidates mentioned"
65
+ )
66
+ key_findings: List[str] = Field(
67
+ default_factory=list,
68
+ description="Key findings from the evidence"
69
+ )
70
+
71
+
72
+ class JudgeAssessment(BaseModel):
73
+ """Complete assessment from the Judge."""
74
+
75
+ details: AssessmentDetails
76
+ sufficient: bool = Field(
77
+ ...,
78
+ description="Is evidence sufficient to provide a recommendation?"
79
+ )
80
+ confidence: float = Field(
81
+ ...,
82
+ ge=0.0,
83
+ le=1.0,
84
+ description="Confidence in the assessment (0-1)"
85
+ )
86
+ recommendation: Literal["continue", "synthesize"] = Field(
87
+ ...,
88
+ description="continue = need more evidence, synthesize = ready to answer"
89
+ )
90
+ next_search_queries: List[str] = Field(
91
+ default_factory=list,
92
+ description="If continue, what queries to search next"
93
+ )
94
+ reasoning: str = Field(
95
+ ...,
96
+ min_length=20,
97
+ description="Overall reasoning for the recommendation"
98
+ )
99
+ ```
100
+
101
+ ---
102
+
103
+ ## 3. Prompt Engineering (`src/prompts/judge.py`)
104
+
105
+ We treat prompts as code. They should be versioned and clean.
106
+
107
+ ```python
108
+ """Judge prompts for evidence assessment."""
109
+ from typing import List
110
+ from src.utils.models import Evidence
111
+
112
+
113
+ SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
114
+
115
+ Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition.
116
+
117
+ ## Evaluation Criteria
118
+
119
+ 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
120
+ - 0-3: No clear mechanism, speculative
121
+ - 4-6: Some mechanistic insight, but gaps exist
122
+ - 7-10: Clear, well-supported mechanism of action
123
+
124
+ 2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
125
+ - 0-3: No clinical data, only theoretical
126
+ - 4-6: Preclinical or early clinical data
127
+ - 7-10: Strong clinical evidence (trials, meta-analyses)
128
+
129
+ 3. **Sufficiency**: Evidence is sufficient when:
130
+ - Combined scores >= 12 AND
131
+ - At least one specific drug candidate identified AND
132
+ - Clear mechanistic rationale exists
133
+
134
+ ## Output Rules
135
+
136
+ - Always output valid JSON matching the schema
137
+ - Be conservative: only recommend "synthesize" when truly confident
138
+ - If continuing, suggest specific, actionable search queries
139
+ - Never hallucinate drug names or findings not in the evidence
140
+ """
141
+
142
+
143
+ def format_user_prompt(question: str, evidence: List[Evidence]) -> str:
144
+ """
145
+ Format the user prompt with question and evidence.
146
+
147
+ Args:
148
+ question: The user's research question
149
+ evidence: List of Evidence objects from search
150
+
151
+ Returns:
152
+ Formatted prompt string
153
+ """
154
+ evidence_text = "\n\n".join([
155
+ f"### Evidence {i+1}\n"
156
+ f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
157
+ f"**URL**: {e.citation.url}\n"
158
+ f"**Date**: {e.citation.date}\n"
159
+ f"**Content**:\n{e.content[:1500]}..."
160
+ if len(e.content) > 1500 else
161
+ f"### Evidence {i+1}\n"
162
+ f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
163
+ f"**URL**: {e.citation.url}\n"
164
+ f"**Date**: {e.citation.date}\n"
165
+ f"**Content**:\n{e.content}"
166
+ for i, e in enumerate(evidence)
167
+ ])
168
+
169
+ return f"""## Research Question
170
+ {question}
171
+
172
+ ## Available Evidence ({len(evidence)} sources)
173
+
174
+ {evidence_text}
175
+
176
+ ## Your Task
177
+
178
+ Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
179
+ Respond with a JSON object matching the JudgeAssessment schema.
180
+ """
181
+
182
+
183
+ def format_empty_evidence_prompt(question: str) -> str:
184
+ """
185
+ Format prompt when no evidence was found.
186
+
187
+ Args:
188
+ question: The user's research question
189
+
190
+ Returns:
191
+ Formatted prompt string
192
+ """
193
+ return f"""## Research Question
194
+ {question}
195
+
196
+ ## Available Evidence
197
+
198
+ No evidence was found from the search.
199
+
200
+ ## Your Task
201
+
202
+ Since no evidence was found, recommend search queries that might yield better results.
203
+ Set sufficient=False and recommendation="continue".
204
+ Suggest 3-5 specific search queries.
205
+ """
206
+ ```
207
+
208
+ ---
209
+
210
+ ## 4. JudgeHandler Implementation (`src/agent_factory/judges.py`)
211
+
212
+ Using PydanticAI for structured output with retry logic.
213
+
214
+ ```python
215
+ """Judge handler for evidence assessment using PydanticAI."""
216
+ import os
217
+ from typing import List
218
+ import structlog
219
+ from pydantic_ai import Agent
220
+ from pydantic_ai.models.openai import OpenAIModel
221
+ from pydantic_ai.models.anthropic import AnthropicModel
222
+
223
+ from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails
224
+ from src.utils.config import settings
225
+ from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt
226
+
227
+ logger = structlog.get_logger()
228
+
229
+
230
+ def get_model():
231
+ """Get the LLM model based on configuration."""
232
+ provider = getattr(settings, "llm_provider", "openai")
233
+
234
+ if provider == "anthropic":
235
+ return AnthropicModel(
236
+ model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"),
237
+ api_key=os.getenv("ANTHROPIC_API_KEY"),
238
+ )
239
+ else:
240
+ return OpenAIModel(
241
+ model_name=getattr(settings, "openai_model", "gpt-4o"),
242
+ api_key=os.getenv("OPENAI_API_KEY"),
243
+ )
244
+
245
+
246
+ class JudgeHandler:
247
+ """
248
+ Handles evidence assessment using an LLM with structured output.
249
+
250
+ Uses PydanticAI to ensure responses match the JudgeAssessment schema.
251
+ """
252
+
253
+ def __init__(self, model=None):
254
+ """
255
+ Initialize the JudgeHandler.
256
+
257
+ Args:
258
+ model: Optional PydanticAI model. If None, uses config default.
259
+ """
260
+ self.model = model or get_model()
261
+ self.agent = Agent(
262
+ model=self.model,
263
+ result_type=JudgeAssessment,
264
+ system_prompt=SYSTEM_PROMPT,
265
+ retries=3,
266
+ )
267
+
268
+ async def assess(
269
+ self,
270
+ question: str,
271
+ evidence: List[Evidence],
272
+ ) -> JudgeAssessment:
273
+ """
274
+ Assess evidence and determine if it's sufficient.
275
+
276
+ Args:
277
+ question: The user's research question
278
+ evidence: List of Evidence objects from search
279
+
280
+ Returns:
281
+ JudgeAssessment with evaluation results
282
+
283
+ Raises:
284
+ JudgeError: If assessment fails after retries
285
+ """
286
+ logger.info(
287
+ "Starting evidence assessment",
288
+ question=question[:100],
289
+ evidence_count=len(evidence),
290
+ )
291
+
292
+ # Format the prompt based on whether we have evidence
293
+ if evidence:
294
+ user_prompt = format_user_prompt(question, evidence)
295
+ else:
296
+ user_prompt = format_empty_evidence_prompt(question)
297
+
298
+ try:
299
+ # Run the agent with structured output
300
+ result = await self.agent.run(user_prompt)
301
+ assessment = result.data
302
+
303
+ logger.info(
304
+ "Assessment complete",
305
+ sufficient=assessment.sufficient,
306
+ recommendation=assessment.recommendation,
307
+ confidence=assessment.confidence,
308
+ )
309
+
310
+ return assessment
311
+
312
+ except Exception as e:
313
+ logger.error("Assessment failed", error=str(e))
314
+ # Return a safe default assessment on failure
315
+ return self._create_fallback_assessment(question, str(e))
316
+
317
+ def _create_fallback_assessment(
318
+ self,
319
+ question: str,
320
+ error: str,
321
+ ) -> JudgeAssessment:
322
+ """
323
+ Create a fallback assessment when LLM fails.
324
+
325
+ Args:
326
+ question: The original question
327
+ error: The error message
328
+
329
+ Returns:
330
+ Safe fallback JudgeAssessment
331
+ """
332
+ return JudgeAssessment(
333
+ details=AssessmentDetails(
334
+ mechanism_score=0,
335
+ mechanism_reasoning="Assessment failed due to LLM error",
336
+ clinical_evidence_score=0,
337
+ clinical_reasoning="Assessment failed due to LLM error",
338
+ drug_candidates=[],
339
+ key_findings=[],
340
+ ),
341
+ sufficient=False,
342
+ confidence=0.0,
343
+ recommendation="continue",
344
+ next_search_queries=[
345
+ f"{question} mechanism",
346
+ f"{question} clinical trials",
347
+ f"{question} drug candidates",
348
+ ],
349
+ reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
350
+ )
351
+
352
+
353
+ class HFInferenceJudgeHandler:
354
+ """
355
+ JudgeHandler using HuggingFace Inference API for FREE LLM calls.
356
+
357
+ This is the DEFAULT for demo mode - provides real AI analysis without
358
+ requiring users to have OpenAI/Anthropic API keys.
359
+
360
+ Model Fallback Chain (handles gated models and rate limits):
361
+ 1. meta-llama/Llama-3.1-8B-Instruct (best quality, requires HF_TOKEN)
362
+ 2. mistralai/Mistral-7B-Instruct-v0.3 (good quality, may require token)
363
+ 3. HuggingFaceH4/zephyr-7b-beta (ungated, always works)
364
+
365
+ Rate Limit Handling:
366
+ - Exponential backoff with 3 retries
367
+ - Falls back to next model on persistent 429/503 errors
368
+ """
369
+
370
+ # Model fallback chain: gated (best) → ungated (fallback)
371
+ FALLBACK_MODELS = [
372
+ "meta-llama/Llama-3.1-8B-Instruct", # Best quality (gated)
373
+ "mistralai/Mistral-7B-Instruct-v0.3", # Good quality
374
+ "HuggingFaceH4/zephyr-7b-beta", # Ungated fallback
375
+ ]
376
+
377
+ def __init__(self, model_id: str | None = None) -> None:
378
+ """
379
+ Initialize with HF Inference client.
380
+
381
+ Args:
382
+ model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain.
383
+ """
384
+ self.model_id = model_id
385
+ # Will automatically use HF_TOKEN from env if available
386
+ self.client = InferenceClient()
387
+ self.call_count = 0
388
+ self.last_question: str | None = None
389
+ self.last_evidence: list[Evidence] | None = None
390
+
391
+ def _extract_json(self, text: str) -> dict[str, Any] | None:
392
+ """
393
+ Robust JSON extraction that handles markdown blocks and nested braces.
394
+ """
395
+ text = text.strip()
396
+
397
+ # Remove markdown code blocks if present (with bounds checking)
398
+ if "```json" in text:
399
+ parts = text.split("```json", 1)
400
+ if len(parts) > 1:
401
+ inner_parts = parts[1].split("```", 1)
402
+ text = inner_parts[0]
403
+ elif "```" in text:
404
+ parts = text.split("```", 1)
405
+ if len(parts) > 1:
406
+ inner_parts = parts[1].split("```", 1)
407
+ text = inner_parts[0]
408
+
409
+ text = text.strip()
410
+
411
+ # Find first '{'
412
+ start_idx = text.find("{")
413
+ if start_idx == -1:
414
+ return None
415
+
416
+ # Stack-based parsing ignoring chars in strings
417
+ count = 0
418
+ in_string = False
419
+ escape = False
420
+
421
+ for i, char in enumerate(text[start_idx:], start=start_idx):
422
+ if in_string:
423
+ if escape:
424
+ escape = False
425
+ elif char == "\\":
426
+ escape = True
427
+ elif char == '"':
428
+ in_string = False
429
+ elif char == '"':
430
+ in_string = True
431
+ elif char == "{":
432
+ count += 1
433
+ elif char == "}":
434
+ count -= 1
435
+ if count == 0:
436
+ try:
437
+ result = json.loads(text[start_idx : i + 1])
438
+ if isinstance(result, dict):
439
+ return result
440
+ return None
441
+ except json.JSONDecodeError:
442
+ return None
443
+
444
+ return None
445
+
446
+ @retry(
447
+ stop=stop_after_attempt(3),
448
+ wait=wait_exponential(multiplier=1, min=1, max=4),
449
+ retry=retry_if_exception_type(Exception),
450
+ reraise=True,
451
+ )
452
+ async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment:
453
+ """Make API call with retry logic using chat_completion."""
454
+ loop = asyncio.get_running_loop()
455
+
456
+ # Build messages for chat_completion (model-agnostic)
457
+ messages = [
458
+ {
459
+ "role": "system",
460
+ "content": f"""{SYSTEM_PROMPT}
461
+
462
+ IMPORTANT: Respond with ONLY valid JSON matching this schema:
463
+ {{
464
+ "details": {{
465
+ "mechanism_score": <int 0-10>,
466
+ "mechanism_reasoning": "<string>",
467
+ "clinical_evidence_score": <int 0-10>,
468
+ "clinical_reasoning": "<string>",
469
+ "drug_candidates": ["<string>", ...],
470
+ "key_findings": ["<string>", ...]
471
+ }},
472
+ "sufficient": <bool>,
473
+ "confidence": <float 0-1>,
474
+ "recommendation": "continue" | "synthesize",
475
+ "next_search_queries": ["<string>", ...],
476
+ "reasoning": "<string>"
477
+ }}""",
478
+ },
479
+ {"role": "user", "content": prompt},
480
+ ]
481
+
482
+ # Use chat_completion (conversational task - supported by all models)
483
+ response = await loop.run_in_executor(
484
+ None,
485
+ lambda: self.client.chat_completion(
486
+ messages=messages,
487
+ model=model,
488
+ max_tokens=1024,
489
+ temperature=0.1,
490
+ ),
491
+ )
492
+
493
+ # Extract content from response
494
+ content = response.choices[0].message.content
495
+ if not content:
496
+ raise ValueError("Empty response from model")
497
+
498
+ # Extract and parse JSON
499
+ json_data = self._extract_json(content)
500
+ if not json_data:
501
+ raise ValueError("No valid JSON found in response")
502
+
503
+ return JudgeAssessment(**json_data)
504
+
505
+ async def assess(
506
+ self,
507
+ question: str,
508
+ evidence: list[Evidence],
509
+ ) -> JudgeAssessment:
510
+ """
511
+ Assess evidence using HuggingFace Inference API.
512
+ Attempts models in order until one succeeds.
513
+ """
514
+ self.call_count += 1
515
+ self.last_question = question
516
+ self.last_evidence = evidence
517
+
518
+ # Format the user prompt
519
+ if evidence:
520
+ user_prompt = format_user_prompt(question, evidence)
521
+ else:
522
+ user_prompt = format_empty_evidence_prompt(question)
523
+
524
+ models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS
525
+ last_error: Exception | None = None
526
+
527
+ for model in models_to_try:
528
+ try:
529
+ return await self._call_with_retry(model, user_prompt, question)
530
+ except Exception as e:
531
+ logger.warning("Model failed", model=model, error=str(e))
532
+ last_error = e
533
+ continue
534
+
535
+ # All models failed
536
+ logger.error("All HF models failed", error=str(last_error))
537
+ return self._create_fallback_assessment(question, str(last_error))
538
+
539
+ def _create_fallback_assessment(
540
+ self,
541
+ question: str,
542
+ error: str,
543
+ ) -> JudgeAssessment:
544
+ """Create a fallback assessment when inference fails."""
545
+ return JudgeAssessment(
546
+ details=AssessmentDetails(
547
+ mechanism_score=0,
548
+ mechanism_reasoning=f"Assessment failed: {error}",
549
+ clinical_evidence_score=0,
550
+ clinical_reasoning=f"Assessment failed: {error}",
551
+ drug_candidates=[],
552
+ key_findings=[],
553
+ ),
554
+ sufficient=False,
555
+ confidence=0.0,
556
+ recommendation="continue",
557
+ next_search_queries=[
558
+ f"{question} mechanism",
559
+ f"{question} clinical trials",
560
+ f"{question} drug candidates",
561
+ ],
562
+ reasoning=f"HF Inference failed: {error}. Recommend retrying.",
563
+ )
564
+
565
+
566
+ class MockJudgeHandler:
567
+ """
568
+ Mock JudgeHandler for UNIT TESTING ONLY.
569
+
570
+ NOT for production use. Use HFInferenceJudgeHandler for demo mode.
571
+ """
572
+
573
+ def __init__(self, mock_response: JudgeAssessment | None = None):
574
+ """Initialize with optional mock response for testing."""
575
+ self.mock_response = mock_response
576
+ self.call_count = 0
577
+ self.last_question = None
578
+ self.last_evidence = None
579
+
580
+ async def assess(
581
+ self,
582
+ question: str,
583
+ evidence: List[Evidence],
584
+ ) -> JudgeAssessment:
585
+ """Return the mock response (for testing only)."""
586
+ self.call_count += 1
587
+ self.last_question = question
588
+ self.last_evidence = evidence
589
+
590
+ if self.mock_response:
591
+ return self.mock_response
592
+
593
+ # Default mock response for tests
594
+ return JudgeAssessment(
595
+ details=AssessmentDetails(
596
+ mechanism_score=7,
597
+ mechanism_reasoning="Mock assessment for testing",
598
+ clinical_evidence_score=6,
599
+ clinical_reasoning="Mock assessment for testing",
600
+ drug_candidates=["TestDrug"],
601
+ key_findings=["Test finding"],
602
+ ),
603
+ sufficient=len(evidence) >= 3,
604
+ confidence=0.75,
605
+ recommendation="synthesize" if len(evidence) >= 3 else "continue",
606
+ next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
607
+ reasoning="Mock assessment for unit testing only",
608
+ )
609
+ ```
610
+
611
+ ---
612
+
613
+ ## 5. TDD Workflow
614
+
615
+ ### Test File: `tests/unit/agent_factory/test_judges.py`
616
+
617
+ ```python
618
+ """Unit tests for JudgeHandler."""
619
+ import pytest
620
+ from unittest.mock import AsyncMock, MagicMock, patch
621
+
622
+ from src.utils.models import (
623
+ Evidence,
624
+ Citation,
625
+ JudgeAssessment,
626
+ AssessmentDetails,
627
+ )
628
+
629
+
630
+ class TestJudgeHandler:
631
+ """Tests for JudgeHandler."""
632
+
633
+ @pytest.mark.asyncio
634
+ async def test_assess_returns_assessment(self):
635
+ """JudgeHandler should return JudgeAssessment from LLM."""
636
+ from src.agent_factory.judges import JudgeHandler
637
+
638
+ # Create mock assessment
639
+ mock_assessment = JudgeAssessment(
640
+ details=AssessmentDetails(
641
+ mechanism_score=8,
642
+ mechanism_reasoning="Strong mechanistic evidence",
643
+ clinical_evidence_score=7,
644
+ clinical_reasoning="Good clinical support",
645
+ drug_candidates=["Metformin"],
646
+ key_findings=["Neuroprotective effects"],
647
+ ),
648
+ sufficient=True,
649
+ confidence=0.85,
650
+ recommendation="synthesize",
651
+ next_search_queries=[],
652
+ reasoning="Evidence is sufficient for synthesis",
653
+ )
654
+
655
+ # Mock the PydanticAI agent
656
+ mock_result = MagicMock()
657
+ mock_result.data = mock_assessment
658
+
659
+ with patch("src.agent_factory.judges.Agent") as mock_agent_class:
660
+ mock_agent = AsyncMock()
661
+ mock_agent.run = AsyncMock(return_value=mock_result)
662
+ mock_agent_class.return_value = mock_agent
663
+
664
+ handler = JudgeHandler()
665
+ # Replace the agent with our mock
666
+ handler.agent = mock_agent
667
+
668
+ evidence = [
669
+ Evidence(
670
+ content="Metformin shows neuroprotective properties...",
671
+ citation=Citation(
672
+ source="pubmed",
673
+ title="Metformin in AD",
674
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
675
+ date="2024-01-01",
676
+ ),
677
+ )
678
+ ]
679
+
680
+ result = await handler.assess("metformin alzheimer", evidence)
681
+
682
+ assert result.sufficient is True
683
+ assert result.recommendation == "synthesize"
684
+ assert result.confidence == 0.85
685
+ assert "Metformin" in result.details.drug_candidates
686
+
687
+ @pytest.mark.asyncio
688
+ async def test_assess_empty_evidence(self):
689
+ """JudgeHandler should handle empty evidence gracefully."""
690
+ from src.agent_factory.judges import JudgeHandler
691
+
692
+ mock_assessment = JudgeAssessment(
693
+ details=AssessmentDetails(
694
+ mechanism_score=0,
695
+ mechanism_reasoning="No evidence to assess",
696
+ clinical_evidence_score=0,
697
+ clinical_reasoning="No evidence to assess",
698
+ drug_candidates=[],
699
+ key_findings=[],
700
+ ),
701
+ sufficient=False,
702
+ confidence=0.0,
703
+ recommendation="continue",
704
+ next_search_queries=["metformin alzheimer mechanism"],
705
+ reasoning="No evidence found, need to search more",
706
+ )
707
+
708
+ mock_result = MagicMock()
709
+ mock_result.data = mock_assessment
710
+
711
+ with patch("src.agent_factory.judges.Agent") as mock_agent_class:
712
+ mock_agent = AsyncMock()
713
+ mock_agent.run = AsyncMock(return_value=mock_result)
714
+ mock_agent_class.return_value = mock_agent
715
+
716
+ handler = JudgeHandler()
717
+ handler.agent = mock_agent
718
+
719
+ result = await handler.assess("metformin alzheimer", [])
720
+
721
+ assert result.sufficient is False
722
+ assert result.recommendation == "continue"
723
+ assert len(result.next_search_queries) > 0
724
+
725
+ @pytest.mark.asyncio
726
+ async def test_assess_handles_llm_failure(self):
727
+ """JudgeHandler should return fallback on LLM failure."""
728
+ from src.agent_factory.judges import JudgeHandler
729
+
730
+ with patch("src.agent_factory.judges.Agent") as mock_agent_class:
731
+ mock_agent = AsyncMock()
732
+ mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
733
+ mock_agent_class.return_value = mock_agent
734
+
735
+ handler = JudgeHandler()
736
+ handler.agent = mock_agent
737
+
738
+ evidence = [
739
+ Evidence(
740
+ content="Some content",
741
+ citation=Citation(
742
+ source="pubmed",
743
+ title="Title",
744
+ url="url",
745
+ date="2024",
746
+ ),
747
+ )
748
+ ]
749
+
750
+ result = await handler.assess("test question", evidence)
751
+
752
+ # Should return fallback, not raise
753
+ assert result.sufficient is False
754
+ assert result.recommendation == "continue"
755
+ assert "failed" in result.reasoning.lower()
756
+
757
+
758
+ class TestHFInferenceJudgeHandler:
759
+ """Tests for HFInferenceJudgeHandler."""
760
+
761
+ @pytest.mark.asyncio
762
+ async def test_extract_json_raw(self):
763
+ """Should extract raw JSON."""
764
+ from src.agent_factory.judges import HFInferenceJudgeHandler
765
+
766
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
767
+ # Bypass __init__ for unit testing extraction
768
+
769
+ result = handler._extract_json('{"key": "value"}')
770
+ assert result == {"key": "value"}
771
+
772
+ @pytest.mark.asyncio
773
+ async def test_extract_json_markdown_block(self):
774
+ """Should extract JSON from markdown code block."""
775
+ from src.agent_factory.judges import HFInferenceJudgeHandler
776
+
777
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
778
+
779
+ response = '''Here is the assessment:
780
+ ```json
781
+ {"key": "value", "nested": {"inner": 1}}
782
+ ```
783
+ '''
784
+ result = handler._extract_json(response)
785
+ assert result == {"key": "value", "nested": {"inner": 1}}
786
+
787
+ @pytest.mark.asyncio
788
+ async def test_extract_json_with_preamble(self):
789
+ """Should extract JSON with preamble text."""
790
+ from src.agent_factory.judges import HFInferenceJudgeHandler
791
+
792
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
793
+
794
+ response = 'Here is your JSON response:\n{"sufficient": true, "confidence": 0.85}'
795
+ result = handler._extract_json(response)
796
+ assert result == {"sufficient": True, "confidence": 0.85}
797
+
798
+ @pytest.mark.asyncio
799
+ async def test_extract_json_nested_braces(self):
800
+ """Should handle nested braces correctly."""
801
+ from src.agent_factory.judges import HFInferenceJudgeHandler
802
+
803
+ handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler)
804
+
805
+ response = '{"details": {"mechanism_score": 8}, "reasoning": "test"}'
806
+ result = handler._extract_json(response)
807
+ assert result["details"]["mechanism_score"] == 8
808
+
809
+ @pytest.mark.asyncio
810
+ async def test_hf_handler_uses_fallback_models(self):
811
+ """HFInferenceJudgeHandler should have fallback model chain."""
812
+ from src.agent_factory.judges import HFInferenceJudgeHandler
813
+
814
+ # Check class has fallback models defined
815
+ assert len(HFInferenceJudgeHandler.FALLBACK_MODELS) >= 3
816
+ assert "zephyr-7b-beta" in HFInferenceJudgeHandler.FALLBACK_MODELS[-1]
817
+
818
+ @pytest.mark.asyncio
819
+ async def test_hf_handler_fallback_on_auth_error(self):
820
+ """Should fall back to ungated model on auth error."""
821
+ from src.agent_factory.judges import HFInferenceJudgeHandler
822
+ from unittest.mock import MagicMock, patch
823
+
824
+ with patch("src.agent_factory.judges.InferenceClient") as mock_client_class:
825
+ # First call raises 403, second succeeds
826
+ mock_client = MagicMock()
827
+ mock_client.chat_completion.side_effect = [
828
+ Exception("403 Forbidden: gated model"),
829
+ MagicMock(choices=[MagicMock(message=MagicMock(content='{"sufficient": false}'))])
830
+ ]
831
+ mock_client_class.return_value = mock_client
832
+
833
+ handler = HFInferenceJudgeHandler()
834
+ # Manually trigger fallback test
835
+ assert handler._try_fallback_model() is True
836
+ assert handler.model_id != "meta-llama/Llama-3.1-8B-Instruct"
837
+
838
+
839
+ class TestMockJudgeHandler:
840
+ """Tests for MockJudgeHandler (UNIT TESTING ONLY)."""
841
+
842
+ @pytest.mark.asyncio
843
+ async def test_mock_handler_returns_default(self):
844
+ """MockJudgeHandler should return default assessment."""
845
+ from src.agent_factory.judges import MockJudgeHandler
846
+
847
+ handler = MockJudgeHandler()
848
+
849
+ evidence = [
850
+ Evidence(
851
+ content="Content 1",
852
+ citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
853
+ ),
854
+ Evidence(
855
+ content="Content 2",
856
+ citation=Citation(source="web", title="T2", url="u2", date="2024"),
857
+ ),
858
+ ]
859
+
860
+ result = await handler.assess("test", evidence)
861
+
862
+ assert handler.call_count == 1
863
+ assert handler.last_question == "test"
864
+ assert len(handler.last_evidence) == 2
865
+ assert result.details.mechanism_score == 7
866
+
867
+ @pytest.mark.asyncio
868
+ async def test_mock_handler_custom_response(self):
869
+ """MockJudgeHandler should return custom response when provided."""
870
+ from src.agent_factory.judges import MockJudgeHandler
871
+
872
+ custom_assessment = JudgeAssessment(
873
+ details=AssessmentDetails(
874
+ mechanism_score=10,
875
+ mechanism_reasoning="Custom reasoning",
876
+ clinical_evidence_score=10,
877
+ clinical_reasoning="Custom clinical",
878
+ drug_candidates=["CustomDrug"],
879
+ key_findings=["Custom finding"],
880
+ ),
881
+ sufficient=True,
882
+ confidence=1.0,
883
+ recommendation="synthesize",
884
+ next_search_queries=[],
885
+ reasoning="Custom assessment",
886
+ )
887
+
888
+ handler = MockJudgeHandler(mock_response=custom_assessment)
889
+ result = await handler.assess("test", [])
890
+
891
+ assert result.details.mechanism_score == 10
892
+ assert result.details.drug_candidates == ["CustomDrug"]
893
+
894
+ @pytest.mark.asyncio
895
+ async def test_mock_handler_insufficient_with_few_evidence(self):
896
+ """MockJudgeHandler should recommend continue with < 3 evidence."""
897
+ from src.agent_factory.judges import MockJudgeHandler
898
+
899
+ handler = MockJudgeHandler()
900
+
901
+ # Only 2 pieces of evidence
902
+ evidence = [
903
+ Evidence(
904
+ content="Content",
905
+ citation=Citation(source="pubmed", title="T", url="u", date="2024"),
906
+ ),
907
+ Evidence(
908
+ content="Content 2",
909
+ citation=Citation(source="web", title="T2", url="u2", date="2024"),
910
+ ),
911
+ ]
912
+
913
+ result = await handler.assess("test", evidence)
914
+
915
+ assert result.sufficient is False
916
+ assert result.recommendation == "continue"
917
+ assert len(result.next_search_queries) > 0
918
+ ```
919
+
920
+ ---
921
+
922
+ ## 6. Dependencies
923
+
924
+ Add to `pyproject.toml`:
925
+
926
+ ```toml
927
+ [project]
928
+ dependencies = [
929
+ # ... existing deps ...
930
+ "pydantic-ai>=0.0.16",
931
+ "openai>=1.0.0",
932
+ "anthropic>=0.18.0",
933
+ "huggingface-hub>=0.20.0", # For HFInferenceJudgeHandler (FREE LLM)
934
+ ]
935
+ ```
936
+
937
+ **Note**: `huggingface-hub` is required for the free tier to work. It:
938
+ - Provides `InferenceClient` for API calls
939
+ - Auto-reads `HF_TOKEN` from environment (optional, for gated models)
940
+ - Works without any token for ungated models like `zephyr-7b-beta`
941
+
942
+ ---
943
+
944
+ ## 7. Configuration (`src/utils/config.py`)
945
+
946
+ Add LLM configuration:
947
+
948
+ ```python
949
+ """Add to src/utils/config.py."""
950
+ from pydantic_settings import BaseSettings
951
+ from typing import Literal
952
+
953
+
954
+ class Settings(BaseSettings):
955
+ """Application settings."""
956
+
957
+ # LLM Configuration
958
+ llm_provider: Literal["openai", "anthropic"] = "openai"
959
+ openai_model: str = "gpt-4o"
960
+ anthropic_model: str = "claude-3-5-sonnet-20241022"
961
+
962
+ # API Keys (loaded from environment)
963
+ openai_api_key: str | None = None
964
+ anthropic_api_key: str | None = None
965
+ ncbi_api_key: str | None = None
966
+
967
+ class Config:
968
+ env_file = ".env"
969
+ env_file_encoding = "utf-8"
970
+
971
+
972
+ settings = Settings()
973
+ ```
974
+
975
+ ---
976
+
977
+ ## 8. Implementation Checklist
978
+
979
+ - [ ] Add `AssessmentDetails` and `JudgeAssessment` models to `src/utils/models.py`
980
+ - [ ] Create `src/prompts/__init__.py` (empty, for package)
981
+ - [ ] Create `src/prompts/judge.py` with prompt templates
982
+ - [ ] Create `src/agent_factory/__init__.py` with exports
983
+ - [ ] Implement `src/agent_factory/judges.py` with JudgeHandler
984
+ - [ ] Update `src/utils/config.py` with LLM settings
985
+ - [ ] Create `tests/unit/agent_factory/__init__.py`
986
+ - [ ] Write tests in `tests/unit/agent_factory/test_judges.py`
987
+ - [ ] Run `uv run pytest tests/unit/agent_factory/ -v` — **ALL TESTS MUST PASS**
988
+ - [ ] Commit: `git commit -m "feat: phase 3 judge slice complete"`
989
+
990
+ ---
991
+
992
+ ## 9. Definition of Done
993
+
994
+ Phase 3 is **COMPLETE** when:
995
+
996
+ 1. All unit tests pass: `uv run pytest tests/unit/agent_factory/ -v`
997
+ 2. `JudgeHandler` can assess evidence and return structured output
998
+ 3. Graceful degradation: if LLM fails, returns safe fallback
999
+ 4. MockJudgeHandler works for testing without API calls
1000
+ 5. Can run this in Python REPL:
1001
+
1002
+ ```python
1003
+ import asyncio
1004
+ import os
1005
+ from src.utils.models import Evidence, Citation
1006
+ from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
1007
+
1008
+ # Test with mock (no API key needed)
1009
+ async def test_mock():
1010
+ handler = MockJudgeHandler()
1011
+ evidence = [
1012
+ Evidence(
1013
+ content="Metformin shows neuroprotective effects in AD models",
1014
+ citation=Citation(
1015
+ source="pubmed",
1016
+ title="Metformin and Alzheimer's",
1017
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
1018
+ date="2024-01-01",
1019
+ ),
1020
+ ),
1021
+ ]
1022
+ result = await handler.assess("metformin alzheimer", evidence)
1023
+ print(f"Sufficient: {result.sufficient}")
1024
+ print(f"Recommendation: {result.recommendation}")
1025
+ print(f"Drug candidates: {result.details.drug_candidates}")
1026
+
1027
+ asyncio.run(test_mock())
1028
+
1029
+ # Test with real LLM (requires API key)
1030
+ async def test_real():
1031
+ os.environ["OPENAI_API_KEY"] = "your-key-here" # Or set in .env
1032
+ handler = JudgeHandler()
1033
+ evidence = [
1034
+ Evidence(
1035
+ content="Metformin shows neuroprotective effects in AD models...",
1036
+ citation=Citation(
1037
+ source="pubmed",
1038
+ title="Metformin and Alzheimer's",
1039
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
1040
+ date="2024-01-01",
1041
+ ),
1042
+ ),
1043
+ ]
1044
+ result = await handler.assess("metformin alzheimer", evidence)
1045
+ print(f"Sufficient: {result.sufficient}")
1046
+ print(f"Confidence: {result.confidence}")
1047
+ print(f"Reasoning: {result.reasoning}")
1048
+
1049
+ # asyncio.run(test_real()) # Uncomment with valid API key
1050
+ ```
1051
+
1052
+ **Proceed to Phase 4 ONLY after all checkboxes are complete.**
docs/implementation/04_phase_ui.md ADDED
@@ -0,0 +1,1104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 4 Implementation Spec: Orchestrator & UI
2
+
3
+ **Goal**: Connect the Brain and the Body, then give it a Face.
4
+ **Philosophy**: "Streaming is Trust."
5
+ **Prerequisite**: Phase 3 complete (all judge tests passing)
6
+
7
+ ---
8
+
9
+ ## 1. The Slice Definition
10
+
11
+ This slice connects:
12
+ 1. **Orchestrator**: The state machine (While loop) calling Search -> Judge.
13
+ 2. **UI**: Gradio interface that visualizes the loop.
14
+
15
+ **Files to Create/Modify**:
16
+ - `src/orchestrator.py` - Agent loop logic
17
+ - `src/app.py` - Gradio UI
18
+ - `tests/unit/test_orchestrator.py` - Unit tests
19
+ - `Dockerfile` - Container for deployment
20
+ - `README.md` - Usage instructions (update)
21
+
22
+ ---
23
+
24
+ ## 2. Agent Events (`src/utils/models.py`)
25
+
26
+ Add event types for streaming UI updates:
27
+
28
+ ```python
29
+ """Add to src/utils/models.py (after JudgeAssessment models)."""
30
+ from pydantic import BaseModel, Field
31
+ from typing import Literal, Any
32
+ from datetime import datetime
33
+
34
+
35
+ class AgentEvent(BaseModel):
36
+ """Event emitted by the orchestrator for UI streaming."""
37
+
38
+ type: Literal[
39
+ "started",
40
+ "searching",
41
+ "search_complete",
42
+ "judging",
43
+ "judge_complete",
44
+ "looping",
45
+ "synthesizing",
46
+ "complete",
47
+ "error",
48
+ ]
49
+ message: str
50
+ data: Any = None
51
+ timestamp: datetime = Field(default_factory=datetime.now)
52
+ iteration: int = 0
53
+
54
+ def to_markdown(self) -> str:
55
+ """Format event as markdown for chat display."""
56
+ icons = {
57
+ "started": "🚀",
58
+ "searching": "🔍",
59
+ "search_complete": "📚",
60
+ "judging": "🧠",
61
+ "judge_complete": "✅",
62
+ "looping": "🔄",
63
+ "synthesizing": "📝",
64
+ "complete": "🎉",
65
+ "error": "❌",
66
+ }
67
+ icon = icons.get(self.type, "•")
68
+ return f"{icon} **{self.type.upper()}**: {self.message}"
69
+
70
+
71
+ class OrchestratorConfig(BaseModel):
72
+ """Configuration for the orchestrator."""
73
+
74
+ max_iterations: int = Field(default=5, ge=1, le=10)
75
+ max_results_per_tool: int = Field(default=10, ge=1, le=50)
76
+ search_timeout: float = Field(default=30.0, ge=5.0, le=120.0)
77
+ ```
78
+
79
+ ---
80
+
81
+ ## 3. The Orchestrator (`src/orchestrator.py`)
82
+
83
+ This is the "Agent" logic — the while loop that drives search and judgment.
84
+
85
+ ```python
86
+ """Orchestrator - the agent loop connecting Search and Judge."""
87
+ import asyncio
88
+ from typing import AsyncGenerator, List, Protocol
89
+ import structlog
90
+
91
+ from src.utils.models import (
92
+ Evidence,
93
+ SearchResult,
94
+ JudgeAssessment,
95
+ AgentEvent,
96
+ OrchestratorConfig,
97
+ )
98
+
99
+ logger = structlog.get_logger()
100
+
101
+
102
+ class SearchHandlerProtocol(Protocol):
103
+ """Protocol for search handler."""
104
+ async def execute(self, query: str, max_results_per_tool: int = 10) -> SearchResult:
105
+ ...
106
+
107
+
108
+ class JudgeHandlerProtocol(Protocol):
109
+ """Protocol for judge handler."""
110
+ async def assess(self, question: str, evidence: List[Evidence]) -> JudgeAssessment:
111
+ ...
112
+
113
+
114
+ class Orchestrator:
115
+ """
116
+ The agent orchestrator - runs the Search -> Judge -> Loop cycle.
117
+
118
+ This is a generator-based design that yields events for real-time UI updates.
119
+ """
120
+
121
+ def __init__(
122
+ self,
123
+ search_handler: SearchHandlerProtocol,
124
+ judge_handler: JudgeHandlerProtocol,
125
+ config: OrchestratorConfig | None = None,
126
+ ):
127
+ """
128
+ Initialize the orchestrator.
129
+
130
+ Args:
131
+ search_handler: Handler for executing searches
132
+ judge_handler: Handler for assessing evidence
133
+ config: Optional configuration (uses defaults if not provided)
134
+ """
135
+ self.search = search_handler
136
+ self.judge = judge_handler
137
+ self.config = config or OrchestratorConfig()
138
+ self.history: List[dict] = []
139
+
140
+ async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
141
+ """
142
+ Run the agent loop for a query.
143
+
144
+ Yields AgentEvent objects for each step, allowing real-time UI updates.
145
+
146
+ Args:
147
+ query: The user's research question
148
+
149
+ Yields:
150
+ AgentEvent objects for each step of the process
151
+ """
152
+ logger.info("Starting orchestrator", query=query)
153
+
154
+ yield AgentEvent(
155
+ type="started",
156
+ message=f"Starting research for: {query}",
157
+ iteration=0,
158
+ )
159
+
160
+ all_evidence: List[Evidence] = []
161
+ current_queries = [query]
162
+ iteration = 0
163
+
164
+ while iteration < self.config.max_iterations:
165
+ iteration += 1
166
+ logger.info("Iteration", iteration=iteration, queries=current_queries)
167
+
168
+ # === SEARCH PHASE ===
169
+ yield AgentEvent(
170
+ type="searching",
171
+ message=f"Searching for: {', '.join(current_queries[:3])}...",
172
+ iteration=iteration,
173
+ )
174
+
175
+ try:
176
+ # Execute searches for all current queries
177
+ search_tasks = [
178
+ self.search.execute(q, self.config.max_results_per_tool)
179
+ for q in current_queries[:3] # Limit to 3 queries per iteration
180
+ ]
181
+ search_results = await asyncio.gather(*search_tasks, return_exceptions=True)
182
+
183
+ # Collect evidence from successful searches
184
+ new_evidence: List[Evidence] = []
185
+ errors: List[str] = []
186
+
187
+ for q, result in zip(current_queries[:3], search_results):
188
+ if isinstance(result, Exception):
189
+ errors.append(f"Search for '{q}' failed: {str(result)}")
190
+ else:
191
+ new_evidence.extend(result.evidence)
192
+ errors.extend(result.errors)
193
+
194
+ # Deduplicate evidence by URL
195
+ seen_urls = {e.citation.url for e in all_evidence}
196
+ unique_new = [e for e in new_evidence if e.citation.url not in seen_urls]
197
+ all_evidence.extend(unique_new)
198
+
199
+ yield AgentEvent(
200
+ type="search_complete",
201
+ message=f"Found {len(unique_new)} new sources ({len(all_evidence)} total)",
202
+ data={"new_count": len(unique_new), "total_count": len(all_evidence)},
203
+ iteration=iteration,
204
+ )
205
+
206
+ if errors:
207
+ logger.warning("Search errors", errors=errors)
208
+
209
+ except Exception as e:
210
+ logger.error("Search phase failed", error=str(e))
211
+ yield AgentEvent(
212
+ type="error",
213
+ message=f"Search failed: {str(e)}",
214
+ iteration=iteration,
215
+ )
216
+ continue
217
+
218
+ # === JUDGE PHASE ===
219
+ yield AgentEvent(
220
+ type="judging",
221
+ message=f"Evaluating {len(all_evidence)} sources...",
222
+ iteration=iteration,
223
+ )
224
+
225
+ try:
226
+ assessment = await self.judge.assess(query, all_evidence)
227
+
228
+ yield AgentEvent(
229
+ type="judge_complete",
230
+ message=f"Assessment: {assessment.recommendation} (confidence: {assessment.confidence:.0%})",
231
+ data={
232
+ "sufficient": assessment.sufficient,
233
+ "confidence": assessment.confidence,
234
+ "mechanism_score": assessment.details.mechanism_score,
235
+ "clinical_score": assessment.details.clinical_evidence_score,
236
+ },
237
+ iteration=iteration,
238
+ )
239
+
240
+ # Record this iteration in history
241
+ self.history.append({
242
+ "iteration": iteration,
243
+ "queries": current_queries,
244
+ "evidence_count": len(all_evidence),
245
+ "assessment": assessment.model_dump(),
246
+ })
247
+
248
+ # === DECISION PHASE ===
249
+ if assessment.sufficient and assessment.recommendation == "synthesize":
250
+ yield AgentEvent(
251
+ type="synthesizing",
252
+ message="Evidence sufficient! Preparing synthesis...",
253
+ iteration=iteration,
254
+ )
255
+
256
+ # Generate final response
257
+ final_response = self._generate_synthesis(query, all_evidence, assessment)
258
+
259
+ yield AgentEvent(
260
+ type="complete",
261
+ message=final_response,
262
+ data={
263
+ "evidence_count": len(all_evidence),
264
+ "iterations": iteration,
265
+ "drug_candidates": assessment.details.drug_candidates,
266
+ "key_findings": assessment.details.key_findings,
267
+ },
268
+ iteration=iteration,
269
+ )
270
+ return
271
+
272
+ else:
273
+ # Need more evidence - prepare next queries
274
+ current_queries = assessment.next_search_queries or [
275
+ f"{query} mechanism of action",
276
+ f"{query} clinical evidence",
277
+ ]
278
+
279
+ yield AgentEvent(
280
+ type="looping",
281
+ message=f"Need more evidence. Next searches: {', '.join(current_queries[:2])}...",
282
+ data={"next_queries": current_queries},
283
+ iteration=iteration,
284
+ )
285
+
286
+ except Exception as e:
287
+ logger.error("Judge phase failed", error=str(e))
288
+ yield AgentEvent(
289
+ type="error",
290
+ message=f"Assessment failed: {str(e)}",
291
+ iteration=iteration,
292
+ )
293
+ continue
294
+
295
+ # Max iterations reached
296
+ yield AgentEvent(
297
+ type="complete",
298
+ message=self._generate_partial_synthesis(query, all_evidence),
299
+ data={
300
+ "evidence_count": len(all_evidence),
301
+ "iterations": iteration,
302
+ "max_reached": True,
303
+ },
304
+ iteration=iteration,
305
+ )
306
+
307
+ def _generate_synthesis(
308
+ self,
309
+ query: str,
310
+ evidence: List[Evidence],
311
+ assessment: JudgeAssessment,
312
+ ) -> str:
313
+ """
314
+ Generate the final synthesis response.
315
+
316
+ Args:
317
+ query: The original question
318
+ evidence: All collected evidence
319
+ assessment: The final assessment
320
+
321
+ Returns:
322
+ Formatted synthesis as markdown
323
+ """
324
+ drug_list = "\n".join([f"- **{d}**" for d in assessment.details.drug_candidates]) or "- No specific candidates identified"
325
+ findings_list = "\n".join([f"- {f}" for f in assessment.details.key_findings]) or "- See evidence below"
326
+
327
+ citations = "\n".join([
328
+ f"{i+1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()}, {e.citation.date})"
329
+ for i, e in enumerate(evidence[:10]) # Limit to 10 citations
330
+ ])
331
+
332
+ return f"""## Drug Repurposing Analysis
333
+
334
+ ### Question
335
+ {query}
336
+
337
+ ### Drug Candidates
338
+ {drug_list}
339
+
340
+ ### Key Findings
341
+ {findings_list}
342
+
343
+ ### Assessment
344
+ - **Mechanism Score**: {assessment.details.mechanism_score}/10
345
+ - **Clinical Evidence Score**: {assessment.details.clinical_evidence_score}/10
346
+ - **Confidence**: {assessment.confidence:.0%}
347
+
348
+ ### Reasoning
349
+ {assessment.reasoning}
350
+
351
+ ### Citations ({len(evidence)} sources)
352
+ {citations}
353
+
354
+ ---
355
+ *Analysis based on {len(evidence)} sources across {len(self.history)} iterations.*
356
+ """
357
+
358
+ def _generate_partial_synthesis(
359
+ self,
360
+ query: str,
361
+ evidence: List[Evidence],
362
+ ) -> str:
363
+ """
364
+ Generate a partial synthesis when max iterations reached.
365
+
366
+ Args:
367
+ query: The original question
368
+ evidence: All collected evidence
369
+
370
+ Returns:
371
+ Formatted partial synthesis as markdown
372
+ """
373
+ citations = "\n".join([
374
+ f"{i+1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
375
+ for i, e in enumerate(evidence[:10])
376
+ ])
377
+
378
+ return f"""## Partial Analysis (Max Iterations Reached)
379
+
380
+ ### Question
381
+ {query}
382
+
383
+ ### Status
384
+ Maximum search iterations reached. The evidence gathered may be incomplete.
385
+
386
+ ### Evidence Collected
387
+ Found {len(evidence)} sources. Consider refining your query for more specific results.
388
+
389
+ ### Citations
390
+ {citations}
391
+
392
+ ---
393
+ *Consider searching with more specific terms or drug names.*
394
+ """
395
+ ```
396
+
397
+ ---
398
+
399
+ ## 4. The Gradio UI (`src/app.py`)
400
+
401
+ Using Gradio 5 generator pattern for real-time streaming.
402
+
403
+ ```python
404
+ """Gradio UI for DeepCritical agent."""
405
+ import asyncio
406
+ import gradio as gr
407
+ from typing import AsyncGenerator
408
+
409
+ from src.orchestrator import Orchestrator
410
+ from src.tools.pubmed import PubMedTool
411
+ from src.tools.clinicaltrials import ClinicalTrialsTool
412
+ from src.tools.biorxiv import BioRxivTool
413
+ from src.tools.search_handler import SearchHandler
414
+ from src.agent_factory.judges import JudgeHandler, HFInferenceJudgeHandler
415
+ from src.utils.models import OrchestratorConfig, AgentEvent
416
+
417
+
418
+ def create_orchestrator(
419
+ user_api_key: str | None = None,
420
+ api_provider: str = "openai",
421
+ ) -> tuple[Orchestrator, str]:
422
+ """
423
+ Create an orchestrator instance.
424
+
425
+ Args:
426
+ user_api_key: Optional user-provided API key (BYOK)
427
+ api_provider: API provider ("openai" or "anthropic")
428
+
429
+ Returns:
430
+ Tuple of (Configured Orchestrator instance, backend_name)
431
+
432
+ Priority:
433
+ 1. User-provided API key → JudgeHandler (OpenAI/Anthropic)
434
+ 2. Environment API key → JudgeHandler (OpenAI/Anthropic)
435
+ 3. No key → HFInferenceJudgeHandler (FREE, automatic fallback chain)
436
+
437
+ HF Inference Fallback Chain:
438
+ 1. Llama 3.1 8B (requires HF_TOKEN for gated model)
439
+ 2. Mistral 7B (may require token)
440
+ 3. Zephyr 7B (ungated, always works)
441
+ """
442
+ import os
443
+
444
+ # Create search tools
445
+ search_handler = SearchHandler(
446
+ tools=[PubMedTool(), ClinicalTrialsTool(), BioRxivTool()],
447
+ timeout=30.0,
448
+ )
449
+
450
+ # Determine which judge to use
451
+ has_env_key = bool(os.getenv("OPENAI_API_KEY") or os.getenv("ANTHROPIC_API_KEY"))
452
+ has_user_key = bool(user_api_key)
453
+ has_hf_token = bool(os.getenv("HF_TOKEN"))
454
+
455
+ if has_user_key:
456
+ # User provided their own key
457
+ judge_handler = JudgeHandler(model=None)
458
+ backend_name = f"your {api_provider.upper()} API key"
459
+ elif has_env_key:
460
+ # Environment has API key configured
461
+ judge_handler = JudgeHandler(model=None)
462
+ backend_name = "configured API key"
463
+ else:
464
+ # Use FREE HuggingFace Inference with automatic fallback
465
+ judge_handler = HFInferenceJudgeHandler()
466
+ if has_hf_token:
467
+ backend_name = "HuggingFace Inference (Llama 3.1)"
468
+ else:
469
+ backend_name = "HuggingFace Inference (free tier)"
470
+
471
+ # Create orchestrator
472
+ config = OrchestratorConfig(
473
+ max_iterations=5,
474
+ max_results_per_tool=10,
475
+ )
476
+
477
+ return Orchestrator(
478
+ search_handler=search_handler,
479
+ judge_handler=judge_handler,
480
+ config=config,
481
+ ), backend_name
482
+
483
+
484
+ async def research_agent(
485
+ message: str,
486
+ history: list[dict],
487
+ api_key: str = "",
488
+ api_provider: str = "openai",
489
+ ) -> AsyncGenerator[str, None]:
490
+ """
491
+ Gradio chat function that runs the research agent.
492
+
493
+ Args:
494
+ message: User's research question
495
+ history: Chat history (Gradio format)
496
+ api_key: Optional user-provided API key (BYOK)
497
+ api_provider: API provider ("openai" or "anthropic")
498
+
499
+ Yields:
500
+ Markdown-formatted responses for streaming
501
+ """
502
+ if not message.strip():
503
+ yield "Please enter a research question."
504
+ return
505
+
506
+ import os
507
+
508
+ # Clean user-provided API key
509
+ user_api_key = api_key.strip() if api_key else None
510
+
511
+ # Create orchestrator with appropriate judge
512
+ orchestrator, backend_name = create_orchestrator(
513
+ user_api_key=user_api_key,
514
+ api_provider=api_provider,
515
+ )
516
+
517
+ # Determine icon based on backend
518
+ has_hf_token = bool(os.getenv("HF_TOKEN"))
519
+ if "HuggingFace" in backend_name:
520
+ icon = "🤗"
521
+ extra_note = (
522
+ "\n*For premium analysis, enter an OpenAI or Anthropic API key.*"
523
+ if not has_hf_token else ""
524
+ )
525
+ else:
526
+ icon = "🔑"
527
+ extra_note = ""
528
+
529
+ # Inform user which backend is being used
530
+ yield f"{icon} **Using {backend_name}**{extra_note}\n\n"
531
+
532
+ # Run the agent and stream events
533
+ response_parts = []
534
+
535
+ try:
536
+ async for event in orchestrator.run(message):
537
+ # Format event as markdown
538
+ event_md = event.to_markdown()
539
+ response_parts.append(event_md)
540
+
541
+ # If complete, show full response
542
+ if event.type == "complete":
543
+ yield event.message
544
+ else:
545
+ # Show progress
546
+ yield "\n\n".join(response_parts)
547
+
548
+ except Exception as e:
549
+ yield f"❌ **Error**: {str(e)}"
550
+
551
+
552
+ def create_demo() -> gr.Blocks:
553
+ """
554
+ Create the Gradio demo interface.
555
+
556
+ Returns:
557
+ Configured Gradio Blocks interface
558
+ """
559
+ with gr.Blocks(
560
+ title="DeepCritical - Drug Repurposing Research Agent",
561
+ theme=gr.themes.Soft(),
562
+ ) as demo:
563
+ gr.Markdown("""
564
+ # 🧬 DeepCritical
565
+ ## AI-Powered Drug Repurposing Research Agent
566
+
567
+ Ask questions about potential drug repurposing opportunities.
568
+ The agent will search PubMed and the web, evaluate evidence, and provide recommendations.
569
+
570
+ **Example questions:**
571
+ - "What drugs could be repurposed for Alzheimer's disease?"
572
+ - "Is metformin effective for cancer treatment?"
573
+ - "What existing medications show promise for Long COVID?"
574
+ """)
575
+
576
+ # Note: additional_inputs render in an accordion below the chat input
577
+ gr.ChatInterface(
578
+ fn=research_agent,
579
+ examples=[
580
+ [
581
+ "What drugs could be repurposed for Alzheimer's disease?",
582
+ "simple",
583
+ "",
584
+ "openai",
585
+ ],
586
+ [
587
+ "Is metformin effective for treating cancer?",
588
+ "simple",
589
+ "",
590
+ "openai",
591
+ ],
592
+ ],
593
+ additional_inputs=[
594
+ gr.Radio(
595
+ choices=["simple", "magentic"],
596
+ value="simple",
597
+ label="Orchestrator Mode",
598
+ info="Simple: Linear | Magentic: Multi-Agent (OpenAI)",
599
+ ),
600
+ gr.Textbox(
601
+ label="API Key (Optional - Bring Your Own Key)",
602
+ placeholder="sk-... or sk-ant-...",
603
+ type="password",
604
+ info="Enter your own API key for full AI analysis. Never stored.",
605
+ ),
606
+ gr.Radio(
607
+ choices=["openai", "anthropic"],
608
+ value="openai",
609
+ label="API Provider",
610
+ info="Select the provider for your API key",
611
+ ),
612
+ ],
613
+ )
614
+
615
+ gr.Markdown("""
616
+ ---
617
+ **Note**: This is a research tool and should not be used for medical decisions.
618
+ Always consult healthcare professionals for medical advice.
619
+
620
+ Built with 🤖 PydanticAI + 🔬 PubMed + 🦆 DuckDuckGo
621
+ """)
622
+
623
+ return demo
624
+
625
+
626
+ def main():
627
+ """Run the Gradio app."""
628
+ demo = create_demo()
629
+ demo.launch(
630
+ server_name="0.0.0.0",
631
+ server_port=7860,
632
+ share=False,
633
+ )
634
+
635
+
636
+ if __name__ == "__main__":
637
+ main()
638
+ ```
639
+
640
+ ---
641
+
642
+ ## 5. TDD Workflow
643
+
644
+ ### Test File: `tests/unit/test_orchestrator.py`
645
+
646
+ ```python
647
+ """Unit tests for Orchestrator."""
648
+ import pytest
649
+ from unittest.mock import AsyncMock, MagicMock
650
+
651
+ from src.utils.models import (
652
+ Evidence,
653
+ Citation,
654
+ SearchResult,
655
+ JudgeAssessment,
656
+ AssessmentDetails,
657
+ OrchestratorConfig,
658
+ )
659
+
660
+
661
+ class TestOrchestrator:
662
+ """Tests for Orchestrator."""
663
+
664
+ @pytest.fixture
665
+ def mock_search_handler(self):
666
+ """Create a mock search handler."""
667
+ handler = AsyncMock()
668
+ handler.execute = AsyncMock(return_value=SearchResult(
669
+ query="test",
670
+ evidence=[
671
+ Evidence(
672
+ content="Test content",
673
+ citation=Citation(
674
+ source="pubmed",
675
+ title="Test Title",
676
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
677
+ date="2024-01-01",
678
+ ),
679
+ ),
680
+ ],
681
+ sources_searched=["pubmed"],
682
+ total_found=1,
683
+ errors=[],
684
+ ))
685
+ return handler
686
+
687
+ @pytest.fixture
688
+ def mock_judge_sufficient(self):
689
+ """Create a mock judge that returns sufficient."""
690
+ handler = AsyncMock()
691
+ handler.assess = AsyncMock(return_value=JudgeAssessment(
692
+ details=AssessmentDetails(
693
+ mechanism_score=8,
694
+ mechanism_reasoning="Good mechanism",
695
+ clinical_evidence_score=7,
696
+ clinical_reasoning="Good clinical",
697
+ drug_candidates=["Drug A"],
698
+ key_findings=["Finding 1"],
699
+ ),
700
+ sufficient=True,
701
+ confidence=0.85,
702
+ recommendation="synthesize",
703
+ next_search_queries=[],
704
+ reasoning="Evidence is sufficient",
705
+ ))
706
+ return handler
707
+
708
+ @pytest.fixture
709
+ def mock_judge_insufficient(self):
710
+ """Create a mock judge that returns insufficient."""
711
+ handler = AsyncMock()
712
+ handler.assess = AsyncMock(return_value=JudgeAssessment(
713
+ details=AssessmentDetails(
714
+ mechanism_score=4,
715
+ mechanism_reasoning="Weak mechanism",
716
+ clinical_evidence_score=3,
717
+ clinical_reasoning="Weak clinical",
718
+ drug_candidates=[],
719
+ key_findings=[],
720
+ ),
721
+ sufficient=False,
722
+ confidence=0.3,
723
+ recommendation="continue",
724
+ next_search_queries=["more specific query"],
725
+ reasoning="Need more evidence",
726
+ ))
727
+ return handler
728
+
729
+ @pytest.mark.asyncio
730
+ async def test_orchestrator_completes_with_sufficient_evidence(
731
+ self,
732
+ mock_search_handler,
733
+ mock_judge_sufficient,
734
+ ):
735
+ """Orchestrator should complete when evidence is sufficient."""
736
+ from src.orchestrator import Orchestrator
737
+
738
+ config = OrchestratorConfig(max_iterations=5)
739
+ orchestrator = Orchestrator(
740
+ search_handler=mock_search_handler,
741
+ judge_handler=mock_judge_sufficient,
742
+ config=config,
743
+ )
744
+
745
+ events = []
746
+ async for event in orchestrator.run("test query"):
747
+ events.append(event)
748
+
749
+ # Should have started, searched, judged, and completed
750
+ event_types = [e.type for e in events]
751
+ assert "started" in event_types
752
+ assert "searching" in event_types
753
+ assert "search_complete" in event_types
754
+ assert "judging" in event_types
755
+ assert "judge_complete" in event_types
756
+ assert "complete" in event_types
757
+
758
+ # Should only have 1 iteration
759
+ complete_event = [e for e in events if e.type == "complete"][0]
760
+ assert complete_event.iteration == 1
761
+
762
+ @pytest.mark.asyncio
763
+ async def test_orchestrator_loops_when_insufficient(
764
+ self,
765
+ mock_search_handler,
766
+ mock_judge_insufficient,
767
+ ):
768
+ """Orchestrator should loop when evidence is insufficient."""
769
+ from src.orchestrator import Orchestrator
770
+
771
+ config = OrchestratorConfig(max_iterations=3)
772
+ orchestrator = Orchestrator(
773
+ search_handler=mock_search_handler,
774
+ judge_handler=mock_judge_insufficient,
775
+ config=config,
776
+ )
777
+
778
+ events = []
779
+ async for event in orchestrator.run("test query"):
780
+ events.append(event)
781
+
782
+ # Should have looping events
783
+ event_types = [e.type for e in events]
784
+ assert event_types.count("looping") >= 2 # At least 2 loop events
785
+
786
+ # Should hit max iterations
787
+ complete_event = [e for e in events if e.type == "complete"][0]
788
+ assert complete_event.data.get("max_reached") is True
789
+
790
+ @pytest.mark.asyncio
791
+ async def test_orchestrator_respects_max_iterations(
792
+ self,
793
+ mock_search_handler,
794
+ mock_judge_insufficient,
795
+ ):
796
+ """Orchestrator should stop at max_iterations."""
797
+ from src.orchestrator import Orchestrator
798
+
799
+ config = OrchestratorConfig(max_iterations=2)
800
+ orchestrator = Orchestrator(
801
+ search_handler=mock_search_handler,
802
+ judge_handler=mock_judge_insufficient,
803
+ config=config,
804
+ )
805
+
806
+ events = []
807
+ async for event in orchestrator.run("test query"):
808
+ events.append(event)
809
+
810
+ # Should have exactly 2 iterations
811
+ max_iteration = max(e.iteration for e in events)
812
+ assert max_iteration == 2
813
+
814
+ @pytest.mark.asyncio
815
+ async def test_orchestrator_handles_search_error(self):
816
+ """Orchestrator should handle search errors gracefully."""
817
+ from src.orchestrator import Orchestrator
818
+
819
+ mock_search = AsyncMock()
820
+ mock_search.execute = AsyncMock(side_effect=Exception("Search failed"))
821
+
822
+ mock_judge = AsyncMock()
823
+ mock_judge.assess = AsyncMock(return_value=JudgeAssessment(
824
+ details=AssessmentDetails(
825
+ mechanism_score=0,
826
+ mechanism_reasoning="N/A",
827
+ clinical_evidence_score=0,
828
+ clinical_reasoning="N/A",
829
+ drug_candidates=[],
830
+ key_findings=[],
831
+ ),
832
+ sufficient=False,
833
+ confidence=0.0,
834
+ recommendation="continue",
835
+ next_search_queries=["retry query"],
836
+ reasoning="Search failed",
837
+ ))
838
+
839
+ config = OrchestratorConfig(max_iterations=2)
840
+ orchestrator = Orchestrator(
841
+ search_handler=mock_search,
842
+ judge_handler=mock_judge,
843
+ config=config,
844
+ )
845
+
846
+ events = []
847
+ async for event in orchestrator.run("test query"):
848
+ events.append(event)
849
+
850
+ # Should have error events
851
+ event_types = [e.type for e in events]
852
+ assert "error" in event_types
853
+
854
+ @pytest.mark.asyncio
855
+ async def test_orchestrator_deduplicates_evidence(self, mock_judge_insufficient):
856
+ """Orchestrator should deduplicate evidence by URL."""
857
+ from src.orchestrator import Orchestrator
858
+
859
+ # Search returns same evidence each time
860
+ duplicate_evidence = Evidence(
861
+ content="Duplicate content",
862
+ citation=Citation(
863
+ source="pubmed",
864
+ title="Same Title",
865
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/", # Same URL
866
+ date="2024-01-01",
867
+ ),
868
+ )
869
+
870
+ mock_search = AsyncMock()
871
+ mock_search.execute = AsyncMock(return_value=SearchResult(
872
+ query="test",
873
+ evidence=[duplicate_evidence],
874
+ sources_searched=["pubmed"],
875
+ total_found=1,
876
+ errors=[],
877
+ ))
878
+
879
+ config = OrchestratorConfig(max_iterations=2)
880
+ orchestrator = Orchestrator(
881
+ search_handler=mock_search,
882
+ judge_handler=mock_judge_insufficient,
883
+ config=config,
884
+ )
885
+
886
+ events = []
887
+ async for event in orchestrator.run("test query"):
888
+ events.append(event)
889
+
890
+ # Second search_complete should show 0 new evidence
891
+ search_complete_events = [e for e in events if e.type == "search_complete"]
892
+ assert len(search_complete_events) == 2
893
+
894
+ # First iteration should have 1 new
895
+ assert search_complete_events[0].data["new_count"] == 1
896
+
897
+ # Second iteration should have 0 new (duplicate)
898
+ assert search_complete_events[1].data["new_count"] == 0
899
+
900
+
901
+ class TestAgentEvent:
902
+ """Tests for AgentEvent."""
903
+
904
+ def test_to_markdown(self):
905
+ """AgentEvent should format to markdown correctly."""
906
+ from src.utils.models import AgentEvent
907
+
908
+ event = AgentEvent(
909
+ type="searching",
910
+ message="Searching for: metformin alzheimer",
911
+ iteration=1,
912
+ )
913
+
914
+ md = event.to_markdown()
915
+ assert "🔍" in md
916
+ assert "SEARCHING" in md
917
+ assert "metformin alzheimer" in md
918
+
919
+ def test_complete_event_icon(self):
920
+ """Complete event should have celebration icon."""
921
+ from src.utils.models import AgentEvent
922
+
923
+ event = AgentEvent(
924
+ type="complete",
925
+ message="Done!",
926
+ iteration=3,
927
+ )
928
+
929
+ md = event.to_markdown()
930
+ assert "🎉" in md
931
+ ```
932
+
933
+ ---
934
+
935
+ ## 6. Dockerfile
936
+
937
+ ```dockerfile
938
+ # Dockerfile for DeepCritical
939
+ FROM python:3.11-slim
940
+
941
+ # Set working directory
942
+ WORKDIR /app
943
+
944
+ # Install system dependencies
945
+ RUN apt-get update && apt-get install -y \
946
+ git \
947
+ && rm -rf /var/lib/apt/lists/*
948
+
949
+ # Install uv
950
+ RUN pip install uv
951
+
952
+ # Copy project files
953
+ COPY pyproject.toml .
954
+ COPY src/ src/
955
+
956
+ # Install dependencies
957
+ RUN uv pip install --system .
958
+
959
+ # Expose port
960
+ EXPOSE 7860
961
+
962
+ # Set environment variables
963
+ ENV GRADIO_SERVER_NAME=0.0.0.0
964
+ ENV GRADIO_SERVER_PORT=7860
965
+
966
+ # Run the app
967
+ CMD ["python", "-m", "src.app"]
968
+ ```
969
+
970
+ ---
971
+
972
+ ## 7. HuggingFace Spaces Configuration
973
+
974
+ Create `README.md` header for HuggingFace Spaces:
975
+
976
+ ```markdown
977
+ ---
978
+ title: DeepCritical
979
+ emoji: 🧬
980
+ colorFrom: blue
981
+ colorTo: purple
982
+ sdk: gradio
983
+ sdk_version: 5.0.0
984
+ app_file: src/app.py
985
+ pinned: false
986
+ license: mit
987
+ ---
988
+
989
+ # DeepCritical
990
+
991
+ AI-Powered Drug Repurposing Research Agent
992
+ ```
993
+
994
+ ---
995
+
996
+ ## 8. Implementation Checklist
997
+
998
+ - [ ] Add `AgentEvent` and `OrchestratorConfig` models to `src/utils/models.py`
999
+ - [ ] Implement `src/orchestrator.py` with full Orchestrator class
1000
+ - [ ] Implement `src/app.py` with Gradio interface
1001
+ - [ ] Create `tests/unit/test_orchestrator.py` with all tests
1002
+ - [ ] Create `Dockerfile` for deployment
1003
+ - [ ] Update project `README.md` with usage instructions
1004
+ - [ ] Run `uv run pytest tests/unit/test_orchestrator.py -v` — **ALL TESTS MUST PASS**
1005
+ - [ ] Test locally: `uv run python -m src.app`
1006
+ - [ ] Commit: `git commit -m "feat: phase 4 orchestrator and UI complete"`
1007
+
1008
+ ---
1009
+
1010
+ ## 9. Definition of Done
1011
+
1012
+ Phase 4 is **COMPLETE** when:
1013
+
1014
+ 1. All unit tests pass: `uv run pytest tests/unit/test_orchestrator.py -v`
1015
+ 2. Orchestrator correctly loops Search -> Judge until sufficient
1016
+ 3. Max iterations limit is enforced
1017
+ 4. Graceful error handling throughout
1018
+ 5. Gradio UI streams events in real-time
1019
+ 6. Can run locally:
1020
+
1021
+ ```bash
1022
+ # Start the UI
1023
+ uv run python -m src.app
1024
+
1025
+ # Open browser to http://localhost:7860
1026
+ # Enter a question like "What drugs could be repurposed for Alzheimer's disease?"
1027
+ # Watch the agent search, evaluate, and respond
1028
+ ```
1029
+
1030
+ 7. Can run the full flow in Python:
1031
+
1032
+ ```python
1033
+ import asyncio
1034
+ from src.orchestrator import Orchestrator
1035
+ from src.tools.pubmed import PubMedTool
1036
+ from src.tools.biorxiv import BioRxivTool
1037
+ from src.tools.clinicaltrials import ClinicalTrialsTool
1038
+ from src.tools.search_handler import SearchHandler
1039
+ from src.agent_factory.judges import HFInferenceJudgeHandler, MockJudgeHandler
1040
+ from src.utils.models import OrchestratorConfig
1041
+
1042
+ async def test_full_flow():
1043
+ # Create components
1044
+ search_handler = SearchHandler([PubMedTool(), ClinicalTrialsTool(), BioRxivTool()])
1045
+
1046
+ # Option 1: Use FREE HuggingFace Inference (real AI analysis)
1047
+ judge_handler = HFInferenceJudgeHandler()
1048
+
1049
+ # Option 2: Use MockJudgeHandler for UNIT TESTING ONLY
1050
+ # judge_handler = MockJudgeHandler()
1051
+
1052
+ config = OrchestratorConfig(max_iterations=3)
1053
+
1054
+ # Create orchestrator
1055
+ orchestrator = Orchestrator(
1056
+ search_handler=search_handler,
1057
+ judge_handler=judge_handler,
1058
+ config=config,
1059
+ )
1060
+
1061
+ # Run and collect events
1062
+ print("Starting agent...")
1063
+ async for event in orchestrator.run("metformin alzheimer"):
1064
+ print(event.to_markdown())
1065
+
1066
+ print("\nDone!")
1067
+
1068
+ asyncio.run(test_full_flow())
1069
+ ```
1070
+
1071
+ **Important**: `MockJudgeHandler` is for **unit testing only**. For actual demo/production use, always use `HFInferenceJudgeHandler` (free) or `JudgeHandler` (with API key).
1072
+
1073
+ ---
1074
+
1075
+ ## 10. Deployment Verification
1076
+
1077
+ After deployment to HuggingFace Spaces:
1078
+
1079
+ 1. **Visit the Space URL** and verify the UI loads
1080
+ 2. **Test with example queries**:
1081
+ - "What drugs could be repurposed for Alzheimer's disease?"
1082
+ - "Is metformin effective for cancer treatment?"
1083
+ 3. **Verify streaming** - events should appear in real-time
1084
+ 4. **Check error handling** - try an empty query, verify graceful handling
1085
+ 5. **Monitor logs** for any errors
1086
+
1087
+ ---
1088
+
1089
+ ## Project Complete! 🎉
1090
+
1091
+ When Phase 4 is done, the DeepCritical MVP is complete:
1092
+
1093
+ - **Phase 1**: Foundation (uv, pytest, config) ✅
1094
+ - **Phase 2**: Search Slice (PubMed, DuckDuckGo) ✅
1095
+ - **Phase 3**: Judge Slice (PydanticAI, structured output) ✅
1096
+ - **Phase 4**: Orchestrator + UI (Gradio, streaming) ✅
1097
+
1098
+ The agent can:
1099
+ 1. Accept a drug repurposing question
1100
+ 2. Search PubMed and the web for evidence
1101
+ 3. Evaluate evidence quality with an LLM
1102
+ 4. Loop until confident or max iterations
1103
+ 5. Synthesize a research-backed recommendation
1104
+ 6. Display real-time progress in a beautiful UI
docs/implementation/05_phase_magentic.md ADDED
@@ -0,0 +1,1091 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 5 Implementation Spec: Magentic Integration
2
+
3
+ **Goal**: Upgrade orchestrator to use Microsoft Agent Framework's Magentic-One pattern.
4
+ **Philosophy**: "Same API, Better Engine."
5
+ **Prerequisite**: Phase 4 complete (MVP working end-to-end)
6
+
7
+ ---
8
+
9
+ ## 1. Why Magentic?
10
+
11
+ Magentic-One provides:
12
+ - **LLM-powered manager** that dynamically plans, selects agents, tracks progress
13
+ - **Built-in stall detection** and automatic replanning
14
+ - **Checkpointing** for pause/resume workflows
15
+ - **Event streaming** for real-time UI updates
16
+ - **Multi-agent coordination** with round limits and reset logic
17
+
18
+ ---
19
+
20
+ ## 2. Critical Architecture Understanding
21
+
22
+ ### 2.1 How Magentic Actually Works
23
+
24
+ ```
25
+ ┌─────────────────────────────────────────────────────────────────────────┐
26
+ │ MagenticBuilder Workflow │
27
+ ├─────────────────────────────────────────────────────────────────────────┤
28
+ │ │
29
+ │ User Task: "Research drug repurposing for metformin alzheimer" │
30
+ │ ↓ │
31
+ │ ┌──────────────────────────────────────────────────────────────────┐ │
32
+ │ │ StandardMagenticManager │ │
33
+ │ │ │ │
34
+ │ │ 1. plan() → LLM generates facts & plan │ │
35
+ │ │ 2. create_progress_ledger() → LLM decides: │ │
36
+ │ │ - is_request_satisfied? │ │
37
+ │ │ - next_speaker: "searcher" │ │
38
+ │ │ - instruction_or_question: "Search for clinical trials..." │ │
39
+ │ │ │ │
40
+ │ └──────────────────────────────────────────────────────────────────┘ │
41
+ │ ↓ │
42
+ │ NATURAL LANGUAGE INSTRUCTION sent to agent │
43
+ │ "Search for clinical trials about metformin..." │
44
+ │ ↓ │
45
+ │ ┌──────────────────────────────────────────────────────────────────┐ │
46
+ │ │ ChatAgent (searcher) │ │
47
+ │ │ │ │
48
+ │ │ chat_client (INTERNAL LLM) ← understands instruction │ │
49
+ │ │ ↓ │ │
50
+ │ │ "I'll search for metformin alzheimer clinical trials" │ │
51
+ │ │ ↓ │ │
52
+ │ │ tools=[search_pubmed, search_clinicaltrials] ← calls tools │ │
53
+ │ │ ↓ │ │
54
+ │ │ Returns natural language response to manager │ │
55
+ │ │ │ │
56
+ │ └──────────────────────────────────────────────────────────────────┘ │
57
+ │ ↓ │
58
+ │ Manager evaluates response │
59
+ │ Decides next agent or completion │
60
+ │ │
61
+ └─────────────────────────────────────────────────────────────────────────┘
62
+ ```
63
+
64
+ ### 2.2 The Critical Insight
65
+
66
+ **Microsoft's ChatAgent has an INTERNAL LLM (`chat_client`) that:**
67
+ 1. Receives natural language instructions from the manager
68
+ 2. Understands what action to take
69
+ 3. Calls attached tools (functions)
70
+ 4. Returns natural language responses
71
+
72
+ **Our previous implementation was WRONG because:**
73
+ - We wrapped handlers as bare `BaseAgent` subclasses
74
+ - No internal LLM to understand instructions
75
+ - Raw instruction text was passed directly to APIs (PubMed doesn't understand "Search for clinical trials...")
76
+
77
+ ### 2.3 Correct Pattern: ChatAgent with Tools
78
+
79
+ ```python
80
+ # CORRECT: Agent backed by LLM that calls tools
81
+ from agent_framework import ChatAgent, AIFunction
82
+ from agent_framework.openai import OpenAIChatClient
83
+
84
+ # Define tool that ChatAgent can call
85
+ @AIFunction
86
+ async def search_pubmed(query: str, max_results: int = 10) -> str:
87
+ """Search PubMed for biomedical literature.
88
+
89
+ Args:
90
+ query: Search keywords (e.g., "metformin alzheimer mechanism")
91
+ max_results: Maximum number of results to return
92
+ """
93
+ result = await pubmed_tool.search(query, max_results)
94
+ return format_results(result)
95
+
96
+ # ChatAgent with internal LLM + tools
97
+ search_agent = ChatAgent(
98
+ name="SearchAgent",
99
+ description="Searches biomedical databases for drug repurposing evidence",
100
+ instructions="You search PubMed, ClinicalTrials.gov, and bioRxiv for evidence.",
101
+ chat_client=OpenAIChatClient(model_id="gpt-4o-mini"), # INTERNAL LLM
102
+ tools=[search_pubmed, search_clinicaltrials, search_biorxiv], # TOOLS
103
+ )
104
+ ```
105
+
106
+ ---
107
+
108
+ ## 3. Correct Implementation
109
+
110
+ ### 3.1 Shared State Module (`src/agents/state.py`)
111
+
112
+ **CRITICAL**: Tools must update shared state so:
113
+ 1. EmbeddingService can deduplicate across searches
114
+ 2. ReportAgent can access structured Evidence objects for citations
115
+
116
+ ```python
117
+ """Shared state for Magentic agents.
118
+
119
+ This module provides global state that tools update as a side effect.
120
+ ChatAgent tools return strings to the LLM, but also update this state
121
+ for semantic deduplication and structured citation access.
122
+ """
123
+ from __future__ import annotations
124
+
125
+ from typing import TYPE_CHECKING
126
+
127
+ import structlog
128
+
129
+ if TYPE_CHECKING:
130
+ from src.services.embeddings import EmbeddingService
131
+
132
+ from src.utils.models import Evidence
133
+
134
+ logger = structlog.get_logger()
135
+
136
+
137
+ class MagenticState:
138
+ """Shared state container for Magentic workflow.
139
+
140
+ Maintains:
141
+ - evidence_store: All collected Evidence objects (for citations)
142
+ - embedding_service: Optional semantic search (for deduplication)
143
+ """
144
+
145
+ def __init__(self) -> None:
146
+ self.evidence_store: list[Evidence] = []
147
+ self.embedding_service: EmbeddingService | None = None
148
+ self._seen_urls: set[str] = set()
149
+
150
+ def init_embedding_service(self) -> None:
151
+ """Lazy-initialize embedding service if available."""
152
+ if self.embedding_service is not None:
153
+ return
154
+ try:
155
+ from src.services.embeddings import get_embedding_service
156
+ self.embedding_service = get_embedding_service()
157
+ logger.info("Embedding service enabled for Magentic mode")
158
+ except Exception as e:
159
+ logger.warning("Embedding service unavailable", error=str(e))
160
+
161
+ async def add_evidence(self, evidence_list: list[Evidence]) -> list[Evidence]:
162
+ """Add evidence with semantic deduplication.
163
+
164
+ Args:
165
+ evidence_list: New evidence from search
166
+
167
+ Returns:
168
+ List of unique evidence (not duplicates)
169
+ """
170
+ if not evidence_list:
171
+ return []
172
+
173
+ # URL-based deduplication first (fast)
174
+ url_unique = [
175
+ e for e in evidence_list
176
+ if e.citation.url not in self._seen_urls
177
+ ]
178
+
179
+ # Semantic deduplication if available
180
+ if self.embedding_service and url_unique:
181
+ try:
182
+ unique = await self.embedding_service.deduplicate(url_unique, threshold=0.85)
183
+ logger.info(
184
+ "Semantic deduplication",
185
+ before=len(url_unique),
186
+ after=len(unique),
187
+ )
188
+ except Exception as e:
189
+ logger.warning("Deduplication failed, using URL-based", error=str(e))
190
+ unique = url_unique
191
+ else:
192
+ unique = url_unique
193
+
194
+ # Update state
195
+ for e in unique:
196
+ self._seen_urls.add(e.citation.url)
197
+ self.evidence_store.append(e)
198
+
199
+ return unique
200
+
201
+ async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]:
202
+ """Find semantically related evidence from vector store.
203
+
204
+ Args:
205
+ query: Search query
206
+ n_results: Number of related items
207
+
208
+ Returns:
209
+ Related Evidence objects (reconstructed from vector store)
210
+ """
211
+ if not self.embedding_service:
212
+ return []
213
+
214
+ try:
215
+ from src.utils.models import Citation
216
+
217
+ related = await self.embedding_service.search_similar(query, n_results)
218
+ evidence = []
219
+
220
+ for item in related:
221
+ if item["id"] in self._seen_urls:
222
+ continue # Already in results
223
+
224
+ meta = item.get("metadata", {})
225
+ authors_str = meta.get("authors", "")
226
+ authors = [a.strip() for a in authors_str.split(",") if a.strip()]
227
+
228
+ ev = Evidence(
229
+ content=item["content"],
230
+ citation=Citation(
231
+ title=meta.get("title", "Related Evidence"),
232
+ url=item["id"],
233
+ source=meta.get("source", "pubmed"),
234
+ date=meta.get("date", "n.d."),
235
+ authors=authors,
236
+ ),
237
+ relevance=max(0.0, 1.0 - item.get("distance", 0.5)),
238
+ )
239
+ evidence.append(ev)
240
+
241
+ return evidence
242
+ except Exception as e:
243
+ logger.warning("Related search failed", error=str(e))
244
+ return []
245
+
246
+ def reset(self) -> None:
247
+ """Reset state for new workflow run."""
248
+ self.evidence_store.clear()
249
+ self._seen_urls.clear()
250
+
251
+
252
+ # Global singleton for workflow
253
+ _state: MagenticState | None = None
254
+
255
+
256
+ def get_magentic_state() -> MagenticState:
257
+ """Get or create the global Magentic state."""
258
+ global _state
259
+ if _state is None:
260
+ _state = MagenticState()
261
+ return _state
262
+
263
+
264
+ def reset_magentic_state() -> None:
265
+ """Reset state for a fresh workflow run."""
266
+ global _state
267
+ if _state is not None:
268
+ _state.reset()
269
+ else:
270
+ _state = MagenticState()
271
+ ```
272
+
273
+ ### 3.2 Tool Functions (`src/agents/tools.py`)
274
+
275
+ Tools call APIs AND update shared state. Return strings to LLM, but also store structured Evidence.
276
+
277
+ ```python
278
+ """Tool functions for Magentic agents.
279
+
280
+ IMPORTANT: These tools do TWO things:
281
+ 1. Return formatted strings to the ChatAgent's internal LLM
282
+ 2. Update shared state (evidence_store, embeddings) as a side effect
283
+
284
+ This preserves semantic deduplication and structured citation access.
285
+ """
286
+ from agent_framework import AIFunction
287
+
288
+ from src.agents.state import get_magentic_state
289
+ from src.tools.biorxiv import BioRxivTool
290
+ from src.tools.clinicaltrials import ClinicalTrialsTool
291
+ from src.tools.pubmed import PubMedTool
292
+
293
+ # Singleton tool instances
294
+ _pubmed = PubMedTool()
295
+ _clinicaltrials = ClinicalTrialsTool()
296
+ _biorxiv = BioRxivTool()
297
+
298
+
299
+ def _format_results(results: list, source_name: str, query: str) -> str:
300
+ """Format search results for LLM consumption."""
301
+ if not results:
302
+ return f"No {source_name} results found for: {query}"
303
+
304
+ output = [f"Found {len(results)} {source_name} results:\n"]
305
+ for i, r in enumerate(results[:10], 1):
306
+ output.append(f"{i}. **{r.citation.title}**")
307
+ output.append(f" Source: {r.citation.source} | Date: {r.citation.date}")
308
+ output.append(f" {r.content[:300]}...")
309
+ output.append(f" URL: {r.citation.url}\n")
310
+
311
+ return "\n".join(output)
312
+
313
+
314
+ @AIFunction
315
+ async def search_pubmed(query: str, max_results: int = 10) -> str:
316
+ """Search PubMed for biomedical research papers.
317
+
318
+ Use this tool to find peer-reviewed scientific literature about
319
+ drugs, diseases, mechanisms of action, and clinical studies.
320
+
321
+ Args:
322
+ query: Search keywords (e.g., "metformin alzheimer mechanism")
323
+ max_results: Maximum results to return (default 10)
324
+
325
+ Returns:
326
+ Formatted list of papers with titles, abstracts, and citations
327
+ """
328
+ # 1. Execute search
329
+ results = await _pubmed.search(query, max_results)
330
+
331
+ # 2. Update shared state (semantic dedup + evidence store)
332
+ state = get_magentic_state()
333
+ unique = await state.add_evidence(results)
334
+
335
+ # 3. Also get related evidence from vector store
336
+ related = await state.search_related(query, n_results=3)
337
+ if related:
338
+ await state.add_evidence(related)
339
+
340
+ # 4. Return formatted string for LLM
341
+ total_new = len(unique)
342
+ total_stored = len(state.evidence_store)
343
+
344
+ output = _format_results(results, "PubMed", query)
345
+ output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
346
+
347
+ if related:
348
+ output += f"\n[Also found {len(related)} semantically related items from previous searches]"
349
+
350
+ return output
351
+
352
+
353
+ @AIFunction
354
+ async def search_clinical_trials(query: str, max_results: int = 10) -> str:
355
+ """Search ClinicalTrials.gov for clinical studies.
356
+
357
+ Use this tool to find ongoing and completed clinical trials
358
+ for drug repurposing candidates.
359
+
360
+ Args:
361
+ query: Search terms (e.g., "metformin cancer phase 3")
362
+ max_results: Maximum results to return (default 10)
363
+
364
+ Returns:
365
+ Formatted list of clinical trials with status and details
366
+ """
367
+ # 1. Execute search
368
+ results = await _clinicaltrials.search(query, max_results)
369
+
370
+ # 2. Update shared state
371
+ state = get_magentic_state()
372
+ unique = await state.add_evidence(results)
373
+
374
+ # 3. Return formatted string
375
+ total_new = len(unique)
376
+ total_stored = len(state.evidence_store)
377
+
378
+ output = _format_results(results, "ClinicalTrials.gov", query)
379
+ output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
380
+
381
+ return output
382
+
383
+
384
+ @AIFunction
385
+ async def search_preprints(query: str, max_results: int = 10) -> str:
386
+ """Search bioRxiv/medRxiv for preprint papers.
387
+
388
+ Use this tool to find the latest research that hasn't been
389
+ peer-reviewed yet. Good for cutting-edge findings.
390
+
391
+ Args:
392
+ query: Search terms (e.g., "long covid treatment")
393
+ max_results: Maximum results to return (default 10)
394
+
395
+ Returns:
396
+ Formatted list of preprints with abstracts and links
397
+ """
398
+ # 1. Execute search
399
+ results = await _biorxiv.search(query, max_results)
400
+
401
+ # 2. Update shared state
402
+ state = get_magentic_state()
403
+ unique = await state.add_evidence(results)
404
+
405
+ # 3. Return formatted string
406
+ total_new = len(unique)
407
+ total_stored = len(state.evidence_store)
408
+
409
+ output = _format_results(results, "bioRxiv/medRxiv", query)
410
+ output += f"\n[State: {total_new} new, {total_stored} total in evidence store]"
411
+
412
+ return output
413
+
414
+
415
+ @AIFunction
416
+ async def get_evidence_summary() -> str:
417
+ """Get summary of all collected evidence.
418
+
419
+ Use this tool when you need to review what evidence has been collected
420
+ before making an assessment or generating a report.
421
+
422
+ Returns:
423
+ Summary of evidence store with counts and key citations
424
+ """
425
+ state = get_magentic_state()
426
+ evidence = state.evidence_store
427
+
428
+ if not evidence:
429
+ return "No evidence collected yet."
430
+
431
+ # Group by source
432
+ by_source: dict[str, list] = {}
433
+ for e in evidence:
434
+ src = e.citation.source
435
+ if src not in by_source:
436
+ by_source[src] = []
437
+ by_source[src].append(e)
438
+
439
+ output = [f"**Evidence Store Summary** ({len(evidence)} total items)\n"]
440
+
441
+ for source, items in by_source.items():
442
+ output.append(f"\n### {source.upper()} ({len(items)} items)")
443
+ for e in items[:5]: # First 5 per source
444
+ output.append(f"- {e.citation.title[:80]}...")
445
+
446
+ return "\n".join(output)
447
+
448
+
449
+ @AIFunction
450
+ async def get_bibliography() -> str:
451
+ """Get full bibliography of all collected evidence.
452
+
453
+ Use this tool when generating a final report to get properly
454
+ formatted citations for all evidence.
455
+
456
+ Returns:
457
+ Numbered bibliography with full citation details
458
+ """
459
+ state = get_magentic_state()
460
+ evidence = state.evidence_store
461
+
462
+ if not evidence:
463
+ return "No evidence collected for bibliography."
464
+
465
+ output = ["## References\n"]
466
+
467
+ for i, e in enumerate(evidence, 1):
468
+ # Format: Authors (Year). Title. Source. URL
469
+ authors = ", ".join(e.citation.authors[:3]) if e.citation.authors else "Unknown"
470
+ if e.citation.authors and len(e.citation.authors) > 3:
471
+ authors += " et al."
472
+
473
+ year = e.citation.date[:4] if e.citation.date else "n.d."
474
+
475
+ output.append(
476
+ f"{i}. {authors} ({year}). {e.citation.title}. "
477
+ f"*{e.citation.source.upper()}*. [{e.citation.url}]({e.citation.url})"
478
+ )
479
+
480
+ return "\n".join(output)
481
+ ```
482
+
483
+ ### 3.3 ChatAgent-Based Agents (`src/agents/magentic_agents.py`)
484
+
485
+ ```python
486
+ """Magentic-compatible agents using ChatAgent pattern."""
487
+ from agent_framework import ChatAgent
488
+ from agent_framework.openai import OpenAIChatClient
489
+
490
+ from src.agents.tools import (
491
+ get_bibliography,
492
+ get_evidence_summary,
493
+ search_clinical_trials,
494
+ search_preprints,
495
+ search_pubmed,
496
+ )
497
+ from src.utils.config import settings
498
+
499
+
500
+ def create_search_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
501
+ """Create a search agent with internal LLM and search tools.
502
+
503
+ Args:
504
+ chat_client: Optional custom chat client. If None, uses default.
505
+
506
+ Returns:
507
+ ChatAgent configured for biomedical search
508
+ """
509
+ client = chat_client or OpenAIChatClient(
510
+ model_id="gpt-4o-mini", # Fast, cheap for tool orchestration
511
+ api_key=settings.openai_api_key,
512
+ )
513
+
514
+ return ChatAgent(
515
+ name="SearchAgent",
516
+ description="Searches biomedical databases (PubMed, ClinicalTrials.gov, bioRxiv) for drug repurposing evidence",
517
+ instructions="""You are a biomedical search specialist. When asked to find evidence:
518
+
519
+ 1. Analyze the request to determine what to search for
520
+ 2. Extract key search terms (drug names, disease names, mechanisms)
521
+ 3. Use the appropriate search tools:
522
+ - search_pubmed for peer-reviewed papers
523
+ - search_clinical_trials for clinical studies
524
+ - search_preprints for cutting-edge findings
525
+ 4. Summarize what you found and highlight key evidence
526
+
527
+ Be thorough - search multiple databases when appropriate.
528
+ Focus on finding: mechanisms of action, clinical evidence, and specific drug candidates.""",
529
+ chat_client=client,
530
+ tools=[search_pubmed, search_clinical_trials, search_preprints],
531
+ temperature=0.3, # More deterministic for tool use
532
+ )
533
+
534
+
535
+ def create_judge_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
536
+ """Create a judge agent that evaluates evidence quality.
537
+
538
+ Args:
539
+ chat_client: Optional custom chat client. If None, uses default.
540
+
541
+ Returns:
542
+ ChatAgent configured for evidence assessment
543
+ """
544
+ client = chat_client or OpenAIChatClient(
545
+ model_id="gpt-4o", # Better model for nuanced judgment
546
+ api_key=settings.openai_api_key,
547
+ )
548
+
549
+ return ChatAgent(
550
+ name="JudgeAgent",
551
+ description="Evaluates evidence quality and determines if sufficient for synthesis",
552
+ instructions="""You are an evidence quality assessor. When asked to evaluate:
553
+
554
+ 1. First, call get_evidence_summary() to see all collected evidence
555
+ 2. Score on two dimensions (0-10 each):
556
+ - Mechanism Score: How well is the biological mechanism explained?
557
+ - Clinical Score: How strong is the clinical/preclinical evidence?
558
+ 3. Determine if evidence is SUFFICIENT for a final report:
559
+ - Sufficient: Clear mechanism + supporting clinical data
560
+ - Insufficient: Gaps in mechanism OR weak clinical evidence
561
+ 4. If insufficient, suggest specific search queries to fill gaps
562
+
563
+ Be rigorous but fair. Look for:
564
+ - Molecular targets and pathways
565
+ - Animal model studies
566
+ - Human clinical trials
567
+ - Safety data
568
+ - Drug-drug interactions""",
569
+ chat_client=client,
570
+ tools=[get_evidence_summary], # Can review collected evidence
571
+ temperature=0.2, # Consistent judgments
572
+ )
573
+
574
+
575
+ def create_hypothesis_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
576
+ """Create a hypothesis generation agent.
577
+
578
+ Args:
579
+ chat_client: Optional custom chat client. If None, uses default.
580
+
581
+ Returns:
582
+ ChatAgent configured for hypothesis generation
583
+ """
584
+ client = chat_client or OpenAIChatClient(
585
+ model_id="gpt-4o",
586
+ api_key=settings.openai_api_key,
587
+ )
588
+
589
+ return ChatAgent(
590
+ name="HypothesisAgent",
591
+ description="Generates mechanistic hypotheses for drug repurposing",
592
+ instructions="""You are a biomedical hypothesis generator. Based on evidence:
593
+
594
+ 1. Identify the key molecular targets involved
595
+ 2. Map the biological pathways affected
596
+ 3. Generate testable hypotheses in this format:
597
+
598
+ DRUG → TARGET → PATHWAY → THERAPEUTIC EFFECT
599
+
600
+ Example:
601
+ Metformin → AMPK activation → mTOR inhibition → Reduced tau phosphorylation
602
+
603
+ 4. Explain the rationale for each hypothesis
604
+ 5. Suggest what additional evidence would support or refute it
605
+
606
+ Focus on mechanistic plausibility and existing evidence.""",
607
+ chat_client=client,
608
+ temperature=0.5, # Some creativity for hypothesis generation
609
+ )
610
+
611
+
612
+ def create_report_agent(chat_client: OpenAIChatClient | None = None) -> ChatAgent:
613
+ """Create a report synthesis agent.
614
+
615
+ Args:
616
+ chat_client: Optional custom chat client. If None, uses default.
617
+
618
+ Returns:
619
+ ChatAgent configured for report generation
620
+ """
621
+ client = chat_client or OpenAIChatClient(
622
+ model_id="gpt-4o",
623
+ api_key=settings.openai_api_key,
624
+ )
625
+
626
+ return ChatAgent(
627
+ name="ReportAgent",
628
+ description="Synthesizes research findings into structured reports",
629
+ instructions="""You are a scientific report writer. When asked to synthesize:
630
+
631
+ 1. First, call get_evidence_summary() to review all collected evidence
632
+ 2. Then call get_bibliography() to get properly formatted citations
633
+
634
+ Generate a structured report with these sections:
635
+
636
+ ## Executive Summary
637
+ Brief overview of findings and recommendation
638
+
639
+ ## Methodology
640
+ Databases searched, queries used, evidence reviewed
641
+
642
+ ## Key Findings
643
+ ### Mechanism of Action
644
+ - Molecular targets
645
+ - Biological pathways
646
+ - Proposed mechanism
647
+
648
+ ### Clinical Evidence
649
+ - Preclinical studies
650
+ - Clinical trials
651
+ - Safety profile
652
+
653
+ ## Drug Candidates
654
+ List specific drugs with repurposing potential
655
+
656
+ ## Limitations
657
+ Gaps in evidence, conflicting data, caveats
658
+
659
+ ## Conclusion
660
+ Final recommendation with confidence level
661
+
662
+ ## References
663
+ Use the output from get_bibliography() - do not make up citations!
664
+
665
+ Be comprehensive but concise. Cite evidence for all claims.""",
666
+ chat_client=client,
667
+ tools=[get_evidence_summary, get_bibliography], # Access to collected evidence
668
+ temperature=0.3,
669
+ )
670
+ ```
671
+
672
+ ### 3.4 Magentic Orchestrator (`src/orchestrator_magentic.py`)
673
+
674
+ ```python
675
+ """Magentic-based orchestrator using ChatAgent pattern."""
676
+ from collections.abc import AsyncGenerator
677
+ from typing import Any
678
+
679
+ import structlog
680
+ from agent_framework import (
681
+ MagenticAgentDeltaEvent,
682
+ MagenticAgentMessageEvent,
683
+ MagenticBuilder,
684
+ MagenticFinalResultEvent,
685
+ MagenticOrchestratorMessageEvent,
686
+ WorkflowOutputEvent,
687
+ )
688
+ from agent_framework.openai import OpenAIChatClient
689
+
690
+ from src.agents.magentic_agents import (
691
+ create_hypothesis_agent,
692
+ create_judge_agent,
693
+ create_report_agent,
694
+ create_search_agent,
695
+ )
696
+ from src.agents.state import get_magentic_state, reset_magentic_state
697
+ from src.utils.config import settings
698
+ from src.utils.exceptions import ConfigurationError
699
+ from src.utils.models import AgentEvent
700
+
701
+ logger = structlog.get_logger()
702
+
703
+
704
+ class MagenticOrchestrator:
705
+ """
706
+ Magentic-based orchestrator using ChatAgent pattern.
707
+
708
+ Each agent has an internal LLM that understands natural language
709
+ instructions from the manager and can call tools appropriately.
710
+ """
711
+
712
+ def __init__(
713
+ self,
714
+ max_rounds: int = 10,
715
+ chat_client: OpenAIChatClient | None = None,
716
+ ) -> None:
717
+ """Initialize orchestrator.
718
+
719
+ Args:
720
+ max_rounds: Maximum coordination rounds
721
+ chat_client: Optional shared chat client for agents
722
+ """
723
+ if not settings.openai_api_key:
724
+ raise ConfigurationError(
725
+ "Magentic mode requires OPENAI_API_KEY. "
726
+ "Set the key or use mode='simple'."
727
+ )
728
+
729
+ self._max_rounds = max_rounds
730
+ self._chat_client = chat_client
731
+
732
+ def _build_workflow(self) -> Any:
733
+ """Build the Magentic workflow with ChatAgent participants."""
734
+ # Create agents with internal LLMs
735
+ search_agent = create_search_agent(self._chat_client)
736
+ judge_agent = create_judge_agent(self._chat_client)
737
+ hypothesis_agent = create_hypothesis_agent(self._chat_client)
738
+ report_agent = create_report_agent(self._chat_client)
739
+
740
+ # Manager chat client (orchestrates the agents)
741
+ manager_client = OpenAIChatClient(
742
+ model_id="gpt-4o", # Good model for planning/coordination
743
+ api_key=settings.openai_api_key,
744
+ )
745
+
746
+ return (
747
+ MagenticBuilder()
748
+ .participants(
749
+ searcher=search_agent,
750
+ hypothesizer=hypothesis_agent,
751
+ judge=judge_agent,
752
+ reporter=report_agent,
753
+ )
754
+ .with_standard_manager(
755
+ chat_client=manager_client,
756
+ max_round_count=self._max_rounds,
757
+ max_stall_count=3,
758
+ max_reset_count=2,
759
+ )
760
+ .build()
761
+ )
762
+
763
+ async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
764
+ """
765
+ Run the Magentic workflow.
766
+
767
+ Args:
768
+ query: User's research question
769
+
770
+ Yields:
771
+ AgentEvent objects for real-time UI updates
772
+ """
773
+ logger.info("Starting Magentic orchestrator", query=query)
774
+
775
+ # CRITICAL: Reset state for fresh workflow run
776
+ reset_magentic_state()
777
+
778
+ # Initialize embedding service if available
779
+ state = get_magentic_state()
780
+ state.init_embedding_service()
781
+
782
+ yield AgentEvent(
783
+ type="started",
784
+ message=f"Starting research (Magentic mode): {query}",
785
+ iteration=0,
786
+ )
787
+
788
+ workflow = self._build_workflow()
789
+
790
+ task = f"""Research drug repurposing opportunities for: {query}
791
+
792
+ Workflow:
793
+ 1. SearchAgent: Find evidence from PubMed, ClinicalTrials.gov, and bioRxiv
794
+ 2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
795
+ 3. JudgeAgent: Evaluate if evidence is sufficient
796
+ 4. If insufficient → SearchAgent refines search based on gaps
797
+ 5. If sufficient → ReportAgent synthesizes final report
798
+
799
+ Focus on:
800
+ - Identifying specific molecular targets
801
+ - Understanding mechanism of action
802
+ - Finding clinical evidence supporting hypotheses
803
+
804
+ The final output should be a structured research report."""
805
+
806
+ iteration = 0
807
+ try:
808
+ async for event in workflow.run_stream(task):
809
+ agent_event = self._process_event(event, iteration)
810
+ if agent_event:
811
+ if isinstance(event, MagenticAgentMessageEvent):
812
+ iteration += 1
813
+ yield agent_event
814
+
815
+ except Exception as e:
816
+ logger.error("Magentic workflow failed", error=str(e))
817
+ yield AgentEvent(
818
+ type="error",
819
+ message=f"Workflow error: {e!s}",
820
+ iteration=iteration,
821
+ )
822
+
823
+ def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
824
+ """Process workflow event into AgentEvent."""
825
+ if isinstance(event, MagenticOrchestratorMessageEvent):
826
+ text = event.message.text if event.message else ""
827
+ if text:
828
+ return AgentEvent(
829
+ type="judging",
830
+ message=f"Manager ({event.kind}): {text[:200]}...",
831
+ iteration=iteration,
832
+ )
833
+
834
+ elif isinstance(event, MagenticAgentMessageEvent):
835
+ agent_name = event.agent_id or "unknown"
836
+ text = event.message.text if event.message else ""
837
+
838
+ event_type = "judging"
839
+ if "search" in agent_name.lower():
840
+ event_type = "search_complete"
841
+ elif "judge" in agent_name.lower():
842
+ event_type = "judge_complete"
843
+ elif "hypothes" in agent_name.lower():
844
+ event_type = "hypothesizing"
845
+ elif "report" in agent_name.lower():
846
+ event_type = "synthesizing"
847
+
848
+ return AgentEvent(
849
+ type=event_type,
850
+ message=f"{agent_name}: {text[:200]}...",
851
+ iteration=iteration + 1,
852
+ )
853
+
854
+ elif isinstance(event, MagenticFinalResultEvent):
855
+ text = event.message.text if event.message else "No result"
856
+ return AgentEvent(
857
+ type="complete",
858
+ message=text,
859
+ data={"iterations": iteration},
860
+ iteration=iteration,
861
+ )
862
+
863
+ elif isinstance(event, MagenticAgentDeltaEvent):
864
+ if event.text:
865
+ return AgentEvent(
866
+ type="streaming",
867
+ message=event.text,
868
+ data={"agent_id": event.agent_id},
869
+ iteration=iteration,
870
+ )
871
+
872
+ elif isinstance(event, WorkflowOutputEvent):
873
+ if event.data:
874
+ return AgentEvent(
875
+ type="complete",
876
+ message=str(event.data),
877
+ iteration=iteration,
878
+ )
879
+
880
+ return None
881
+ ```
882
+
883
+ ### 3.4 Updated Factory (`src/orchestrator_factory.py`)
884
+
885
+ ```python
886
+ """Factory for creating orchestrators."""
887
+ from typing import Any, Literal
888
+
889
+ from src.orchestrator import JudgeHandlerProtocol, Orchestrator, SearchHandlerProtocol
890
+ from src.utils.models import OrchestratorConfig
891
+
892
+
893
+ def create_orchestrator(
894
+ search_handler: SearchHandlerProtocol | None = None,
895
+ judge_handler: JudgeHandlerProtocol | None = None,
896
+ config: OrchestratorConfig | None = None,
897
+ mode: Literal["simple", "magentic"] = "simple",
898
+ ) -> Any:
899
+ """
900
+ Create an orchestrator instance.
901
+
902
+ Args:
903
+ search_handler: The search handler (required for simple mode)
904
+ judge_handler: The judge handler (required for simple mode)
905
+ config: Optional configuration
906
+ mode: "simple" for Phase 4 loop, "magentic" for ChatAgent-based multi-agent
907
+
908
+ Returns:
909
+ Orchestrator instance
910
+
911
+ Note:
912
+ Magentic mode does NOT use search_handler/judge_handler.
913
+ It creates ChatAgent instances with internal LLMs that call tools directly.
914
+ """
915
+ if mode == "magentic":
916
+ try:
917
+ from src.orchestrator_magentic import MagenticOrchestrator
918
+
919
+ return MagenticOrchestrator(
920
+ max_rounds=config.max_iterations if config else 10,
921
+ )
922
+ except ImportError:
923
+ # Fallback to simple if agent-framework not installed
924
+ pass
925
+
926
+ # Simple mode requires handlers
927
+ if search_handler is None or judge_handler is None:
928
+ raise ValueError("Simple mode requires search_handler and judge_handler")
929
+
930
+ return Orchestrator(
931
+ search_handler=search_handler,
932
+ judge_handler=judge_handler,
933
+ config=config,
934
+ )
935
+ ```
936
+
937
+ ---
938
+
939
+ ## 4. Why This Works
940
+
941
+ ### 4.1 The Manager → Agent Communication
942
+
943
+ ```
944
+ Manager LLM decides: "Tell SearchAgent to find clinical trials for metformin"
945
+
946
+ Sends instruction: "Search for clinical trials about metformin and cancer"
947
+
948
+ SearchAgent's INTERNAL LLM receives this
949
+
950
+ Internal LLM understands: "I should call search_clinical_trials('metformin cancer')"
951
+
952
+ Tool executes: ClinicalTrials.gov API
953
+
954
+ Internal LLM formats response: "I found 15 trials. Here are the key ones..."
955
+
956
+ Manager receives natural language response
957
+ ```
958
+
959
+ ### 4.2 Why Our Old Implementation Failed
960
+
961
+ ```
962
+ Manager sends: "Search for clinical trials about metformin..."
963
+
964
+ OLD SearchAgent.run() extracts: query = "Search for clinical trials about metformin..."
965
+
966
+ Passes to PubMed: pubmed.search("Search for clinical trials about metformin...")
967
+
968
+ PubMed doesn't understand English instructions → garbage results or error
969
+ ```
970
+
971
+ ---
972
+
973
+ ## 5. Directory Structure
974
+
975
+ ```text
976
+ src/
977
+ ├── agents/
978
+ │ ├── __init__.py
979
+ │ ├── state.py # MagenticState (evidence_store + embeddings)
980
+ │ ├── tools.py # AIFunction tool definitions (update state)
981
+ │ └── magentic_agents.py # ChatAgent factory functions
982
+ ├── services/
983
+ │ └── embeddings.py # EmbeddingService (semantic dedup)
984
+ ├── orchestrator.py # Simple mode (unchanged)
985
+ ├── orchestrator_magentic.py # Magentic mode with ChatAgents
986
+ └── orchestrator_factory.py # Mode selection
987
+ ```
988
+
989
+ ---
990
+
991
+ ## 6. Dependencies
992
+
993
+ ```toml
994
+ [project.optional-dependencies]
995
+ magentic = [
996
+ "agent-framework-core>=1.0.0b",
997
+ "agent-framework-openai>=1.0.0b", # For OpenAIChatClient
998
+ ]
999
+ embeddings = [
1000
+ "chromadb>=0.4.0",
1001
+ "sentence-transformers>=2.2.0",
1002
+ ]
1003
+ ```
1004
+
1005
+ **IMPORTANT: Magentic mode REQUIRES OpenAI API key.**
1006
+
1007
+ The Microsoft Agent Framework's standard manager and ChatAgent use OpenAIChatClient internally.
1008
+ There is no AnthropicChatClient in the framework. If only `ANTHROPIC_API_KEY` is set:
1009
+ - `mode="simple"` works fine
1010
+ - `mode="magentic"` throws `ConfigurationError`
1011
+
1012
+ This is enforced in `MagenticOrchestrator.__init__`.
1013
+
1014
+ ---
1015
+
1016
+ ## 7. Implementation Checklist
1017
+
1018
+ - [ ] Create `src/agents/state.py` with MagenticState class
1019
+ - [ ] Create `src/agents/tools.py` with AIFunction search tools + state updates
1020
+ - [ ] Create `src/agents/magentic_agents.py` with ChatAgent factories
1021
+ - [ ] Rewrite `src/orchestrator_magentic.py` to use ChatAgent pattern
1022
+ - [ ] Update `src/orchestrator_factory.py` for new signature
1023
+ - [ ] Test with real OpenAI API
1024
+ - [ ] Verify manager properly coordinates agents
1025
+ - [ ] Ensure tools are called with correct parameters
1026
+ - [ ] Verify semantic deduplication works (evidence_store populates)
1027
+ - [ ] Verify bibliography generation in final reports
1028
+
1029
+ ---
1030
+
1031
+ ## 8. Definition of Done
1032
+
1033
+ Phase 5 is **COMPLETE** when:
1034
+
1035
+ 1. Magentic mode runs without hanging
1036
+ 2. Manager successfully coordinates agents via natural language
1037
+ 3. SearchAgent calls tools with proper search keywords (not raw instructions)
1038
+ 4. JudgeAgent evaluates evidence from conversation history
1039
+ 5. ReportAgent generates structured final report
1040
+ 6. Events stream to UI correctly
1041
+
1042
+ ---
1043
+
1044
+ ## 9. Testing Magentic Mode
1045
+
1046
+ ```bash
1047
+ # Test with real API
1048
+ OPENAI_API_KEY=sk-... uv run python -c "
1049
+ import asyncio
1050
+ from src.orchestrator_factory import create_orchestrator
1051
+
1052
+ async def test():
1053
+ orch = create_orchestrator(mode='magentic')
1054
+ async for event in orch.run('metformin alzheimer'):
1055
+ print(f'[{event.type}] {event.message[:100]}')
1056
+
1057
+ asyncio.run(test())
1058
+ "
1059
+ ```
1060
+
1061
+ Expected output:
1062
+ ```
1063
+ [started] Starting research (Magentic mode): metformin alzheimer
1064
+ [judging] Manager (plan): I will coordinate the agents to research...
1065
+ [search_complete] SearchAgent: Found 25 PubMed results for metformin alzheimer...
1066
+ [hypothesizing] HypothesisAgent: Based on the evidence, I propose...
1067
+ [judge_complete] JudgeAgent: Mechanism Score: 7/10, Clinical Score: 6/10...
1068
+ [synthesizing] ReportAgent: ## Executive Summary...
1069
+ [complete] <full research report>
1070
+ ```
1071
+
1072
+ ---
1073
+
1074
+ ## 10. Key Differences from Old Spec
1075
+
1076
+ | Aspect | OLD (Wrong) | NEW (Correct) |
1077
+ |--------|-------------|---------------|
1078
+ | Agent type | `BaseAgent` subclass | `ChatAgent` with `chat_client` |
1079
+ | Internal LLM | None | OpenAIChatClient |
1080
+ | How tools work | Handler.execute(raw_instruction) | LLM understands instruction, calls AIFunction |
1081
+ | Message handling | Extract text → pass to API | LLM interprets → extracts keywords → calls tool |
1082
+ | State management | Passed to agent constructors | Global MagenticState singleton |
1083
+ | Evidence storage | In agent instance | In MagenticState.evidence_store |
1084
+ | Semantic search | Coupled to agents | Tools call state.add_evidence() |
1085
+ | Citations for report | From agent's store | Via get_bibliography() tool |
1086
+
1087
+ **Key Insights:**
1088
+ 1. Magentic agents must have internal LLMs to understand natural language instructions
1089
+ 2. Tools must update shared state as a side effect (return strings, but also store Evidence)
1090
+ 3. ReportAgent uses `get_bibliography()` tool to access structured citations
1091
+ 4. State is reset at start of each workflow run via `reset_magentic_state()`
docs/implementation/06_phase_embeddings.md ADDED
@@ -0,0 +1,409 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 6 Implementation Spec: Embeddings & Semantic Search
2
+
3
+ **Goal**: Add vector search for semantic evidence retrieval.
4
+ **Philosophy**: "Find what you mean, not just what you type."
5
+ **Prerequisite**: Phase 5 complete (Magentic working)
6
+
7
+ ---
8
+
9
+ ## 1. Why Embeddings?
10
+
11
+ Current limitation: **Keyword-only search misses semantically related papers.**
12
+
13
+ Example problem:
14
+ - User searches: "metformin alzheimer"
15
+ - PubMed returns: Papers with exact keywords
16
+ - MISSED: Papers about "AMPK activation neuroprotection" (same mechanism, different words)
17
+
18
+ With embeddings:
19
+ - Embed the query AND all evidence
20
+ - Find semantically similar papers even without keyword match
21
+ - Deduplicate by meaning, not just URL
22
+
23
+ ---
24
+
25
+ ## 2. Architecture
26
+
27
+ ### Current (Phase 5)
28
+ ```
29
+ Query → SearchAgent → PubMed/Web (keyword) → Evidence
30
+ ```
31
+
32
+ ### Phase 6
33
+ ```
34
+ Query → Embed(Query) → SearchAgent
35
+ ├── PubMed/Web (keyword) → Evidence
36
+ └── VectorDB (semantic) → Related Evidence
37
+
38
+ Evidence → Embed → Store
39
+ ```
40
+
41
+ ### Shared Context Enhancement
42
+ ```python
43
+ # Current
44
+ evidence_store = {"current": []}
45
+
46
+ # Phase 6
47
+ evidence_store = {
48
+ "current": [], # Raw evidence
49
+ "embeddings": {}, # URL -> embedding vector
50
+ "vector_index": None, # ChromaDB collection
51
+ }
52
+ ```
53
+
54
+ ---
55
+
56
+ ## 3. Technology Choice
57
+
58
+ ### ChromaDB (Recommended)
59
+ - **Free**, open-source, local-first
60
+ - No API keys, no cloud dependency
61
+ - Supports sentence-transformers out of the box
62
+ - Perfect for hackathon (no infra setup)
63
+
64
+ ### Embedding Model
65
+ - `sentence-transformers/all-MiniLM-L6-v2` (fast, good quality)
66
+ - Or `BAAI/bge-small-en-v1.5` (better quality, still fast)
67
+
68
+ ---
69
+
70
+ ## 4. Implementation
71
+
72
+ ### 4.1 Dependencies
73
+
74
+ Add to `pyproject.toml`:
75
+ ```toml
76
+ [project.optional-dependencies]
77
+ embeddings = [
78
+ "chromadb>=0.4.0",
79
+ "sentence-transformers>=2.2.0",
80
+ ]
81
+ ```
82
+
83
+ ### 4.2 Embedding Service (`src/services/embeddings.py`)
84
+
85
+ > **CRITICAL: Async Pattern Required**
86
+ >
87
+ > `sentence-transformers` is synchronous and CPU-bound. Running it directly in async code
88
+ > will **block the event loop**, freezing the UI and halting all concurrent operations.
89
+ >
90
+ > **Solution**: Use `asyncio.run_in_executor()` to offload to thread pool.
91
+ > This pattern already exists in `src/tools/websearch.py:28-34`.
92
+
93
+ ```python
94
+ """Embedding service for semantic search.
95
+
96
+ IMPORTANT: All public methods are async to avoid blocking the event loop.
97
+ The sentence-transformers model is CPU-bound, so we use run_in_executor().
98
+ """
99
+ import asyncio
100
+ from typing import List
101
+
102
+ import chromadb
103
+ from sentence_transformers import SentenceTransformer
104
+
105
+
106
+ class EmbeddingService:
107
+ """Handles text embedding and vector storage.
108
+
109
+ All embedding operations run in a thread pool to avoid blocking
110
+ the async event loop. See src/tools/websearch.py for the pattern.
111
+ """
112
+
113
+ def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
114
+ self._model = SentenceTransformer(model_name)
115
+ self._client = chromadb.Client() # In-memory for hackathon
116
+ self._collection = self._client.create_collection(
117
+ name="evidence",
118
+ metadata={"hnsw:space": "cosine"}
119
+ )
120
+
121
+ # ─────────────────────────────────────────────────────────────────
122
+ # Sync internal methods (run in thread pool)
123
+ # ─────────────────────────────────────────────────────────────────
124
+
125
+ def _sync_embed(self, text: str) -> List[float]:
126
+ """Synchronous embedding - DO NOT call directly from async code."""
127
+ return self._model.encode(text).tolist()
128
+
129
+ def _sync_batch_embed(self, texts: List[str]) -> List[List[float]]:
130
+ """Batch embedding for efficiency - DO NOT call directly from async code."""
131
+ return [e.tolist() for e in self._model.encode(texts)]
132
+
133
+ # ─────────────────────────────────────────────────────────────────
134
+ # Async public methods (safe for event loop)
135
+ # ─────────────────────────────────────────────────────────────────
136
+
137
+ async def embed(self, text: str) -> List[float]:
138
+ """Embed a single text (async-safe).
139
+
140
+ Uses run_in_executor to avoid blocking the event loop.
141
+ """
142
+ loop = asyncio.get_running_loop()
143
+ return await loop.run_in_executor(None, self._sync_embed, text)
144
+
145
+ async def embed_batch(self, texts: List[str]) -> List[List[float]]:
146
+ """Batch embed multiple texts (async-safe, more efficient)."""
147
+ loop = asyncio.get_running_loop()
148
+ return await loop.run_in_executor(None, self._sync_batch_embed, texts)
149
+
150
+ async def add_evidence(self, evidence_id: str, content: str, metadata: dict) -> None:
151
+ """Add evidence to vector store (async-safe)."""
152
+ embedding = await self.embed(content)
153
+ # ChromaDB operations are fast, but wrap for consistency
154
+ loop = asyncio.get_running_loop()
155
+ await loop.run_in_executor(
156
+ None,
157
+ lambda: self._collection.add(
158
+ ids=[evidence_id],
159
+ embeddings=[embedding],
160
+ metadatas=[metadata],
161
+ documents=[content]
162
+ )
163
+ )
164
+
165
+ async def search_similar(self, query: str, n_results: int = 5) -> List[dict]:
166
+ """Find semantically similar evidence (async-safe)."""
167
+ query_embedding = await self.embed(query)
168
+
169
+ loop = asyncio.get_running_loop()
170
+ results = await loop.run_in_executor(
171
+ None,
172
+ lambda: self._collection.query(
173
+ query_embeddings=[query_embedding],
174
+ n_results=n_results
175
+ )
176
+ )
177
+
178
+ # Handle empty results gracefully
179
+ if not results["ids"] or not results["ids"][0]:
180
+ return []
181
+
182
+ return [
183
+ {"id": id, "content": doc, "metadata": meta, "distance": dist}
184
+ for id, doc, meta, dist in zip(
185
+ results["ids"][0],
186
+ results["documents"][0],
187
+ results["metadatas"][0],
188
+ results["distances"][0]
189
+ )
190
+ ]
191
+
192
+ async def deduplicate(self, new_evidence: List, threshold: float = 0.9) -> List:
193
+ """Remove semantically duplicate evidence (async-safe)."""
194
+ unique = []
195
+ for evidence in new_evidence:
196
+ similar = await self.search_similar(evidence.content, n_results=1)
197
+ if not similar or similar[0]["distance"] > (1 - threshold):
198
+ unique.append(evidence)
199
+ await self.add_evidence(
200
+ evidence_id=evidence.citation.url,
201
+ content=evidence.content,
202
+ metadata={"source": evidence.citation.source}
203
+ )
204
+ return unique
205
+ ```
206
+
207
+ ### 4.3 Enhanced SearchAgent (`src/agents/search_agent.py`)
208
+
209
+ Update SearchAgent to use embeddings. **Note**: All embedding calls are `await`ed:
210
+
211
+ ```python
212
+ class SearchAgent(BaseAgent):
213
+ def __init__(
214
+ self,
215
+ search_handler: SearchHandlerProtocol,
216
+ evidence_store: dict,
217
+ embedding_service: EmbeddingService | None = None, # NEW
218
+ ):
219
+ # ... existing init ...
220
+ self._embeddings = embedding_service
221
+
222
+ async def run(self, messages, *, thread=None, **kwargs) -> AgentRunResponse:
223
+ # ... extract query ...
224
+
225
+ # Execute keyword search
226
+ result = await self._handler.execute(query, max_results_per_tool=10)
227
+
228
+ # Semantic deduplication (NEW) - ALL CALLS ARE AWAITED
229
+ if self._embeddings:
230
+ # Deduplicate by semantic similarity (async-safe)
231
+ unique_evidence = await self._embeddings.deduplicate(result.evidence)
232
+
233
+ # Also search for semantically related evidence (async-safe)
234
+ related = await self._embeddings.search_similar(query, n_results=5)
235
+
236
+ # Merge related evidence not already in results
237
+ existing_urls = {e.citation.url for e in unique_evidence}
238
+ for item in related:
239
+ if item["id"] not in existing_urls:
240
+ # Reconstruct Evidence from stored data
241
+ # ... merge logic ...
242
+
243
+ # ... rest of method ...
244
+ ```
245
+
246
+ ### 4.4 Semantic Expansion in Orchestrator
247
+
248
+ The MagenticOrchestrator can use embeddings to expand queries:
249
+
250
+ ```python
251
+ # In task instruction
252
+ task = f"""Research drug repurposing opportunities for: {query}
253
+
254
+ The system has semantic search enabled. When evidence is found:
255
+ 1. Related concepts will be automatically surfaced
256
+ 2. Duplicates are removed by meaning, not just URL
257
+ 3. Use the surfaced related concepts to refine searches
258
+ """
259
+ ```
260
+
261
+ ### 4.5 HuggingFace Spaces Deployment
262
+
263
+ > **⚠️ Important for HF Spaces**
264
+ >
265
+ > `sentence-transformers` downloads models (~500MB) to `~/.cache` on first use.
266
+ > HuggingFace Spaces have **ephemeral storage** - the cache is wiped on restart.
267
+ > This causes slow cold starts and bandwidth usage.
268
+
269
+ **Solution**: Pre-download the model in your Dockerfile:
270
+
271
+ ```dockerfile
272
+ # In Dockerfile
273
+ FROM python:3.11-slim
274
+
275
+ # Set cache directory
276
+ ENV HF_HOME=/app/.cache
277
+ ENV TRANSFORMERS_CACHE=/app/.cache
278
+
279
+ # Pre-download the embedding model during build
280
+ RUN pip install sentence-transformers && \
281
+ python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
282
+
283
+ # ... rest of Dockerfile
284
+ ```
285
+
286
+ **Alternative**: Use environment variable to specify persistent path:
287
+
288
+ ```yaml
289
+ # In HF Spaces settings or app.yaml
290
+ env:
291
+ - name: HF_HOME
292
+ value: /data/.cache # Persistent volume
293
+ ```
294
+
295
+ ---
296
+
297
+ ## 5. Directory Structure After Phase 6
298
+
299
+ ```
300
+ src/
301
+ ├── services/ # NEW
302
+ │ ├── __init__.py
303
+ │ └── embeddings.py # EmbeddingService
304
+ ├── agents/
305
+ │ ├── search_agent.py # Updated with embeddings
306
+ │ └── judge_agent.py
307
+ └── ...
308
+ ```
309
+
310
+ ---
311
+
312
+ ## 6. Tests
313
+
314
+ ### 6.1 Unit Tests (`tests/unit/services/test_embeddings.py`)
315
+
316
+ > **Note**: All tests are async since the EmbeddingService methods are async.
317
+
318
+ ```python
319
+ """Unit tests for EmbeddingService."""
320
+ import pytest
321
+ from src.services.embeddings import EmbeddingService
322
+
323
+
324
+ class TestEmbeddingService:
325
+ @pytest.mark.asyncio
326
+ async def test_embed_returns_vector(self):
327
+ """Embedding should return a float vector."""
328
+ service = EmbeddingService()
329
+ embedding = await service.embed("metformin diabetes")
330
+ assert isinstance(embedding, list)
331
+ assert len(embedding) > 0
332
+ assert all(isinstance(x, float) for x in embedding)
333
+
334
+ @pytest.mark.asyncio
335
+ async def test_similar_texts_have_close_embeddings(self):
336
+ """Semantically similar texts should have similar embeddings."""
337
+ service = EmbeddingService()
338
+ e1 = await service.embed("metformin treats diabetes")
339
+ e2 = await service.embed("metformin is used for diabetes treatment")
340
+ e3 = await service.embed("the weather is sunny today")
341
+
342
+ # Cosine similarity helper
343
+ from numpy import dot
344
+ from numpy.linalg import norm
345
+ cosine = lambda a, b: dot(a, b) / (norm(a) * norm(b))
346
+
347
+ # Similar texts should be closer
348
+ assert cosine(e1, e2) > cosine(e1, e3)
349
+
350
+ @pytest.mark.asyncio
351
+ async def test_batch_embed_efficient(self):
352
+ """Batch embedding should be more efficient than individual calls."""
353
+ service = EmbeddingService()
354
+ texts = ["text one", "text two", "text three"]
355
+
356
+ # Batch embed
357
+ batch_results = await service.embed_batch(texts)
358
+ assert len(batch_results) == 3
359
+ assert all(isinstance(e, list) for e in batch_results)
360
+
361
+ @pytest.mark.asyncio
362
+ async def test_add_and_search(self):
363
+ """Should be able to add evidence and search for similar."""
364
+ service = EmbeddingService()
365
+ await service.add_evidence(
366
+ evidence_id="test1",
367
+ content="Metformin activates AMPK pathway",
368
+ metadata={"source": "pubmed"}
369
+ )
370
+
371
+ results = await service.search_similar("AMPK activation drugs", n_results=1)
372
+ assert len(results) == 1
373
+ assert "AMPK" in results[0]["content"]
374
+
375
+ @pytest.mark.asyncio
376
+ async def test_search_similar_empty_collection(self):
377
+ """Search on empty collection should return empty list, not error."""
378
+ service = EmbeddingService()
379
+ results = await service.search_similar("anything", n_results=5)
380
+ assert results == []
381
+ ```
382
+
383
+ ---
384
+
385
+ ## 7. Definition of Done
386
+
387
+ Phase 6 is **COMPLETE** when:
388
+
389
+ 1. `EmbeddingService` implemented with ChromaDB
390
+ 2. SearchAgent uses embeddings for deduplication
391
+ 3. Semantic search surfaces related evidence
392
+ 4. All unit tests pass
393
+ 5. Integration test shows improved recall (finds related papers)
394
+
395
+ ---
396
+
397
+ ## 8. Value Delivered
398
+
399
+ | Before (Phase 5) | After (Phase 6) |
400
+ |------------------|-----------------|
401
+ | Keyword-only search | Semantic + keyword search |
402
+ | URL-based deduplication | Meaning-based deduplication |
403
+ | Miss related papers | Surface related concepts |
404
+ | Exact match required | Fuzzy semantic matching |
405
+
406
+ **Real example improvement:**
407
+ - Query: "metformin alzheimer"
408
+ - Before: Only papers mentioning both words
409
+ - After: Also finds "AMPK neuroprotection", "biguanide cognitive", etc.
docs/implementation/07_phase_hypothesis.md ADDED
@@ -0,0 +1,630 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 7 Implementation Spec: Hypothesis Agent
2
+
3
+ **Goal**: Add an agent that generates scientific hypotheses to guide targeted searches.
4
+ **Philosophy**: "Don't just find evidence—understand the mechanisms."
5
+ **Prerequisite**: Phase 6 complete (Embeddings working)
6
+
7
+ ---
8
+
9
+ ## 1. Why Hypothesis Agent?
10
+
11
+ Current limitation: **Search is reactive, not hypothesis-driven.**
12
+
13
+ Current flow:
14
+ 1. User asks about "metformin alzheimer"
15
+ 2. Search finds papers
16
+ 3. Judge says "need more evidence"
17
+ 4. Search again with slightly different keywords
18
+
19
+ With Hypothesis Agent:
20
+ 1. User asks about "metformin alzheimer"
21
+ 2. Search finds initial papers
22
+ 3. **Hypothesis Agent analyzes**: "Evidence suggests metformin → AMPK activation → autophagy → amyloid clearance"
23
+ 4. Search can now target: "metformin AMPK", "autophagy neurodegeneration", "amyloid clearance drugs"
24
+
25
+ **Key insight**: Scientific research is hypothesis-driven. The agent should think like a researcher.
26
+
27
+ ---
28
+
29
+ ## 2. Architecture
30
+
31
+ ### Current (Phase 6)
32
+ ```
33
+ User Query → Magentic Manager
34
+ ├── SearchAgent → Evidence
35
+ └── JudgeAgent → Sufficient? → Synthesize/Continue
36
+ ```
37
+
38
+ ### Phase 7
39
+ ```
40
+ User Query → Magentic Manager
41
+ ├── SearchAgent → Evidence
42
+ ├── HypothesisAgent → Mechanistic Hypotheses ← NEW
43
+ └── JudgeAgent → Sufficient? → Synthesize/Continue
44
+
45
+ Uses hypotheses to guide next search
46
+ ```
47
+
48
+ ### Shared Context Enhancement
49
+ ```python
50
+ evidence_store = {
51
+ "current": [],
52
+ "embeddings": {},
53
+ "vector_index": None,
54
+ "hypotheses": [], # NEW: Generated hypotheses
55
+ "tested_hypotheses": [], # NEW: Hypotheses with supporting/contradicting evidence
56
+ }
57
+ ```
58
+
59
+ ---
60
+
61
+ ## 3. Hypothesis Model
62
+
63
+ ### 3.1 Data Model (`src/utils/models.py`)
64
+
65
+ ```python
66
+ class MechanismHypothesis(BaseModel):
67
+ """A scientific hypothesis about drug mechanism."""
68
+
69
+ drug: str = Field(description="The drug being studied")
70
+ target: str = Field(description="Molecular target (e.g., AMPK, mTOR)")
71
+ pathway: str = Field(description="Biological pathway affected")
72
+ effect: str = Field(description="Downstream effect on disease")
73
+ confidence: float = Field(ge=0, le=1, description="Confidence in hypothesis")
74
+ supporting_evidence: list[str] = Field(
75
+ default_factory=list,
76
+ description="PMIDs or URLs supporting this hypothesis"
77
+ )
78
+ contradicting_evidence: list[str] = Field(
79
+ default_factory=list,
80
+ description="PMIDs or URLs contradicting this hypothesis"
81
+ )
82
+ search_suggestions: list[str] = Field(
83
+ default_factory=list,
84
+ description="Suggested searches to test this hypothesis"
85
+ )
86
+
87
+ def to_search_queries(self) -> list[str]:
88
+ """Generate search queries to test this hypothesis."""
89
+ return [
90
+ f"{self.drug} {self.target}",
91
+ f"{self.target} {self.pathway}",
92
+ f"{self.pathway} {self.effect}",
93
+ *self.search_suggestions
94
+ ]
95
+ ```
96
+
97
+ ### 3.2 Hypothesis Assessment
98
+
99
+ ```python
100
+ class HypothesisAssessment(BaseModel):
101
+ """Assessment of evidence against hypotheses."""
102
+
103
+ hypotheses: list[MechanismHypothesis]
104
+ primary_hypothesis: MechanismHypothesis | None = Field(
105
+ description="Most promising hypothesis based on current evidence"
106
+ )
107
+ knowledge_gaps: list[str] = Field(
108
+ description="What we don't know yet"
109
+ )
110
+ recommended_searches: list[str] = Field(
111
+ description="Searches to fill knowledge gaps"
112
+ )
113
+ ```
114
+
115
+ ---
116
+
117
+ ## 4. Implementation
118
+
119
+ ### 4.0 Text Utilities (`src/utils/text_utils.py`)
120
+
121
+ > **Why These Utilities?**
122
+ >
123
+ > The original spec used arbitrary truncation (`evidence[:10]` and `content[:300]`).
124
+ > This loses important information randomly. These utilities provide:
125
+ > 1. **Sentence-aware truncation** - cuts at sentence boundaries, not mid-word
126
+ > 2. **Diverse evidence selection** - uses embeddings to select varied evidence (MMR)
127
+
128
+ ```python
129
+ """Text processing utilities for evidence handling."""
130
+ from typing import TYPE_CHECKING
131
+
132
+ if TYPE_CHECKING:
133
+ from src.services.embeddings import EmbeddingService
134
+ from src.utils.models import Evidence
135
+
136
+
137
+ def truncate_at_sentence(text: str, max_chars: int = 300) -> str:
138
+ """Truncate text at sentence boundary, preserving meaning.
139
+
140
+ Args:
141
+ text: The text to truncate
142
+ max_chars: Maximum characters (default 300)
143
+
144
+ Returns:
145
+ Text truncated at last complete sentence within limit
146
+ """
147
+ if len(text) <= max_chars:
148
+ return text
149
+
150
+ # Find truncation point
151
+ truncated = text[:max_chars]
152
+
153
+ # Look for sentence endings: . ! ? followed by space or end
154
+ for sep in ['. ', '! ', '? ', '.\n', '!\n', '?\n']:
155
+ last_sep = truncated.rfind(sep)
156
+ if last_sep > max_chars // 2: # Don't truncate too aggressively
157
+ return text[:last_sep + 1].strip()
158
+
159
+ # Fallback: find last period
160
+ last_period = truncated.rfind('.')
161
+ if last_period > max_chars // 2:
162
+ return text[:last_period + 1].strip()
163
+
164
+ # Last resort: truncate at word boundary
165
+ last_space = truncated.rfind(' ')
166
+ if last_space > 0:
167
+ return text[:last_space].strip() + "..."
168
+
169
+ return truncated + "..."
170
+
171
+
172
+ async def select_diverse_evidence(
173
+ evidence: list["Evidence"],
174
+ n: int,
175
+ query: str,
176
+ embeddings: "EmbeddingService | None" = None
177
+ ) -> list["Evidence"]:
178
+ """Select n most diverse and relevant evidence items.
179
+
180
+ Uses Maximal Marginal Relevance (MMR) when embeddings available,
181
+ falls back to relevance_score sorting otherwise.
182
+
183
+ Args:
184
+ evidence: All available evidence
185
+ n: Number of items to select
186
+ query: Original query for relevance scoring
187
+ embeddings: Optional EmbeddingService for semantic diversity
188
+
189
+ Returns:
190
+ Selected evidence items, diverse and relevant
191
+ """
192
+ if not evidence:
193
+ return []
194
+
195
+ if n >= len(evidence):
196
+ return evidence
197
+
198
+ # Fallback: sort by relevance score if no embeddings
199
+ if embeddings is None:
200
+ return sorted(
201
+ evidence,
202
+ key=lambda e: e.relevance_score,
203
+ reverse=True
204
+ )[:n]
205
+
206
+ # MMR: Maximal Marginal Relevance for diverse selection
207
+ # Score = λ * relevance - (1-λ) * max_similarity_to_selected
208
+ lambda_param = 0.7 # Balance relevance vs diversity
209
+
210
+ # Get query embedding
211
+ query_emb = await embeddings.embed(query)
212
+
213
+ # Get all evidence embeddings
214
+ evidence_embs = await embeddings.embed_batch([e.content for e in evidence])
215
+
216
+ # Compute relevance scores (cosine similarity to query)
217
+ from numpy import dot
218
+ from numpy.linalg import norm
219
+ cosine = lambda a, b: float(dot(a, b) / (norm(a) * norm(b)))
220
+
221
+ relevance_scores = [cosine(query_emb, emb) for emb in evidence_embs]
222
+
223
+ # Greedy MMR selection
224
+ selected_indices: list[int] = []
225
+ remaining = set(range(len(evidence)))
226
+
227
+ for _ in range(n):
228
+ best_score = float('-inf')
229
+ best_idx = -1
230
+
231
+ for idx in remaining:
232
+ # Relevance component
233
+ relevance = relevance_scores[idx]
234
+
235
+ # Diversity component: max similarity to already selected
236
+ if selected_indices:
237
+ max_sim = max(
238
+ cosine(evidence_embs[idx], evidence_embs[sel])
239
+ for sel in selected_indices
240
+ )
241
+ else:
242
+ max_sim = 0
243
+
244
+ # MMR score
245
+ mmr_score = lambda_param * relevance - (1 - lambda_param) * max_sim
246
+
247
+ if mmr_score > best_score:
248
+ best_score = mmr_score
249
+ best_idx = idx
250
+
251
+ if best_idx >= 0:
252
+ selected_indices.append(best_idx)
253
+ remaining.remove(best_idx)
254
+
255
+ return [evidence[i] for i in selected_indices]
256
+ ```
257
+
258
+ ### 4.1 Hypothesis Prompts (`src/prompts/hypothesis.py`)
259
+
260
+ ```python
261
+ """Prompts for Hypothesis Agent."""
262
+ from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
263
+
264
+ SYSTEM_PROMPT = """You are a biomedical research scientist specializing in drug repurposing.
265
+
266
+ Your role is to generate mechanistic hypotheses based on evidence.
267
+
268
+ A good hypothesis:
269
+ 1. Proposes a MECHANISM: Drug → Target → Pathway → Effect
270
+ 2. Is TESTABLE: Can be supported or refuted by literature search
271
+ 3. Is SPECIFIC: Names actual molecular targets and pathways
272
+ 4. Generates SEARCH QUERIES: Helps find more evidence
273
+
274
+ Example hypothesis format:
275
+ - Drug: Metformin
276
+ - Target: AMPK (AMP-activated protein kinase)
277
+ - Pathway: mTOR inhibition → autophagy activation
278
+ - Effect: Enhanced clearance of amyloid-beta in Alzheimer's
279
+ - Confidence: 0.7
280
+ - Search suggestions: ["metformin AMPK brain", "autophagy amyloid clearance"]
281
+
282
+ Be specific. Use actual gene/protein names when possible."""
283
+
284
+
285
+ async def format_hypothesis_prompt(
286
+ query: str,
287
+ evidence: list,
288
+ embeddings=None
289
+ ) -> str:
290
+ """Format prompt for hypothesis generation.
291
+
292
+ Uses smart evidence selection instead of arbitrary truncation.
293
+
294
+ Args:
295
+ query: The research query
296
+ evidence: All collected evidence
297
+ embeddings: Optional EmbeddingService for diverse selection
298
+ """
299
+ # Select diverse, relevant evidence (not arbitrary first 10)
300
+ selected = await select_diverse_evidence(
301
+ evidence, n=10, query=query, embeddings=embeddings
302
+ )
303
+
304
+ # Format with sentence-aware truncation
305
+ evidence_text = "\n".join([
306
+ f"- **{e.citation.title}** ({e.citation.source}): {truncate_at_sentence(e.content, 300)}"
307
+ for e in selected
308
+ ])
309
+
310
+ return f"""Based on the following evidence about "{query}", generate mechanistic hypotheses.
311
+
312
+ ## Evidence ({len(selected)} papers selected for diversity)
313
+ {evidence_text}
314
+
315
+ ## Task
316
+ 1. Identify potential drug targets mentioned in the evidence
317
+ 2. Propose mechanism hypotheses (Drug → Target → Pathway → Effect)
318
+ 3. Rate confidence based on evidence strength
319
+ 4. Suggest searches to test each hypothesis
320
+
321
+ Generate 2-4 hypotheses, prioritized by confidence."""
322
+ ```
323
+
324
+ ### 4.2 Hypothesis Agent (`src/agents/hypothesis_agent.py`)
325
+
326
+ ```python
327
+ """Hypothesis agent for mechanistic reasoning."""
328
+ from collections.abc import AsyncIterable
329
+ from typing import TYPE_CHECKING, Any
330
+
331
+ from agent_framework import (
332
+ AgentRunResponse,
333
+ AgentRunResponseUpdate,
334
+ AgentThread,
335
+ BaseAgent,
336
+ ChatMessage,
337
+ Role,
338
+ )
339
+ from pydantic_ai import Agent
340
+
341
+ from src.prompts.hypothesis import SYSTEM_PROMPT, format_hypothesis_prompt
342
+ from src.utils.config import settings
343
+ from src.utils.models import Evidence, HypothesisAssessment
344
+
345
+ if TYPE_CHECKING:
346
+ from src.services.embeddings import EmbeddingService
347
+
348
+
349
+ class HypothesisAgent(BaseAgent):
350
+ """Generates mechanistic hypotheses based on evidence."""
351
+
352
+ def __init__(
353
+ self,
354
+ evidence_store: dict[str, list[Evidence]],
355
+ embedding_service: "EmbeddingService | None" = None, # NEW: for diverse selection
356
+ ) -> None:
357
+ super().__init__(
358
+ name="HypothesisAgent",
359
+ description="Generates scientific hypotheses about drug mechanisms to guide research",
360
+ )
361
+ self._evidence_store = evidence_store
362
+ self._embeddings = embedding_service # Used for MMR evidence selection
363
+ self._agent = Agent(
364
+ model=settings.llm_provider, # Uses configured LLM
365
+ output_type=HypothesisAssessment,
366
+ system_prompt=SYSTEM_PROMPT,
367
+ )
368
+
369
+ async def run(
370
+ self,
371
+ messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
372
+ *,
373
+ thread: AgentThread | None = None,
374
+ **kwargs: Any,
375
+ ) -> AgentRunResponse:
376
+ """Generate hypotheses based on current evidence."""
377
+ # Extract query
378
+ query = self._extract_query(messages)
379
+
380
+ # Get current evidence
381
+ evidence = self._evidence_store.get("current", [])
382
+
383
+ if not evidence:
384
+ return AgentRunResponse(
385
+ messages=[ChatMessage(
386
+ role=Role.ASSISTANT,
387
+ text="No evidence available yet. Search for evidence first."
388
+ )],
389
+ response_id="hypothesis-no-evidence",
390
+ )
391
+
392
+ # Generate hypotheses with diverse evidence selection
393
+ # NOTE: format_hypothesis_prompt is now async
394
+ prompt = await format_hypothesis_prompt(
395
+ query, evidence, embeddings=self._embeddings
396
+ )
397
+ result = await self._agent.run(prompt)
398
+ assessment = result.output
399
+
400
+ # Store hypotheses in shared context
401
+ existing = self._evidence_store.get("hypotheses", [])
402
+ self._evidence_store["hypotheses"] = existing + assessment.hypotheses
403
+
404
+ # Format response
405
+ response_text = self._format_response(assessment)
406
+
407
+ return AgentRunResponse(
408
+ messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
409
+ response_id=f"hypothesis-{len(assessment.hypotheses)}",
410
+ additional_properties={"assessment": assessment.model_dump()},
411
+ )
412
+
413
+ def _format_response(self, assessment: HypothesisAssessment) -> str:
414
+ """Format hypothesis assessment as markdown."""
415
+ lines = ["## Generated Hypotheses\n"]
416
+
417
+ for i, h in enumerate(assessment.hypotheses, 1):
418
+ lines.append(f"### Hypothesis {i} (Confidence: {h.confidence:.0%})")
419
+ lines.append(f"**Mechanism**: {h.drug} → {h.target} → {h.pathway} → {h.effect}")
420
+ lines.append(f"**Suggested searches**: {', '.join(h.search_suggestions)}\n")
421
+
422
+ if assessment.primary_hypothesis:
423
+ lines.append(f"### Primary Hypothesis")
424
+ h = assessment.primary_hypothesis
425
+ lines.append(f"{h.drug} → {h.target} → {h.pathway} → {h.effect}\n")
426
+
427
+ if assessment.knowledge_gaps:
428
+ lines.append("### Knowledge Gaps")
429
+ for gap in assessment.knowledge_gaps:
430
+ lines.append(f"- {gap}")
431
+
432
+ if assessment.recommended_searches:
433
+ lines.append("\n### Recommended Next Searches")
434
+ for search in assessment.recommended_searches:
435
+ lines.append(f"- `{search}`")
436
+
437
+ return "\n".join(lines)
438
+
439
+ def _extract_query(self, messages) -> str:
440
+ """Extract query from messages."""
441
+ if isinstance(messages, str):
442
+ return messages
443
+ elif isinstance(messages, ChatMessage):
444
+ return messages.text or ""
445
+ elif isinstance(messages, list):
446
+ for msg in reversed(messages):
447
+ if isinstance(msg, ChatMessage) and msg.role == Role.USER:
448
+ return msg.text or ""
449
+ elif isinstance(msg, str):
450
+ return msg
451
+ return ""
452
+
453
+ async def run_stream(
454
+ self,
455
+ messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
456
+ *,
457
+ thread: AgentThread | None = None,
458
+ **kwargs: Any,
459
+ ) -> AsyncIterable[AgentRunResponseUpdate]:
460
+ """Streaming wrapper."""
461
+ result = await self.run(messages, thread=thread, **kwargs)
462
+ yield AgentRunResponseUpdate(
463
+ messages=result.messages,
464
+ response_id=result.response_id
465
+ )
466
+ ```
467
+
468
+ ### 4.3 Update MagenticOrchestrator
469
+
470
+ Add HypothesisAgent to the workflow:
471
+
472
+ ```python
473
+ # In MagenticOrchestrator.__init__
474
+ self._hypothesis_agent = HypothesisAgent(self._evidence_store)
475
+
476
+ # In workflow building
477
+ workflow = (
478
+ MagenticBuilder()
479
+ .participants(
480
+ searcher=search_agent,
481
+ hypothesizer=self._hypothesis_agent, # NEW
482
+ judge=judge_agent,
483
+ )
484
+ .with_standard_manager(...)
485
+ .build()
486
+ )
487
+
488
+ # Update task instruction
489
+ task = f"""Research drug repurposing opportunities for: {query}
490
+
491
+ Workflow:
492
+ 1. SearchAgent: Find initial evidence from PubMed and web
493
+ 2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
494
+ 3. SearchAgent: Use hypothesis-suggested queries for targeted search
495
+ 4. JudgeAgent: Evaluate if evidence supports hypotheses
496
+ 5. Repeat until confident or max rounds
497
+
498
+ Focus on:
499
+ - Identifying specific molecular targets
500
+ - Understanding mechanism of action
501
+ - Finding supporting/contradicting evidence for hypotheses
502
+ """
503
+ ```
504
+
505
+ ---
506
+
507
+ ## 5. Directory Structure After Phase 7
508
+
509
+ ```
510
+ src/
511
+ ├── agents/
512
+ │ ├── search_agent.py
513
+ │ ├── judge_agent.py
514
+ │ └── hypothesis_agent.py # NEW
515
+ ├── prompts/
516
+ │ ├── judge.py
517
+ │ └── hypothesis.py # NEW
518
+ ├── services/
519
+ │ └── embeddings.py
520
+ └── utils/
521
+ └── models.py # Updated with hypothesis models
522
+ ```
523
+
524
+ ---
525
+
526
+ ## 6. Tests
527
+
528
+ ### 6.1 Unit Tests (`tests/unit/agents/test_hypothesis_agent.py`)
529
+
530
+ ```python
531
+ """Unit tests for HypothesisAgent."""
532
+ import pytest
533
+ from unittest.mock import AsyncMock, MagicMock, patch
534
+
535
+ from src.agents.hypothesis_agent import HypothesisAgent
536
+ from src.utils.models import Citation, Evidence, HypothesisAssessment, MechanismHypothesis
537
+
538
+
539
+ @pytest.fixture
540
+ def sample_evidence():
541
+ return [
542
+ Evidence(
543
+ content="Metformin activates AMPK, which inhibits mTOR signaling...",
544
+ citation=Citation(
545
+ source="pubmed",
546
+ title="Metformin and AMPK",
547
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
548
+ date="2023"
549
+ )
550
+ )
551
+ ]
552
+
553
+
554
+ @pytest.fixture
555
+ def mock_assessment():
556
+ return HypothesisAssessment(
557
+ hypotheses=[
558
+ MechanismHypothesis(
559
+ drug="Metformin",
560
+ target="AMPK",
561
+ pathway="mTOR inhibition",
562
+ effect="Reduced cancer cell proliferation",
563
+ confidence=0.75,
564
+ search_suggestions=["metformin AMPK cancer", "mTOR cancer therapy"]
565
+ )
566
+ ],
567
+ primary_hypothesis=None,
568
+ knowledge_gaps=["Clinical trial data needed"],
569
+ recommended_searches=["metformin clinical trial cancer"]
570
+ )
571
+
572
+
573
+ @pytest.mark.asyncio
574
+ async def test_hypothesis_agent_generates_hypotheses(sample_evidence, mock_assessment):
575
+ """HypothesisAgent should generate mechanistic hypotheses."""
576
+ store = {"current": sample_evidence, "hypotheses": []}
577
+
578
+ with patch("src.agents.hypothesis_agent.Agent") as MockAgent:
579
+ mock_result = MagicMock()
580
+ mock_result.output = mock_assessment
581
+ MockAgent.return_value.run = AsyncMock(return_value=mock_result)
582
+
583
+ agent = HypothesisAgent(store)
584
+ response = await agent.run("metformin cancer")
585
+
586
+ assert "AMPK" in response.messages[0].text
587
+ assert len(store["hypotheses"]) == 1
588
+
589
+
590
+ @pytest.mark.asyncio
591
+ async def test_hypothesis_agent_no_evidence():
592
+ """HypothesisAgent should handle empty evidence gracefully."""
593
+ store = {"current": [], "hypotheses": []}
594
+ agent = HypothesisAgent(store)
595
+
596
+ response = await agent.run("test query")
597
+
598
+ assert "No evidence" in response.messages[0].text
599
+ ```
600
+
601
+ ---
602
+
603
+ ## 7. Definition of Done
604
+
605
+ Phase 7 is **COMPLETE** when:
606
+
607
+ 1. `MechanismHypothesis` and `HypothesisAssessment` models implemented
608
+ 2. `HypothesisAgent` generates hypotheses from evidence
609
+ 3. Hypotheses stored in shared context
610
+ 4. Search queries generated from hypotheses
611
+ 5. Magentic workflow includes HypothesisAgent
612
+ 6. All unit tests pass
613
+
614
+ ---
615
+
616
+ ## 8. Value Delivered
617
+
618
+ | Before (Phase 6) | After (Phase 7) |
619
+ |------------------|-----------------|
620
+ | Reactive search | Hypothesis-driven search |
621
+ | Generic queries | Mechanism-targeted queries |
622
+ | No scientific reasoning | Drug → Target → Pathway → Effect |
623
+ | Judge says "need more" | Hypothesis says "search for X to test Y" |
624
+
625
+ **Real example improvement:**
626
+ - Query: "metformin alzheimer"
627
+ - Before: "metformin alzheimer mechanism", "metformin brain"
628
+ - After: "metformin AMPK activation", "AMPK autophagy neurodegeneration", "autophagy amyloid clearance"
629
+
630
+ The search becomes **scientifically targeted** rather than keyword variations.
docs/implementation/08_phase_report.md ADDED
@@ -0,0 +1,854 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 8 Implementation Spec: Report Agent
2
+
3
+ **Goal**: Generate structured scientific reports with proper citations and methodology.
4
+ **Philosophy**: "Research isn't complete until it's communicated clearly."
5
+ **Prerequisite**: Phase 7 complete (Hypothesis Agent working)
6
+
7
+ ---
8
+
9
+ ## 1. Why Report Agent?
10
+
11
+ Current limitation: **Synthesis is basic markdown, not a scientific report.**
12
+
13
+ Current output:
14
+ ```markdown
15
+ ## Drug Repurposing Analysis
16
+ ### Drug Candidates
17
+ - Metformin
18
+ ### Key Findings
19
+ - Some findings
20
+ ### Citations
21
+ 1. [Paper 1](url)
22
+ ```
23
+
24
+ With Report Agent:
25
+ ```markdown
26
+ ## Executive Summary
27
+ One-paragraph summary for busy readers...
28
+
29
+ ## Research Question
30
+ Clear statement of what was investigated...
31
+
32
+ ## Methodology
33
+ - Sources searched: PubMed, DuckDuckGo
34
+ - Date range: ...
35
+ - Inclusion criteria: ...
36
+
37
+ ## Hypotheses Tested
38
+ 1. Metformin → AMPK → neuroprotection (Supported: 7 papers, Contradicted: 2)
39
+
40
+ ## Findings
41
+ ### Mechanistic Evidence
42
+ ...
43
+ ### Clinical Evidence
44
+ ...
45
+
46
+ ## Limitations
47
+ - Only English language papers
48
+ - Abstract-level analysis only
49
+
50
+ ## Conclusion
51
+ ...
52
+
53
+ ## References
54
+ Properly formatted citations...
55
+ ```
56
+
57
+ ---
58
+
59
+ ## 2. Architecture
60
+
61
+ ### Phase 8 Addition
62
+ ```text
63
+ Evidence + Hypotheses + Assessment
64
+
65
+ Report Agent
66
+
67
+ Structured Scientific Report
68
+ ```
69
+
70
+ ### Report Generation Flow
71
+ ```text
72
+ 1. JudgeAgent says "synthesize"
73
+ 2. Magentic Manager selects ReportAgent
74
+ 3. ReportAgent gathers:
75
+ - All evidence from shared context
76
+ - All hypotheses (supported/contradicted)
77
+ - Assessment scores
78
+ 4. ReportAgent generates structured report
79
+ 5. Final output to user
80
+ ```
81
+
82
+ ---
83
+
84
+ ## 3. Report Model
85
+
86
+ ### 3.1 Data Model (`src/utils/models.py`)
87
+
88
+ ```python
89
+ class ReportSection(BaseModel):
90
+ """A section of the research report."""
91
+ title: str
92
+ content: str
93
+ citations: list[str] = Field(default_factory=list)
94
+
95
+
96
+ class ResearchReport(BaseModel):
97
+ """Structured scientific report."""
98
+
99
+ title: str = Field(description="Report title")
100
+ executive_summary: str = Field(
101
+ description="One-paragraph summary for quick reading",
102
+ min_length=100,
103
+ max_length=500
104
+ )
105
+ research_question: str = Field(description="Clear statement of what was investigated")
106
+
107
+ methodology: ReportSection = Field(description="How the research was conducted")
108
+ hypotheses_tested: list[dict] = Field(
109
+ description="Hypotheses with supporting/contradicting evidence counts"
110
+ )
111
+
112
+ mechanistic_findings: ReportSection = Field(
113
+ description="Findings about drug mechanisms"
114
+ )
115
+ clinical_findings: ReportSection = Field(
116
+ description="Findings from clinical/preclinical studies"
117
+ )
118
+
119
+ drug_candidates: list[str] = Field(description="Identified drug candidates")
120
+ limitations: list[str] = Field(description="Study limitations")
121
+ conclusion: str = Field(description="Overall conclusion")
122
+
123
+ references: list[dict] = Field(
124
+ description="Formatted references with title, authors, source, URL"
125
+ )
126
+
127
+ # Metadata
128
+ sources_searched: list[str] = Field(default_factory=list)
129
+ total_papers_reviewed: int = 0
130
+ search_iterations: int = 0
131
+ confidence_score: float = Field(ge=0, le=1)
132
+
133
+ def to_markdown(self) -> str:
134
+ """Render report as markdown."""
135
+ sections = [
136
+ f"# {self.title}\n",
137
+ f"## Executive Summary\n{self.executive_summary}\n",
138
+ f"## Research Question\n{self.research_question}\n",
139
+ f"## Methodology\n{self.methodology.content}\n",
140
+ ]
141
+
142
+ # Hypotheses
143
+ sections.append("## Hypotheses Tested\n")
144
+ for h in self.hypotheses_tested:
145
+ status = "✅ Supported" if h.get("supported", 0) > h.get("contradicted", 0) else "⚠️ Mixed"
146
+ sections.append(
147
+ f"- **{h['mechanism']}** ({status}): "
148
+ f"{h.get('supported', 0)} supporting, {h.get('contradicted', 0)} contradicting\n"
149
+ )
150
+
151
+ # Findings
152
+ sections.append(f"## Mechanistic Findings\n{self.mechanistic_findings.content}\n")
153
+ sections.append(f"## Clinical Findings\n{self.clinical_findings.content}\n")
154
+
155
+ # Drug candidates
156
+ sections.append("## Drug Candidates\n")
157
+ for drug in self.drug_candidates:
158
+ sections.append(f"- **{drug}**\n")
159
+
160
+ # Limitations
161
+ sections.append("## Limitations\n")
162
+ for lim in self.limitations:
163
+ sections.append(f"- {lim}\n")
164
+
165
+ # Conclusion
166
+ sections.append(f"## Conclusion\n{self.conclusion}\n")
167
+
168
+ # References
169
+ sections.append("## References\n")
170
+ for i, ref in enumerate(self.references, 1):
171
+ sections.append(
172
+ f"{i}. {ref.get('authors', 'Unknown')}. "
173
+ f"*{ref.get('title', 'Untitled')}*. "
174
+ f"{ref.get('source', '')} ({ref.get('date', '')}). "
175
+ f"[Link]({ref.get('url', '#')})\n"
176
+ )
177
+
178
+ # Metadata footer
179
+ sections.append("\n---\n")
180
+ sections.append(
181
+ f"*Report generated from {self.total_papers_reviewed} papers "
182
+ f"across {self.search_iterations} search iterations. "
183
+ f"Confidence: {self.confidence_score:.0%}*"
184
+ )
185
+
186
+ return "\n".join(sections)
187
+ ```
188
+
189
+ ---
190
+
191
+ ## 4. Implementation
192
+
193
+ ### 4.0 Citation Validation (`src/utils/citation_validator.py`)
194
+
195
+ > **🚨 CRITICAL: Why Citation Validation?**
196
+ >
197
+ > LLMs frequently **hallucinate** citations - inventing paper titles, authors, and URLs
198
+ > that don't exist. For a medical research tool, fake citations are **dangerous**.
199
+ >
200
+ > This validation layer ensures every reference in the report actually exists
201
+ > in the collected evidence.
202
+
203
+ ```python
204
+ """Citation validation to prevent LLM hallucination.
205
+
206
+ CRITICAL: Medical research requires accurate citations.
207
+ This module validates that all references exist in collected evidence.
208
+ """
209
+ import logging
210
+ from typing import TYPE_CHECKING
211
+
212
+ if TYPE_CHECKING:
213
+ from src.utils.models import Evidence, ResearchReport
214
+
215
+ logger = logging.getLogger(__name__)
216
+
217
+
218
+ def validate_references(
219
+ report: "ResearchReport",
220
+ evidence: list["Evidence"]
221
+ ) -> "ResearchReport":
222
+ """Ensure all references actually exist in collected evidence.
223
+
224
+ CRITICAL: Prevents LLM hallucination of citations.
225
+
226
+ Args:
227
+ report: The generated research report
228
+ evidence: All evidence collected during research
229
+
230
+ Returns:
231
+ Report with only valid references (hallucinated ones removed)
232
+ """
233
+ # Build set of valid URLs from evidence
234
+ valid_urls = {e.citation.url for e in evidence}
235
+ valid_titles = {e.citation.title.lower() for e in evidence}
236
+
237
+ validated_refs = []
238
+ removed_count = 0
239
+
240
+ for ref in report.references:
241
+ ref_url = ref.get("url", "")
242
+ ref_title = ref.get("title", "").lower()
243
+
244
+ # Check if URL matches collected evidence
245
+ if ref_url in valid_urls:
246
+ validated_refs.append(ref)
247
+ # Fallback: check title match (URLs might differ slightly)
248
+ elif ref_title and any(ref_title in t or t in ref_title for t in valid_titles):
249
+ validated_refs.append(ref)
250
+ else:
251
+ removed_count += 1
252
+ logger.warning(
253
+ f"Removed hallucinated reference: '{ref.get('title', 'Unknown')}' "
254
+ f"(URL: {ref_url[:50]}...)"
255
+ )
256
+
257
+ if removed_count > 0:
258
+ logger.info(
259
+ f"Citation validation removed {removed_count} hallucinated references. "
260
+ f"{len(validated_refs)} valid references remain."
261
+ )
262
+
263
+ # Update report with validated references
264
+ report.references = validated_refs
265
+ return report
266
+
267
+
268
+ def build_reference_from_evidence(evidence: "Evidence") -> dict:
269
+ """Build a properly formatted reference from evidence.
270
+
271
+ Use this to ensure references match the original evidence exactly.
272
+ """
273
+ return {
274
+ "title": evidence.citation.title,
275
+ "authors": evidence.citation.authors or ["Unknown"],
276
+ "source": evidence.citation.source,
277
+ "date": evidence.citation.date or "n.d.",
278
+ "url": evidence.citation.url,
279
+ }
280
+ ```
281
+
282
+ ### 4.1 Report Prompts (`src/prompts/report.py`)
283
+
284
+ ```python
285
+ """Prompts for Report Agent."""
286
+ from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
287
+
288
+ SYSTEM_PROMPT = """You are a scientific writer specializing in drug repurposing research reports.
289
+
290
+ Your role is to synthesize evidence and hypotheses into a clear, structured report.
291
+
292
+ A good report:
293
+ 1. Has a clear EXECUTIVE SUMMARY (one paragraph, key takeaways)
294
+ 2. States the RESEARCH QUESTION clearly
295
+ 3. Describes METHODOLOGY (what was searched, how)
296
+ 4. Evaluates HYPOTHESES with evidence counts
297
+ 5. Separates MECHANISTIC and CLINICAL findings
298
+ 6. Lists specific DRUG CANDIDATES
299
+ 7. Acknowledges LIMITATIONS honestly
300
+ 8. Provides a balanced CONCLUSION
301
+ 9. Includes properly formatted REFERENCES
302
+
303
+ Write in scientific but accessible language. Be specific about evidence strength.
304
+
305
+ ─────────────────────────────────────────────────────────────────────────────
306
+ 🚨 CRITICAL CITATION REQUIREMENTS 🚨
307
+ ─────────────────────────────────────────────────────────────────────────────
308
+
309
+ You MUST follow these rules for the References section:
310
+
311
+ 1. You may ONLY cite papers that appear in the Evidence section above
312
+ 2. Every reference URL must EXACTLY match a provided evidence URL
313
+ 3. Do NOT invent, fabricate, or hallucinate any references
314
+ 4. Do NOT modify paper titles, authors, dates, or URLs
315
+ 5. If unsure about a citation, OMIT it rather than guess
316
+ 6. Copy URLs exactly as provided - do not create similar-looking URLs
317
+
318
+ VIOLATION OF THESE RULES PRODUCES DANGEROUS MISINFORMATION.
319
+ ─────────────────────────────────────────────────────────────────────────────"""
320
+
321
+
322
+ async def format_report_prompt(
323
+ query: str,
324
+ evidence: list,
325
+ hypotheses: list,
326
+ assessment: dict,
327
+ metadata: dict,
328
+ embeddings=None
329
+ ) -> str:
330
+ """Format prompt for report generation.
331
+
332
+ Includes full evidence details for accurate citation.
333
+ """
334
+ # Select diverse evidence (not arbitrary truncation)
335
+ selected = await select_diverse_evidence(
336
+ evidence, n=20, query=query, embeddings=embeddings
337
+ )
338
+
339
+ # Include FULL citation details for each evidence item
340
+ # This helps the LLM create accurate references
341
+ evidence_summary = "\n".join([
342
+ f"- **Title**: {e.citation.title}\n"
343
+ f" **URL**: {e.citation.url}\n"
344
+ f" **Authors**: {', '.join(e.citation.authors or ['Unknown'])}\n"
345
+ f" **Date**: {e.citation.date or 'n.d.'}\n"
346
+ f" **Source**: {e.citation.source}\n"
347
+ f" **Content**: {truncate_at_sentence(e.content, 200)}\n"
348
+ for e in selected
349
+ ])
350
+
351
+ hypotheses_summary = "\n".join([
352
+ f"- {h.drug} → {h.target} → {h.pathway} → {h.effect} (Confidence: {h.confidence:.0%})"
353
+ for h in hypotheses
354
+ ]) if hypotheses else "No hypotheses generated yet."
355
+
356
+ return f"""Generate a structured research report for the following query.
357
+
358
+ ## Original Query
359
+ {query}
360
+
361
+ ## Evidence Collected ({len(selected)} papers, selected for diversity)
362
+
363
+ {evidence_summary}
364
+
365
+ ## Hypotheses Generated
366
+ {hypotheses_summary}
367
+
368
+ ## Assessment Scores
369
+ - Mechanism Score: {assessment.get('mechanism_score', 'N/A')}/10
370
+ - Clinical Evidence Score: {assessment.get('clinical_score', 'N/A')}/10
371
+ - Overall Confidence: {assessment.get('confidence', 0):.0%}
372
+
373
+ ## Metadata
374
+ - Sources Searched: {', '.join(metadata.get('sources', []))}
375
+ - Search Iterations: {metadata.get('iterations', 0)}
376
+
377
+ Generate a complete ResearchReport with all sections filled in.
378
+
379
+ REMINDER: Only cite papers from the Evidence section above. Copy URLs exactly."""
380
+ ```
381
+
382
+ ### 4.2 Report Agent (`src/agents/report_agent.py`)
383
+
384
+ ```python
385
+ """Report agent for generating structured research reports."""
386
+ from collections.abc import AsyncIterable
387
+ from typing import TYPE_CHECKING, Any
388
+
389
+ from agent_framework import (
390
+ AgentRunResponse,
391
+ AgentRunResponseUpdate,
392
+ AgentThread,
393
+ BaseAgent,
394
+ ChatMessage,
395
+ Role,
396
+ )
397
+ from pydantic_ai import Agent
398
+
399
+ from src.prompts.report import SYSTEM_PROMPT, format_report_prompt
400
+ from src.utils.citation_validator import validate_references # CRITICAL
401
+ from src.utils.config import settings
402
+ from src.utils.models import Evidence, MechanismHypothesis, ResearchReport
403
+
404
+ if TYPE_CHECKING:
405
+ from src.services.embeddings import EmbeddingService
406
+
407
+
408
+ class ReportAgent(BaseAgent):
409
+ """Generates structured scientific reports from evidence and hypotheses."""
410
+
411
+ def __init__(
412
+ self,
413
+ evidence_store: dict[str, list[Evidence]],
414
+ embedding_service: "EmbeddingService | None" = None, # For diverse selection
415
+ ) -> None:
416
+ super().__init__(
417
+ name="ReportAgent",
418
+ description="Generates structured scientific research reports with citations",
419
+ )
420
+ self._evidence_store = evidence_store
421
+ self._embeddings = embedding_service
422
+ self._agent = Agent(
423
+ model=settings.llm_provider,
424
+ output_type=ResearchReport,
425
+ system_prompt=SYSTEM_PROMPT,
426
+ )
427
+
428
+ async def run(
429
+ self,
430
+ messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
431
+ *,
432
+ thread: AgentThread | None = None,
433
+ **kwargs: Any,
434
+ ) -> AgentRunResponse:
435
+ """Generate research report."""
436
+ query = self._extract_query(messages)
437
+
438
+ # Gather all context
439
+ evidence = self._evidence_store.get("current", [])
440
+ hypotheses = self._evidence_store.get("hypotheses", [])
441
+ assessment = self._evidence_store.get("last_assessment", {})
442
+
443
+ if not evidence:
444
+ return AgentRunResponse(
445
+ messages=[ChatMessage(
446
+ role=Role.ASSISTANT,
447
+ text="Cannot generate report: No evidence collected."
448
+ )],
449
+ response_id="report-no-evidence",
450
+ )
451
+
452
+ # Build metadata
453
+ metadata = {
454
+ "sources": list(set(e.citation.source for e in evidence)),
455
+ "iterations": self._evidence_store.get("iteration_count", 0),
456
+ }
457
+
458
+ # Generate report (format_report_prompt is now async)
459
+ prompt = await format_report_prompt(
460
+ query=query,
461
+ evidence=evidence,
462
+ hypotheses=hypotheses,
463
+ assessment=assessment,
464
+ metadata=metadata,
465
+ embeddings=self._embeddings,
466
+ )
467
+
468
+ result = await self._agent.run(prompt)
469
+ report = result.output
470
+
471
+ # ═══════════════════════════════════════════════════════════════════
472
+ # 🚨 CRITICAL: Validate citations to prevent hallucination
473
+ # ═══════════════════════════════════════════════════════════════════
474
+ report = validate_references(report, evidence)
475
+
476
+ # Store validated report
477
+ self._evidence_store["final_report"] = report
478
+
479
+ # Return markdown version
480
+ return AgentRunResponse(
481
+ messages=[ChatMessage(role=Role.ASSISTANT, text=report.to_markdown())],
482
+ response_id="report-complete",
483
+ additional_properties={"report": report.model_dump()},
484
+ )
485
+
486
+ def _extract_query(self, messages) -> str:
487
+ """Extract query from messages."""
488
+ if isinstance(messages, str):
489
+ return messages
490
+ elif isinstance(messages, ChatMessage):
491
+ return messages.text or ""
492
+ elif isinstance(messages, list):
493
+ for msg in reversed(messages):
494
+ if isinstance(msg, ChatMessage) and msg.role == Role.USER:
495
+ return msg.text or ""
496
+ elif isinstance(msg, str):
497
+ return msg
498
+ return ""
499
+
500
+ async def run_stream(
501
+ self,
502
+ messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
503
+ *,
504
+ thread: AgentThread | None = None,
505
+ **kwargs: Any,
506
+ ) -> AsyncIterable[AgentRunResponseUpdate]:
507
+ """Streaming wrapper."""
508
+ result = await self.run(messages, thread=thread, **kwargs)
509
+ yield AgentRunResponseUpdate(
510
+ messages=result.messages,
511
+ response_id=result.response_id
512
+ )
513
+ ```
514
+
515
+ ### 4.3 Update MagenticOrchestrator
516
+
517
+ Add ReportAgent as the final synthesis step:
518
+
519
+ ```python
520
+ # In MagenticOrchestrator.__init__
521
+ self._report_agent = ReportAgent(self._evidence_store)
522
+
523
+ # In workflow building
524
+ workflow = (
525
+ MagenticBuilder()
526
+ .participants(
527
+ searcher=search_agent,
528
+ hypothesizer=hypothesis_agent,
529
+ judge=judge_agent,
530
+ reporter=self._report_agent, # NEW
531
+ )
532
+ .with_standard_manager(...)
533
+ .build()
534
+ )
535
+
536
+ # Update task instruction
537
+ task = f"""Research drug repurposing opportunities for: {query}
538
+
539
+ Workflow:
540
+ 1. SearchAgent: Find evidence from PubMed and web
541
+ 2. HypothesisAgent: Generate mechanistic hypotheses
542
+ 3. SearchAgent: Targeted search based on hypotheses
543
+ 4. JudgeAgent: Evaluate evidence sufficiency
544
+ 5. If sufficient → ReportAgent: Generate structured research report
545
+ 6. If not sufficient → Repeat from step 1 with refined queries
546
+
547
+ The final output should be a complete research report with:
548
+ - Executive summary
549
+ - Methodology
550
+ - Hypotheses tested
551
+ - Mechanistic and clinical findings
552
+ - Drug candidates
553
+ - Limitations
554
+ - Conclusion with references
555
+ """
556
+ ```
557
+
558
+ ---
559
+
560
+ ## 5. Directory Structure After Phase 8
561
+
562
+ ```
563
+ src/
564
+ ├── agents/
565
+ │ ├── search_agent.py
566
+ │ ├── judge_agent.py
567
+ │ ├── hypothesis_agent.py
568
+ │ └── report_agent.py # NEW
569
+ ├── prompts/
570
+ │ ├── judge.py
571
+ │ ├── hypothesis.py
572
+ │ └── report.py # NEW
573
+ ├── services/
574
+ │ └── embeddings.py
575
+ └── utils/
576
+ └── models.py # Updated with report models
577
+ ```
578
+
579
+ ---
580
+
581
+ ## 6. Tests
582
+
583
+ ### 6.1 Unit Tests (`tests/unit/agents/test_report_agent.py`)
584
+
585
+ ```python
586
+ """Unit tests for ReportAgent."""
587
+ import pytest
588
+ from unittest.mock import AsyncMock, MagicMock, patch
589
+
590
+ from src.agents.report_agent import ReportAgent
591
+ from src.utils.models import (
592
+ Citation, Evidence, MechanismHypothesis,
593
+ ResearchReport, ReportSection
594
+ )
595
+
596
+
597
+ @pytest.fixture
598
+ def sample_evidence():
599
+ return [
600
+ Evidence(
601
+ content="Metformin activates AMPK...",
602
+ citation=Citation(
603
+ source="pubmed",
604
+ title="Metformin mechanisms",
605
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
606
+ date="2023",
607
+ authors=["Smith J", "Jones A"]
608
+ )
609
+ )
610
+ ]
611
+
612
+
613
+ @pytest.fixture
614
+ def sample_hypotheses():
615
+ return [
616
+ MechanismHypothesis(
617
+ drug="Metformin",
618
+ target="AMPK",
619
+ pathway="mTOR inhibition",
620
+ effect="Neuroprotection",
621
+ confidence=0.8,
622
+ search_suggestions=[]
623
+ )
624
+ ]
625
+
626
+
627
+ @pytest.fixture
628
+ def mock_report():
629
+ return ResearchReport(
630
+ title="Drug Repurposing Analysis: Metformin for Alzheimer's",
631
+ executive_summary="This report analyzes metformin as a potential...",
632
+ research_question="Can metformin be repurposed for Alzheimer's disease?",
633
+ methodology=ReportSection(
634
+ title="Methodology",
635
+ content="Searched PubMed and web sources..."
636
+ ),
637
+ hypotheses_tested=[
638
+ {"mechanism": "Metformin → AMPK → neuroprotection", "supported": 5, "contradicted": 1}
639
+ ],
640
+ mechanistic_findings=ReportSection(
641
+ title="Mechanistic Findings",
642
+ content="Evidence suggests AMPK activation..."
643
+ ),
644
+ clinical_findings=ReportSection(
645
+ title="Clinical Findings",
646
+ content="Limited clinical data available..."
647
+ ),
648
+ drug_candidates=["Metformin"],
649
+ limitations=["Abstract-level analysis only"],
650
+ conclusion="Metformin shows promise...",
651
+ references=[],
652
+ sources_searched=["pubmed", "web"],
653
+ total_papers_reviewed=10,
654
+ search_iterations=3,
655
+ confidence_score=0.75
656
+ )
657
+
658
+
659
+ @pytest.mark.asyncio
660
+ async def test_report_agent_generates_report(
661
+ sample_evidence, sample_hypotheses, mock_report
662
+ ):
663
+ """ReportAgent should generate structured report."""
664
+ store = {
665
+ "current": sample_evidence,
666
+ "hypotheses": sample_hypotheses,
667
+ "last_assessment": {"mechanism_score": 8, "clinical_score": 6}
668
+ }
669
+
670
+ with patch("src.agents.report_agent.Agent") as MockAgent:
671
+ mock_result = MagicMock()
672
+ mock_result.output = mock_report
673
+ MockAgent.return_value.run = AsyncMock(return_value=mock_result)
674
+
675
+ agent = ReportAgent(store)
676
+ response = await agent.run("metformin alzheimer")
677
+
678
+ assert "Executive Summary" in response.messages[0].text
679
+ assert "Methodology" in response.messages[0].text
680
+ assert "References" in response.messages[0].text
681
+
682
+
683
+ @pytest.mark.asyncio
684
+ async def test_report_agent_no_evidence():
685
+ """ReportAgent should handle empty evidence gracefully."""
686
+ store = {"current": [], "hypotheses": []}
687
+ agent = ReportAgent(store)
688
+
689
+ response = await agent.run("test query")
690
+
691
+ assert "Cannot generate report" in response.messages[0].text
692
+
693
+
694
+ # ═══════════════════════════════════════════════════════════════════════════
695
+ # 🚨 CRITICAL: Citation Validation Tests
696
+ # ═══════════════════════════════════════════════════════════════════════════
697
+
698
+ @pytest.mark.asyncio
699
+ async def test_report_agent_removes_hallucinated_citations(sample_evidence):
700
+ """ReportAgent should remove citations not in evidence."""
701
+ from src.utils.citation_validator import validate_references
702
+
703
+ # Create report with mix of valid and hallucinated references
704
+ report_with_hallucinations = ResearchReport(
705
+ title="Test Report",
706
+ executive_summary="This is a test report for citation validation...",
707
+ research_question="Testing citation validation",
708
+ methodology=ReportSection(title="Methodology", content="Test"),
709
+ hypotheses_tested=[],
710
+ mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
711
+ clinical_findings=ReportSection(title="Clinical", content="Test"),
712
+ drug_candidates=["TestDrug"],
713
+ limitations=["Test limitation"],
714
+ conclusion="Test conclusion",
715
+ references=[
716
+ # Valid reference (matches sample_evidence)
717
+ {
718
+ "title": "Metformin mechanisms",
719
+ "url": "https://pubmed.ncbi.nlm.nih.gov/12345/",
720
+ "authors": ["Smith J", "Jones A"],
721
+ "date": "2023",
722
+ "source": "pubmed"
723
+ },
724
+ # HALLUCINATED reference (URL doesn't exist in evidence)
725
+ {
726
+ "title": "Fake Paper That Doesn't Exist",
727
+ "url": "https://fake-journal.com/made-up-paper",
728
+ "authors": ["Hallucinated A"],
729
+ "date": "2024",
730
+ "source": "fake"
731
+ },
732
+ # Another HALLUCINATED reference
733
+ {
734
+ "title": "Invented Research",
735
+ "url": "https://pubmed.ncbi.nlm.nih.gov/99999999/",
736
+ "authors": ["NotReal B"],
737
+ "date": "2025",
738
+ "source": "pubmed"
739
+ }
740
+ ],
741
+ sources_searched=["pubmed"],
742
+ total_papers_reviewed=1,
743
+ search_iterations=1,
744
+ confidence_score=0.5
745
+ )
746
+
747
+ # Validate - should remove hallucinated references
748
+ validated_report = validate_references(report_with_hallucinations, sample_evidence)
749
+
750
+ # Only the valid reference should remain
751
+ assert len(validated_report.references) == 1
752
+ assert validated_report.references[0]["title"] == "Metformin mechanisms"
753
+ assert "Fake Paper" not in str(validated_report.references)
754
+
755
+
756
+ def test_citation_validator_handles_empty_references():
757
+ """Citation validator should handle reports with no references."""
758
+ from src.utils.citation_validator import validate_references
759
+
760
+ report = ResearchReport(
761
+ title="Empty Refs Report",
762
+ executive_summary="This report has no references...",
763
+ research_question="Testing empty refs",
764
+ methodology=ReportSection(title="Methodology", content="Test"),
765
+ hypotheses_tested=[],
766
+ mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
767
+ clinical_findings=ReportSection(title="Clinical", content="Test"),
768
+ drug_candidates=[],
769
+ limitations=[],
770
+ conclusion="Test",
771
+ references=[], # Empty!
772
+ sources_searched=[],
773
+ total_papers_reviewed=0,
774
+ search_iterations=0,
775
+ confidence_score=0.0
776
+ )
777
+
778
+ validated = validate_references(report, [])
779
+ assert validated.references == []
780
+ ```
781
+
782
+ ---
783
+
784
+ ## 7. Definition of Done
785
+
786
+ Phase 8 is **COMPLETE** when:
787
+
788
+ 1. `ResearchReport` model implemented with all sections
789
+ 2. `ReportAgent` generates structured reports
790
+ 3. Reports include proper citations and methodology
791
+ 4. Magentic workflow uses ReportAgent for final synthesis
792
+ 5. Report renders as clean markdown
793
+ 6. All unit tests pass
794
+
795
+ ---
796
+
797
+ ## 8. Value Delivered
798
+
799
+ | Before (Phase 7) | After (Phase 8) |
800
+ |------------------|-----------------|
801
+ | Basic synthesis | Structured scientific report |
802
+ | Simple bullet points | Executive summary + methodology |
803
+ | List of citations | Formatted references |
804
+ | No methodology | Clear research process |
805
+ | No limitations | Honest limitations section |
806
+
807
+ **Sample output comparison:**
808
+
809
+ Before:
810
+ ```
811
+ ## Analysis
812
+ - Metformin might help
813
+ - Found 5 papers
814
+ [Link 1] [Link 2]
815
+ ```
816
+
817
+ After:
818
+ ```
819
+ # Drug Repurposing Analysis: Metformin for Alzheimer's Disease
820
+
821
+ ## Executive Summary
822
+ Analysis of 15 papers suggests metformin may provide neuroprotection
823
+ through AMPK activation. Mechanistic evidence is strong (8/10),
824
+ while clinical evidence is moderate (6/10)...
825
+
826
+ ## Methodology
827
+ Systematic search of PubMed and web sources using queries...
828
+
829
+ ## Hypotheses Tested
830
+ - ✅ Metformin → AMPK → neuroprotection (7 supporting, 2 contradicting)
831
+
832
+ ## References
833
+ 1. Smith J, Jones A. *Metformin mechanisms*. Nature (2023). [Link](...)
834
+ ```
835
+
836
+ ---
837
+
838
+ ## 9. Complete Magentic Architecture (Phases 5-8)
839
+
840
+ ```
841
+ User Query
842
+
843
+ Gradio UI
844
+
845
+ Magentic Manager (LLM Coordinator)
846
+ ├── SearchAgent ←→ PubMed + Web + VectorDB
847
+ ├── HypothesisAgent ←→ Mechanistic Reasoning
848
+ ├── JudgeAgent ←→ Evidence Assessment
849
+ └── ReportAgent ←→ Final Synthesis
850
+
851
+ Structured Research Report
852
+ ```
853
+
854
+ **This matches Mario's diagram** with the practical agents that add real value for drug repurposing research.
docs/implementation/09_phase_source_cleanup.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 9 Implementation Spec: Remove DuckDuckGo
2
+
3
+ **Goal**: Remove unreliable web search, focus on credible scientific sources.
4
+ **Philosophy**: "Scientific credibility over source quantity."
5
+ **Prerequisite**: Phase 8 complete (all agents working)
6
+ **Estimated Time**: 30-45 minutes
7
+
8
+ ---
9
+
10
+ ## 1. Why Remove DuckDuckGo?
11
+
12
+ ### Current Problems
13
+
14
+ | Issue | Impact |
15
+ |-------|--------|
16
+ | Rate-limited aggressively | Returns 0 results frequently |
17
+ | Not peer-reviewed | Random blogs, news, misinformation |
18
+ | Not citable | Cannot use in scientific reports |
19
+ | Adds noise | Dilutes quality evidence |
20
+
21
+ ### After Removal
22
+
23
+ | Benefit | Impact |
24
+ |---------|--------|
25
+ | Cleaner codebase | -150 lines of dead code |
26
+ | No rate limit failures | 100% source reliability |
27
+ | Scientific credibility | All sources peer-reviewed/preprint |
28
+ | Simpler debugging | Fewer failure modes |
29
+
30
+ ---
31
+
32
+ ## 2. Files to Modify/Delete
33
+
34
+ ### 2.1 DELETE: `src/tools/websearch.py`
35
+
36
+ ```bash
37
+ # File to delete entirely
38
+ src/tools/websearch.py # ~80 lines
39
+ ```
40
+
41
+ ### 2.2 MODIFY: SearchHandler Usage
42
+
43
+ Update all files that instantiate `SearchHandler` with `WebTool()`:
44
+
45
+ | File | Change |
46
+ |------|--------|
47
+ | `examples/search_demo/run_search.py` | Remove `WebTool()` from tools list |
48
+ | `examples/hypothesis_demo/run_hypothesis.py` | Remove `WebTool()` from tools list |
49
+ | `examples/full_stack_demo/run_full.py` | Remove `WebTool()` from tools list |
50
+ | `examples/orchestrator_demo/run_agent.py` | Remove `WebTool()` from tools list |
51
+ | `examples/orchestrator_demo/run_magentic.py` | Remove `WebTool()` from tools list |
52
+
53
+ ### 2.3 MODIFY: Type Definitions
54
+
55
+ Update `src/utils/models.py`:
56
+
57
+ ```python
58
+ # BEFORE
59
+ sources_searched: list[Literal["pubmed", "web"]]
60
+
61
+ # AFTER (Phase 9)
62
+ sources_searched: list[Literal["pubmed"]]
63
+
64
+ # AFTER (Phase 10-11)
65
+ sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
66
+ ```
67
+
68
+ ### 2.4 DELETE: Tests for WebTool
69
+
70
+ ```bash
71
+ # File to delete
72
+ tests/unit/tools/test_websearch.py
73
+ ```
74
+
75
+ ---
76
+
77
+ ## 3. TDD Implementation
78
+
79
+ ### 3.1 Test: SearchHandler Works Without WebTool
80
+
81
+ ```python
82
+ # tests/unit/tools/test_search_handler.py
83
+
84
+ @pytest.mark.asyncio
85
+ async def test_search_handler_pubmed_only():
86
+ """SearchHandler should work with only PubMed tool."""
87
+ from src.tools.pubmed import PubMedTool
88
+ from src.tools.search_handler import SearchHandler
89
+
90
+ handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
91
+
92
+ # Should not raise
93
+ result = await handler.execute("metformin diabetes", max_results_per_tool=3)
94
+
95
+ assert result.sources_searched == ["pubmed"]
96
+ assert "web" not in result.sources_searched
97
+ assert len(result.errors) == 0 # No failures
98
+ ```
99
+
100
+ ### 3.2 Test: WebTool Import Fails (Deleted)
101
+
102
+ ```python
103
+ # tests/unit/tools/test_websearch_removed.py
104
+
105
+ def test_websearch_module_deleted():
106
+ """WebTool should no longer exist."""
107
+ with pytest.raises(ImportError):
108
+ from src.tools.websearch import WebTool
109
+ ```
110
+
111
+ ### 3.3 Test: Examples Don't Reference WebTool
112
+
113
+ ```python
114
+ # tests/unit/test_no_webtool_references.py
115
+
116
+ import ast
117
+ import pathlib
118
+
119
+ def test_examples_no_webtool_imports():
120
+ """No example files should import WebTool."""
121
+ examples_dir = pathlib.Path("examples")
122
+
123
+ for py_file in examples_dir.rglob("*.py"):
124
+ content = py_file.read_text()
125
+ tree = ast.parse(content)
126
+
127
+ for node in ast.walk(tree):
128
+ if isinstance(node, ast.ImportFrom):
129
+ if node.module and "websearch" in node.module:
130
+ pytest.fail(f"{py_file} imports websearch (should be removed)")
131
+ if isinstance(node, ast.Import):
132
+ for alias in node.names:
133
+ if "websearch" in alias.name:
134
+ pytest.fail(f"{py_file} imports websearch (should be removed)")
135
+ ```
136
+
137
+ ---
138
+
139
+ ## 4. Step-by-Step Implementation
140
+
141
+ ### Step 1: Write Tests First (TDD)
142
+
143
+ ```bash
144
+ # Create the test file
145
+ touch tests/unit/tools/test_websearch_removed.py
146
+ # Write the tests from section 3
147
+ ```
148
+
149
+ ### Step 2: Run Tests (Should Fail)
150
+
151
+ ```bash
152
+ uv run pytest tests/unit/tools/test_websearch_removed.py -v
153
+ # Expected: FAIL (websearch still exists)
154
+ ```
155
+
156
+ ### Step 3: Delete WebTool
157
+
158
+ ```bash
159
+ rm src/tools/websearch.py
160
+ rm tests/unit/tools/test_websearch.py
161
+ ```
162
+
163
+ ### Step 4: Update SearchHandler Usages
164
+
165
+ ```python
166
+ # BEFORE (in each example file)
167
+ from src.tools.websearch import WebTool
168
+ search_handler = SearchHandler(tools=[PubMedTool(), WebTool()], timeout=30.0)
169
+
170
+ # AFTER
171
+ from src.tools.pubmed import PubMedTool
172
+ search_handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
173
+ ```
174
+
175
+ ### Step 5: Update Type Definitions
176
+
177
+ ```python
178
+ # src/utils/models.py
179
+ # BEFORE
180
+ sources_searched: list[Literal["pubmed", "web"]]
181
+
182
+ # AFTER
183
+ sources_searched: list[Literal["pubmed"]]
184
+ ```
185
+
186
+ ### Step 6: Run All Tests
187
+
188
+ ```bash
189
+ uv run pytest tests/unit/ -v
190
+ # Expected: ALL PASS
191
+ ```
192
+
193
+ ### Step 7: Run Lints
194
+
195
+ ```bash
196
+ uv run ruff check src tests examples
197
+ uv run mypy src
198
+ # Expected: No errors
199
+ ```
200
+
201
+ ---
202
+
203
+ ## 5. Definition of Done
204
+
205
+ Phase 9 is **COMPLETE** when:
206
+
207
+ - [ ] `src/tools/websearch.py` deleted
208
+ - [ ] `tests/unit/tools/test_websearch.py` deleted
209
+ - [ ] All example files updated (no WebTool imports)
210
+ - [ ] Type definitions updated in models.py
211
+ - [ ] New tests verify WebTool is removed
212
+ - [ ] All existing tests pass
213
+ - [ ] Lints pass
214
+ - [ ] Examples run successfully with PubMed only
215
+
216
+ ---
217
+
218
+ ## 6. Verification Commands
219
+
220
+ ```bash
221
+ # 1. Verify websearch.py is gone
222
+ ls src/tools/websearch.py 2>&1 | grep "No such file"
223
+
224
+ # 2. Verify no WebTool imports remain
225
+ grep -r "WebTool" src/ examples/ && echo "FAIL: WebTool references found" || echo "PASS"
226
+ grep -r "websearch" src/ examples/ && echo "FAIL: websearch references found" || echo "PASS"
227
+
228
+ # 3. Run tests
229
+ uv run pytest tests/unit/ -v
230
+
231
+ # 4. Run example (should work)
232
+ source .env && uv run python examples/search_demo/run_search.py "metformin cancer"
233
+ ```
234
+
235
+ ---
236
+
237
+ ## 7. Rollback Plan
238
+
239
+ If something breaks:
240
+
241
+ ```bash
242
+ git checkout HEAD -- src/tools/websearch.py
243
+ git checkout HEAD -- tests/unit/tools/test_websearch.py
244
+ ```
245
+
246
+ ---
247
+
248
+ ## 8. Value Delivered
249
+
250
+ | Before | After |
251
+ |--------|-------|
252
+ | 2 search sources (1 broken) | 1 reliable source |
253
+ | Rate limit failures | No failures |
254
+ | Web noise in results | Pure scientific sources |
255
+ | ~230 lines for websearch | 0 lines |
256
+
257
+ **Net effect**: Simpler, more reliable, more credible.
docs/implementation/10_phase_clinicaltrials.md ADDED
@@ -0,0 +1,437 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 10 Implementation Spec: ClinicalTrials.gov Integration
2
+
3
+ **Goal**: Add clinical trial search for drug repurposing evidence.
4
+ **Philosophy**: "Clinical trials are the bridge from hypothesis to therapy."
5
+ **Prerequisite**: Phase 9 complete (DuckDuckGo removed)
6
+ **Estimated Time**: 2-3 hours
7
+
8
+ ---
9
+
10
+ ## 1. Why ClinicalTrials.gov?
11
+
12
+ ### Scientific Value
13
+
14
+ | Feature | Value for Drug Repurposing |
15
+ |---------|---------------------------|
16
+ | **400,000+ studies** | Massive evidence base |
17
+ | **Trial phase data** | Phase I/II/III = evidence strength |
18
+ | **Intervention details** | Exact drug + dosing |
19
+ | **Outcome measures** | What was measured |
20
+ | **Status tracking** | Completed vs recruiting |
21
+ | **Free API** | No cost, no key required |
22
+
23
+ ### Example Query Response
24
+
25
+ Query: "metformin Alzheimer's"
26
+
27
+ ```json
28
+ {
29
+ "studies": [
30
+ {
31
+ "nctId": "NCT04098666",
32
+ "briefTitle": "Metformin in Alzheimer's Dementia Prevention",
33
+ "phase": "Phase 2",
34
+ "status": "Recruiting",
35
+ "conditions": ["Alzheimer Disease"],
36
+ "interventions": ["Drug: Metformin"]
37
+ }
38
+ ]
39
+ }
40
+ ```
41
+
42
+ **This is GOLD for drug repurposing** - actual trials testing the hypothesis!
43
+
44
+ ---
45
+
46
+ ## 2. API Specification
47
+
48
+ ### Endpoint
49
+
50
+ ```
51
+ Base URL: https://clinicaltrials.gov/api/v2/studies
52
+ ```
53
+
54
+ ### Key Parameters
55
+
56
+ | Parameter | Description | Example |
57
+ |-----------|-------------|---------|
58
+ | `query.cond` | Condition/disease | `Alzheimer` |
59
+ | `query.intr` | Intervention/drug | `Metformin` |
60
+ | `query.term` | General search | `metformin alzheimer` |
61
+ | `pageSize` | Results per page | `20` |
62
+ | `fields` | Fields to return | See below |
63
+
64
+ ### Fields We Need
65
+
66
+ ```
67
+ NCTId, BriefTitle, Phase, OverallStatus, Condition,
68
+ InterventionName, StartDate, CompletionDate, BriefSummary
69
+ ```
70
+
71
+ ### Rate Limits
72
+
73
+ - ~50 requests/minute per IP
74
+ - No authentication required
75
+ - Paginated (100 results max per call)
76
+
77
+ ### Documentation
78
+
79
+ - [API v2 Docs](https://clinicaltrials.gov/data-api/api)
80
+ - [Migration Guide](https://www.nlm.nih.gov/pubs/techbull/ma24/ma24_clinicaltrials_api.html)
81
+
82
+ ---
83
+
84
+ ## 3. Data Model
85
+
86
+ ### 3.1 Update Citation Source Type (`src/utils/models.py`)
87
+
88
+ ```python
89
+ # BEFORE
90
+ source: Literal["pubmed", "web"]
91
+
92
+ # AFTER
93
+ source: Literal["pubmed", "clinicaltrials", "biorxiv"]
94
+ ```
95
+
96
+ ### 3.2 Evidence from Clinical Trials
97
+
98
+ Clinical trial data maps to our existing `Evidence` model:
99
+
100
+ ```python
101
+ Evidence(
102
+ content=f"{brief_summary}. Phase: {phase}. Status: {status}.",
103
+ citation=Citation(
104
+ source="clinicaltrials",
105
+ title=brief_title,
106
+ url=f"https://clinicaltrials.gov/study/{nct_id}",
107
+ date=start_date or "Unknown",
108
+ authors=[] # Trials don't have authors in the same way
109
+ ),
110
+ relevance=0.8 # Trials are highly relevant for repurposing
111
+ )
112
+ ```
113
+
114
+ ---
115
+
116
+ ## 4. Implementation
117
+
118
+ ### 4.0 Important: HTTP Client Selection
119
+
120
+ **ClinicalTrials.gov's WAF blocks `httpx`'s TLS fingerprint.** Use `requests` instead.
121
+
122
+ | Library | Status | Notes |
123
+ |---------|--------|-------|
124
+ | `httpx` | ❌ 403 Blocked | TLS/JA3 fingerprint flagged |
125
+ | `httpx[http2]` | ❌ 403 Blocked | HTTP/2 doesn't help |
126
+ | `requests` | ✅ Works | Industry standard, not blocked |
127
+ | `urllib` | ✅ Works | Stdlib alternative |
128
+
129
+ We use `requests` wrapped in `asyncio.to_thread()` for async compatibility.
130
+
131
+ ### 4.1 ClinicalTrials Tool (`src/tools/clinicaltrials.py`)
132
+
133
+ ```python
134
+ """ClinicalTrials.gov search tool using API v2."""
135
+
136
+ import asyncio
137
+ from typing import Any, ClassVar
138
+
139
+ import requests
140
+ from tenacity import retry, stop_after_attempt, wait_exponential
141
+
142
+ from src.utils.exceptions import SearchError
143
+ from src.utils.models import Citation, Evidence
144
+
145
+
146
+ class ClinicalTrialsTool:
147
+ """Search tool for ClinicalTrials.gov.
148
+
149
+ Note: Uses `requests` library instead of `httpx` because ClinicalTrials.gov's
150
+ WAF blocks httpx's TLS fingerprint. The `requests` library is not blocked.
151
+ """
152
+
153
+ BASE_URL = "https://clinicaltrials.gov/api/v2/studies"
154
+ FIELDS: ClassVar[list[str]] = [
155
+ "NCTId",
156
+ "BriefTitle",
157
+ "Phase",
158
+ "OverallStatus",
159
+ "Condition",
160
+ "InterventionName",
161
+ "StartDate",
162
+ "BriefSummary",
163
+ ]
164
+
165
+ @property
166
+ def name(self) -> str:
167
+ return "clinicaltrials"
168
+
169
+ @retry(
170
+ stop=stop_after_attempt(3),
171
+ wait=wait_exponential(multiplier=1, min=1, max=10),
172
+ reraise=True,
173
+ )
174
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
175
+ """Search ClinicalTrials.gov for studies."""
176
+ params = {
177
+ "query.term": query,
178
+ "pageSize": min(max_results, 100),
179
+ "fields": "|".join(self.FIELDS),
180
+ }
181
+
182
+ try:
183
+ # Run blocking requests.get in a separate thread for async compatibility
184
+ response = await asyncio.to_thread(
185
+ requests.get,
186
+ self.BASE_URL,
187
+ params=params,
188
+ headers={"User-Agent": "DeepCritical-Research-Agent/1.0"},
189
+ timeout=30,
190
+ )
191
+ response.raise_for_status()
192
+
193
+ data = response.json()
194
+ studies = data.get("studies", [])
195
+ return [self._study_to_evidence(study) for study in studies[:max_results]]
196
+
197
+ except requests.HTTPError as e:
198
+ raise SearchError(f"ClinicalTrials.gov API error: {e}") from e
199
+ except requests.RequestException as e:
200
+ raise SearchError(f"ClinicalTrials.gov request failed: {e}") from e
201
+
202
+ def _study_to_evidence(self, study: dict) -> Evidence:
203
+ """Convert a clinical trial study to Evidence."""
204
+ # Navigate nested structure
205
+ protocol = study.get("protocolSection", {})
206
+ id_module = protocol.get("identificationModule", {})
207
+ status_module = protocol.get("statusModule", {})
208
+ desc_module = protocol.get("descriptionModule", {})
209
+ design_module = protocol.get("designModule", {})
210
+ conditions_module = protocol.get("conditionsModule", {})
211
+ arms_module = protocol.get("armsInterventionsModule", {})
212
+
213
+ nct_id = id_module.get("nctId", "Unknown")
214
+ title = id_module.get("briefTitle", "Untitled Study")
215
+ status = status_module.get("overallStatus", "Unknown")
216
+ start_date = status_module.get("startDateStruct", {}).get("date", "Unknown")
217
+
218
+ # Get phase (might be a list)
219
+ phases = design_module.get("phases", [])
220
+ phase = phases[0] if phases else "Not Applicable"
221
+
222
+ # Get conditions
223
+ conditions = conditions_module.get("conditions", [])
224
+ conditions_str = ", ".join(conditions[:3]) if conditions else "Unknown"
225
+
226
+ # Get interventions
227
+ interventions = arms_module.get("interventions", [])
228
+ intervention_names = [i.get("name", "") for i in interventions[:3]]
229
+ interventions_str = ", ".join(intervention_names) if intervention_names else "Unknown"
230
+
231
+ # Get summary
232
+ summary = desc_module.get("briefSummary", "No summary available.")
233
+
234
+ # Build content with key trial info
235
+ content = (
236
+ f"{summary[:500]}... "
237
+ f"Trial Phase: {phase}. "
238
+ f"Status: {status}. "
239
+ f"Conditions: {conditions_str}. "
240
+ f"Interventions: {interventions_str}."
241
+ )
242
+
243
+ return Evidence(
244
+ content=content[:2000],
245
+ citation=Citation(
246
+ source="clinicaltrials",
247
+ title=title[:500],
248
+ url=f"https://clinicaltrials.gov/study/{nct_id}",
249
+ date=start_date,
250
+ authors=[], # Trials don't have traditional authors
251
+ ),
252
+ relevance=0.85, # Trials are highly relevant for repurposing
253
+ )
254
+ ```
255
+
256
+ ---
257
+
258
+ ## 5. TDD Test Suite
259
+
260
+ ### 5.1 Unit Tests (`tests/unit/tools/test_clinicaltrials.py`)
261
+
262
+ Uses `unittest.mock.patch` to mock `requests.get` (not `respx` since we're not using `httpx`).
263
+
264
+ ```python
265
+ """Unit tests for ClinicalTrials.gov tool."""
266
+
267
+ from unittest.mock import MagicMock, patch
268
+
269
+ import pytest
270
+ import requests
271
+
272
+ from src.tools.clinicaltrials import ClinicalTrialsTool
273
+ from src.utils.exceptions import SearchError
274
+ from src.utils.models import Evidence
275
+
276
+
277
+ @pytest.fixture
278
+ def mock_clinicaltrials_response() -> dict:
279
+ """Mock ClinicalTrials.gov API response."""
280
+ return {
281
+ "studies": [
282
+ {
283
+ "protocolSection": {
284
+ "identificationModule": {
285
+ "nctId": "NCT04098666",
286
+ "briefTitle": "Metformin in Alzheimer's Dementia Prevention",
287
+ },
288
+ "statusModule": {
289
+ "overallStatus": "Recruiting",
290
+ "startDateStruct": {"date": "2020-01-15"},
291
+ },
292
+ "descriptionModule": {
293
+ "briefSummary": "This study evaluates metformin for Alzheimer's prevention."
294
+ },
295
+ "designModule": {"phases": ["PHASE2"]},
296
+ "conditionsModule": {"conditions": ["Alzheimer Disease", "Dementia"]},
297
+ "armsInterventionsModule": {
298
+ "interventions": [{"name": "Metformin", "type": "Drug"}]
299
+ },
300
+ }
301
+ }
302
+ ]
303
+ }
304
+
305
+
306
+ class TestClinicalTrialsTool:
307
+ """Tests for ClinicalTrialsTool."""
308
+
309
+ def test_tool_name(self) -> None:
310
+ """Tool should have correct name."""
311
+ tool = ClinicalTrialsTool()
312
+ assert tool.name == "clinicaltrials"
313
+
314
+ @pytest.mark.asyncio
315
+ async def test_search_returns_evidence(
316
+ self, mock_clinicaltrials_response: dict
317
+ ) -> None:
318
+ """Search should return Evidence objects."""
319
+ with patch("src.tools.clinicaltrials.requests.get") as mock_get:
320
+ mock_response = MagicMock()
321
+ mock_response.json.return_value = mock_clinicaltrials_response
322
+ mock_response.raise_for_status = MagicMock()
323
+ mock_get.return_value = mock_response
324
+
325
+ tool = ClinicalTrialsTool()
326
+ results = await tool.search("metformin alzheimer", max_results=5)
327
+
328
+ assert len(results) == 1
329
+ assert isinstance(results[0], Evidence)
330
+ assert results[0].citation.source == "clinicaltrials"
331
+ assert "NCT04098666" in results[0].citation.url
332
+ assert "Metformin" in results[0].citation.title
333
+
334
+ @pytest.mark.asyncio
335
+ async def test_search_api_error(self) -> None:
336
+ """Search should raise SearchError on API failure."""
337
+ with patch("src.tools.clinicaltrials.requests.get") as mock_get:
338
+ mock_response = MagicMock()
339
+ mock_response.raise_for_status.side_effect = requests.HTTPError(
340
+ "500 Server Error"
341
+ )
342
+ mock_get.return_value = mock_response
343
+
344
+ tool = ClinicalTrialsTool()
345
+
346
+ with pytest.raises(SearchError):
347
+ await tool.search("metformin alzheimer")
348
+
349
+
350
+ class TestClinicalTrialsIntegration:
351
+ """Integration tests (marked for separate run)."""
352
+
353
+ @pytest.mark.integration
354
+ @pytest.mark.asyncio
355
+ async def test_real_api_call(self) -> None:
356
+ """Test actual API call (requires network)."""
357
+ tool = ClinicalTrialsTool()
358
+ results = await tool.search("metformin diabetes", max_results=3)
359
+
360
+ assert len(results) > 0
361
+ assert all(isinstance(r, Evidence) for r in results)
362
+ assert all(r.citation.source == "clinicaltrials" for r in results)
363
+ ```
364
+
365
+ ---
366
+
367
+ ## 6. Integration with SearchHandler
368
+
369
+ ### 6.1 Update Example Files
370
+
371
+ ```python
372
+ # examples/search_demo/run_search.py
373
+ from src.tools.clinicaltrials import ClinicalTrialsTool
374
+ from src.tools.pubmed import PubMedTool
375
+ from src.tools.search_handler import SearchHandler
376
+
377
+ search_handler = SearchHandler(
378
+ tools=[PubMedTool(), ClinicalTrialsTool()],
379
+ timeout=30.0
380
+ )
381
+ ```
382
+
383
+ ### 6.2 Update SearchResult Type
384
+
385
+ ```python
386
+ # src/utils/models.py
387
+ sources_searched: list[Literal["pubmed", "clinicaltrials"]]
388
+ ```
389
+
390
+ ---
391
+
392
+ ## 7. Definition of Done
393
+
394
+ Phase 10 is **COMPLETE** when:
395
+
396
+ - [ ] `src/tools/clinicaltrials.py` implemented
397
+ - [ ] Unit tests in `tests/unit/tools/test_clinicaltrials.py`
398
+ - [ ] Integration test marked with `@pytest.mark.integration`
399
+ - [ ] SearchHandler updated to include ClinicalTrialsTool
400
+ - [ ] Type definitions updated in models.py
401
+ - [ ] Example files updated
402
+ - [ ] All unit tests pass
403
+ - [ ] Lints pass
404
+ - [ ] Manual verification with real API
405
+
406
+ ---
407
+
408
+ ## 8. Verification Commands
409
+
410
+ ```bash
411
+ # 1. Run unit tests
412
+ uv run pytest tests/unit/tools/test_clinicaltrials.py -v
413
+
414
+ # 2. Run integration test (requires network)
415
+ uv run pytest tests/unit/tools/test_clinicaltrials.py -v -m integration
416
+
417
+ # 3. Run full test suite
418
+ uv run pytest tests/unit/ -v
419
+
420
+ # 4. Run example
421
+ source .env && uv run python examples/search_demo/run_search.py "metformin alzheimer"
422
+ # Should show results from BOTH PubMed AND ClinicalTrials.gov
423
+ ```
424
+
425
+ ---
426
+
427
+ ## 9. Value Delivered
428
+
429
+ | Before | After |
430
+ |--------|-------|
431
+ | Papers only | Papers + Clinical Trials |
432
+ | "Drug X might help" | "Drug X is in Phase II trial" |
433
+ | No trial status | Recruiting/Completed/Terminated |
434
+ | No phase info | Phase I/II/III evidence strength |
435
+
436
+ **Demo pitch addition**:
437
+ > "DeepCritical searches PubMed for peer-reviewed evidence AND ClinicalTrials.gov for 400,000+ clinical trials."