VibecoderMcSwaggins commited on
Commit
3fcd8e7
Β·
1 Parent(s): 388cd05

feat(docs): update implementation roadmap and add specs for Phases 9-11

Browse files

- Updated the implementation roadmap to reflect the completion of Phases 1-8.
- Added detailed specifications for Phase 9: Remove DuckDuckGo, Phase 10: ClinicalTrials.gov Integration, and Phase 11: bioRxiv Preprint Integration.
- Enhanced the status section to indicate the completion of Phases 1-8 and readiness for Phases 9-11.

docs/implementation/09_phase_source_cleanup.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 9 Implementation Spec: Remove DuckDuckGo
2
+
3
+ **Goal**: Remove unreliable web search, focus on credible scientific sources.
4
+ **Philosophy**: "Scientific credibility over source quantity."
5
+ **Prerequisite**: Phase 8 complete (all agents working)
6
+ **Estimated Time**: 30-45 minutes
7
+
8
+ ---
9
+
10
+ ## 1. Why Remove DuckDuckGo?
11
+
12
+ ### Current Problems
13
+
14
+ | Issue | Impact |
15
+ |-------|--------|
16
+ | Rate-limited aggressively | Returns 0 results frequently |
17
+ | Not peer-reviewed | Random blogs, news, misinformation |
18
+ | Not citable | Cannot use in scientific reports |
19
+ | Adds noise | Dilutes quality evidence |
20
+
21
+ ### After Removal
22
+
23
+ | Benefit | Impact |
24
+ |---------|--------|
25
+ | Cleaner codebase | -150 lines of dead code |
26
+ | No rate limit failures | 100% source reliability |
27
+ | Scientific credibility | All sources peer-reviewed/preprint |
28
+ | Simpler debugging | Fewer failure modes |
29
+
30
+ ---
31
+
32
+ ## 2. Files to Modify/Delete
33
+
34
+ ### 2.1 DELETE: `src/tools/websearch.py`
35
+
36
+ ```bash
37
+ # File to delete entirely
38
+ src/tools/websearch.py # ~80 lines
39
+ ```
40
+
41
+ ### 2.2 MODIFY: SearchHandler Usage
42
+
43
+ Update all files that instantiate `SearchHandler` with `WebTool()`:
44
+
45
+ | File | Change |
46
+ |------|--------|
47
+ | `examples/search_demo/run_search.py` | Remove `WebTool()` from tools list |
48
+ | `examples/hypothesis_demo/run_hypothesis.py` | Remove `WebTool()` from tools list |
49
+ | `examples/full_stack_demo/run_full.py` | Remove `WebTool()` from tools list |
50
+ | `examples/orchestrator_demo/run_agent.py` | Remove `WebTool()` from tools list |
51
+ | `examples/orchestrator_demo/run_magentic.py` | Remove `WebTool()` from tools list |
52
+
53
+ ### 2.3 MODIFY: Type Definitions
54
+
55
+ Update `src/utils/models.py`:
56
+
57
+ ```python
58
+ # BEFORE
59
+ sources_searched: list[Literal["pubmed", "web"]]
60
+
61
+ # AFTER (Phase 9)
62
+ sources_searched: list[Literal["pubmed"]]
63
+
64
+ # AFTER (Phase 10-11)
65
+ sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
66
+ ```
67
+
68
+ ### 2.4 DELETE: Tests for WebTool
69
+
70
+ ```bash
71
+ # File to delete
72
+ tests/unit/tools/test_websearch.py
73
+ ```
74
+
75
+ ---
76
+
77
+ ## 3. TDD Implementation
78
+
79
+ ### 3.1 Test: SearchHandler Works Without WebTool
80
+
81
+ ```python
82
+ # tests/unit/tools/test_search_handler.py
83
+
84
+ @pytest.mark.asyncio
85
+ async def test_search_handler_pubmed_only():
86
+ """SearchHandler should work with only PubMed tool."""
87
+ from src.tools.pubmed import PubMedTool
88
+ from src.tools.search_handler import SearchHandler
89
+
90
+ handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
91
+
92
+ # Should not raise
93
+ result = await handler.execute("metformin diabetes", max_results_per_tool=3)
94
+
95
+ assert result.sources_searched == ["pubmed"]
96
+ assert "web" not in result.sources_searched
97
+ assert len(result.errors) == 0 # No failures
98
+ ```
99
+
100
+ ### 3.2 Test: WebTool Import Fails (Deleted)
101
+
102
+ ```python
103
+ # tests/unit/tools/test_websearch_removed.py
104
+
105
+ def test_websearch_module_deleted():
106
+ """WebTool should no longer exist."""
107
+ with pytest.raises(ImportError):
108
+ from src.tools.websearch import WebTool
109
+ ```
110
+
111
+ ### 3.3 Test: Examples Don't Reference WebTool
112
+
113
+ ```python
114
+ # tests/unit/test_no_webtool_references.py
115
+
116
+ import ast
117
+ import pathlib
118
+
119
+ def test_examples_no_webtool_imports():
120
+ """No example files should import WebTool."""
121
+ examples_dir = pathlib.Path("examples")
122
+
123
+ for py_file in examples_dir.rglob("*.py"):
124
+ content = py_file.read_text()
125
+ tree = ast.parse(content)
126
+
127
+ for node in ast.walk(tree):
128
+ if isinstance(node, ast.ImportFrom):
129
+ if node.module and "websearch" in node.module:
130
+ pytest.fail(f"{py_file} imports websearch (should be removed)")
131
+ if isinstance(node, ast.Import):
132
+ for alias in node.names:
133
+ if "websearch" in alias.name:
134
+ pytest.fail(f"{py_file} imports websearch (should be removed)")
135
+ ```
136
+
137
+ ---
138
+
139
+ ## 4. Step-by-Step Implementation
140
+
141
+ ### Step 1: Write Tests First (TDD)
142
+
143
+ ```bash
144
+ # Create the test file
145
+ touch tests/unit/tools/test_websearch_removed.py
146
+ # Write the tests from section 3
147
+ ```
148
+
149
+ ### Step 2: Run Tests (Should Fail)
150
+
151
+ ```bash
152
+ uv run pytest tests/unit/tools/test_websearch_removed.py -v
153
+ # Expected: FAIL (websearch still exists)
154
+ ```
155
+
156
+ ### Step 3: Delete WebTool
157
+
158
+ ```bash
159
+ rm src/tools/websearch.py
160
+ rm tests/unit/tools/test_websearch.py
161
+ ```
162
+
163
+ ### Step 4: Update SearchHandler Usages
164
+
165
+ ```python
166
+ # BEFORE (in each example file)
167
+ from src.tools.websearch import WebTool
168
+ search_handler = SearchHandler(tools=[PubMedTool(), WebTool()], timeout=30.0)
169
+
170
+ # AFTER
171
+ from src.tools.pubmed import PubMedTool
172
+ search_handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
173
+ ```
174
+
175
+ ### Step 5: Update Type Definitions
176
+
177
+ ```python
178
+ # src/utils/models.py
179
+ # BEFORE
180
+ sources_searched: list[Literal["pubmed", "web"]]
181
+
182
+ # AFTER
183
+ sources_searched: list[Literal["pubmed"]]
184
+ ```
185
+
186
+ ### Step 6: Run All Tests
187
+
188
+ ```bash
189
+ uv run pytest tests/unit/ -v
190
+ # Expected: ALL PASS
191
+ ```
192
+
193
+ ### Step 7: Run Lints
194
+
195
+ ```bash
196
+ uv run ruff check src tests examples
197
+ uv run mypy src
198
+ # Expected: No errors
199
+ ```
200
+
201
+ ---
202
+
203
+ ## 5. Definition of Done
204
+
205
+ Phase 9 is **COMPLETE** when:
206
+
207
+ - [ ] `src/tools/websearch.py` deleted
208
+ - [ ] `tests/unit/tools/test_websearch.py` deleted
209
+ - [ ] All example files updated (no WebTool imports)
210
+ - [ ] Type definitions updated in models.py
211
+ - [ ] New tests verify WebTool is removed
212
+ - [ ] All existing tests pass
213
+ - [ ] Lints pass
214
+ - [ ] Examples run successfully with PubMed only
215
+
216
+ ---
217
+
218
+ ## 6. Verification Commands
219
+
220
+ ```bash
221
+ # 1. Verify websearch.py is gone
222
+ ls src/tools/websearch.py 2>&1 | grep "No such file"
223
+
224
+ # 2. Verify no WebTool imports remain
225
+ grep -r "WebTool" src/ examples/ && echo "FAIL: WebTool references found" || echo "PASS"
226
+ grep -r "websearch" src/ examples/ && echo "FAIL: websearch references found" || echo "PASS"
227
+
228
+ # 3. Run tests
229
+ uv run pytest tests/unit/ -v
230
+
231
+ # 4. Run example (should work)
232
+ source .env && uv run python examples/search_demo/run_search.py "metformin cancer"
233
+ ```
234
+
235
+ ---
236
+
237
+ ## 7. Rollback Plan
238
+
239
+ If something breaks:
240
+
241
+ ```bash
242
+ git checkout HEAD -- src/tools/websearch.py
243
+ git checkout HEAD -- tests/unit/tools/test_websearch.py
244
+ ```
245
+
246
+ ---
247
+
248
+ ## 8. Value Delivered
249
+
250
+ | Before | After |
251
+ |--------|-------|
252
+ | 2 search sources (1 broken) | 1 reliable source |
253
+ | Rate limit failures | No failures |
254
+ | Web noise in results | Pure scientific sources |
255
+ | ~230 lines for websearch | 0 lines |
256
+
257
+ **Net effect**: Simpler, more reliable, more credible.
docs/implementation/10_phase_clinicaltrials.md ADDED
@@ -0,0 +1,456 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 10 Implementation Spec: ClinicalTrials.gov Integration
2
+
3
+ **Goal**: Add clinical trial search for drug repurposing evidence.
4
+ **Philosophy**: "Clinical trials are the bridge from hypothesis to therapy."
5
+ **Prerequisite**: Phase 9 complete (DuckDuckGo removed)
6
+ **Estimated Time**: 2-3 hours
7
+
8
+ ---
9
+
10
+ ## 1. Why ClinicalTrials.gov?
11
+
12
+ ### Scientific Value
13
+
14
+ | Feature | Value for Drug Repurposing |
15
+ |---------|---------------------------|
16
+ | **400,000+ studies** | Massive evidence base |
17
+ | **Trial phase data** | Phase I/II/III = evidence strength |
18
+ | **Intervention details** | Exact drug + dosing |
19
+ | **Outcome measures** | What was measured |
20
+ | **Status tracking** | Completed vs recruiting |
21
+ | **Free API** | No cost, no key required |
22
+
23
+ ### Example Query Response
24
+
25
+ Query: "metformin Alzheimer's"
26
+
27
+ ```json
28
+ {
29
+ "studies": [
30
+ {
31
+ "nctId": "NCT04098666",
32
+ "briefTitle": "Metformin in Alzheimer's Dementia Prevention",
33
+ "phase": "Phase 2",
34
+ "status": "Recruiting",
35
+ "conditions": ["Alzheimer Disease"],
36
+ "interventions": ["Drug: Metformin"]
37
+ }
38
+ ]
39
+ }
40
+ ```
41
+
42
+ **This is GOLD for drug repurposing** - actual trials testing the hypothesis!
43
+
44
+ ---
45
+
46
+ ## 2. API Specification
47
+
48
+ ### Endpoint
49
+
50
+ ```
51
+ Base URL: https://clinicaltrials.gov/api/v2/studies
52
+ ```
53
+
54
+ ### Key Parameters
55
+
56
+ | Parameter | Description | Example |
57
+ |-----------|-------------|---------|
58
+ | `query.cond` | Condition/disease | `Alzheimer` |
59
+ | `query.intr` | Intervention/drug | `Metformin` |
60
+ | `query.term` | General search | `metformin alzheimer` |
61
+ | `pageSize` | Results per page | `20` |
62
+ | `fields` | Fields to return | See below |
63
+
64
+ ### Fields We Need
65
+
66
+ ```
67
+ NCTId, BriefTitle, Phase, OverallStatus, Condition,
68
+ InterventionName, StartDate, CompletionDate, BriefSummary
69
+ ```
70
+
71
+ ### Rate Limits
72
+
73
+ - ~50 requests/minute per IP
74
+ - No authentication required
75
+ - Paginated (100 results max per call)
76
+
77
+ ### Documentation
78
+
79
+ - [API v2 Docs](https://clinicaltrials.gov/data-api/api)
80
+ - [Migration Guide](https://www.nlm.nih.gov/pubs/techbull/ma24/ma24_clinicaltrials_api.html)
81
+
82
+ ---
83
+
84
+ ## 3. Data Model
85
+
86
+ ### 3.1 Update Citation Source Type (`src/utils/models.py`)
87
+
88
+ ```python
89
+ # BEFORE
90
+ source: Literal["pubmed", "web"]
91
+
92
+ # AFTER
93
+ source: Literal["pubmed", "clinicaltrials", "biorxiv"]
94
+ ```
95
+
96
+ ### 3.2 Evidence from Clinical Trials
97
+
98
+ Clinical trial data maps to our existing `Evidence` model:
99
+
100
+ ```python
101
+ Evidence(
102
+ content=f"{brief_summary}. Phase: {phase}. Status: {status}.",
103
+ citation=Citation(
104
+ source="clinicaltrials",
105
+ title=brief_title,
106
+ url=f"https://clinicaltrials.gov/study/{nct_id}",
107
+ date=start_date or "Unknown",
108
+ authors=[] # Trials don't have authors in the same way
109
+ ),
110
+ relevance=0.8 # Trials are highly relevant for repurposing
111
+ )
112
+ ```
113
+
114
+ ---
115
+
116
+ ## 4. Implementation
117
+
118
+ ### 4.1 ClinicalTrials Tool (`src/tools/clinicaltrials.py`)
119
+
120
+ ```python
121
+ """ClinicalTrials.gov search tool using API v2."""
122
+
123
+ import httpx
124
+ from tenacity import retry, stop_after_attempt, wait_exponential
125
+
126
+ from src.utils.exceptions import SearchError
127
+ from src.utils.models import Citation, Evidence
128
+
129
+
130
+ class ClinicalTrialsTool:
131
+ """Search tool for ClinicalTrials.gov."""
132
+
133
+ BASE_URL = "https://clinicaltrials.gov/api/v2/studies"
134
+ FIELDS = [
135
+ "NCTId",
136
+ "BriefTitle",
137
+ "Phase",
138
+ "OverallStatus",
139
+ "Condition",
140
+ "InterventionName",
141
+ "StartDate",
142
+ "BriefSummary",
143
+ ]
144
+
145
+ @property
146
+ def name(self) -> str:
147
+ return "clinicaltrials"
148
+
149
+ @retry(
150
+ stop=stop_after_attempt(3),
151
+ wait=wait_exponential(multiplier=1, min=1, max=10),
152
+ reraise=True,
153
+ )
154
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
155
+ """
156
+ Search ClinicalTrials.gov for studies.
157
+
158
+ Args:
159
+ query: Search query (e.g., "metformin alzheimer")
160
+ max_results: Maximum results to return
161
+
162
+ Returns:
163
+ List of Evidence objects from clinical trials
164
+ """
165
+ params = {
166
+ "query.term": query,
167
+ "pageSize": min(max_results, 100),
168
+ "fields": "|".join(self.FIELDS),
169
+ }
170
+
171
+ async with httpx.AsyncClient(timeout=30.0) as client:
172
+ try:
173
+ response = await client.get(self.BASE_URL, params=params)
174
+ response.raise_for_status()
175
+ except httpx.HTTPStatusError as e:
176
+ raise SearchError(f"ClinicalTrials.gov search failed: {e}") from e
177
+
178
+ data = response.json()
179
+ studies = data.get("studies", [])
180
+
181
+ return [self._study_to_evidence(study) for study in studies[:max_results]]
182
+
183
+ def _study_to_evidence(self, study: dict) -> Evidence:
184
+ """Convert a clinical trial study to Evidence."""
185
+ # Navigate nested structure
186
+ protocol = study.get("protocolSection", {})
187
+ id_module = protocol.get("identificationModule", {})
188
+ status_module = protocol.get("statusModule", {})
189
+ desc_module = protocol.get("descriptionModule", {})
190
+ design_module = protocol.get("designModule", {})
191
+ conditions_module = protocol.get("conditionsModule", {})
192
+ arms_module = protocol.get("armsInterventionsModule", {})
193
+
194
+ nct_id = id_module.get("nctId", "Unknown")
195
+ title = id_module.get("briefTitle", "Untitled Study")
196
+ status = status_module.get("overallStatus", "Unknown")
197
+ start_date = status_module.get("startDateStruct", {}).get("date", "Unknown")
198
+
199
+ # Get phase (might be a list)
200
+ phases = design_module.get("phases", [])
201
+ phase = phases[0] if phases else "Not Applicable"
202
+
203
+ # Get conditions
204
+ conditions = conditions_module.get("conditions", [])
205
+ conditions_str = ", ".join(conditions[:3]) if conditions else "Unknown"
206
+
207
+ # Get interventions
208
+ interventions = arms_module.get("interventions", [])
209
+ intervention_names = [i.get("name", "") for i in interventions[:3]]
210
+ interventions_str = ", ".join(intervention_names) if intervention_names else "Unknown"
211
+
212
+ # Get summary
213
+ summary = desc_module.get("briefSummary", "No summary available.")
214
+
215
+ # Build content with key trial info
216
+ content = (
217
+ f"{summary[:500]}... "
218
+ f"Trial Phase: {phase}. "
219
+ f"Status: {status}. "
220
+ f"Conditions: {conditions_str}. "
221
+ f"Interventions: {interventions_str}."
222
+ )
223
+
224
+ return Evidence(
225
+ content=content[:2000],
226
+ citation=Citation(
227
+ source="clinicaltrials",
228
+ title=title[:500],
229
+ url=f"https://clinicaltrials.gov/study/{nct_id}",
230
+ date=start_date,
231
+ authors=[], # Trials don't have traditional authors
232
+ ),
233
+ relevance=0.85, # Trials are highly relevant for repurposing
234
+ )
235
+ ```
236
+
237
+ ---
238
+
239
+ ## 5. TDD Test Suite
240
+
241
+ ### 5.1 Unit Tests (`tests/unit/tools/test_clinicaltrials.py`)
242
+
243
+ ```python
244
+ """Unit tests for ClinicalTrials.gov tool."""
245
+
246
+ import pytest
247
+ import respx
248
+ from httpx import Response
249
+
250
+ from src.tools.clinicaltrials import ClinicalTrialsTool
251
+ from src.utils.models import Evidence
252
+
253
+
254
+ @pytest.fixture
255
+ def mock_clinicaltrials_response():
256
+ """Mock ClinicalTrials.gov API response."""
257
+ return {
258
+ "studies": [
259
+ {
260
+ "protocolSection": {
261
+ "identificationModule": {
262
+ "nctId": "NCT04098666",
263
+ "briefTitle": "Metformin in Alzheimer's Dementia Prevention"
264
+ },
265
+ "statusModule": {
266
+ "overallStatus": "Recruiting",
267
+ "startDateStruct": {"date": "2020-01-15"}
268
+ },
269
+ "descriptionModule": {
270
+ "briefSummary": "This study evaluates metformin for Alzheimer's prevention."
271
+ },
272
+ "designModule": {
273
+ "phases": ["PHASE2"]
274
+ },
275
+ "conditionsModule": {
276
+ "conditions": ["Alzheimer Disease", "Dementia"]
277
+ },
278
+ "armsInterventionsModule": {
279
+ "interventions": [
280
+ {"name": "Metformin", "type": "Drug"}
281
+ ]
282
+ }
283
+ }
284
+ }
285
+ ]
286
+ }
287
+
288
+
289
+ class TestClinicalTrialsTool:
290
+ """Tests for ClinicalTrialsTool."""
291
+
292
+ def test_tool_name(self):
293
+ """Tool should have correct name."""
294
+ tool = ClinicalTrialsTool()
295
+ assert tool.name == "clinicaltrials"
296
+
297
+ @pytest.mark.asyncio
298
+ @respx.mock
299
+ async def test_search_returns_evidence(self, mock_clinicaltrials_response):
300
+ """Search should return Evidence objects."""
301
+ respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
302
+ return_value=Response(200, json=mock_clinicaltrials_response)
303
+ )
304
+
305
+ tool = ClinicalTrialsTool()
306
+ results = await tool.search("metformin alzheimer", max_results=5)
307
+
308
+ assert len(results) == 1
309
+ assert isinstance(results[0], Evidence)
310
+ assert results[0].citation.source == "clinicaltrials"
311
+ assert "NCT04098666" in results[0].citation.url
312
+ assert "Metformin" in results[0].citation.title
313
+
314
+ @pytest.mark.asyncio
315
+ @respx.mock
316
+ async def test_search_extracts_phase(self, mock_clinicaltrials_response):
317
+ """Search should extract trial phase."""
318
+ respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
319
+ return_value=Response(200, json=mock_clinicaltrials_response)
320
+ )
321
+
322
+ tool = ClinicalTrialsTool()
323
+ results = await tool.search("metformin alzheimer")
324
+
325
+ assert "PHASE2" in results[0].content
326
+
327
+ @pytest.mark.asyncio
328
+ @respx.mock
329
+ async def test_search_extracts_status(self, mock_clinicaltrials_response):
330
+ """Search should extract trial status."""
331
+ respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
332
+ return_value=Response(200, json=mock_clinicaltrials_response)
333
+ )
334
+
335
+ tool = ClinicalTrialsTool()
336
+ results = await tool.search("metformin alzheimer")
337
+
338
+ assert "Recruiting" in results[0].content
339
+
340
+ @pytest.mark.asyncio
341
+ @respx.mock
342
+ async def test_search_empty_results(self):
343
+ """Search should handle empty results gracefully."""
344
+ respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
345
+ return_value=Response(200, json={"studies": []})
346
+ )
347
+
348
+ tool = ClinicalTrialsTool()
349
+ results = await tool.search("nonexistent query xyz")
350
+
351
+ assert results == []
352
+
353
+ @pytest.mark.asyncio
354
+ @respx.mock
355
+ async def test_search_api_error(self):
356
+ """Search should raise SearchError on API failure."""
357
+ from src.utils.exceptions import SearchError
358
+
359
+ respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
360
+ return_value=Response(500, text="Internal Server Error")
361
+ )
362
+
363
+ tool = ClinicalTrialsTool()
364
+
365
+ with pytest.raises(SearchError):
366
+ await tool.search("metformin alzheimer")
367
+
368
+
369
+ class TestClinicalTrialsIntegration:
370
+ """Integration tests (marked for separate run)."""
371
+
372
+ @pytest.mark.integration
373
+ @pytest.mark.asyncio
374
+ async def test_real_api_call(self):
375
+ """Test actual API call (requires network)."""
376
+ tool = ClinicalTrialsTool()
377
+ results = await tool.search("metformin diabetes", max_results=3)
378
+
379
+ assert len(results) > 0
380
+ assert all(isinstance(r, Evidence) for r in results)
381
+ assert all(r.citation.source == "clinicaltrials" for r in results)
382
+ ```
383
+
384
+ ---
385
+
386
+ ## 6. Integration with SearchHandler
387
+
388
+ ### 6.1 Update Example Files
389
+
390
+ ```python
391
+ # examples/search_demo/run_search.py
392
+ from src.tools.clinicaltrials import ClinicalTrialsTool
393
+ from src.tools.pubmed import PubMedTool
394
+ from src.tools.search_handler import SearchHandler
395
+
396
+ search_handler = SearchHandler(
397
+ tools=[PubMedTool(), ClinicalTrialsTool()],
398
+ timeout=30.0
399
+ )
400
+ ```
401
+
402
+ ### 6.2 Update SearchResult Type
403
+
404
+ ```python
405
+ # src/utils/models.py
406
+ sources_searched: list[Literal["pubmed", "clinicaltrials"]]
407
+ ```
408
+
409
+ ---
410
+
411
+ ## 7. Definition of Done
412
+
413
+ Phase 10 is **COMPLETE** when:
414
+
415
+ - [ ] `src/tools/clinicaltrials.py` implemented
416
+ - [ ] Unit tests in `tests/unit/tools/test_clinicaltrials.py`
417
+ - [ ] Integration test marked with `@pytest.mark.integration`
418
+ - [ ] SearchHandler updated to include ClinicalTrialsTool
419
+ - [ ] Type definitions updated in models.py
420
+ - [ ] Example files updated
421
+ - [ ] All unit tests pass
422
+ - [ ] Lints pass
423
+ - [ ] Manual verification with real API
424
+
425
+ ---
426
+
427
+ ## 8. Verification Commands
428
+
429
+ ```bash
430
+ # 1. Run unit tests
431
+ uv run pytest tests/unit/tools/test_clinicaltrials.py -v
432
+
433
+ # 2. Run integration test (requires network)
434
+ uv run pytest tests/unit/tools/test_clinicaltrials.py -v -m integration
435
+
436
+ # 3. Run full test suite
437
+ uv run pytest tests/unit/ -v
438
+
439
+ # 4. Run example
440
+ source .env && uv run python examples/search_demo/run_search.py "metformin alzheimer"
441
+ # Should show results from BOTH PubMed AND ClinicalTrials.gov
442
+ ```
443
+
444
+ ---
445
+
446
+ ## 9. Value Delivered
447
+
448
+ | Before | After |
449
+ |--------|-------|
450
+ | Papers only | Papers + Clinical Trials |
451
+ | "Drug X might help" | "Drug X is in Phase II trial" |
452
+ | No trial status | Recruiting/Completed/Terminated |
453
+ | No phase info | Phase I/II/III evidence strength |
454
+
455
+ **Demo pitch addition**:
456
+ > "DeepCritical searches PubMed for peer-reviewed evidence AND ClinicalTrials.gov for 400,000+ clinical trials."
docs/implementation/11_phase_biorxiv.md ADDED
@@ -0,0 +1,572 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 11 Implementation Spec: bioRxiv Preprint Integration
2
+
3
+ **Goal**: Add cutting-edge preprint search for the latest research.
4
+ **Philosophy**: "Preprints are where breakthroughs appear first."
5
+ **Prerequisite**: Phase 10 complete (ClinicalTrials.gov working)
6
+ **Estimated Time**: 2-3 hours
7
+
8
+ ---
9
+
10
+ ## 1. Why bioRxiv?
11
+
12
+ ### Scientific Value
13
+
14
+ | Feature | Value for Drug Repurposing |
15
+ |---------|---------------------------|
16
+ | **Cutting-edge research** | 6-12 months ahead of PubMed |
17
+ | **Rapid publication** | Days, not months |
18
+ | **Free full-text** | Complete papers, not just abstracts |
19
+ | **medRxiv included** | Medical preprints via same API |
20
+ | **No API key required** | Free and open |
21
+
22
+ ### The Preprint Advantage
23
+
24
+ ```
25
+ Traditional Publication Timeline:
26
+ Research β†’ Submit β†’ Review β†’ Revise β†’ Accept β†’ Publish
27
+ |___________________________ 6-18 months _______________|
28
+
29
+ Preprint Timeline:
30
+ Research β†’ Upload β†’ Available
31
+ |______ 1-3 days ______|
32
+ ```
33
+
34
+ **For drug repurposing**: Preprints contain the newest hypotheses and evidence!
35
+
36
+ ---
37
+
38
+ ## 2. API Specification
39
+
40
+ ### Endpoint
41
+
42
+ ```
43
+ Base URL: https://api.biorxiv.org/details/[server]/[interval]/[cursor]/[format]
44
+ ```
45
+
46
+ ### Servers
47
+
48
+ | Server | Content |
49
+ |--------|---------|
50
+ | `biorxiv` | Biology preprints |
51
+ | `medrxiv` | Medical preprints (more relevant for us!) |
52
+
53
+ ### Interval Formats
54
+
55
+ | Format | Example | Description |
56
+ |--------|---------|-------------|
57
+ | Date range | `2024-01-01/2024-12-31` | Papers between dates |
58
+ | Recent N | `50` | Most recent N papers |
59
+ | Recent N days | `30d` | Papers from last N days |
60
+
61
+ ### Response Format
62
+
63
+ ```json
64
+ {
65
+ "collection": [
66
+ {
67
+ "doi": "10.1101/2024.01.15.123456",
68
+ "title": "Metformin repurposing for neurodegeneration",
69
+ "authors": "Smith, J; Jones, A",
70
+ "date": "2024-01-15",
71
+ "category": "neuroscience",
72
+ "abstract": "We investigated metformin's potential..."
73
+ }
74
+ ],
75
+ "messages": [{"status": "ok", "count": 100}]
76
+ }
77
+ ```
78
+
79
+ ### Rate Limits
80
+
81
+ - No official limit, but be respectful
82
+ - Results paginated (100 per call)
83
+ - Use cursor for pagination
84
+
85
+ ### Documentation
86
+
87
+ - [bioRxiv API](https://api.biorxiv.org/)
88
+ - [medrxivr R package docs](https://docs.ropensci.org/medrxivr/)
89
+
90
+ ---
91
+
92
+ ## 3. Search Strategy
93
+
94
+ ### Challenge: bioRxiv API Limitations
95
+
96
+ The bioRxiv API does NOT support keyword search directly. It returns papers by:
97
+ - Date range
98
+ - Recent count
99
+
100
+ ### Solution: Client-Side Filtering
101
+
102
+ ```python
103
+ # Strategy:
104
+ # 1. Fetch recent papers (e.g., last 90 days)
105
+ # 2. Filter by keyword matching in title/abstract
106
+ # 3. Use embeddings for semantic matching (leverage Phase 6!)
107
+ ```
108
+
109
+ ### Alternative: Content Search Endpoint
110
+
111
+ ```
112
+ https://api.biorxiv.org/pubs/[server]/[doi_prefix]
113
+ ```
114
+
115
+ For searching, we can use the publisher endpoint with filtering.
116
+
117
+ ---
118
+
119
+ ## 4. Data Model
120
+
121
+ ### 4.1 Update Citation Source Type (`src/utils/models.py`)
122
+
123
+ ```python
124
+ # After Phase 11
125
+ source: Literal["pubmed", "clinicaltrials", "biorxiv"]
126
+ ```
127
+
128
+ ### 4.2 Evidence from Preprints
129
+
130
+ ```python
131
+ Evidence(
132
+ content=abstract[:2000],
133
+ citation=Citation(
134
+ source="biorxiv", # or "medrxiv"
135
+ title=title,
136
+ url=f"https://doi.org/{doi}",
137
+ date=date,
138
+ authors=authors.split("; ")[:5]
139
+ ),
140
+ relevance=0.75 # Preprints slightly lower than peer-reviewed
141
+ )
142
+ ```
143
+
144
+ ---
145
+
146
+ ## 5. Implementation
147
+
148
+ ### 5.1 bioRxiv Tool (`src/tools/biorxiv.py`)
149
+
150
+ ```python
151
+ """bioRxiv/medRxiv preprint search tool."""
152
+
153
+ import re
154
+ from datetime import datetime, timedelta
155
+
156
+ import httpx
157
+ from tenacity import retry, stop_after_attempt, wait_exponential
158
+
159
+ from src.utils.exceptions import SearchError
160
+ from src.utils.models import Citation, Evidence
161
+
162
+
163
+ class BioRxivTool:
164
+ """Search tool for bioRxiv and medRxiv preprints."""
165
+
166
+ BASE_URL = "https://api.biorxiv.org/details"
167
+ # Use medRxiv for medical/clinical content (more relevant for drug repurposing)
168
+ DEFAULT_SERVER = "medrxiv"
169
+ # Fetch papers from last N days
170
+ DEFAULT_DAYS = 90
171
+
172
+ def __init__(self, server: str = DEFAULT_SERVER, days: int = DEFAULT_DAYS):
173
+ """
174
+ Initialize bioRxiv tool.
175
+
176
+ Args:
177
+ server: "biorxiv" or "medrxiv"
178
+ days: How many days back to search
179
+ """
180
+ self.server = server
181
+ self.days = days
182
+
183
+ @property
184
+ def name(self) -> str:
185
+ return "biorxiv"
186
+
187
+ @retry(
188
+ stop=stop_after_attempt(3),
189
+ wait=wait_exponential(multiplier=1, min=1, max=10),
190
+ reraise=True,
191
+ )
192
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
193
+ """
194
+ Search bioRxiv/medRxiv for preprints matching query.
195
+
196
+ Note: bioRxiv API doesn't support keyword search directly.
197
+ We fetch recent papers and filter client-side.
198
+
199
+ Args:
200
+ query: Search query (keywords)
201
+ max_results: Maximum results to return
202
+
203
+ Returns:
204
+ List of Evidence objects from preprints
205
+ """
206
+ # Build date range for last N days
207
+ end_date = datetime.now().strftime("%Y-%m-%d")
208
+ start_date = (datetime.now() - timedelta(days=self.days)).strftime("%Y-%m-%d")
209
+ interval = f"{start_date}/{end_date}"
210
+
211
+ # Fetch recent papers
212
+ url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"
213
+
214
+ async with httpx.AsyncClient(timeout=30.0) as client:
215
+ try:
216
+ response = await client.get(url)
217
+ response.raise_for_status()
218
+ except httpx.HTTPStatusError as e:
219
+ raise SearchError(f"bioRxiv search failed: {e}") from e
220
+
221
+ data = response.json()
222
+ papers = data.get("collection", [])
223
+
224
+ # Filter papers by query keywords
225
+ query_terms = self._extract_terms(query)
226
+ matching = self._filter_by_keywords(papers, query_terms, max_results)
227
+
228
+ return [self._paper_to_evidence(paper) for paper in matching]
229
+
230
+ def _extract_terms(self, query: str) -> list[str]:
231
+ """Extract search terms from query."""
232
+ # Simple tokenization, lowercase
233
+ terms = re.findall(r'\b\w+\b', query.lower())
234
+ # Filter out common stop words
235
+ stop_words = {'the', 'a', 'an', 'in', 'on', 'for', 'and', 'or', 'of', 'to'}
236
+ return [t for t in terms if t not in stop_words and len(t) > 2]
237
+
238
+ def _filter_by_keywords(
239
+ self, papers: list[dict], terms: list[str], max_results: int
240
+ ) -> list[dict]:
241
+ """Filter papers that contain query terms in title or abstract."""
242
+ scored_papers = []
243
+
244
+ for paper in papers:
245
+ title = paper.get("title", "").lower()
246
+ abstract = paper.get("abstract", "").lower()
247
+ text = f"{title} {abstract}"
248
+
249
+ # Count matching terms
250
+ matches = sum(1 for term in terms if term in text)
251
+
252
+ if matches > 0:
253
+ scored_papers.append((matches, paper))
254
+
255
+ # Sort by match count (descending)
256
+ scored_papers.sort(key=lambda x: x[0], reverse=True)
257
+
258
+ return [paper for _, paper in scored_papers[:max_results]]
259
+
260
+ def _paper_to_evidence(self, paper: dict) -> Evidence:
261
+ """Convert a preprint paper to Evidence."""
262
+ doi = paper.get("doi", "")
263
+ title = paper.get("title", "Untitled")
264
+ authors_str = paper.get("authors", "Unknown")
265
+ date = paper.get("date", "Unknown")
266
+ abstract = paper.get("abstract", "No abstract available.")
267
+ category = paper.get("category", "")
268
+
269
+ # Parse authors (format: "Smith, J; Jones, A")
270
+ authors = [a.strip() for a in authors_str.split(";")][:5]
271
+
272
+ # Note this is a preprint in the content
273
+ content = (
274
+ f"[PREPRINT - Not peer-reviewed] "
275
+ f"{abstract[:1800]}... "
276
+ f"Category: {category}."
277
+ )
278
+
279
+ return Evidence(
280
+ content=content[:2000],
281
+ citation=Citation(
282
+ source="biorxiv",
283
+ title=title[:500],
284
+ url=f"https://doi.org/{doi}" if doi else f"https://www.medrxiv.org/",
285
+ date=date,
286
+ authors=authors,
287
+ ),
288
+ relevance=0.75, # Slightly lower than peer-reviewed
289
+ )
290
+ ```
291
+
292
+ ---
293
+
294
+ ## 6. TDD Test Suite
295
+
296
+ ### 6.1 Unit Tests (`tests/unit/tools/test_biorxiv.py`)
297
+
298
+ ```python
299
+ """Unit tests for bioRxiv tool."""
300
+
301
+ import pytest
302
+ import respx
303
+ from httpx import Response
304
+
305
+ from src.tools.biorxiv import BioRxivTool
306
+ from src.utils.models import Evidence
307
+
308
+
309
+ @pytest.fixture
310
+ def mock_biorxiv_response():
311
+ """Mock bioRxiv API response."""
312
+ return {
313
+ "collection": [
314
+ {
315
+ "doi": "10.1101/2024.01.15.24301234",
316
+ "title": "Metformin repurposing for Alzheimer's disease: a systematic review",
317
+ "authors": "Smith, John; Jones, Alice; Brown, Bob",
318
+ "date": "2024-01-15",
319
+ "category": "neurology",
320
+ "abstract": "Background: Metformin has shown neuroprotective effects. "
321
+ "We conducted a systematic review of metformin's potential "
322
+ "for Alzheimer's disease treatment."
323
+ },
324
+ {
325
+ "doi": "10.1101/2024.01.10.24301111",
326
+ "title": "COVID-19 vaccine efficacy study",
327
+ "authors": "Wilson, C",
328
+ "date": "2024-01-10",
329
+ "category": "infectious diseases",
330
+ "abstract": "This study evaluates COVID-19 vaccine efficacy."
331
+ }
332
+ ],
333
+ "messages": [{"status": "ok", "count": 2}]
334
+ }
335
+
336
+
337
+ class TestBioRxivTool:
338
+ """Tests for BioRxivTool."""
339
+
340
+ def test_tool_name(self):
341
+ """Tool should have correct name."""
342
+ tool = BioRxivTool()
343
+ assert tool.name == "biorxiv"
344
+
345
+ def test_default_server_is_medrxiv(self):
346
+ """Default server should be medRxiv for medical relevance."""
347
+ tool = BioRxivTool()
348
+ assert tool.server == "medrxiv"
349
+
350
+ @pytest.mark.asyncio
351
+ @respx.mock
352
+ async def test_search_returns_evidence(self, mock_biorxiv_response):
353
+ """Search should return Evidence objects."""
354
+ respx.get(url__startswith="https://api.biorxiv.org/details").mock(
355
+ return_value=Response(200, json=mock_biorxiv_response)
356
+ )
357
+
358
+ tool = BioRxivTool()
359
+ results = await tool.search("metformin alzheimer", max_results=5)
360
+
361
+ assert len(results) == 1 # Only the matching paper
362
+ assert isinstance(results[0], Evidence)
363
+ assert results[0].citation.source == "biorxiv"
364
+ assert "metformin" in results[0].citation.title.lower()
365
+
366
+ @pytest.mark.asyncio
367
+ @respx.mock
368
+ async def test_search_filters_by_keywords(self, mock_biorxiv_response):
369
+ """Search should filter papers by query keywords."""
370
+ respx.get(url__startswith="https://api.biorxiv.org/details").mock(
371
+ return_value=Response(200, json=mock_biorxiv_response)
372
+ )
373
+
374
+ tool = BioRxivTool()
375
+
376
+ # Search for metformin - should match first paper
377
+ results = await tool.search("metformin")
378
+ assert len(results) == 1
379
+ assert "metformin" in results[0].citation.title.lower()
380
+
381
+ # Search for COVID - should match second paper
382
+ results = await tool.search("covid vaccine")
383
+ assert len(results) == 1
384
+ assert "covid" in results[0].citation.title.lower()
385
+
386
+ @pytest.mark.asyncio
387
+ @respx.mock
388
+ async def test_search_marks_as_preprint(self, mock_biorxiv_response):
389
+ """Evidence content should note it's a preprint."""
390
+ respx.get(url__startswith="https://api.biorxiv.org/details").mock(
391
+ return_value=Response(200, json=mock_biorxiv_response)
392
+ )
393
+
394
+ tool = BioRxivTool()
395
+ results = await tool.search("metformin")
396
+
397
+ assert "PREPRINT" in results[0].content
398
+ assert "Not peer-reviewed" in results[0].content
399
+
400
+ @pytest.mark.asyncio
401
+ @respx.mock
402
+ async def test_search_empty_results(self):
403
+ """Search should handle empty results gracefully."""
404
+ respx.get(url__startswith="https://api.biorxiv.org/details").mock(
405
+ return_value=Response(200, json={"collection": [], "messages": []})
406
+ )
407
+
408
+ tool = BioRxivTool()
409
+ results = await tool.search("xyznonexistent")
410
+
411
+ assert results == []
412
+
413
+ @pytest.mark.asyncio
414
+ @respx.mock
415
+ async def test_search_api_error(self):
416
+ """Search should raise SearchError on API failure."""
417
+ from src.utils.exceptions import SearchError
418
+
419
+ respx.get(url__startswith="https://api.biorxiv.org/details").mock(
420
+ return_value=Response(500, text="Internal Server Error")
421
+ )
422
+
423
+ tool = BioRxivTool()
424
+
425
+ with pytest.raises(SearchError):
426
+ await tool.search("metformin")
427
+
428
+ def test_extract_terms(self):
429
+ """Should extract meaningful search terms."""
430
+ tool = BioRxivTool()
431
+
432
+ terms = tool._extract_terms("metformin for Alzheimer's disease")
433
+
434
+ assert "metformin" in terms
435
+ assert "alzheimer" in terms
436
+ assert "disease" in terms
437
+ assert "for" not in terms # Stop word
438
+ assert "the" not in terms # Stop word
439
+
440
+
441
+ class TestBioRxivIntegration:
442
+ """Integration tests (marked for separate run)."""
443
+
444
+ @pytest.mark.integration
445
+ @pytest.mark.asyncio
446
+ async def test_real_api_call(self):
447
+ """Test actual API call (requires network)."""
448
+ tool = BioRxivTool(days=30) # Last 30 days
449
+ results = await tool.search("diabetes", max_results=3)
450
+
451
+ # May or may not find results depending on recent papers
452
+ assert isinstance(results, list)
453
+ for r in results:
454
+ assert isinstance(r, Evidence)
455
+ assert r.citation.source == "biorxiv"
456
+ ```
457
+
458
+ ---
459
+
460
+ ## 7. Integration with SearchHandler
461
+
462
+ ### 7.1 Final SearchHandler Configuration
463
+
464
+ ```python
465
+ # examples/search_demo/run_search.py
466
+ from src.tools.biorxiv import BioRxivTool
467
+ from src.tools.clinicaltrials import ClinicalTrialsTool
468
+ from src.tools.pubmed import PubMedTool
469
+ from src.tools.search_handler import SearchHandler
470
+
471
+ search_handler = SearchHandler(
472
+ tools=[
473
+ PubMedTool(), # Peer-reviewed papers
474
+ ClinicalTrialsTool(), # Clinical trials
475
+ BioRxivTool(), # Preprints (cutting edge)
476
+ ],
477
+ timeout=30.0
478
+ )
479
+ ```
480
+
481
+ ### 7.2 Final Type Definition
482
+
483
+ ```python
484
+ # src/utils/models.py
485
+ sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
486
+ ```
487
+
488
+ ---
489
+
490
+ ## 8. Definition of Done
491
+
492
+ Phase 11 is **COMPLETE** when:
493
+
494
+ - [ ] `src/tools/biorxiv.py` implemented
495
+ - [ ] Unit tests in `tests/unit/tools/test_biorxiv.py`
496
+ - [ ] Integration test marked with `@pytest.mark.integration`
497
+ - [ ] SearchHandler updated to include BioRxivTool
498
+ - [ ] Type definitions updated in models.py
499
+ - [ ] Example files updated
500
+ - [ ] All unit tests pass
501
+ - [ ] Lints pass
502
+ - [ ] Manual verification with real API
503
+
504
+ ---
505
+
506
+ ## 9. Verification Commands
507
+
508
+ ```bash
509
+ # 1. Run unit tests
510
+ uv run pytest tests/unit/tools/test_biorxiv.py -v
511
+
512
+ # 2. Run integration test (requires network)
513
+ uv run pytest tests/unit/tools/test_biorxiv.py -v -m integration
514
+
515
+ # 3. Run full test suite
516
+ uv run pytest tests/unit/ -v
517
+
518
+ # 4. Run example with all three sources
519
+ source .env && uv run python examples/search_demo/run_search.py "metformin diabetes"
520
+ # Should show results from PubMed, ClinicalTrials.gov, AND bioRxiv/medRxiv
521
+ ```
522
+
523
+ ---
524
+
525
+ ## 10. Value Delivered
526
+
527
+ | Before | After |
528
+ |--------|-------|
529
+ | Only published papers | Published + Preprints |
530
+ | 6-18 month lag | Near real-time research |
531
+ | Miss cutting-edge | Catch breakthroughs early |
532
+
533
+ **Demo pitch (final)**:
534
+ > "DeepCritical searches PubMed for peer-reviewed evidence, ClinicalTrials.gov for 400,000+ clinical trials, and bioRxiv/medRxiv for cutting-edge preprints - then uses LLMs to generate mechanistic hypotheses and synthesize findings into publication-quality reports."
535
+
536
+ ---
537
+
538
+ ## 11. Complete Source Architecture (After Phase 11)
539
+
540
+ ```
541
+ User Query: "Can metformin treat Alzheimer's?"
542
+ |
543
+ v
544
+ SearchHandler
545
+ |
546
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
547
+ | | |
548
+ v v v
549
+ PubMedTool ClinicalTrials BioRxivTool
550
+ | Tool |
551
+ | | |
552
+ v v v
553
+ "15 peer- "3 Phase II "2 preprints
554
+ reviewed trials from last
555
+ papers" recruiting" 90 days"
556
+ | | |
557
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
558
+ |
559
+ v
560
+ Evidence Pool
561
+ |
562
+ v
563
+ EmbeddingService.deduplicate()
564
+ |
565
+ v
566
+ HypothesisAgent β†’ JudgeAgent β†’ ReportAgent
567
+ |
568
+ v
569
+ Structured Research Report
570
+ ```
571
+
572
+ **This is the Gucci Banger stack.**
docs/implementation/roadmap.md CHANGED
@@ -188,9 +188,12 @@ Structured Research Report
188
  3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)** βœ…
189
  4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)** βœ…
190
  5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** βœ…
191
- 6. **[Phase 6 Spec: Embeddings & Semantic Search](06_phase_embeddings.md)**
192
- 7. **[Phase 7 Spec: Hypothesis Agent](07_phase_hypothesis.md)**
193
- 8. **[Phase 8 Spec: Report Agent](08_phase_report.md)**
 
 
 
194
 
195
  ---
196
 
@@ -203,8 +206,11 @@ Structured Research Report
203
  | Phase 3: Judge | βœ… COMPLETE | LLM evidence assessment |
204
  | Phase 4: UI & Loop | βœ… COMPLETE | Working Gradio app |
205
  | Phase 5: Magentic | βœ… COMPLETE | Multi-agent orchestration |
206
- | Phase 6: Embeddings | πŸ“ SPEC READY | Semantic search |
207
- | Phase 7: Hypothesis | πŸ“ SPEC READY | Mechanistic reasoning |
208
- | Phase 8: Report | πŸ“ SPEC READY | Structured reports |
209
-
210
- *Phases 1-5 completed in ONE DAY. Phases 6-8 specs ready for implementation.*
 
 
 
 
188
  3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)** βœ…
189
  4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)** βœ…
190
  5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** βœ…
191
+ 6. **[Phase 6 Spec: Embeddings & Semantic Search](06_phase_embeddings.md)** βœ…
192
+ 7. **[Phase 7 Spec: Hypothesis Agent](07_phase_hypothesis.md)** βœ…
193
+ 8. **[Phase 8 Spec: Report Agent](08_phase_report.md)** βœ…
194
+ 9. **[Phase 9 Spec: Remove DuckDuckGo](09_phase_source_cleanup.md)** πŸ“
195
+ 10. **[Phase 10 Spec: ClinicalTrials.gov](10_phase_clinicaltrials.md)** πŸ“
196
+ 11. **[Phase 11 Spec: bioRxiv Preprints](11_phase_biorxiv.md)** πŸ“
197
 
198
  ---
199
 
 
206
  | Phase 3: Judge | βœ… COMPLETE | LLM evidence assessment |
207
  | Phase 4: UI & Loop | βœ… COMPLETE | Working Gradio app |
208
  | Phase 5: Magentic | βœ… COMPLETE | Multi-agent orchestration |
209
+ | Phase 6: Embeddings | βœ… COMPLETE | Semantic search + ChromaDB |
210
+ | Phase 7: Hypothesis | βœ… COMPLETE | Mechanistic reasoning chains |
211
+ | Phase 8: Report | βœ… COMPLETE | Structured scientific reports |
212
+ | Phase 9: Source Cleanup | πŸ“ SPEC READY | Remove DuckDuckGo |
213
+ | Phase 10: ClinicalTrials | πŸ“ SPEC READY | ClinicalTrials.gov API |
214
+ | Phase 11: bioRxiv | πŸ“ SPEC READY | Preprint search |
215
+
216
+ *Phases 1-8 COMPLETE. Phases 9-11 will add multi-source credibility.*
docs/index.md CHANGED
@@ -14,10 +14,17 @@ AI-powered deep research system for accelerating drug repurposing discovery.
14
 
15
  ### Implementation (Start Here!)
16
  - **[Roadmap](implementation/roadmap.md)** - Phased execution plan with TDD
17
- - **[Phase 1: Foundation](implementation/01_phase_foundation.md)** - Tooling, config, first tests
18
- - **[Phase 2: Search](implementation/02_phase_search.md)** - PubMed + DuckDuckGo
19
- - **[Phase 3: Judge](implementation/03_phase_judge.md)** - LLM evidence assessment
20
- - **[Phase 4: UI](implementation/04_phase_ui.md)** - Orchestrator + Gradio + Deploy
 
 
 
 
 
 
 
21
 
22
  ### Guides
23
  - [Setup Guide](guides/setup.md) (coming soon)
@@ -76,6 +83,13 @@ User Question β†’ Research Agent (Orchestrator)
76
 
77
  ## Status
78
 
 
 
 
 
 
 
 
79
  **Architecture Review**: PASSED (98-99/100)
80
- **Specs**: IRONCLAD
81
- **Next**: Implementation
 
14
 
15
  ### Implementation (Start Here!)
16
  - **[Roadmap](implementation/roadmap.md)** - Phased execution plan with TDD
17
+ - **[Phase 1: Foundation](implementation/01_phase_foundation.md)** βœ… - Tooling, config, first tests
18
+ - **[Phase 2: Search](implementation/02_phase_search.md)** βœ… - PubMed search
19
+ - **[Phase 3: Judge](implementation/03_phase_judge.md)** βœ… - LLM evidence assessment
20
+ - **[Phase 4: UI](implementation/04_phase_ui.md)** βœ… - Orchestrator + Gradio
21
+ - **[Phase 5: Magentic](implementation/05_phase_magentic.md)** βœ… - Multi-agent orchestration
22
+ - **[Phase 6: Embeddings](implementation/06_phase_embeddings.md)** βœ… - Semantic search + dedup
23
+ - **[Phase 7: Hypothesis](implementation/07_phase_hypothesis.md)** βœ… - Mechanistic reasoning
24
+ - **[Phase 8: Report](implementation/08_phase_report.md)** βœ… - Structured scientific reports
25
+ - **[Phase 9: Source Cleanup](implementation/09_phase_source_cleanup.md)** πŸ“ - Remove DuckDuckGo
26
+ - **[Phase 10: ClinicalTrials](implementation/10_phase_clinicaltrials.md)** πŸ“ - Clinical trials API
27
+ - **[Phase 11: bioRxiv](implementation/11_phase_biorxiv.md)** πŸ“ - Preprint search
28
 
29
  ### Guides
30
  - [Setup Guide](guides/setup.md) (coming soon)
 
83
 
84
  ## Status
85
 
86
+ | Phase | Status |
87
+ |-------|--------|
88
+ | Phases 1-8 | βœ… COMPLETE |
89
+ | Phase 9: Remove DuckDuckGo | πŸ“ SPEC READY |
90
+ | Phase 10: ClinicalTrials.gov | πŸ“ SPEC READY |
91
+ | Phase 11: bioRxiv | πŸ“ SPEC READY |
92
+
93
  **Architecture Review**: PASSED (98-99/100)
94
+ **Phases 1-8**: COMPLETE
95
+ **Next**: Phases 9-11 (Multi-Source Enhancement)