File size: 17,016 Bytes
3fcd8e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
# Phase 11 Implementation Spec: bioRxiv Preprint Integration

**Goal**: Add cutting-edge preprint search for the latest research.
**Philosophy**: "Preprints are where breakthroughs appear first."
**Prerequisite**: Phase 10 complete (ClinicalTrials.gov working)
**Estimated Time**: 2-3 hours

---

## 1. Why bioRxiv?

### Scientific Value

| Feature | Value for Drug Repurposing |
|---------|---------------------------|
| **Cutting-edge research** | 6-12 months ahead of PubMed |
| **Rapid publication** | Days, not months |
| **Free full-text** | Complete papers, not just abstracts |
| **medRxiv included** | Medical preprints via same API |
| **No API key required** | Free and open |

### The Preprint Advantage

```
Traditional Publication Timeline:
  Research β†’ Submit β†’ Review β†’ Revise β†’ Accept β†’ Publish
  |___________________________ 6-18 months _______________|

Preprint Timeline:
  Research β†’ Upload β†’ Available
  |______ 1-3 days ______|
```

**For drug repurposing**: Preprints contain the newest hypotheses and evidence!

---

## 2. API Specification

### Endpoint

```
Base URL: https://api.biorxiv.org/details/[server]/[interval]/[cursor]/[format]
```

### Servers

| Server | Content |
|--------|---------|
| `biorxiv` | Biology preprints |
| `medrxiv` | Medical preprints (more relevant for us!) |

### Interval Formats

| Format | Example | Description |
|--------|---------|-------------|
| Date range | `2024-01-01/2024-12-31` | Papers between dates |
| Recent N | `50` | Most recent N papers |
| Recent N days | `30d` | Papers from last N days |

### Response Format

```json
{
  "collection": [
    {
      "doi": "10.1101/2024.01.15.123456",
      "title": "Metformin repurposing for neurodegeneration",
      "authors": "Smith, J; Jones, A",
      "date": "2024-01-15",
      "category": "neuroscience",
      "abstract": "We investigated metformin's potential..."
    }
  ],
  "messages": [{"status": "ok", "count": 100}]
}
```

### Rate Limits

- No official limit, but be respectful
- Results paginated (100 per call)
- Use cursor for pagination

### Documentation

- [bioRxiv API](https://api.biorxiv.org/)
- [medrxivr R package docs](https://docs.ropensci.org/medrxivr/)

---

## 3. Search Strategy

### Challenge: bioRxiv API Limitations

The bioRxiv API does NOT support keyword search directly. It returns papers by:
- Date range
- Recent count

### Solution: Client-Side Filtering

```python
# Strategy:
# 1. Fetch recent papers (e.g., last 90 days)
# 2. Filter by keyword matching in title/abstract
# 3. Use embeddings for semantic matching (leverage Phase 6!)
```

### Alternative: Content Search Endpoint

```
https://api.biorxiv.org/pubs/[server]/[doi_prefix]
```

For searching, we can use the publisher endpoint with filtering.

---

## 4. Data Model

### 4.1 Update Citation Source Type (`src/utils/models.py`)

```python
# After Phase 11
source: Literal["pubmed", "clinicaltrials", "biorxiv"]
```

### 4.2 Evidence from Preprints

```python
Evidence(
    content=abstract[:2000],
    citation=Citation(
        source="biorxiv",  # or "medrxiv"
        title=title,
        url=f"https://doi.org/{doi}",
        date=date,
        authors=authors.split("; ")[:5]
    ),
    relevance=0.75  # Preprints slightly lower than peer-reviewed
)
```

---

## 5. Implementation

### 5.1 bioRxiv Tool (`src/tools/biorxiv.py`)

```python
"""bioRxiv/medRxiv preprint search tool."""

import re
from datetime import datetime, timedelta

import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

from src.utils.exceptions import SearchError
from src.utils.models import Citation, Evidence


class BioRxivTool:
    """Search tool for bioRxiv and medRxiv preprints."""

    BASE_URL = "https://api.biorxiv.org/details"
    # Use medRxiv for medical/clinical content (more relevant for drug repurposing)
    DEFAULT_SERVER = "medrxiv"
    # Fetch papers from last N days
    DEFAULT_DAYS = 90

    def __init__(self, server: str = DEFAULT_SERVER, days: int = DEFAULT_DAYS):
        """
        Initialize bioRxiv tool.

        Args:
            server: "biorxiv" or "medrxiv"
            days: How many days back to search
        """
        self.server = server
        self.days = days

    @property
    def name(self) -> str:
        return "biorxiv"

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=1, max=10),
        reraise=True,
    )
    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
        """
        Search bioRxiv/medRxiv for preprints matching query.

        Note: bioRxiv API doesn't support keyword search directly.
        We fetch recent papers and filter client-side.

        Args:
            query: Search query (keywords)
            max_results: Maximum results to return

        Returns:
            List of Evidence objects from preprints
        """
        # Build date range for last N days
        end_date = datetime.now().strftime("%Y-%m-%d")
        start_date = (datetime.now() - timedelta(days=self.days)).strftime("%Y-%m-%d")
        interval = f"{start_date}/{end_date}"

        # Fetch recent papers
        url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"

        async with httpx.AsyncClient(timeout=30.0) as client:
            try:
                response = await client.get(url)
                response.raise_for_status()
            except httpx.HTTPStatusError as e:
                raise SearchError(f"bioRxiv search failed: {e}") from e

            data = response.json()
            papers = data.get("collection", [])

            # Filter papers by query keywords
            query_terms = self._extract_terms(query)
            matching = self._filter_by_keywords(papers, query_terms, max_results)

            return [self._paper_to_evidence(paper) for paper in matching]

    def _extract_terms(self, query: str) -> list[str]:
        """Extract search terms from query."""
        # Simple tokenization, lowercase
        terms = re.findall(r'\b\w+\b', query.lower())
        # Filter out common stop words
        stop_words = {'the', 'a', 'an', 'in', 'on', 'for', 'and', 'or', 'of', 'to'}
        return [t for t in terms if t not in stop_words and len(t) > 2]

    def _filter_by_keywords(
        self, papers: list[dict], terms: list[str], max_results: int
    ) -> list[dict]:
        """Filter papers that contain query terms in title or abstract."""
        scored_papers = []

        for paper in papers:
            title = paper.get("title", "").lower()
            abstract = paper.get("abstract", "").lower()
            text = f"{title} {abstract}"

            # Count matching terms
            matches = sum(1 for term in terms if term in text)

            if matches > 0:
                scored_papers.append((matches, paper))

        # Sort by match count (descending)
        scored_papers.sort(key=lambda x: x[0], reverse=True)

        return [paper for _, paper in scored_papers[:max_results]]

    def _paper_to_evidence(self, paper: dict) -> Evidence:
        """Convert a preprint paper to Evidence."""
        doi = paper.get("doi", "")
        title = paper.get("title", "Untitled")
        authors_str = paper.get("authors", "Unknown")
        date = paper.get("date", "Unknown")
        abstract = paper.get("abstract", "No abstract available.")
        category = paper.get("category", "")

        # Parse authors (format: "Smith, J; Jones, A")
        authors = [a.strip() for a in authors_str.split(";")][:5]

        # Note this is a preprint in the content
        content = (
            f"[PREPRINT - Not peer-reviewed] "
            f"{abstract[:1800]}... "
            f"Category: {category}."
        )

        return Evidence(
            content=content[:2000],
            citation=Citation(
                source="biorxiv",
                title=title[:500],
                url=f"https://doi.org/{doi}" if doi else f"https://www.medrxiv.org/",
                date=date,
                authors=authors,
            ),
            relevance=0.75,  # Slightly lower than peer-reviewed
        )
```

---

## 6. TDD Test Suite

### 6.1 Unit Tests (`tests/unit/tools/test_biorxiv.py`)

```python
"""Unit tests for bioRxiv tool."""

import pytest
import respx
from httpx import Response

from src.tools.biorxiv import BioRxivTool
from src.utils.models import Evidence


@pytest.fixture
def mock_biorxiv_response():
    """Mock bioRxiv API response."""
    return {
        "collection": [
            {
                "doi": "10.1101/2024.01.15.24301234",
                "title": "Metformin repurposing for Alzheimer's disease: a systematic review",
                "authors": "Smith, John; Jones, Alice; Brown, Bob",
                "date": "2024-01-15",
                "category": "neurology",
                "abstract": "Background: Metformin has shown neuroprotective effects. "
                           "We conducted a systematic review of metformin's potential "
                           "for Alzheimer's disease treatment."
            },
            {
                "doi": "10.1101/2024.01.10.24301111",
                "title": "COVID-19 vaccine efficacy study",
                "authors": "Wilson, C",
                "date": "2024-01-10",
                "category": "infectious diseases",
                "abstract": "This study evaluates COVID-19 vaccine efficacy."
            }
        ],
        "messages": [{"status": "ok", "count": 2}]
    }


class TestBioRxivTool:
    """Tests for BioRxivTool."""

    def test_tool_name(self):
        """Tool should have correct name."""
        tool = BioRxivTool()
        assert tool.name == "biorxiv"

    def test_default_server_is_medrxiv(self):
        """Default server should be medRxiv for medical relevance."""
        tool = BioRxivTool()
        assert tool.server == "medrxiv"

    @pytest.mark.asyncio
    @respx.mock
    async def test_search_returns_evidence(self, mock_biorxiv_response):
        """Search should return Evidence objects."""
        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
            return_value=Response(200, json=mock_biorxiv_response)
        )

        tool = BioRxivTool()
        results = await tool.search("metformin alzheimer", max_results=5)

        assert len(results) == 1  # Only the matching paper
        assert isinstance(results[0], Evidence)
        assert results[0].citation.source == "biorxiv"
        assert "metformin" in results[0].citation.title.lower()

    @pytest.mark.asyncio
    @respx.mock
    async def test_search_filters_by_keywords(self, mock_biorxiv_response):
        """Search should filter papers by query keywords."""
        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
            return_value=Response(200, json=mock_biorxiv_response)
        )

        tool = BioRxivTool()

        # Search for metformin - should match first paper
        results = await tool.search("metformin")
        assert len(results) == 1
        assert "metformin" in results[0].citation.title.lower()

        # Search for COVID - should match second paper
        results = await tool.search("covid vaccine")
        assert len(results) == 1
        assert "covid" in results[0].citation.title.lower()

    @pytest.mark.asyncio
    @respx.mock
    async def test_search_marks_as_preprint(self, mock_biorxiv_response):
        """Evidence content should note it's a preprint."""
        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
            return_value=Response(200, json=mock_biorxiv_response)
        )

        tool = BioRxivTool()
        results = await tool.search("metformin")

        assert "PREPRINT" in results[0].content
        assert "Not peer-reviewed" in results[0].content

    @pytest.mark.asyncio
    @respx.mock
    async def test_search_empty_results(self):
        """Search should handle empty results gracefully."""
        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
            return_value=Response(200, json={"collection": [], "messages": []})
        )

        tool = BioRxivTool()
        results = await tool.search("xyznonexistent")

        assert results == []

    @pytest.mark.asyncio
    @respx.mock
    async def test_search_api_error(self):
        """Search should raise SearchError on API failure."""
        from src.utils.exceptions import SearchError

        respx.get(url__startswith="https://api.biorxiv.org/details").mock(
            return_value=Response(500, text="Internal Server Error")
        )

        tool = BioRxivTool()

        with pytest.raises(SearchError):
            await tool.search("metformin")

    def test_extract_terms(self):
        """Should extract meaningful search terms."""
        tool = BioRxivTool()

        terms = tool._extract_terms("metformin for Alzheimer's disease")

        assert "metformin" in terms
        assert "alzheimer" in terms
        assert "disease" in terms
        assert "for" not in terms  # Stop word
        assert "the" not in terms  # Stop word


class TestBioRxivIntegration:
    """Integration tests (marked for separate run)."""

    @pytest.mark.integration
    @pytest.mark.asyncio
    async def test_real_api_call(self):
        """Test actual API call (requires network)."""
        tool = BioRxivTool(days=30)  # Last 30 days
        results = await tool.search("diabetes", max_results=3)

        # May or may not find results depending on recent papers
        assert isinstance(results, list)
        for r in results:
            assert isinstance(r, Evidence)
            assert r.citation.source == "biorxiv"
```

---

## 7. Integration with SearchHandler

### 7.1 Final SearchHandler Configuration

```python
# examples/search_demo/run_search.py
from src.tools.biorxiv import BioRxivTool
from src.tools.clinicaltrials import ClinicalTrialsTool
from src.tools.pubmed import PubMedTool
from src.tools.search_handler import SearchHandler

search_handler = SearchHandler(
    tools=[
        PubMedTool(),           # Peer-reviewed papers
        ClinicalTrialsTool(),   # Clinical trials
        BioRxivTool(),          # Preprints (cutting edge)
    ],
    timeout=30.0
)
```

### 7.2 Final Type Definition

```python
# src/utils/models.py
sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
```

---

## 8. Definition of Done

Phase 11 is **COMPLETE** when:

- [ ] `src/tools/biorxiv.py` implemented
- [ ] Unit tests in `tests/unit/tools/test_biorxiv.py`
- [ ] Integration test marked with `@pytest.mark.integration`
- [ ] SearchHandler updated to include BioRxivTool
- [ ] Type definitions updated in models.py
- [ ] Example files updated
- [ ] All unit tests pass
- [ ] Lints pass
- [ ] Manual verification with real API

---

## 9. Verification Commands

```bash
# 1. Run unit tests
uv run pytest tests/unit/tools/test_biorxiv.py -v

# 2. Run integration test (requires network)
uv run pytest tests/unit/tools/test_biorxiv.py -v -m integration

# 3. Run full test suite
uv run pytest tests/unit/ -v

# 4. Run example with all three sources
source .env && uv run python examples/search_demo/run_search.py "metformin diabetes"
# Should show results from PubMed, ClinicalTrials.gov, AND bioRxiv/medRxiv
```

---

## 10. Value Delivered

| Before | After |
|--------|-------|
| Only published papers | Published + Preprints |
| 6-18 month lag | Near real-time research |
| Miss cutting-edge | Catch breakthroughs early |

**Demo pitch (final)**:
> "DeepCritical searches PubMed for peer-reviewed evidence, ClinicalTrials.gov for 400,000+ clinical trials, and bioRxiv/medRxiv for cutting-edge preprints - then uses LLMs to generate mechanistic hypotheses and synthesize findings into publication-quality reports."

---

## 11. Complete Source Architecture (After Phase 11)

```
User Query: "Can metformin treat Alzheimer's?"
                    |
                    v
            SearchHandler
                    |
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    |               |               |
    v               v               v
PubMedTool    ClinicalTrials   BioRxivTool
    |          Tool               |
    |               |               |
    v               v               v
"15 peer-    "3 Phase II     "2 preprints
reviewed      trials          from last
papers"       recruiting"     90 days"
    |               |               |
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    |
                    v
            Evidence Pool
                    |
                    v
        EmbeddingService.deduplicate()
                    |
                    v
        HypothesisAgent β†’ JudgeAgent β†’ ReportAgent
                    |
                    v
        Structured Research Report
```

**This is the Gucci Banger stack.**