Spaces:
Running
Phase 11 Implementation Spec: bioRxiv Preprint Integration
Goal: Add cutting-edge preprint search for the latest research. Philosophy: "Preprints are where breakthroughs appear first." Prerequisite: Phase 10 complete (ClinicalTrials.gov working) Estimated Time: 2-3 hours
1. Why bioRxiv?
Scientific Value
| Feature | Value for Drug Repurposing |
|---|---|
| Cutting-edge research | 6-12 months ahead of PubMed |
| Rapid publication | Days, not months |
| Free full-text | Complete papers, not just abstracts |
| medRxiv included | Medical preprints via same API |
| No API key required | Free and open |
The Preprint Advantage
Traditional Publication Timeline:
Research β Submit β Review β Revise β Accept β Publish
|___________________________ 6-18 months _______________|
Preprint Timeline:
Research β Upload β Available
|______ 1-3 days ______|
For drug repurposing: Preprints contain the newest hypotheses and evidence!
2. API Specification
Endpoint
Base URL: https://api.biorxiv.org/details/[server]/[interval]/[cursor]/[format]
Servers
| Server | Content |
|---|---|
biorxiv |
Biology preprints |
medrxiv |
Medical preprints (more relevant for us!) |
Interval Formats
| Format | Example | Description |
|---|---|---|
| Date range | 2024-01-01/2024-12-31 |
Papers between dates |
| Recent N | 50 |
Most recent N papers |
| Recent N days | 30d |
Papers from last N days |
Response Format
{
"collection": [
{
"doi": "10.1101/2024.01.15.123456",
"title": "Metformin repurposing for neurodegeneration",
"authors": "Smith, J; Jones, A",
"date": "2024-01-15",
"category": "neuroscience",
"abstract": "We investigated metformin's potential..."
}
],
"messages": [{"status": "ok", "count": 100}]
}
Rate Limits
- No official limit, but be respectful
- Results paginated (100 per call)
- Use cursor for pagination
Documentation
3. Search Strategy
Challenge: bioRxiv API Limitations
The bioRxiv API does NOT support keyword search directly. It returns papers by:
- Date range
- Recent count
Solution: Client-Side Filtering
# Strategy:
# 1. Fetch recent papers (e.g., last 90 days)
# 2. Filter by keyword matching in title/abstract
# 3. Use embeddings for semantic matching (leverage Phase 6!)
Alternative: Content Search Endpoint
https://api.biorxiv.org/pubs/[server]/[doi_prefix]
For searching, we can use the publisher endpoint with filtering.
4. Data Model
4.1 Update Citation Source Type (src/utils/models.py)
# After Phase 11
source: Literal["pubmed", "clinicaltrials", "biorxiv"]
4.2 Evidence from Preprints
Evidence(
content=abstract[:2000],
citation=Citation(
source="biorxiv", # or "medrxiv"
title=title,
url=f"https://doi.org/{doi}",
date=date,
authors=authors.split("; ")[:5]
),
relevance=0.75 # Preprints slightly lower than peer-reviewed
)
5. Implementation
5.1 bioRxiv Tool (src/tools/biorxiv.py)
"""bioRxiv/medRxiv preprint search tool."""
import re
from datetime import datetime, timedelta
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential
from src.utils.exceptions import SearchError
from src.utils.models import Citation, Evidence
class BioRxivTool:
"""Search tool for bioRxiv and medRxiv preprints."""
BASE_URL = "https://api.biorxiv.org/details"
# Use medRxiv for medical/clinical content (more relevant for drug repurposing)
DEFAULT_SERVER = "medrxiv"
# Fetch papers from last N days
DEFAULT_DAYS = 90
def __init__(self, server: str = DEFAULT_SERVER, days: int = DEFAULT_DAYS):
"""
Initialize bioRxiv tool.
Args:
server: "biorxiv" or "medrxiv"
days: How many days back to search
"""
self.server = server
self.days = days
@property
def name(self) -> str:
return "biorxiv"
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
reraise=True,
)
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
"""
Search bioRxiv/medRxiv for preprints matching query.
Note: bioRxiv API doesn't support keyword search directly.
We fetch recent papers and filter client-side.
Args:
query: Search query (keywords)
max_results: Maximum results to return
Returns:
List of Evidence objects from preprints
"""
# Build date range for last N days
end_date = datetime.now().strftime("%Y-%m-%d")
start_date = (datetime.now() - timedelta(days=self.days)).strftime("%Y-%m-%d")
interval = f"{start_date}/{end_date}"
# Fetch recent papers
url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"
async with httpx.AsyncClient(timeout=30.0) as client:
try:
response = await client.get(url)
response.raise_for_status()
except httpx.HTTPStatusError as e:
raise SearchError(f"bioRxiv search failed: {e}") from e
data = response.json()
papers = data.get("collection", [])
# Filter papers by query keywords
query_terms = self._extract_terms(query)
matching = self._filter_by_keywords(papers, query_terms, max_results)
return [self._paper_to_evidence(paper) for paper in matching]
def _extract_terms(self, query: str) -> list[str]:
"""Extract search terms from query."""
# Simple tokenization, lowercase
terms = re.findall(r'\b\w+\b', query.lower())
# Filter out common stop words
stop_words = {'the', 'a', 'an', 'in', 'on', 'for', 'and', 'or', 'of', 'to'}
return [t for t in terms if t not in stop_words and len(t) > 2]
def _filter_by_keywords(
self, papers: list[dict], terms: list[str], max_results: int
) -> list[dict]:
"""Filter papers that contain query terms in title or abstract."""
scored_papers = []
for paper in papers:
title = paper.get("title", "").lower()
abstract = paper.get("abstract", "").lower()
text = f"{title} {abstract}"
# Count matching terms
matches = sum(1 for term in terms if term in text)
if matches > 0:
scored_papers.append((matches, paper))
# Sort by match count (descending)
scored_papers.sort(key=lambda x: x[0], reverse=True)
return [paper for _, paper in scored_papers[:max_results]]
def _paper_to_evidence(self, paper: dict) -> Evidence:
"""Convert a preprint paper to Evidence."""
doi = paper.get("doi", "")
title = paper.get("title", "Untitled")
authors_str = paper.get("authors", "Unknown")
date = paper.get("date", "Unknown")
abstract = paper.get("abstract", "No abstract available.")
category = paper.get("category", "")
# Parse authors (format: "Smith, J; Jones, A")
authors = [a.strip() for a in authors_str.split(";")][:5]
# Note this is a preprint in the content
content = (
f"[PREPRINT - Not peer-reviewed] "
f"{abstract[:1800]}... "
f"Category: {category}."
)
return Evidence(
content=content[:2000],
citation=Citation(
source="biorxiv",
title=title[:500],
url=f"https://doi.org/{doi}" if doi else f"https://www.medrxiv.org/",
date=date,
authors=authors,
),
relevance=0.75, # Slightly lower than peer-reviewed
)
6. TDD Test Suite
6.1 Unit Tests (tests/unit/tools/test_biorxiv.py)
"""Unit tests for bioRxiv tool."""
import pytest
import respx
from httpx import Response
from src.tools.biorxiv import BioRxivTool
from src.utils.models import Evidence
@pytest.fixture
def mock_biorxiv_response():
"""Mock bioRxiv API response."""
return {
"collection": [
{
"doi": "10.1101/2024.01.15.24301234",
"title": "Metformin repurposing for Alzheimer's disease: a systematic review",
"authors": "Smith, John; Jones, Alice; Brown, Bob",
"date": "2024-01-15",
"category": "neurology",
"abstract": "Background: Metformin has shown neuroprotective effects. "
"We conducted a systematic review of metformin's potential "
"for Alzheimer's disease treatment."
},
{
"doi": "10.1101/2024.01.10.24301111",
"title": "COVID-19 vaccine efficacy study",
"authors": "Wilson, C",
"date": "2024-01-10",
"category": "infectious diseases",
"abstract": "This study evaluates COVID-19 vaccine efficacy."
}
],
"messages": [{"status": "ok", "count": 2}]
}
class TestBioRxivTool:
"""Tests for BioRxivTool."""
def test_tool_name(self):
"""Tool should have correct name."""
tool = BioRxivTool()
assert tool.name == "biorxiv"
def test_default_server_is_medrxiv(self):
"""Default server should be medRxiv for medical relevance."""
tool = BioRxivTool()
assert tool.server == "medrxiv"
@pytest.mark.asyncio
@respx.mock
async def test_search_returns_evidence(self, mock_biorxiv_response):
"""Search should return Evidence objects."""
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
return_value=Response(200, json=mock_biorxiv_response)
)
tool = BioRxivTool()
results = await tool.search("metformin alzheimer", max_results=5)
assert len(results) == 1 # Only the matching paper
assert isinstance(results[0], Evidence)
assert results[0].citation.source == "biorxiv"
assert "metformin" in results[0].citation.title.lower()
@pytest.mark.asyncio
@respx.mock
async def test_search_filters_by_keywords(self, mock_biorxiv_response):
"""Search should filter papers by query keywords."""
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
return_value=Response(200, json=mock_biorxiv_response)
)
tool = BioRxivTool()
# Search for metformin - should match first paper
results = await tool.search("metformin")
assert len(results) == 1
assert "metformin" in results[0].citation.title.lower()
# Search for COVID - should match second paper
results = await tool.search("covid vaccine")
assert len(results) == 1
assert "covid" in results[0].citation.title.lower()
@pytest.mark.asyncio
@respx.mock
async def test_search_marks_as_preprint(self, mock_biorxiv_response):
"""Evidence content should note it's a preprint."""
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
return_value=Response(200, json=mock_biorxiv_response)
)
tool = BioRxivTool()
results = await tool.search("metformin")
assert "PREPRINT" in results[0].content
assert "Not peer-reviewed" in results[0].content
@pytest.mark.asyncio
@respx.mock
async def test_search_empty_results(self):
"""Search should handle empty results gracefully."""
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
return_value=Response(200, json={"collection": [], "messages": []})
)
tool = BioRxivTool()
results = await tool.search("xyznonexistent")
assert results == []
@pytest.mark.asyncio
@respx.mock
async def test_search_api_error(self):
"""Search should raise SearchError on API failure."""
from src.utils.exceptions import SearchError
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
return_value=Response(500, text="Internal Server Error")
)
tool = BioRxivTool()
with pytest.raises(SearchError):
await tool.search("metformin")
def test_extract_terms(self):
"""Should extract meaningful search terms."""
tool = BioRxivTool()
terms = tool._extract_terms("metformin for Alzheimer's disease")
assert "metformin" in terms
assert "alzheimer" in terms
assert "disease" in terms
assert "for" not in terms # Stop word
assert "the" not in terms # Stop word
class TestBioRxivIntegration:
"""Integration tests (marked for separate run)."""
@pytest.mark.integration
@pytest.mark.asyncio
async def test_real_api_call(self):
"""Test actual API call (requires network)."""
tool = BioRxivTool(days=30) # Last 30 days
results = await tool.search("diabetes", max_results=3)
# May or may not find results depending on recent papers
assert isinstance(results, list)
for r in results:
assert isinstance(r, Evidence)
assert r.citation.source == "biorxiv"
7. Integration with SearchHandler
7.1 Final SearchHandler Configuration
# examples/search_demo/run_search.py
from src.tools.biorxiv import BioRxivTool
from src.tools.clinicaltrials import ClinicalTrialsTool
from src.tools.pubmed import PubMedTool
from src.tools.search_handler import SearchHandler
search_handler = SearchHandler(
tools=[
PubMedTool(), # Peer-reviewed papers
ClinicalTrialsTool(), # Clinical trials
BioRxivTool(), # Preprints (cutting edge)
],
timeout=30.0
)
7.2 Final Type Definition
# src/utils/models.py
sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
8. Definition of Done
Phase 11 is COMPLETE when:
-
src/tools/biorxiv.pyimplemented - Unit tests in
tests/unit/tools/test_biorxiv.py - Integration test marked with
@pytest.mark.integration - SearchHandler updated to include BioRxivTool
- Type definitions updated in models.py
- Example files updated
- All unit tests pass
- Lints pass
- Manual verification with real API
9. Verification Commands
# 1. Run unit tests
uv run pytest tests/unit/tools/test_biorxiv.py -v
# 2. Run integration test (requires network)
uv run pytest tests/unit/tools/test_biorxiv.py -v -m integration
# 3. Run full test suite
uv run pytest tests/unit/ -v
# 4. Run example with all three sources
source .env && uv run python examples/search_demo/run_search.py "metformin diabetes"
# Should show results from PubMed, ClinicalTrials.gov, AND bioRxiv/medRxiv
10. Value Delivered
| Before | After |
|---|---|
| Only published papers | Published + Preprints |
| 6-18 month lag | Near real-time research |
| Miss cutting-edge | Catch breakthroughs early |
Demo pitch (final):
"DeepCritical searches PubMed for peer-reviewed evidence, ClinicalTrials.gov for 400,000+ clinical trials, and bioRxiv/medRxiv for cutting-edge preprints - then uses LLMs to generate mechanistic hypotheses and synthesize findings into publication-quality reports."
11. Complete Source Architecture (After Phase 11)
User Query: "Can metformin treat Alzheimer's?"
|
v
SearchHandler
|
βββββββββββββββββΌββββββββββββββββ
| | |
v v v
PubMedTool ClinicalTrials BioRxivTool
| Tool |
| | |
v v v
"15 peer- "3 Phase II "2 preprints
reviewed trials from last
papers" recruiting" 90 days"
| | |
βββββββββββββββββΌββββββββββββββββ
|
v
Evidence Pool
|
v
EmbeddingService.deduplicate()
|
v
HypothesisAgent β JudgeAgent β ReportAgent
|
v
Structured Research Report
This is the Gucci Banger stack.