Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

DeepCritical / docs /implementation /11_phase_biorxiv.md

VibecoderMcSwaggins

feat(docs): update implementation roadmap and add specs for Phases 9-11

3fcd8e7 17 days ago

preview code

raw

history blame

17 kB

	# Phase 11 Implementation Spec: bioRxiv Preprint Integration

	Goal: Add cutting-edge preprint search for the latest research.
	Philosophy: "Preprints are where breakthroughs appear first."
	Prerequisite: Phase 10 complete (ClinicalTrials.gov working)
	Estimated Time: 2-3 hours

	---

	## 1. Why bioRxiv?

	### Scientific Value

	\| Feature \| Value for Drug Repurposing \|
	\|---------\|---------------------------\|
	\| Cutting-edge research \| 6-12 months ahead of PubMed \|
	\| Rapid publication \| Days, not months \|
	\| Free full-text \| Complete papers, not just abstracts \|
	\| medRxiv included \| Medical preprints via same API \|
	\| No API key required \| Free and open \|

	### The Preprint Advantage

	```
	Traditional Publication Timeline:
	Research → Submit → Review → Revise → Accept → Publish
	\|___________________________ 6-18 months _______________\|

	Preprint Timeline:
	Research → Upload → Available
	\|______ 1-3 days ______\|
	```

	For drug repurposing: Preprints contain the newest hypotheses and evidence!

	---

	## 2. API Specification

	### Endpoint

	```
	Base URL: https://api.biorxiv.org/details/[server]/[interval]/[cursor]/[format]
	```

	### Servers

	\| Server \| Content \|
	\|--------\|---------\|
	\| `biorxiv` \| Biology preprints \|
	\| `medrxiv` \| Medical preprints (more relevant for us!) \|

	### Interval Formats

	\| Format \| Example \| Description \|
	\|--------\|---------\|-------------\|
	\| Date range \| `2024-01-01/2024-12-31` \| Papers between dates \|
	\| Recent N \| `50` \| Most recent N papers \|
	\| Recent N days \| `30d` \| Papers from last N days \|

	### Response Format

	```json
	{
	"collection": [
	{
	"doi": "10.1101/2024.01.15.123456",
	"title": "Metformin repurposing for neurodegeneration",
	"authors": "Smith, J; Jones, A",
	"date": "2024-01-15",
	"category": "neuroscience",
	"abstract": "We investigated metformin's potential..."
	}
	],
	"messages": [{"status": "ok", "count": 100}]
	}
	```

	### Rate Limits

	- No official limit, but be respectful
	- Results paginated (100 per call)
	- Use cursor for pagination

	### Documentation

	- [bioRxiv API](https://api.biorxiv.org/)
	- [medrxivr R package docs](https://docs.ropensci.org/medrxivr/)

	---

	## 3. Search Strategy

	### Challenge: bioRxiv API Limitations

	The bioRxiv API does NOT support keyword search directly. It returns papers by:
	- Date range
	- Recent count

	### Solution: Client-Side Filtering

	```python
	# Strategy:
	# 1. Fetch recent papers (e.g., last 90 days)
	# 2. Filter by keyword matching in title/abstract
	# 3. Use embeddings for semantic matching (leverage Phase 6!)
	```

	### Alternative: Content Search Endpoint

	```
	https://api.biorxiv.org/pubs/[server]/[doi_prefix]
	```

	For searching, we can use the publisher endpoint with filtering.

	---

	## 4. Data Model

	### 4.1 Update Citation Source Type (`src/utils/models.py`)

	```python
	# After Phase 11
	source: Literal["pubmed", "clinicaltrials", "biorxiv"]
	```

	### 4.2 Evidence from Preprints

	```python
	Evidence(
	content=abstract[:2000],
	citation=Citation(
	source="biorxiv", # or "medrxiv"
	title=title,
	url=f"https://doi.org/{doi}",
	date=date,
	authors=authors.split("; ")[:5]
	),
	relevance=0.75 # Preprints slightly lower than peer-reviewed
	)
	```

	---

	## 5. Implementation

	### 5.1 bioRxiv Tool (`src/tools/biorxiv.py`)

	```python
	"""bioRxiv/medRxiv preprint search tool."""

	import re
	from datetime import datetime, timedelta

	import httpx
	from tenacity import retry, stop_after_attempt, wait_exponential

	from src.utils.exceptions import SearchError
	from src.utils.models import Citation, Evidence


	class BioRxivTool:
	"""Search tool for bioRxiv and medRxiv preprints."""

	BASE_URL = "https://api.biorxiv.org/details"
	# Use medRxiv for medical/clinical content (more relevant for drug repurposing)
	DEFAULT_SERVER = "medrxiv"
	# Fetch papers from last N days
	DEFAULT_DAYS = 90

	def __init__(self, server: str = DEFAULT_SERVER, days: int = DEFAULT_DAYS):
	"""
	Initialize bioRxiv tool.

	Args:
	server: "biorxiv" or "medrxiv"
	days: How many days back to search
	"""
	self.server = server
	self.days = days

	@property
	def name(self) -> str:
	return "biorxiv"

	@retry(
	stop=stop_after_attempt(3),
	wait=wait_exponential(multiplier=1, min=1, max=10),
	reraise=True,
	)
	async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
	"""
	Search bioRxiv/medRxiv for preprints matching query.

	Note: bioRxiv API doesn't support keyword search directly.
	We fetch recent papers and filter client-side.

	Args:
	query: Search query (keywords)
	max_results: Maximum results to return

	Returns:
	List of Evidence objects from preprints
	"""
	# Build date range for last N days
	end_date = datetime.now().strftime("%Y-%m-%d")
	start_date = (datetime.now() - timedelta(days=self.days)).strftime("%Y-%m-%d")
	interval = f"{start_date}/{end_date}"

	# Fetch recent papers
	url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"

	async with httpx.AsyncClient(timeout=30.0) as client:
	try:
	response = await client.get(url)
	response.raise_for_status()
	except httpx.HTTPStatusError as e:
	raise SearchError(f"bioRxiv search failed: {e}") from e

	data = response.json()
	papers = data.get("collection", [])

	# Filter papers by query keywords
	query_terms = self._extract_terms(query)
	matching = self._filter_by_keywords(papers, query_terms, max_results)

	return [self._paper_to_evidence(paper) for paper in matching]

	def _extract_terms(self, query: str) -> list[str]:
	"""Extract search terms from query."""
	# Simple tokenization, lowercase
	terms = re.findall(r'\b\w+\b', query.lower())
	# Filter out common stop words
	stop_words = {'the', 'a', 'an', 'in', 'on', 'for', 'and', 'or', 'of', 'to'}
	return [t for t in terms if t not in stop_words and len(t) > 2]

	def _filter_by_keywords(
	self, papers: list[dict], terms: list[str], max_results: int
	) -> list[dict]:
	"""Filter papers that contain query terms in title or abstract."""
	scored_papers = []

	for paper in papers:
	title = paper.get("title", "").lower()
	abstract = paper.get("abstract", "").lower()
	text = f"{title} {abstract}"

	# Count matching terms
	matches = sum(1 for term in terms if term in text)

	if matches > 0:
	scored_papers.append((matches, paper))

	# Sort by match count (descending)
	scored_papers.sort(key=lambda x: x[0], reverse=True)

	return [paper for _, paper in scored_papers[:max_results]]

	def _paper_to_evidence(self, paper: dict) -> Evidence:
	"""Convert a preprint paper to Evidence."""
	doi = paper.get("doi", "")
	title = paper.get("title", "Untitled")
	authors_str = paper.get("authors", "Unknown")
	date = paper.get("date", "Unknown")
	abstract = paper.get("abstract", "No abstract available.")
	category = paper.get("category", "")

	# Parse authors (format: "Smith, J; Jones, A")
	authors = [a.strip() for a in authors_str.split(";")][:5]

	# Note this is a preprint in the content
	content = (
	f"[PREPRINT - Not peer-reviewed] "
	f"{abstract[:1800]}... "
	f"Category: {category}."
	)

	return Evidence(
	content=content[:2000],
	citation=Citation(
	source="biorxiv",
	title=title[:500],
	url=f"https://doi.org/{doi}" if doi else f"https://www.medrxiv.org/",
	date=date,
	authors=authors,
	),
	relevance=0.75, # Slightly lower than peer-reviewed
	)
	```

	---

	## 6. TDD Test Suite

	### 6.1 Unit Tests (`tests/unit/tools/test_biorxiv.py`)

	```python
	"""Unit tests for bioRxiv tool."""

	import pytest
	import respx
	from httpx import Response

	from src.tools.biorxiv import BioRxivTool
	from src.utils.models import Evidence


	@pytest.fixture
	def mock_biorxiv_response():
	"""Mock bioRxiv API response."""
	return {
	"collection": [
	{
	"doi": "10.1101/2024.01.15.24301234",
	"title": "Metformin repurposing for Alzheimer's disease: a systematic review",
	"authors": "Smith, John; Jones, Alice; Brown, Bob",
	"date": "2024-01-15",
	"category": "neurology",
	"abstract": "Background: Metformin has shown neuroprotective effects. "
	"We conducted a systematic review of metformin's potential "
	"for Alzheimer's disease treatment."
	},
	{
	"doi": "10.1101/2024.01.10.24301111",
	"title": "COVID-19 vaccine efficacy study",
	"authors": "Wilson, C",
	"date": "2024-01-10",
	"category": "infectious diseases",
	"abstract": "This study evaluates COVID-19 vaccine efficacy."
	}
	],
	"messages": [{"status": "ok", "count": 2}]
	}


	class TestBioRxivTool:
	"""Tests for BioRxivTool."""

	def test_tool_name(self):
	"""Tool should have correct name."""
	tool = BioRxivTool()
	assert tool.name == "biorxiv"

	def test_default_server_is_medrxiv(self):
	"""Default server should be medRxiv for medical relevance."""
	tool = BioRxivTool()
	assert tool.server == "medrxiv"

	@pytest.mark.asyncio
	@respx.mock
	async def test_search_returns_evidence(self, mock_biorxiv_response):
	"""Search should return Evidence objects."""
	respx.get(url__startswith="https://api.biorxiv.org/details").mock(
	return_value=Response(200, json=mock_biorxiv_response)
	)

	tool = BioRxivTool()
	results = await tool.search("metformin alzheimer", max_results=5)

	assert len(results) == 1 # Only the matching paper
	assert isinstance(results[0], Evidence)
	assert results[0].citation.source == "biorxiv"
	assert "metformin" in results[0].citation.title.lower()

	@pytest.mark.asyncio
	@respx.mock
	async def test_search_filters_by_keywords(self, mock_biorxiv_response):
	"""Search should filter papers by query keywords."""
	respx.get(url__startswith="https://api.biorxiv.org/details").mock(
	return_value=Response(200, json=mock_biorxiv_response)
	)

	tool = BioRxivTool()

	# Search for metformin - should match first paper
	results = await tool.search("metformin")
	assert len(results) == 1
	assert "metformin" in results[0].citation.title.lower()

	# Search for COVID - should match second paper
	results = await tool.search("covid vaccine")
	assert len(results) == 1
	assert "covid" in results[0].citation.title.lower()

	@pytest.mark.asyncio
	@respx.mock
	async def test_search_marks_as_preprint(self, mock_biorxiv_response):
	"""Evidence content should note it's a preprint."""
	respx.get(url__startswith="https://api.biorxiv.org/details").mock(
	return_value=Response(200, json=mock_biorxiv_response)
	)

	tool = BioRxivTool()
	results = await tool.search("metformin")

	assert "PREPRINT" in results[0].content
	assert "Not peer-reviewed" in results[0].content

	@pytest.mark.asyncio
	@respx.mock
	async def test_search_empty_results(self):
	"""Search should handle empty results gracefully."""
	respx.get(url__startswith="https://api.biorxiv.org/details").mock(
	return_value=Response(200, json={"collection": [], "messages": []})
	)

	tool = BioRxivTool()
	results = await tool.search("xyznonexistent")

	assert results == []

	@pytest.mark.asyncio
	@respx.mock
	async def test_search_api_error(self):
	"""Search should raise SearchError on API failure."""
	from src.utils.exceptions import SearchError

	respx.get(url__startswith="https://api.biorxiv.org/details").mock(
	return_value=Response(500, text="Internal Server Error")
	)

	tool = BioRxivTool()

	with pytest.raises(SearchError):
	await tool.search("metformin")

	def test_extract_terms(self):
	"""Should extract meaningful search terms."""
	tool = BioRxivTool()

	terms = tool._extract_terms("metformin for Alzheimer's disease")

	assert "metformin" in terms
	assert "alzheimer" in terms
	assert "disease" in terms
	assert "for" not in terms # Stop word
	assert "the" not in terms # Stop word


	class TestBioRxivIntegration:
	"""Integration tests (marked for separate run)."""

	@pytest.mark.integration
	@pytest.mark.asyncio
	async def test_real_api_call(self):
	"""Test actual API call (requires network)."""
	tool = BioRxivTool(days=30) # Last 30 days
	results = await tool.search("diabetes", max_results=3)

	# May or may not find results depending on recent papers
	assert isinstance(results, list)
	for r in results:
	assert isinstance(r, Evidence)
	assert r.citation.source == "biorxiv"
	```

	---

	## 7. Integration with SearchHandler

	### 7.1 Final SearchHandler Configuration

	```python
	# examples/search_demo/run_search.py
	from src.tools.biorxiv import BioRxivTool
	from src.tools.clinicaltrials import ClinicalTrialsTool
	from src.tools.pubmed import PubMedTool
	from src.tools.search_handler import SearchHandler

	search_handler = SearchHandler(
	tools=[
	PubMedTool(), # Peer-reviewed papers
	ClinicalTrialsTool(), # Clinical trials
	BioRxivTool(), # Preprints (cutting edge)
	],
	timeout=30.0
	)
	```

	### 7.2 Final Type Definition

	```python
	# src/utils/models.py
	sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
	```

	---

	## 8. Definition of Done

	Phase 11 is COMPLETE when:

	- [ ] `src/tools/biorxiv.py` implemented
	- [ ] Unit tests in `tests/unit/tools/test_biorxiv.py`
	- [ ] Integration test marked with `@pytest.mark.integration`
	- [ ] SearchHandler updated to include BioRxivTool
	- [ ] Type definitions updated in models.py
	- [ ] Example files updated
	- [ ] All unit tests pass
	- [ ] Lints pass
	- [ ] Manual verification with real API

	---

	## 9. Verification Commands

	```bash
	# 1. Run unit tests
	uv run pytest tests/unit/tools/test_biorxiv.py -v

	# 2. Run integration test (requires network)
	uv run pytest tests/unit/tools/test_biorxiv.py -v -m integration

	# 3. Run full test suite
	uv run pytest tests/unit/ -v

	# 4. Run example with all three sources
	source .env && uv run python examples/search_demo/run_search.py "metformin diabetes"
	# Should show results from PubMed, ClinicalTrials.gov, AND bioRxiv/medRxiv
	```

	---

	## 10. Value Delivered

	\| Before \| After \|
	\|--------\|-------\|
	\| Only published papers \| Published + Preprints \|
	\| 6-18 month lag \| Near real-time research \|
	\| Miss cutting-edge \| Catch breakthroughs early \|

	Demo pitch (final):
	> "DeepCritical searches PubMed for peer-reviewed evidence, ClinicalTrials.gov for 400,000+ clinical trials, and bioRxiv/medRxiv for cutting-edge preprints - then uses LLMs to generate mechanistic hypotheses and synthesize findings into publication-quality reports."

	---

	## 11. Complete Source Architecture (After Phase 11)

	```
	User Query: "Can metformin treat Alzheimer's?"
	\|
	v
	SearchHandler
	\|
	┌───────────────┼───────────────┐
	\| \| \|
	v v v
	PubMedTool ClinicalTrials BioRxivTool
	\| Tool \|
	\| \| \|
	v v v
	"15 peer- "3 Phase II "2 preprints
	reviewed trials from last
	papers" recruiting" 90 days"
	\| \| \|
	└───────────────┼───────────────┘
	\|
	v
	Evidence Pool
	\|
	v
	EmbeddingService.deduplicate()
	\|
	v
	HypothesisAgent → JudgeAgent → ReportAgent
	\|
	v
	Structured Research Report
	```

	This is the Gucci Banger stack.