Spaces:
Running
Running
| # Phase 11 Implementation Spec: bioRxiv Preprint Integration | |
| **Goal**: Add cutting-edge preprint search for the latest research. | |
| **Philosophy**: "Preprints are where breakthroughs appear first." | |
| **Prerequisite**: Phase 10 complete (ClinicalTrials.gov working) | |
| **Estimated Time**: 2-3 hours | |
| --- | |
| ## 1. Why bioRxiv? | |
| ### Scientific Value | |
| | Feature | Value for Drug Repurposing | | |
| |---------|---------------------------| | |
| | **Cutting-edge research** | 6-12 months ahead of PubMed | | |
| | **Rapid publication** | Days, not months | | |
| | **Free full-text** | Complete papers, not just abstracts | | |
| | **medRxiv included** | Medical preprints via same API | | |
| | **No API key required** | Free and open | | |
| ### The Preprint Advantage | |
| ``` | |
| Traditional Publication Timeline: | |
| Research β Submit β Review β Revise β Accept β Publish | |
| |___________________________ 6-18 months _______________| | |
| Preprint Timeline: | |
| Research β Upload β Available | |
| |______ 1-3 days ______| | |
| ``` | |
| **For drug repurposing**: Preprints contain the newest hypotheses and evidence! | |
| --- | |
| ## 2. API Specification | |
| ### Endpoint | |
| ``` | |
| Base URL: https://api.biorxiv.org/details/[server]/[interval]/[cursor]/[format] | |
| ``` | |
| ### Servers | |
| | Server | Content | | |
| |--------|---------| | |
| | `biorxiv` | Biology preprints | | |
| | `medrxiv` | Medical preprints (more relevant for us!) | | |
| ### Interval Formats | |
| | Format | Example | Description | | |
| |--------|---------|-------------| | |
| | Date range | `2024-01-01/2024-12-31` | Papers between dates | | |
| | Recent N | `50` | Most recent N papers | | |
| | Recent N days | `30d` | Papers from last N days | | |
| ### Response Format | |
| ```json | |
| { | |
| "collection": [ | |
| { | |
| "doi": "10.1101/2024.01.15.123456", | |
| "title": "Metformin repurposing for neurodegeneration", | |
| "authors": "Smith, J; Jones, A", | |
| "date": "2024-01-15", | |
| "category": "neuroscience", | |
| "abstract": "We investigated metformin's potential..." | |
| } | |
| ], | |
| "messages": [{"status": "ok", "count": 100}] | |
| } | |
| ``` | |
| ### Rate Limits | |
| - No official limit, but be respectful | |
| - Results paginated (100 per call) | |
| - Use cursor for pagination | |
| ### Documentation | |
| - [bioRxiv API](https://api.biorxiv.org/) | |
| - [medrxivr R package docs](https://docs.ropensci.org/medrxivr/) | |
| --- | |
| ## 3. Search Strategy | |
| ### Challenge: bioRxiv API Limitations | |
| The bioRxiv API does NOT support keyword search directly. It returns papers by: | |
| - Date range | |
| - Recent count | |
| ### Solution: Client-Side Filtering | |
| ```python | |
| # Strategy: | |
| # 1. Fetch recent papers (e.g., last 90 days) | |
| # 2. Filter by keyword matching in title/abstract | |
| # 3. Use embeddings for semantic matching (leverage Phase 6!) | |
| ``` | |
| ### Alternative: Content Search Endpoint | |
| ``` | |
| https://api.biorxiv.org/pubs/[server]/[doi_prefix] | |
| ``` | |
| For searching, we can use the publisher endpoint with filtering. | |
| --- | |
| ## 4. Data Model | |
| ### 4.1 Update Citation Source Type (`src/utils/models.py`) | |
| ```python | |
| # After Phase 11 | |
| source: Literal["pubmed", "clinicaltrials", "biorxiv"] | |
| ``` | |
| ### 4.2 Evidence from Preprints | |
| ```python | |
| Evidence( | |
| content=abstract[:2000], | |
| citation=Citation( | |
| source="biorxiv", # or "medrxiv" | |
| title=title, | |
| url=f"https://doi.org/{doi}", | |
| date=date, | |
| authors=authors.split("; ")[:5] | |
| ), | |
| relevance=0.75 # Preprints slightly lower than peer-reviewed | |
| ) | |
| ``` | |
| --- | |
| ## 5. Implementation | |
| ### 5.1 bioRxiv Tool (`src/tools/biorxiv.py`) | |
| ```python | |
| """bioRxiv/medRxiv preprint search tool.""" | |
| import re | |
| from datetime import datetime, timedelta | |
| import httpx | |
| from tenacity import retry, stop_after_attempt, wait_exponential | |
| from src.utils.exceptions import SearchError | |
| from src.utils.models import Citation, Evidence | |
| class BioRxivTool: | |
| """Search tool for bioRxiv and medRxiv preprints.""" | |
| BASE_URL = "https://api.biorxiv.org/details" | |
| # Use medRxiv for medical/clinical content (more relevant for drug repurposing) | |
| DEFAULT_SERVER = "medrxiv" | |
| # Fetch papers from last N days | |
| DEFAULT_DAYS = 90 | |
| def __init__(self, server: str = DEFAULT_SERVER, days: int = DEFAULT_DAYS): | |
| """ | |
| Initialize bioRxiv tool. | |
| Args: | |
| server: "biorxiv" or "medrxiv" | |
| days: How many days back to search | |
| """ | |
| self.server = server | |
| self.days = days | |
| @property | |
| def name(self) -> str: | |
| return "biorxiv" | |
| @retry( | |
| stop=stop_after_attempt(3), | |
| wait=wait_exponential(multiplier=1, min=1, max=10), | |
| reraise=True, | |
| ) | |
| async def search(self, query: str, max_results: int = 10) -> list[Evidence]: | |
| """ | |
| Search bioRxiv/medRxiv for preprints matching query. | |
| Note: bioRxiv API doesn't support keyword search directly. | |
| We fetch recent papers and filter client-side. | |
| Args: | |
| query: Search query (keywords) | |
| max_results: Maximum results to return | |
| Returns: | |
| List of Evidence objects from preprints | |
| """ | |
| # Build date range for last N days | |
| end_date = datetime.now().strftime("%Y-%m-%d") | |
| start_date = (datetime.now() - timedelta(days=self.days)).strftime("%Y-%m-%d") | |
| interval = f"{start_date}/{end_date}" | |
| # Fetch recent papers | |
| url = f"{self.BASE_URL}/{self.server}/{interval}/0/json" | |
| async with httpx.AsyncClient(timeout=30.0) as client: | |
| try: | |
| response = await client.get(url) | |
| response.raise_for_status() | |
| except httpx.HTTPStatusError as e: | |
| raise SearchError(f"bioRxiv search failed: {e}") from e | |
| data = response.json() | |
| papers = data.get("collection", []) | |
| # Filter papers by query keywords | |
| query_terms = self._extract_terms(query) | |
| matching = self._filter_by_keywords(papers, query_terms, max_results) | |
| return [self._paper_to_evidence(paper) for paper in matching] | |
| def _extract_terms(self, query: str) -> list[str]: | |
| """Extract search terms from query.""" | |
| # Simple tokenization, lowercase | |
| terms = re.findall(r'\b\w+\b', query.lower()) | |
| # Filter out common stop words | |
| stop_words = {'the', 'a', 'an', 'in', 'on', 'for', 'and', 'or', 'of', 'to'} | |
| return [t for t in terms if t not in stop_words and len(t) > 2] | |
| def _filter_by_keywords( | |
| self, papers: list[dict], terms: list[str], max_results: int | |
| ) -> list[dict]: | |
| """Filter papers that contain query terms in title or abstract.""" | |
| scored_papers = [] | |
| for paper in papers: | |
| title = paper.get("title", "").lower() | |
| abstract = paper.get("abstract", "").lower() | |
| text = f"{title} {abstract}" | |
| # Count matching terms | |
| matches = sum(1 for term in terms if term in text) | |
| if matches > 0: | |
| scored_papers.append((matches, paper)) | |
| # Sort by match count (descending) | |
| scored_papers.sort(key=lambda x: x[0], reverse=True) | |
| return [paper for _, paper in scored_papers[:max_results]] | |
| def _paper_to_evidence(self, paper: dict) -> Evidence: | |
| """Convert a preprint paper to Evidence.""" | |
| doi = paper.get("doi", "") | |
| title = paper.get("title", "Untitled") | |
| authors_str = paper.get("authors", "Unknown") | |
| date = paper.get("date", "Unknown") | |
| abstract = paper.get("abstract", "No abstract available.") | |
| category = paper.get("category", "") | |
| # Parse authors (format: "Smith, J; Jones, A") | |
| authors = [a.strip() for a in authors_str.split(";")][:5] | |
| # Note this is a preprint in the content | |
| content = ( | |
| f"[PREPRINT - Not peer-reviewed] " | |
| f"{abstract[:1800]}... " | |
| f"Category: {category}." | |
| ) | |
| return Evidence( | |
| content=content[:2000], | |
| citation=Citation( | |
| source="biorxiv", | |
| title=title[:500], | |
| url=f"https://doi.org/{doi}" if doi else f"https://www.medrxiv.org/", | |
| date=date, | |
| authors=authors, | |
| ), | |
| relevance=0.75, # Slightly lower than peer-reviewed | |
| ) | |
| ``` | |
| --- | |
| ## 6. TDD Test Suite | |
| ### 6.1 Unit Tests (`tests/unit/tools/test_biorxiv.py`) | |
| ```python | |
| """Unit tests for bioRxiv tool.""" | |
| import pytest | |
| import respx | |
| from httpx import Response | |
| from src.tools.biorxiv import BioRxivTool | |
| from src.utils.models import Evidence | |
| @pytest.fixture | |
| def mock_biorxiv_response(): | |
| """Mock bioRxiv API response.""" | |
| return { | |
| "collection": [ | |
| { | |
| "doi": "10.1101/2024.01.15.24301234", | |
| "title": "Metformin repurposing for Alzheimer's disease: a systematic review", | |
| "authors": "Smith, John; Jones, Alice; Brown, Bob", | |
| "date": "2024-01-15", | |
| "category": "neurology", | |
| "abstract": "Background: Metformin has shown neuroprotective effects. " | |
| "We conducted a systematic review of metformin's potential " | |
| "for Alzheimer's disease treatment." | |
| }, | |
| { | |
| "doi": "10.1101/2024.01.10.24301111", | |
| "title": "COVID-19 vaccine efficacy study", | |
| "authors": "Wilson, C", | |
| "date": "2024-01-10", | |
| "category": "infectious diseases", | |
| "abstract": "This study evaluates COVID-19 vaccine efficacy." | |
| } | |
| ], | |
| "messages": [{"status": "ok", "count": 2}] | |
| } | |
| class TestBioRxivTool: | |
| """Tests for BioRxivTool.""" | |
| def test_tool_name(self): | |
| """Tool should have correct name.""" | |
| tool = BioRxivTool() | |
| assert tool.name == "biorxiv" | |
| def test_default_server_is_medrxiv(self): | |
| """Default server should be medRxiv for medical relevance.""" | |
| tool = BioRxivTool() | |
| assert tool.server == "medrxiv" | |
| @pytest.mark.asyncio | |
| @respx.mock | |
| async def test_search_returns_evidence(self, mock_biorxiv_response): | |
| """Search should return Evidence objects.""" | |
| respx.get(url__startswith="https://api.biorxiv.org/details").mock( | |
| return_value=Response(200, json=mock_biorxiv_response) | |
| ) | |
| tool = BioRxivTool() | |
| results = await tool.search("metformin alzheimer", max_results=5) | |
| assert len(results) == 1 # Only the matching paper | |
| assert isinstance(results[0], Evidence) | |
| assert results[0].citation.source == "biorxiv" | |
| assert "metformin" in results[0].citation.title.lower() | |
| @pytest.mark.asyncio | |
| @respx.mock | |
| async def test_search_filters_by_keywords(self, mock_biorxiv_response): | |
| """Search should filter papers by query keywords.""" | |
| respx.get(url__startswith="https://api.biorxiv.org/details").mock( | |
| return_value=Response(200, json=mock_biorxiv_response) | |
| ) | |
| tool = BioRxivTool() | |
| # Search for metformin - should match first paper | |
| results = await tool.search("metformin") | |
| assert len(results) == 1 | |
| assert "metformin" in results[0].citation.title.lower() | |
| # Search for COVID - should match second paper | |
| results = await tool.search("covid vaccine") | |
| assert len(results) == 1 | |
| assert "covid" in results[0].citation.title.lower() | |
| @pytest.mark.asyncio | |
| @respx.mock | |
| async def test_search_marks_as_preprint(self, mock_biorxiv_response): | |
| """Evidence content should note it's a preprint.""" | |
| respx.get(url__startswith="https://api.biorxiv.org/details").mock( | |
| return_value=Response(200, json=mock_biorxiv_response) | |
| ) | |
| tool = BioRxivTool() | |
| results = await tool.search("metformin") | |
| assert "PREPRINT" in results[0].content | |
| assert "Not peer-reviewed" in results[0].content | |
| @pytest.mark.asyncio | |
| @respx.mock | |
| async def test_search_empty_results(self): | |
| """Search should handle empty results gracefully.""" | |
| respx.get(url__startswith="https://api.biorxiv.org/details").mock( | |
| return_value=Response(200, json={"collection": [], "messages": []}) | |
| ) | |
| tool = BioRxivTool() | |
| results = await tool.search("xyznonexistent") | |
| assert results == [] | |
| @pytest.mark.asyncio | |
| @respx.mock | |
| async def test_search_api_error(self): | |
| """Search should raise SearchError on API failure.""" | |
| from src.utils.exceptions import SearchError | |
| respx.get(url__startswith="https://api.biorxiv.org/details").mock( | |
| return_value=Response(500, text="Internal Server Error") | |
| ) | |
| tool = BioRxivTool() | |
| with pytest.raises(SearchError): | |
| await tool.search("metformin") | |
| def test_extract_terms(self): | |
| """Should extract meaningful search terms.""" | |
| tool = BioRxivTool() | |
| terms = tool._extract_terms("metformin for Alzheimer's disease") | |
| assert "metformin" in terms | |
| assert "alzheimer" in terms | |
| assert "disease" in terms | |
| assert "for" not in terms # Stop word | |
| assert "the" not in terms # Stop word | |
| class TestBioRxivIntegration: | |
| """Integration tests (marked for separate run).""" | |
| @pytest.mark.integration | |
| @pytest.mark.asyncio | |
| async def test_real_api_call(self): | |
| """Test actual API call (requires network).""" | |
| tool = BioRxivTool(days=30) # Last 30 days | |
| results = await tool.search("diabetes", max_results=3) | |
| # May or may not find results depending on recent papers | |
| assert isinstance(results, list) | |
| for r in results: | |
| assert isinstance(r, Evidence) | |
| assert r.citation.source == "biorxiv" | |
| ``` | |
| --- | |
| ## 7. Integration with SearchHandler | |
| ### 7.1 Final SearchHandler Configuration | |
| ```python | |
| # examples/search_demo/run_search.py | |
| from src.tools.biorxiv import BioRxivTool | |
| from src.tools.clinicaltrials import ClinicalTrialsTool | |
| from src.tools.pubmed import PubMedTool | |
| from src.tools.search_handler import SearchHandler | |
| search_handler = SearchHandler( | |
| tools=[ | |
| PubMedTool(), # Peer-reviewed papers | |
| ClinicalTrialsTool(), # Clinical trials | |
| BioRxivTool(), # Preprints (cutting edge) | |
| ], | |
| timeout=30.0 | |
| ) | |
| ``` | |
| ### 7.2 Final Type Definition | |
| ```python | |
| # src/utils/models.py | |
| sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]] | |
| ``` | |
| --- | |
| ## 8. Definition of Done | |
| Phase 11 is **COMPLETE** when: | |
| - [ ] `src/tools/biorxiv.py` implemented | |
| - [ ] Unit tests in `tests/unit/tools/test_biorxiv.py` | |
| - [ ] Integration test marked with `@pytest.mark.integration` | |
| - [ ] SearchHandler updated to include BioRxivTool | |
| - [ ] Type definitions updated in models.py | |
| - [ ] Example files updated | |
| - [ ] All unit tests pass | |
| - [ ] Lints pass | |
| - [ ] Manual verification with real API | |
| --- | |
| ## 9. Verification Commands | |
| ```bash | |
| # 1. Run unit tests | |
| uv run pytest tests/unit/tools/test_biorxiv.py -v | |
| # 2. Run integration test (requires network) | |
| uv run pytest tests/unit/tools/test_biorxiv.py -v -m integration | |
| # 3. Run full test suite | |
| uv run pytest tests/unit/ -v | |
| # 4. Run example with all three sources | |
| source .env && uv run python examples/search_demo/run_search.py "metformin diabetes" | |
| # Should show results from PubMed, ClinicalTrials.gov, AND bioRxiv/medRxiv | |
| ``` | |
| --- | |
| ## 10. Value Delivered | |
| | Before | After | | |
| |--------|-------| | |
| | Only published papers | Published + Preprints | | |
| | 6-18 month lag | Near real-time research | | |
| | Miss cutting-edge | Catch breakthroughs early | | |
| **Demo pitch (final)**: | |
| > "DeepCritical searches PubMed for peer-reviewed evidence, ClinicalTrials.gov for 400,000+ clinical trials, and bioRxiv/medRxiv for cutting-edge preprints - then uses LLMs to generate mechanistic hypotheses and synthesize findings into publication-quality reports." | |
| --- | |
| ## 11. Complete Source Architecture (After Phase 11) | |
| ``` | |
| User Query: "Can metformin treat Alzheimer's?" | |
| | | |
| v | |
| SearchHandler | |
| | | |
| βββββββββββββββββΌββββββββββββββββ | |
| | | | | |
| v v v | |
| PubMedTool ClinicalTrials BioRxivTool | |
| | Tool | | |
| | | | | |
| v v v | |
| "15 peer- "3 Phase II "2 preprints | |
| reviewed trials from last | |
| papers" recruiting" 90 days" | |
| | | | | |
| βββββββββββββββββΌββββββββββββββββ | |
| | | |
| v | |
| Evidence Pool | |
| | | |
| v | |
| EmbeddingService.deduplicate() | |
| | | |
| v | |
| HypothesisAgent β JudgeAgent β ReportAgent | |
| | | |
| v | |
| Structured Research Report | |
| ``` | |
| **This is the Gucci Banger stack.** | |