Spaces:
Running
Running
Commit
Β·
9286db5
1
Parent(s):
be7e1a2
feat: add roadmap summary and detailed improvement plans for data sources
Browse files- Introduced new documentation files outlining the current state and future improvements for DeepCritical's data sources: PubMed, ClinicalTrials.gov, Europe PMC, and OpenAlex.
- Each document includes sections on current implementation, strengths, limitations, recommended improvements, and integration opportunities.
- Added a comprehensive roadmap summary to guide future maintainers and enhance project maintainability.
- docs/brainstorming/00_ROADMAP_SUMMARY.md +194 -0
- docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md +193 -0
- docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md +211 -0
- docs/brainstorming/04_OPENALEX_INTEGRATION.md +303 -0
- docs/brainstorming/implementation/15_PHASE_OPENALEX.md +603 -0
- docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md +586 -0
- docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md +540 -0
- docs/brainstorming/implementation/README.md +143 -0
- src/tools/__init__.py +10 -2
- src/utils/models.py +5 -1
docs/brainstorming/00_ROADMAP_SUMMARY.md
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DeepCritical Data Sources: Roadmap Summary
|
| 2 |
+
|
| 3 |
+
**Created**: 2024-11-27
|
| 4 |
+
**Purpose**: Future maintainability and hackathon continuation
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Current State
|
| 9 |
+
|
| 10 |
+
### Working Tools
|
| 11 |
+
|
| 12 |
+
| Tool | Status | Data Quality |
|
| 13 |
+
|------|--------|--------------|
|
| 14 |
+
| PubMed | β
Works | Good (abstracts only) |
|
| 15 |
+
| ClinicalTrials.gov | β
Works | Good (filtered for interventional) |
|
| 16 |
+
| Europe PMC | β
Works | Good (includes preprints) |
|
| 17 |
+
|
| 18 |
+
### Removed Tools
|
| 19 |
+
|
| 20 |
+
| Tool | Status | Reason |
|
| 21 |
+
|------|--------|--------|
|
| 22 |
+
| bioRxiv | β Removed | No search API - only date/DOI lookup |
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Priority Improvements
|
| 27 |
+
|
| 28 |
+
### P0: Critical (Do First)
|
| 29 |
+
|
| 30 |
+
1. **Add Rate Limiting to PubMed**
|
| 31 |
+
- NCBI will block us without it
|
| 32 |
+
- Use `limits` library (see reference repo)
|
| 33 |
+
- 3/sec without key, 10/sec with key
|
| 34 |
+
|
| 35 |
+
### P1: High Value, Medium Effort
|
| 36 |
+
|
| 37 |
+
2. **Add OpenAlex as 4th Source**
|
| 38 |
+
- Citation network (huge for drug repurposing)
|
| 39 |
+
- Concept tagging (semantic discovery)
|
| 40 |
+
- Already implemented in reference repo
|
| 41 |
+
- Free, no API key
|
| 42 |
+
|
| 43 |
+
3. **PubMed Full-Text via BioC**
|
| 44 |
+
- Get full paper text for PMC papers
|
| 45 |
+
- Already in reference repo
|
| 46 |
+
|
| 47 |
+
### P2: Nice to Have
|
| 48 |
+
|
| 49 |
+
4. **ClinicalTrials.gov Results**
|
| 50 |
+
- Get efficacy data from completed trials
|
| 51 |
+
- Requires more complex API calls
|
| 52 |
+
|
| 53 |
+
5. **Europe PMC Annotations**
|
| 54 |
+
- Text-mined entities (genes, drugs, diseases)
|
| 55 |
+
- Automatic entity extraction
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## Effort Estimates
|
| 60 |
+
|
| 61 |
+
| Improvement | Effort | Impact | Priority |
|
| 62 |
+
|-------------|--------|--------|----------|
|
| 63 |
+
| PubMed rate limiting | 1 hour | Stability | P0 |
|
| 64 |
+
| OpenAlex basic search | 2 hours | High | P1 |
|
| 65 |
+
| OpenAlex citations | 2 hours | Very High | P1 |
|
| 66 |
+
| PubMed full-text | 3 hours | Medium | P1 |
|
| 67 |
+
| CT.gov results | 4 hours | Medium | P2 |
|
| 68 |
+
| Europe PMC annotations | 3 hours | Medium | P2 |
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Architecture Decision
|
| 73 |
+
|
| 74 |
+
### Option A: Keep Current + Add OpenAlex
|
| 75 |
+
|
| 76 |
+
```
|
| 77 |
+
User Query
|
| 78 |
+
β
|
| 79 |
+
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 80 |
+
β β β
|
| 81 |
+
PubMed ClinicalTrials Europe PMC
|
| 82 |
+
(abstracts) (trials only) (preprints)
|
| 83 |
+
β β β
|
| 84 |
+
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 85 |
+
β
|
| 86 |
+
OpenAlex β NEW
|
| 87 |
+
(citations, concepts)
|
| 88 |
+
β
|
| 89 |
+
Orchestrator
|
| 90 |
+
β
|
| 91 |
+
Report
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
**Pros**: Low risk, additive
|
| 95 |
+
**Cons**: More complexity, some overlap
|
| 96 |
+
|
| 97 |
+
### Option B: OpenAlex as Primary
|
| 98 |
+
|
| 99 |
+
```
|
| 100 |
+
User Query
|
| 101 |
+
β
|
| 102 |
+
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 103 |
+
β β β
|
| 104 |
+
OpenAlex ClinicalTrials Europe PMC
|
| 105 |
+
(primary (trials only) (full-text
|
| 106 |
+
search) fallback)
|
| 107 |
+
β β β
|
| 108 |
+
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 109 |
+
β
|
| 110 |
+
Orchestrator
|
| 111 |
+
β
|
| 112 |
+
Report
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
**Pros**: Simpler, citation network built-in
|
| 116 |
+
**Cons**: Lose some PubMed-specific features
|
| 117 |
+
|
| 118 |
+
### Recommendation: Option A
|
| 119 |
+
|
| 120 |
+
Keep current architecture working, add OpenAlex incrementally.
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## Quick Wins (Can Do Today)
|
| 125 |
+
|
| 126 |
+
1. **Add `limits` to `pyproject.toml`**
|
| 127 |
+
```toml
|
| 128 |
+
dependencies = [
|
| 129 |
+
"limits>=3.0",
|
| 130 |
+
]
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
2. **Copy OpenAlex tool from reference repo**
|
| 134 |
+
- File: `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`
|
| 135 |
+
- Adapt to our `SearchTool` base class
|
| 136 |
+
|
| 137 |
+
3. **Enable NCBI API Key**
|
| 138 |
+
- Add to `.env`: `NCBI_API_KEY=your_key`
|
| 139 |
+
- 10x rate limit improvement
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## External Resources Worth Exploring
|
| 144 |
+
|
| 145 |
+
### Python Libraries
|
| 146 |
+
|
| 147 |
+
| Library | For | Notes |
|
| 148 |
+
|---------|-----|-------|
|
| 149 |
+
| `limits` | Rate limiting | Used by reference repo |
|
| 150 |
+
| `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
|
| 151 |
+
| `metapub` | PubMed | Full-featured |
|
| 152 |
+
| `sentence-transformers` | Semantic search | For embeddings |
|
| 153 |
+
|
| 154 |
+
### APIs Not Yet Used
|
| 155 |
+
|
| 156 |
+
| API | Provides | Effort |
|
| 157 |
+
|-----|----------|--------|
|
| 158 |
+
| RxNorm | Drug name normalization | Low |
|
| 159 |
+
| DrugBank | Drug targets/mechanisms | Medium (license) |
|
| 160 |
+
| UniProt | Protein data | Medium |
|
| 161 |
+
| ChEMBL | Bioactivity data | Medium |
|
| 162 |
+
|
| 163 |
+
### RAG Tools (Future)
|
| 164 |
+
|
| 165 |
+
| Tool | Purpose |
|
| 166 |
+
|------|---------|
|
| 167 |
+
| [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
|
| 168 |
+
| [txtai](https://github.com/neuml/txtai) | Embeddings + search |
|
| 169 |
+
| [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## Files in This Directory
|
| 174 |
+
|
| 175 |
+
| File | Contents |
|
| 176 |
+
|------|----------|
|
| 177 |
+
| `00_ROADMAP_SUMMARY.md` | This file |
|
| 178 |
+
| `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
|
| 179 |
+
| `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
|
| 180 |
+
| `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
|
| 181 |
+
| `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## For Future Maintainers
|
| 186 |
+
|
| 187 |
+
If you're picking this up after the hackathon:
|
| 188 |
+
|
| 189 |
+
1. **Start with OpenAlex** - biggest bang for buck
|
| 190 |
+
2. **Add rate limiting** - prevents API blocks
|
| 191 |
+
3. **Don't bother with bioRxiv** - use Europe PMC instead
|
| 192 |
+
4. **Reference repo is gold** - `reference_repos/DeepCritical/` has working implementations
|
| 193 |
+
|
| 194 |
+
Good luck! π
|
docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md
ADDED
|
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ClinicalTrials.gov Tool: Current State & Future Improvements
|
| 2 |
+
|
| 3 |
+
**Status**: Currently Implemented
|
| 4 |
+
**Priority**: High (Core Data Source for Drug Repurposing)
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Current Implementation
|
| 9 |
+
|
| 10 |
+
### What We Have (`src/tools/clinicaltrials.py`)
|
| 11 |
+
|
| 12 |
+
- V2 API search via `clinicaltrials.gov/api/v2/studies`
|
| 13 |
+
- Filters: `INTERVENTIONAL` study type, `RECRUITING` status
|
| 14 |
+
- Returns: NCT ID, title, conditions, interventions, phase, status
|
| 15 |
+
- Query preprocessing via shared `query_utils.py`
|
| 16 |
+
|
| 17 |
+
### Current Strengths
|
| 18 |
+
|
| 19 |
+
1. **Good Filtering**: Already filtering for interventional + recruiting
|
| 20 |
+
2. **V2 API**: Using the modern API (v1 deprecated)
|
| 21 |
+
3. **Phase Info**: Extracting trial phases for drug development context
|
| 22 |
+
|
| 23 |
+
### Current Limitations
|
| 24 |
+
|
| 25 |
+
1. **No Outcome Data**: Missing primary/secondary outcomes
|
| 26 |
+
2. **No Eligibility Criteria**: Missing inclusion/exclusion details
|
| 27 |
+
3. **No Sponsor Info**: Missing who's running the trial
|
| 28 |
+
4. **No Result Data**: For completed trials, no efficacy data
|
| 29 |
+
5. **Limited Drug Mapping**: No integration with drug databases
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## API Capabilities We're Not Using
|
| 34 |
+
|
| 35 |
+
### Fields We Could Request
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
# Current fields
|
| 39 |
+
fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
|
| 40 |
+
|
| 41 |
+
# Additional valuable fields
|
| 42 |
+
additional_fields = [
|
| 43 |
+
"PrimaryOutcomeMeasure", # What are they measuring?
|
| 44 |
+
"SecondaryOutcomeMeasure", # Secondary endpoints
|
| 45 |
+
"EligibilityCriteria", # Who can participate?
|
| 46 |
+
"LeadSponsorName", # Who's funding?
|
| 47 |
+
"ResultsFirstPostDate", # Has results?
|
| 48 |
+
"StudyFirstPostDate", # When started?
|
| 49 |
+
"CompletionDate", # When finished?
|
| 50 |
+
"EnrollmentCount", # Sample size
|
| 51 |
+
"InterventionDescription", # Drug details
|
| 52 |
+
"ArmGroupLabel", # Treatment arms
|
| 53 |
+
"InterventionOtherName", # Drug aliases
|
| 54 |
+
]
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
### Filter Enhancements
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
# Current
|
| 61 |
+
aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
|
| 62 |
+
|
| 63 |
+
# Could add
|
| 64 |
+
"status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED" # Include completed for results
|
| 65 |
+
"phase:PHASE2,PHASE3" # Only later-stage trials
|
| 66 |
+
"resultsFirstPostDateRange:2020-01-01_" # Trials with posted results
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## Recommended Improvements
|
| 72 |
+
|
| 73 |
+
### Phase 1: Richer Metadata
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
EXTENDED_FIELDS = [
|
| 77 |
+
"NCTId",
|
| 78 |
+
"BriefTitle",
|
| 79 |
+
"OfficialTitle",
|
| 80 |
+
"Condition",
|
| 81 |
+
"InterventionName",
|
| 82 |
+
"InterventionDescription",
|
| 83 |
+
"InterventionOtherName", # Drug synonyms!
|
| 84 |
+
"Phase",
|
| 85 |
+
"OverallStatus",
|
| 86 |
+
"PrimaryOutcomeMeasure",
|
| 87 |
+
"EnrollmentCount",
|
| 88 |
+
"LeadSponsorName",
|
| 89 |
+
"StudyFirstPostDate",
|
| 90 |
+
]
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Phase 2: Results Retrieval
|
| 94 |
+
|
| 95 |
+
For completed trials, we can get actual efficacy data:
|
| 96 |
+
|
| 97 |
+
```python
|
| 98 |
+
async def get_trial_results(nct_id: str) -> dict | None:
|
| 99 |
+
"""Fetch results for completed trials."""
|
| 100 |
+
url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
|
| 101 |
+
params = {
|
| 102 |
+
"fields": "ResultsSection",
|
| 103 |
+
}
|
| 104 |
+
# Returns outcome measures and statistics
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
### Phase 3: Drug Name Normalization
|
| 108 |
+
|
| 109 |
+
Map intervention names to standard identifiers:
|
| 110 |
+
|
| 111 |
+
```python
|
| 112 |
+
# Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
|
| 113 |
+
# Solution: Use RxNorm or DrugBank for normalization
|
| 114 |
+
|
| 115 |
+
async def normalize_drug_name(intervention: str) -> str:
|
| 116 |
+
"""Normalize drug name via RxNorm API."""
|
| 117 |
+
url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
|
| 118 |
+
# Returns standardized RxCUI
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Integration Opportunities
|
| 124 |
+
|
| 125 |
+
### With PubMed
|
| 126 |
+
|
| 127 |
+
Cross-reference trials with publications:
|
| 128 |
+
```python
|
| 129 |
+
# ClinicalTrials.gov provides PMID links
|
| 130 |
+
# Can correlate trial results with published papers
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### With DrugBank/ChEMBL
|
| 134 |
+
|
| 135 |
+
Map interventions to:
|
| 136 |
+
- Mechanism of action
|
| 137 |
+
- Known targets
|
| 138 |
+
- Adverse effects
|
| 139 |
+
- Drug-drug interactions
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## Python Libraries to Consider
|
| 144 |
+
|
| 145 |
+
| Library | Purpose | Notes |
|
| 146 |
+
|---------|---------|-------|
|
| 147 |
+
| [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
|
| 148 |
+
| [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
|
| 149 |
+
| [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## API Quirks & Gotchas
|
| 154 |
+
|
| 155 |
+
1. **Rate Limiting**: Undocumented, be conservative
|
| 156 |
+
2. **Pagination**: Max 1000 results per request
|
| 157 |
+
3. **Field Names**: Case-sensitive, camelCase
|
| 158 |
+
4. **Empty Results**: Some fields may be null even if requested
|
| 159 |
+
5. **Status Changes**: Trials change status frequently
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
## Example Enhanced Query
|
| 164 |
+
|
| 165 |
+
```python
|
| 166 |
+
async def search_drug_repurposing_trials(
|
| 167 |
+
drug_name: str,
|
| 168 |
+
condition: str,
|
| 169 |
+
include_completed: bool = True,
|
| 170 |
+
) -> list[Evidence]:
|
| 171 |
+
"""Search for trials repurposing a drug for a new condition."""
|
| 172 |
+
|
| 173 |
+
statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
|
| 174 |
+
if include_completed:
|
| 175 |
+
statuses.append("COMPLETED")
|
| 176 |
+
|
| 177 |
+
params = {
|
| 178 |
+
"query.intr": drug_name,
|
| 179 |
+
"query.cond": condition,
|
| 180 |
+
"filter.overallStatus": ",".join(statuses),
|
| 181 |
+
"filter.studyType": "INTERVENTIONAL",
|
| 182 |
+
"fields": ",".join(EXTENDED_FIELDS),
|
| 183 |
+
"pageSize": 50,
|
| 184 |
+
}
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## Sources
|
| 190 |
+
|
| 191 |
+
- [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
|
| 192 |
+
- [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
|
| 193 |
+
- [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)
|
docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md
ADDED
|
@@ -0,0 +1,211 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Europe PMC Tool: Current State & Future Improvements
|
| 2 |
+
|
| 3 |
+
**Status**: Currently Implemented (Replaced bioRxiv)
|
| 4 |
+
**Priority**: High (Preprint + Open Access Source)
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Why Europe PMC Over bioRxiv?
|
| 9 |
+
|
| 10 |
+
### bioRxiv API Limitations (Why We Abandoned It)
|
| 11 |
+
|
| 12 |
+
1. **No Search API**: Only returns papers by date range or DOI
|
| 13 |
+
2. **No Query Capability**: Cannot search for "metformin cancer"
|
| 14 |
+
3. **Workaround Required**: Would need to download ALL preprints and build local search
|
| 15 |
+
4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
|
| 16 |
+
|
| 17 |
+
### Europe PMC Advantages
|
| 18 |
+
|
| 19 |
+
1. **Full Search API**: Boolean queries, filters, facets
|
| 20 |
+
2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
|
| 21 |
+
3. **Includes PubMed**: Also has MEDLINE content
|
| 22 |
+
4. **34 Preprint Servers**: Not just bioRxiv
|
| 23 |
+
5. **Open Access Focus**: Full-text when available
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## Current Implementation
|
| 28 |
+
|
| 29 |
+
### What We Have (`src/tools/europepmc.py`)
|
| 30 |
+
|
| 31 |
+
- REST API search via `europepmc.org/webservices/rest/search`
|
| 32 |
+
- Preprint flagging via `firstPublicationDate` heuristics
|
| 33 |
+
- Returns: title, abstract, authors, DOI, source
|
| 34 |
+
- Marks preprints for transparency
|
| 35 |
+
|
| 36 |
+
### Current Limitations
|
| 37 |
+
|
| 38 |
+
1. **No Full-Text Retrieval**: Only metadata/abstracts
|
| 39 |
+
2. **No Citation Network**: Missing references/citations
|
| 40 |
+
3. **No Supplementary Files**: Not fetching figures/data
|
| 41 |
+
4. **Basic Preprint Detection**: Heuristic, not explicit flag
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## Europe PMC API Capabilities
|
| 46 |
+
|
| 47 |
+
### Endpoints We Could Use
|
| 48 |
+
|
| 49 |
+
| Endpoint | Purpose | Currently Using |
|
| 50 |
+
|----------|---------|-----------------|
|
| 51 |
+
| `/search` | Query papers | Yes |
|
| 52 |
+
| `/fulltext/{ID}` | Full text (XML/JSON) | No |
|
| 53 |
+
| `/{PMCID}/supplementaryFiles` | Figures, data | No |
|
| 54 |
+
| `/citations/{ID}` | Who cited this | No |
|
| 55 |
+
| `/references/{ID}` | What this cites | No |
|
| 56 |
+
| `/annotations` | Text-mined entities | No |
|
| 57 |
+
|
| 58 |
+
### Rich Query Syntax
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
# Current simple query
|
| 62 |
+
query = "metformin cancer"
|
| 63 |
+
|
| 64 |
+
# Could use advanced syntax
|
| 65 |
+
query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
|
| 66 |
+
query += " AND (SRC:PPR)" # Only preprints
|
| 67 |
+
query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
|
| 68 |
+
query += " AND (OPEN_ACCESS:y)" # Only open access
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
### Source Filters
|
| 72 |
+
|
| 73 |
+
```python
|
| 74 |
+
# Filter by source
|
| 75 |
+
"SRC:MED" # MEDLINE
|
| 76 |
+
"SRC:PMC" # PubMed Central
|
| 77 |
+
"SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
|
| 78 |
+
"SRC:AGR" # Agricola
|
| 79 |
+
"SRC:CBA" # Chinese Biological Abstracts
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## Recommended Improvements
|
| 85 |
+
|
| 86 |
+
### Phase 1: Rich Metadata
|
| 87 |
+
|
| 88 |
+
```python
|
| 89 |
+
# Add to search results
|
| 90 |
+
additional_fields = [
|
| 91 |
+
"citedByCount", # Impact indicator
|
| 92 |
+
"source", # Explicit source (MED, PMC, PPR)
|
| 93 |
+
"isOpenAccess", # Boolean flag
|
| 94 |
+
"fullTextUrlList", # URLs for full text
|
| 95 |
+
"authorAffiliations", # Institution info
|
| 96 |
+
"grantsList", # Funding info
|
| 97 |
+
]
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
### Phase 2: Full-Text Retrieval
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
async def get_fulltext(pmcid: str) -> str | None:
|
| 104 |
+
"""Get full text for open access papers."""
|
| 105 |
+
# XML format
|
| 106 |
+
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
|
| 107 |
+
# Or JSON
|
| 108 |
+
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### Phase 3: Citation Network
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
async def get_citations(pmcid: str) -> list[str]:
|
| 115 |
+
"""Get papers that cite this one."""
|
| 116 |
+
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
|
| 117 |
+
|
| 118 |
+
async def get_references(pmcid: str) -> list[str]:
|
| 119 |
+
"""Get papers this one cites."""
|
| 120 |
+
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
### Phase 4: Text-Mined Annotations
|
| 124 |
+
|
| 125 |
+
Europe PMC extracts entities automatically:
|
| 126 |
+
|
| 127 |
+
```python
|
| 128 |
+
async def get_annotations(pmcid: str) -> dict:
|
| 129 |
+
"""Get text-mined entities (genes, diseases, drugs)."""
|
| 130 |
+
url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
|
| 131 |
+
params = {
|
| 132 |
+
"articleIds": f"PMC:{pmcid}",
|
| 133 |
+
"type": "Gene_Proteins,Diseases,Chemicals",
|
| 134 |
+
"format": "JSON",
|
| 135 |
+
}
|
| 136 |
+
# Returns structured entity mentions with positions
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## Supplementary File Retrieval
|
| 142 |
+
|
| 143 |
+
From reference repo (`bioinformatics_tools.py` lines 123-149):
|
| 144 |
+
|
| 145 |
+
```python
|
| 146 |
+
def get_figures(pmcid: str) -> dict[str, str]:
|
| 147 |
+
"""Download figures and supplementary files."""
|
| 148 |
+
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
|
| 149 |
+
# Returns ZIP with images, returns base64-encoded
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## Preprint-Specific Features
|
| 155 |
+
|
| 156 |
+
### Identify Preprint Servers
|
| 157 |
+
|
| 158 |
+
```python
|
| 159 |
+
PREPRINT_SOURCES = {
|
| 160 |
+
"PPR": "General preprints",
|
| 161 |
+
"bioRxiv": "Biology preprints",
|
| 162 |
+
"medRxiv": "Medical preprints",
|
| 163 |
+
"chemRxiv": "Chemistry preprints",
|
| 164 |
+
"Research Square": "Multi-disciplinary",
|
| 165 |
+
"Preprints.org": "MDPI preprints",
|
| 166 |
+
}
|
| 167 |
+
|
| 168 |
+
# Check if published version exists
|
| 169 |
+
async def check_published_version(preprint_doi: str) -> str | None:
|
| 170 |
+
"""Check if preprint has been peer-reviewed and published."""
|
| 171 |
+
# Europe PMC links preprints to final versions
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
## Rate Limiting
|
| 177 |
+
|
| 178 |
+
Europe PMC is more generous than NCBI:
|
| 179 |
+
|
| 180 |
+
```python
|
| 181 |
+
# No documented hard limit, but be respectful
|
| 182 |
+
# Recommend: 10-20 requests/second max
|
| 183 |
+
# Use email in User-Agent for polite pool
|
| 184 |
+
headers = {
|
| 185 |
+
"User-Agent": "DeepCritical/1.0 (mailto:your@email.com)"
|
| 186 |
+
}
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
## vs. The Lens & OpenAlex
|
| 192 |
+
|
| 193 |
+
| Feature | Europe PMC | The Lens | OpenAlex |
|
| 194 |
+
|---------|------------|----------|----------|
|
| 195 |
+
| Biomedical Focus | Yes | Partial | Partial |
|
| 196 |
+
| Preprints | Yes (34 servers) | Yes | Yes |
|
| 197 |
+
| Full Text | PMC papers | Links | No |
|
| 198 |
+
| Citations | Yes | Yes | Yes |
|
| 199 |
+
| Annotations | Yes (text-mined) | No | No |
|
| 200 |
+
| Rate Limits | Generous | Moderate | Very generous |
|
| 201 |
+
| API Key | Optional | Required | Optional |
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
+
## Sources
|
| 206 |
+
|
| 207 |
+
- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
|
| 208 |
+
- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
|
| 209 |
+
- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
|
| 210 |
+
- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
|
| 211 |
+
- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)
|
docs/brainstorming/04_OPENALEX_INTEGRATION.md
ADDED
|
@@ -0,0 +1,303 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenAlex Integration: The Missing Piece?
|
| 2 |
+
|
| 3 |
+
**Status**: NOT Implemented (Candidate for Addition)
|
| 4 |
+
**Priority**: HIGH - Could Replace Multiple Tools
|
| 5 |
+
**Reference**: Already implemented in `reference_repos/DeepCritical`
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## What is OpenAlex?
|
| 10 |
+
|
| 11 |
+
OpenAlex is a **fully open** index of the global research system:
|
| 12 |
+
|
| 13 |
+
- **209M+ works** (papers, books, datasets)
|
| 14 |
+
- **2B+ author records** (disambiguated)
|
| 15 |
+
- **124K+ venues** (journals, repositories)
|
| 16 |
+
- **109K+ institutions**
|
| 17 |
+
- **65K+ concepts** (hierarchical, linked to Wikidata)
|
| 18 |
+
|
| 19 |
+
**Free. Open. No API key required.**
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Why OpenAlex for DeepCritical?
|
| 24 |
+
|
| 25 |
+
### Current Architecture
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
User Query
|
| 29 |
+
β
|
| 30 |
+
ββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
β PubMed ClinicalTrials Europe PMC β β 3 separate APIs
|
| 32 |
+
ββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
+
β
|
| 34 |
+
Orchestrator (deduplicate, judge, synthesize)
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### With OpenAlex
|
| 38 |
+
|
| 39 |
+
```
|
| 40 |
+
User Query
|
| 41 |
+
β
|
| 42 |
+
ββββββββββββββββββββββββββββββββββββββββ
|
| 43 |
+
β OpenAlex β β Single API
|
| 44 |
+
β (includes PubMed + preprints + β
|
| 45 |
+
β citations + concepts + authors) β
|
| 46 |
+
ββββββββββββββββββββββββββββββββββββββββ
|
| 47 |
+
β
|
| 48 |
+
Orchestrator (enrich with CT.gov for trials)
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
**OpenAlex already aggregates**:
|
| 52 |
+
- PubMed/MEDLINE
|
| 53 |
+
- Crossref
|
| 54 |
+
- ORCID
|
| 55 |
+
- Unpaywall (open access links)
|
| 56 |
+
- Microsoft Academic Graph (legacy)
|
| 57 |
+
- Preprint servers
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## Reference Implementation
|
| 62 |
+
|
| 63 |
+
From `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`:
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
class OpenAlexFetchTool(ToolRunner):
|
| 67 |
+
def __init__(self):
|
| 68 |
+
super().__init__(
|
| 69 |
+
ToolSpec(
|
| 70 |
+
name="openalex_fetch",
|
| 71 |
+
description="Fetch OpenAlex work or author",
|
| 72 |
+
inputs={"entity": "TEXT", "identifier": "TEXT"},
|
| 73 |
+
outputs={"result": "JSON"},
|
| 74 |
+
)
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
def run(self, params: dict[str, Any]) -> ExecutionResult:
|
| 78 |
+
entity = params["entity"] # "works", "authors", "venues"
|
| 79 |
+
identifier = params["identifier"]
|
| 80 |
+
base = "https://api.openalex.org"
|
| 81 |
+
url = f"{base}/{entity}/{identifier}"
|
| 82 |
+
resp = requests.get(url, timeout=30)
|
| 83 |
+
return ExecutionResult(success=True, data={"result": resp.json()})
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## OpenAlex API Features
|
| 89 |
+
|
| 90 |
+
### Search Works (Papers)
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
# Search for metformin + cancer papers
|
| 94 |
+
url = "https://api.openalex.org/works"
|
| 95 |
+
params = {
|
| 96 |
+
"search": "metformin cancer drug repurposing",
|
| 97 |
+
"filter": "publication_year:>2020,type:article",
|
| 98 |
+
"sort": "cited_by_count:desc",
|
| 99 |
+
"per_page": 50,
|
| 100 |
+
}
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### Rich Filtering
|
| 104 |
+
|
| 105 |
+
```python
|
| 106 |
+
# Filter examples
|
| 107 |
+
"publication_year:2023"
|
| 108 |
+
"type:article" # vs preprint, book, etc.
|
| 109 |
+
"is_oa:true" # Open access only
|
| 110 |
+
"concepts.id:C71924100" # Papers about "Medicine"
|
| 111 |
+
"authorships.institutions.id:I27837315" # From Harvard
|
| 112 |
+
"cited_by_count:>100" # Highly cited
|
| 113 |
+
"has_fulltext:true" # Full text available
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
### What You Get Back
|
| 117 |
+
|
| 118 |
+
```json
|
| 119 |
+
{
|
| 120 |
+
"id": "W2741809807",
|
| 121 |
+
"title": "Metformin: A candidate drug for...",
|
| 122 |
+
"publication_year": 2023,
|
| 123 |
+
"type": "article",
|
| 124 |
+
"cited_by_count": 45,
|
| 125 |
+
"is_oa": true,
|
| 126 |
+
"primary_location": {
|
| 127 |
+
"source": {"display_name": "Nature Medicine"},
|
| 128 |
+
"pdf_url": "https://...",
|
| 129 |
+
"landing_page_url": "https://..."
|
| 130 |
+
},
|
| 131 |
+
"concepts": [
|
| 132 |
+
{"id": "C71924100", "display_name": "Medicine", "score": 0.95},
|
| 133 |
+
{"id": "C54355233", "display_name": "Pharmacology", "score": 0.88}
|
| 134 |
+
],
|
| 135 |
+
"authorships": [
|
| 136 |
+
{
|
| 137 |
+
"author": {"id": "A123", "display_name": "John Smith"},
|
| 138 |
+
"institutions": [{"display_name": "Harvard Medical School"}]
|
| 139 |
+
}
|
| 140 |
+
],
|
| 141 |
+
"referenced_works": ["W123", "W456"], # Citations
|
| 142 |
+
"related_works": ["W789", "W012"] # Similar papers
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## Key Advantages Over Current Tools
|
| 149 |
+
|
| 150 |
+
### 1. Citation Network (We Don't Have This!)
|
| 151 |
+
|
| 152 |
+
```python
|
| 153 |
+
# Get papers that cite a work
|
| 154 |
+
url = f"https://api.openalex.org/works?filter=cites:{work_id}"
|
| 155 |
+
|
| 156 |
+
# Get papers cited by a work
|
| 157 |
+
# Already in `referenced_works` field
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
### 2. Concept Tagging (We Don't Have This!)
|
| 161 |
+
|
| 162 |
+
OpenAlex auto-tags papers with hierarchical concepts:
|
| 163 |
+
- "Medicine" β "Pharmacology" β "Drug Repurposing"
|
| 164 |
+
- Can search by concept, not just keywords
|
| 165 |
+
|
| 166 |
+
### 3. Author Disambiguation (We Don't Have This!)
|
| 167 |
+
|
| 168 |
+
```python
|
| 169 |
+
# Find all works by an author
|
| 170 |
+
url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}"
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
### 4. Institution Tracking
|
| 174 |
+
|
| 175 |
+
```python
|
| 176 |
+
# Find drug repurposing papers from top institutions
|
| 177 |
+
url = "https://api.openalex.org/works"
|
| 178 |
+
params = {
|
| 179 |
+
"search": "drug repurposing",
|
| 180 |
+
"filter": "authorships.institutions.id:I27837315", # Harvard
|
| 181 |
+
}
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
### 5. Related Works
|
| 185 |
+
|
| 186 |
+
Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML.
|
| 187 |
+
|
| 188 |
+
---
|
| 189 |
+
|
| 190 |
+
## Proposed Implementation
|
| 191 |
+
|
| 192 |
+
### New Tool: `src/tools/openalex.py`
|
| 193 |
+
|
| 194 |
+
```python
|
| 195 |
+
"""OpenAlex search tool for comprehensive scholarly data."""
|
| 196 |
+
|
| 197 |
+
import httpx
|
| 198 |
+
from src.tools.base import SearchTool
|
| 199 |
+
from src.utils.models import Evidence
|
| 200 |
+
|
| 201 |
+
class OpenAlexTool(SearchTool):
|
| 202 |
+
"""Search OpenAlex for scholarly works with rich metadata."""
|
| 203 |
+
|
| 204 |
+
name = "openalex"
|
| 205 |
+
|
| 206 |
+
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
|
| 207 |
+
async with httpx.AsyncClient() as client:
|
| 208 |
+
resp = await client.get(
|
| 209 |
+
"https://api.openalex.org/works",
|
| 210 |
+
params={
|
| 211 |
+
"search": query,
|
| 212 |
+
"filter": "type:article,is_oa:true",
|
| 213 |
+
"sort": "cited_by_count:desc",
|
| 214 |
+
"per_page": max_results,
|
| 215 |
+
"mailto": "deepcritical@example.com", # Polite pool
|
| 216 |
+
},
|
| 217 |
+
)
|
| 218 |
+
data = resp.json()
|
| 219 |
+
|
| 220 |
+
return [
|
| 221 |
+
Evidence(
|
| 222 |
+
source="openalex",
|
| 223 |
+
title=work["title"],
|
| 224 |
+
abstract=work.get("abstract", ""),
|
| 225 |
+
url=work["primary_location"]["landing_page_url"],
|
| 226 |
+
metadata={
|
| 227 |
+
"cited_by_count": work["cited_by_count"],
|
| 228 |
+
"concepts": [c["display_name"] for c in work["concepts"][:5]],
|
| 229 |
+
"is_open_access": work["is_oa"],
|
| 230 |
+
"pdf_url": work["primary_location"].get("pdf_url"),
|
| 231 |
+
},
|
| 232 |
+
)
|
| 233 |
+
for work in data["results"]
|
| 234 |
+
]
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
---
|
| 238 |
+
|
| 239 |
+
## Rate Limits
|
| 240 |
+
|
| 241 |
+
OpenAlex is **extremely generous**:
|
| 242 |
+
|
| 243 |
+
- No hard rate limit documented
|
| 244 |
+
- Recommended: <100,000 requests/day
|
| 245 |
+
- **Polite pool**: Add `mailto=your@email.com` param for faster responses
|
| 246 |
+
- No API key required (optional for priority support)
|
| 247 |
+
|
| 248 |
+
---
|
| 249 |
+
|
| 250 |
+
## Should We Add OpenAlex?
|
| 251 |
+
|
| 252 |
+
### Arguments FOR
|
| 253 |
+
|
| 254 |
+
1. **Already in reference repo** - proven pattern
|
| 255 |
+
2. **Richer data** - citations, concepts, authors
|
| 256 |
+
3. **Single source** - reduces API complexity
|
| 257 |
+
4. **Free & open** - no keys, no limits
|
| 258 |
+
5. **Institution adoption** - Leiden, Sorbonne switched to it
|
| 259 |
+
|
| 260 |
+
### Arguments AGAINST
|
| 261 |
+
|
| 262 |
+
1. **Adds complexity** - another data source
|
| 263 |
+
2. **Overlap** - duplicates some PubMed data
|
| 264 |
+
3. **Not biomedical-focused** - covers all disciplines
|
| 265 |
+
4. **No full text** - still need PMC/Europe PMC for that
|
| 266 |
+
|
| 267 |
+
### Recommendation
|
| 268 |
+
|
| 269 |
+
**Add OpenAlex as a 4th source**, don't replace existing tools.
|
| 270 |
+
|
| 271 |
+
Use it for:
|
| 272 |
+
- Citation network analysis
|
| 273 |
+
- Concept-based discovery
|
| 274 |
+
- High-impact paper finding
|
| 275 |
+
- Author/institution tracking
|
| 276 |
+
|
| 277 |
+
Keep PubMed, ClinicalTrials, Europe PMC for:
|
| 278 |
+
- Authoritative biomedical search
|
| 279 |
+
- Clinical trial data
|
| 280 |
+
- Full-text access
|
| 281 |
+
- Preprint tracking
|
| 282 |
+
|
| 283 |
+
---
|
| 284 |
+
|
| 285 |
+
## Implementation Priority
|
| 286 |
+
|
| 287 |
+
| Task | Effort | Value |
|
| 288 |
+
|------|--------|-------|
|
| 289 |
+
| Basic search | Low | High |
|
| 290 |
+
| Citation network | Medium | Very High |
|
| 291 |
+
| Concept filtering | Low | High |
|
| 292 |
+
| Related works | Low | High |
|
| 293 |
+
| Author tracking | Medium | Medium |
|
| 294 |
+
|
| 295 |
+
---
|
| 296 |
+
|
| 297 |
+
## Sources
|
| 298 |
+
|
| 299 |
+
- [OpenAlex Documentation](https://docs.openalex.org)
|
| 300 |
+
- [OpenAlex API Overview](https://docs.openalex.org/api)
|
| 301 |
+
- [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex)
|
| 302 |
+
- [Leiden University Announcement](https://www.leidenranking.com/information/openalex)
|
| 303 |
+
- [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833)
|
docs/brainstorming/implementation/15_PHASE_OPENALEX.md
ADDED
|
@@ -0,0 +1,603 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 15: OpenAlex Integration
|
| 2 |
+
|
| 3 |
+
**Priority**: HIGH - Biggest bang for buck
|
| 4 |
+
**Effort**: ~2-3 hours
|
| 5 |
+
**Dependencies**: None (existing codebase patterns sufficient)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Prerequisites (COMPLETED)
|
| 10 |
+
|
| 11 |
+
The following model changes have been implemented to support this integration:
|
| 12 |
+
|
| 13 |
+
1. **`SourceName` Literal Updated** (`src/utils/models.py:9`)
|
| 14 |
+
```python
|
| 15 |
+
SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
|
| 16 |
+
```
|
| 17 |
+
- Without this, `source="openalex"` would fail Pydantic validation
|
| 18 |
+
|
| 19 |
+
2. **`Evidence.metadata` Field Added** (`src/utils/models.py:39-42`)
|
| 20 |
+
```python
|
| 21 |
+
metadata: dict[str, Any] = Field(
|
| 22 |
+
default_factory=dict,
|
| 23 |
+
description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
|
| 24 |
+
)
|
| 25 |
+
```
|
| 26 |
+
- Required for storing `cited_by_count`, `concepts`, etc.
|
| 27 |
+
- Model is still frozen - metadata must be passed at construction time
|
| 28 |
+
|
| 29 |
+
3. **`__init__.py` Exports Updated** (`src/tools/__init__.py`)
|
| 30 |
+
- All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool`
|
| 31 |
+
- OpenAlexTool should be added here after implementation
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Overview
|
| 36 |
+
|
| 37 |
+
Add OpenAlex as a 4th data source for comprehensive scholarly data including:
|
| 38 |
+
- Citation networks (who cites whom)
|
| 39 |
+
- Concept tagging (hierarchical topic classification)
|
| 40 |
+
- Author disambiguation
|
| 41 |
+
- 209M+ works indexed
|
| 42 |
+
|
| 43 |
+
**Why OpenAlex?**
|
| 44 |
+
- Free, no API key required
|
| 45 |
+
- Already implemented in reference repo
|
| 46 |
+
- Provides citation data we don't have
|
| 47 |
+
- Aggregates PubMed + preprints + more
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## TDD Implementation Plan
|
| 52 |
+
|
| 53 |
+
### Step 1: Write the Tests First
|
| 54 |
+
|
| 55 |
+
**File**: `tests/unit/tools/test_openalex.py`
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
"""Tests for OpenAlex search tool."""
|
| 59 |
+
|
| 60 |
+
import pytest
|
| 61 |
+
import respx
|
| 62 |
+
from httpx import Response
|
| 63 |
+
|
| 64 |
+
from src.tools.openalex import OpenAlexTool
|
| 65 |
+
from src.utils.models import Evidence
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
class TestOpenAlexTool:
|
| 69 |
+
"""Test suite for OpenAlex search functionality."""
|
| 70 |
+
|
| 71 |
+
@pytest.fixture
|
| 72 |
+
def tool(self) -> OpenAlexTool:
|
| 73 |
+
return OpenAlexTool()
|
| 74 |
+
|
| 75 |
+
def test_name_property(self, tool: OpenAlexTool) -> None:
|
| 76 |
+
"""Tool should identify itself as 'openalex'."""
|
| 77 |
+
assert tool.name == "openalex"
|
| 78 |
+
|
| 79 |
+
@respx.mock
|
| 80 |
+
@pytest.mark.asyncio
|
| 81 |
+
async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None:
|
| 82 |
+
"""Search should return list of Evidence objects."""
|
| 83 |
+
mock_response = {
|
| 84 |
+
"results": [
|
| 85 |
+
{
|
| 86 |
+
"id": "W2741809807",
|
| 87 |
+
"title": "Metformin and cancer: A systematic review",
|
| 88 |
+
"publication_year": 2023,
|
| 89 |
+
"cited_by_count": 45,
|
| 90 |
+
"type": "article",
|
| 91 |
+
"is_oa": True,
|
| 92 |
+
"primary_location": {
|
| 93 |
+
"source": {"display_name": "Nature Medicine"},
|
| 94 |
+
"landing_page_url": "https://doi.org/10.1038/example",
|
| 95 |
+
"pdf_url": None,
|
| 96 |
+
},
|
| 97 |
+
"abstract_inverted_index": {
|
| 98 |
+
"Metformin": [0],
|
| 99 |
+
"shows": [1],
|
| 100 |
+
"anticancer": [2],
|
| 101 |
+
"effects": [3],
|
| 102 |
+
},
|
| 103 |
+
"concepts": [
|
| 104 |
+
{"display_name": "Medicine", "score": 0.95},
|
| 105 |
+
{"display_name": "Oncology", "score": 0.88},
|
| 106 |
+
],
|
| 107 |
+
"authorships": [
|
| 108 |
+
{
|
| 109 |
+
"author": {"display_name": "John Smith"},
|
| 110 |
+
"institutions": [{"display_name": "Harvard"}],
|
| 111 |
+
}
|
| 112 |
+
],
|
| 113 |
+
}
|
| 114 |
+
]
|
| 115 |
+
}
|
| 116 |
+
|
| 117 |
+
respx.get("https://api.openalex.org/works").mock(
|
| 118 |
+
return_value=Response(200, json=mock_response)
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
results = await tool.search("metformin cancer", max_results=10)
|
| 122 |
+
|
| 123 |
+
assert len(results) == 1
|
| 124 |
+
assert isinstance(results[0], Evidence)
|
| 125 |
+
assert "Metformin and cancer" in results[0].citation.title
|
| 126 |
+
assert results[0].citation.source == "openalex"
|
| 127 |
+
|
| 128 |
+
@respx.mock
|
| 129 |
+
@pytest.mark.asyncio
|
| 130 |
+
async def test_search_empty_results(self, tool: OpenAlexTool) -> None:
|
| 131 |
+
"""Search with no results should return empty list."""
|
| 132 |
+
respx.get("https://api.openalex.org/works").mock(
|
| 133 |
+
return_value=Response(200, json={"results": []})
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
results = await tool.search("xyznonexistentquery123")
|
| 137 |
+
assert results == []
|
| 138 |
+
|
| 139 |
+
@respx.mock
|
| 140 |
+
@pytest.mark.asyncio
|
| 141 |
+
async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None:
|
| 142 |
+
"""Tool should handle papers without abstracts."""
|
| 143 |
+
mock_response = {
|
| 144 |
+
"results": [
|
| 145 |
+
{
|
| 146 |
+
"id": "W123",
|
| 147 |
+
"title": "Paper without abstract",
|
| 148 |
+
"publication_year": 2023,
|
| 149 |
+
"cited_by_count": 10,
|
| 150 |
+
"type": "article",
|
| 151 |
+
"is_oa": False,
|
| 152 |
+
"primary_location": {
|
| 153 |
+
"source": {"display_name": "Journal"},
|
| 154 |
+
"landing_page_url": "https://example.com",
|
| 155 |
+
},
|
| 156 |
+
"abstract_inverted_index": None,
|
| 157 |
+
"concepts": [],
|
| 158 |
+
"authorships": [],
|
| 159 |
+
}
|
| 160 |
+
]
|
| 161 |
+
}
|
| 162 |
+
|
| 163 |
+
respx.get("https://api.openalex.org/works").mock(
|
| 164 |
+
return_value=Response(200, json=mock_response)
|
| 165 |
+
)
|
| 166 |
+
|
| 167 |
+
results = await tool.search("test query")
|
| 168 |
+
assert len(results) == 1
|
| 169 |
+
assert results[0].content == "" # No abstract
|
| 170 |
+
|
| 171 |
+
@respx.mock
|
| 172 |
+
@pytest.mark.asyncio
|
| 173 |
+
async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None:
|
| 174 |
+
"""Citation count should be in metadata."""
|
| 175 |
+
mock_response = {
|
| 176 |
+
"results": [
|
| 177 |
+
{
|
| 178 |
+
"id": "W456",
|
| 179 |
+
"title": "Highly cited paper",
|
| 180 |
+
"publication_year": 2020,
|
| 181 |
+
"cited_by_count": 500,
|
| 182 |
+
"type": "article",
|
| 183 |
+
"is_oa": True,
|
| 184 |
+
"primary_location": {
|
| 185 |
+
"source": {"display_name": "Science"},
|
| 186 |
+
"landing_page_url": "https://example.com",
|
| 187 |
+
},
|
| 188 |
+
"abstract_inverted_index": {"Test": [0]},
|
| 189 |
+
"concepts": [],
|
| 190 |
+
"authorships": [],
|
| 191 |
+
}
|
| 192 |
+
]
|
| 193 |
+
}
|
| 194 |
+
|
| 195 |
+
respx.get("https://api.openalex.org/works").mock(
|
| 196 |
+
return_value=Response(200, json=mock_response)
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
results = await tool.search("highly cited")
|
| 200 |
+
assert results[0].metadata["cited_by_count"] == 500
|
| 201 |
+
|
| 202 |
+
@respx.mock
|
| 203 |
+
@pytest.mark.asyncio
|
| 204 |
+
async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None:
|
| 205 |
+
"""Concepts should be extracted for semantic discovery."""
|
| 206 |
+
mock_response = {
|
| 207 |
+
"results": [
|
| 208 |
+
{
|
| 209 |
+
"id": "W789",
|
| 210 |
+
"title": "Drug repurposing study",
|
| 211 |
+
"publication_year": 2023,
|
| 212 |
+
"cited_by_count": 25,
|
| 213 |
+
"type": "article",
|
| 214 |
+
"is_oa": True,
|
| 215 |
+
"primary_location": {
|
| 216 |
+
"source": {"display_name": "PLOS ONE"},
|
| 217 |
+
"landing_page_url": "https://example.com",
|
| 218 |
+
},
|
| 219 |
+
"abstract_inverted_index": {"Drug": [0], "repurposing": [1]},
|
| 220 |
+
"concepts": [
|
| 221 |
+
{"display_name": "Pharmacology", "score": 0.92},
|
| 222 |
+
{"display_name": "Drug Discovery", "score": 0.85},
|
| 223 |
+
{"display_name": "Medicine", "score": 0.80},
|
| 224 |
+
],
|
| 225 |
+
"authorships": [],
|
| 226 |
+
}
|
| 227 |
+
]
|
| 228 |
+
}
|
| 229 |
+
|
| 230 |
+
respx.get("https://api.openalex.org/works").mock(
|
| 231 |
+
return_value=Response(200, json=mock_response)
|
| 232 |
+
)
|
| 233 |
+
|
| 234 |
+
results = await tool.search("drug repurposing")
|
| 235 |
+
assert "Pharmacology" in results[0].metadata["concepts"]
|
| 236 |
+
assert "Drug Discovery" in results[0].metadata["concepts"]
|
| 237 |
+
|
| 238 |
+
@respx.mock
|
| 239 |
+
@pytest.mark.asyncio
|
| 240 |
+
async def test_search_api_error_raises_search_error(
|
| 241 |
+
self, tool: OpenAlexTool
|
| 242 |
+
) -> None:
|
| 243 |
+
"""API errors should raise SearchError."""
|
| 244 |
+
from src.utils.exceptions import SearchError
|
| 245 |
+
|
| 246 |
+
respx.get("https://api.openalex.org/works").mock(
|
| 247 |
+
return_value=Response(500, text="Internal Server Error")
|
| 248 |
+
)
|
| 249 |
+
|
| 250 |
+
with pytest.raises(SearchError):
|
| 251 |
+
await tool.search("test query")
|
| 252 |
+
|
| 253 |
+
def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
|
| 254 |
+
"""Test abstract reconstruction from inverted index."""
|
| 255 |
+
inverted_index = {
|
| 256 |
+
"Metformin": [0, 5],
|
| 257 |
+
"is": [1],
|
| 258 |
+
"a": [2],
|
| 259 |
+
"diabetes": [3],
|
| 260 |
+
"drug": [4],
|
| 261 |
+
"effective": [6],
|
| 262 |
+
}
|
| 263 |
+
abstract = tool._reconstruct_abstract(inverted_index)
|
| 264 |
+
assert abstract == "Metformin is a diabetes drug Metformin effective"
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
---
|
| 268 |
+
|
| 269 |
+
### Step 2: Create the Implementation
|
| 270 |
+
|
| 271 |
+
**File**: `src/tools/openalex.py`
|
| 272 |
+
|
| 273 |
+
```python
|
| 274 |
+
"""OpenAlex search tool for comprehensive scholarly data."""
|
| 275 |
+
|
| 276 |
+
from typing import Any
|
| 277 |
+
|
| 278 |
+
import httpx
|
| 279 |
+
from tenacity import retry, stop_after_attempt, wait_exponential
|
| 280 |
+
|
| 281 |
+
from src.utils.exceptions import SearchError
|
| 282 |
+
from src.utils.models import Citation, Evidence
|
| 283 |
+
|
| 284 |
+
|
| 285 |
+
class OpenAlexTool:
|
| 286 |
+
"""
|
| 287 |
+
Search OpenAlex for scholarly works with rich metadata.
|
| 288 |
+
|
| 289 |
+
OpenAlex provides:
|
| 290 |
+
- 209M+ scholarly works
|
| 291 |
+
- Citation counts and networks
|
| 292 |
+
- Concept tagging (hierarchical)
|
| 293 |
+
- Author disambiguation
|
| 294 |
+
- Open access links
|
| 295 |
+
|
| 296 |
+
API Docs: https://docs.openalex.org/
|
| 297 |
+
"""
|
| 298 |
+
|
| 299 |
+
BASE_URL = "https://api.openalex.org/works"
|
| 300 |
+
|
| 301 |
+
def __init__(self, email: str | None = None) -> None:
|
| 302 |
+
"""
|
| 303 |
+
Initialize OpenAlex tool.
|
| 304 |
+
|
| 305 |
+
Args:
|
| 306 |
+
email: Optional email for polite pool (faster responses)
|
| 307 |
+
"""
|
| 308 |
+
self.email = email or "deepcritical@example.com"
|
| 309 |
+
|
| 310 |
+
@property
|
| 311 |
+
def name(self) -> str:
|
| 312 |
+
return "openalex"
|
| 313 |
+
|
| 314 |
+
@retry(
|
| 315 |
+
stop=stop_after_attempt(3),
|
| 316 |
+
wait=wait_exponential(multiplier=1, min=1, max=10),
|
| 317 |
+
reraise=True,
|
| 318 |
+
)
|
| 319 |
+
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
|
| 320 |
+
"""
|
| 321 |
+
Search OpenAlex for scholarly works.
|
| 322 |
+
|
| 323 |
+
Args:
|
| 324 |
+
query: Search terms
|
| 325 |
+
max_results: Maximum results to return (max 200 per request)
|
| 326 |
+
|
| 327 |
+
Returns:
|
| 328 |
+
List of Evidence objects with citation metadata
|
| 329 |
+
|
| 330 |
+
Raises:
|
| 331 |
+
SearchError: If API request fails
|
| 332 |
+
"""
|
| 333 |
+
params = {
|
| 334 |
+
"search": query,
|
| 335 |
+
"filter": "type:article", # Only peer-reviewed articles
|
| 336 |
+
"sort": "cited_by_count:desc", # Most cited first
|
| 337 |
+
"per_page": min(max_results, 200),
|
| 338 |
+
"mailto": self.email, # Polite pool for faster responses
|
| 339 |
+
}
|
| 340 |
+
|
| 341 |
+
async with httpx.AsyncClient(timeout=30.0) as client:
|
| 342 |
+
try:
|
| 343 |
+
response = await client.get(self.BASE_URL, params=params)
|
| 344 |
+
response.raise_for_status()
|
| 345 |
+
|
| 346 |
+
data = response.json()
|
| 347 |
+
results = data.get("results", [])
|
| 348 |
+
|
| 349 |
+
return [self._to_evidence(work) for work in results[:max_results]]
|
| 350 |
+
|
| 351 |
+
except httpx.HTTPStatusError as e:
|
| 352 |
+
raise SearchError(f"OpenAlex API error: {e}") from e
|
| 353 |
+
except httpx.RequestError as e:
|
| 354 |
+
raise SearchError(f"OpenAlex connection failed: {e}") from e
|
| 355 |
+
|
| 356 |
+
def _to_evidence(self, work: dict[str, Any]) -> Evidence:
|
| 357 |
+
"""Convert OpenAlex work to Evidence object."""
|
| 358 |
+
title = work.get("title", "Untitled")
|
| 359 |
+
pub_year = work.get("publication_year", "Unknown")
|
| 360 |
+
cited_by = work.get("cited_by_count", 0)
|
| 361 |
+
is_oa = work.get("is_oa", False)
|
| 362 |
+
|
| 363 |
+
# Reconstruct abstract from inverted index
|
| 364 |
+
abstract_index = work.get("abstract_inverted_index")
|
| 365 |
+
abstract = self._reconstruct_abstract(abstract_index) if abstract_index else ""
|
| 366 |
+
|
| 367 |
+
# Extract concepts (top 5)
|
| 368 |
+
concepts = [
|
| 369 |
+
c.get("display_name", "")
|
| 370 |
+
for c in work.get("concepts", [])[:5]
|
| 371 |
+
if c.get("display_name")
|
| 372 |
+
]
|
| 373 |
+
|
| 374 |
+
# Extract authors (top 5)
|
| 375 |
+
authorships = work.get("authorships", [])
|
| 376 |
+
authors = [
|
| 377 |
+
a.get("author", {}).get("display_name", "")
|
| 378 |
+
for a in authorships[:5]
|
| 379 |
+
if a.get("author", {}).get("display_name")
|
| 380 |
+
]
|
| 381 |
+
|
| 382 |
+
# Get URL
|
| 383 |
+
primary_loc = work.get("primary_location") or {}
|
| 384 |
+
url = primary_loc.get("landing_page_url", "")
|
| 385 |
+
if not url:
|
| 386 |
+
# Fallback to OpenAlex page
|
| 387 |
+
work_id = work.get("id", "").replace("https://openalex.org/", "")
|
| 388 |
+
url = f"https://openalex.org/{work_id}"
|
| 389 |
+
|
| 390 |
+
return Evidence(
|
| 391 |
+
content=abstract[:2000],
|
| 392 |
+
citation=Citation(
|
| 393 |
+
source="openalex",
|
| 394 |
+
title=title[:500],
|
| 395 |
+
url=url,
|
| 396 |
+
date=str(pub_year),
|
| 397 |
+
authors=authors,
|
| 398 |
+
),
|
| 399 |
+
relevance=min(0.9, 0.5 + (cited_by / 1000)), # Boost by citations
|
| 400 |
+
metadata={
|
| 401 |
+
"cited_by_count": cited_by,
|
| 402 |
+
"is_open_access": is_oa,
|
| 403 |
+
"concepts": concepts,
|
| 404 |
+
"pdf_url": primary_loc.get("pdf_url"),
|
| 405 |
+
},
|
| 406 |
+
)
|
| 407 |
+
|
| 408 |
+
def _reconstruct_abstract(
|
| 409 |
+
self, inverted_index: dict[str, list[int]]
|
| 410 |
+
) -> str:
|
| 411 |
+
"""
|
| 412 |
+
Reconstruct abstract from OpenAlex inverted index format.
|
| 413 |
+
|
| 414 |
+
OpenAlex stores abstracts as {"word": [position1, position2, ...]}.
|
| 415 |
+
This rebuilds the original text.
|
| 416 |
+
"""
|
| 417 |
+
if not inverted_index:
|
| 418 |
+
return ""
|
| 419 |
+
|
| 420 |
+
# Build position -> word mapping
|
| 421 |
+
position_word: dict[int, str] = {}
|
| 422 |
+
for word, positions in inverted_index.items():
|
| 423 |
+
for pos in positions:
|
| 424 |
+
position_word[pos] = word
|
| 425 |
+
|
| 426 |
+
# Reconstruct in order
|
| 427 |
+
if not position_word:
|
| 428 |
+
return ""
|
| 429 |
+
|
| 430 |
+
max_pos = max(position_word.keys())
|
| 431 |
+
words = [position_word.get(i, "") for i in range(max_pos + 1)]
|
| 432 |
+
return " ".join(w for w in words if w)
|
| 433 |
+
```
|
| 434 |
+
|
| 435 |
+
---
|
| 436 |
+
|
| 437 |
+
### Step 3: Register in Search Handler
|
| 438 |
+
|
| 439 |
+
**File**: `src/tools/search_handler.py` (add to imports and tool list)
|
| 440 |
+
|
| 441 |
+
```python
|
| 442 |
+
# Add import
|
| 443 |
+
from src.tools.openalex import OpenAlexTool
|
| 444 |
+
|
| 445 |
+
# Add to _create_tools method
|
| 446 |
+
def _create_tools(self) -> list[SearchTool]:
|
| 447 |
+
return [
|
| 448 |
+
PubMedTool(),
|
| 449 |
+
ClinicalTrialsTool(),
|
| 450 |
+
EuropePMCTool(),
|
| 451 |
+
OpenAlexTool(), # NEW
|
| 452 |
+
]
|
| 453 |
+
```
|
| 454 |
+
|
| 455 |
+
---
|
| 456 |
+
|
| 457 |
+
### Step 4: Update `__init__.py`
|
| 458 |
+
|
| 459 |
+
**File**: `src/tools/__init__.py`
|
| 460 |
+
|
| 461 |
+
```python
|
| 462 |
+
from src.tools.openalex import OpenAlexTool
|
| 463 |
+
|
| 464 |
+
__all__ = [
|
| 465 |
+
"PubMedTool",
|
| 466 |
+
"ClinicalTrialsTool",
|
| 467 |
+
"EuropePMCTool",
|
| 468 |
+
"OpenAlexTool", # NEW
|
| 469 |
+
# ...
|
| 470 |
+
]
|
| 471 |
+
```
|
| 472 |
+
|
| 473 |
+
---
|
| 474 |
+
|
| 475 |
+
## Demo Script
|
| 476 |
+
|
| 477 |
+
**File**: `examples/openalex_demo.py`
|
| 478 |
+
|
| 479 |
+
```python
|
| 480 |
+
#!/usr/bin/env python3
|
| 481 |
+
"""Demo script to verify OpenAlex integration."""
|
| 482 |
+
|
| 483 |
+
import asyncio
|
| 484 |
+
from src.tools.openalex import OpenAlexTool
|
| 485 |
+
|
| 486 |
+
|
| 487 |
+
async def main():
|
| 488 |
+
"""Run OpenAlex search demo."""
|
| 489 |
+
tool = OpenAlexTool()
|
| 490 |
+
|
| 491 |
+
print("=" * 60)
|
| 492 |
+
print("OpenAlex Integration Demo")
|
| 493 |
+
print("=" * 60)
|
| 494 |
+
|
| 495 |
+
# Test 1: Basic drug repurposing search
|
| 496 |
+
print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...")
|
| 497 |
+
results = await tool.search("metformin cancer drug repurposing", max_results=5)
|
| 498 |
+
|
| 499 |
+
for i, evidence in enumerate(results, 1):
|
| 500 |
+
print(f"\n--- Result {i} ---")
|
| 501 |
+
print(f"Title: {evidence.citation.title}")
|
| 502 |
+
print(f"Year: {evidence.citation.date}")
|
| 503 |
+
print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}")
|
| 504 |
+
print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}")
|
| 505 |
+
print(f"Open Access: {evidence.metadata.get('is_open_access', False)}")
|
| 506 |
+
print(f"URL: {evidence.citation.url}")
|
| 507 |
+
if evidence.content:
|
| 508 |
+
print(f"Abstract: {evidence.content[:200]}...")
|
| 509 |
+
|
| 510 |
+
# Test 2: High-impact papers
|
| 511 |
+
print("\n" + "=" * 60)
|
| 512 |
+
print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...")
|
| 513 |
+
results = await tool.search("long COVID treatment", max_results=3)
|
| 514 |
+
|
| 515 |
+
for evidence in results:
|
| 516 |
+
print(f"\n- {evidence.citation.title}")
|
| 517 |
+
print(f" Citations: {evidence.metadata.get('cited_by_count', 0)}")
|
| 518 |
+
|
| 519 |
+
print("\n" + "=" * 60)
|
| 520 |
+
print("Demo complete!")
|
| 521 |
+
|
| 522 |
+
|
| 523 |
+
if __name__ == "__main__":
|
| 524 |
+
asyncio.run(main())
|
| 525 |
+
```
|
| 526 |
+
|
| 527 |
+
---
|
| 528 |
+
|
| 529 |
+
## Verification Checklist
|
| 530 |
+
|
| 531 |
+
### Unit Tests
|
| 532 |
+
```bash
|
| 533 |
+
# Run just OpenAlex tests
|
| 534 |
+
uv run pytest tests/unit/tools/test_openalex.py -v
|
| 535 |
+
|
| 536 |
+
# Expected: All tests pass
|
| 537 |
+
```
|
| 538 |
+
|
| 539 |
+
### Integration Test (Manual)
|
| 540 |
+
```bash
|
| 541 |
+
# Run demo script with real API
|
| 542 |
+
uv run python examples/openalex_demo.py
|
| 543 |
+
|
| 544 |
+
# Expected: Real results from OpenAlex API
|
| 545 |
+
```
|
| 546 |
+
|
| 547 |
+
### Full Test Suite
|
| 548 |
+
```bash
|
| 549 |
+
# Ensure nothing broke
|
| 550 |
+
make check
|
| 551 |
+
|
| 552 |
+
# Expected: All 110+ tests pass, mypy clean
|
| 553 |
+
```
|
| 554 |
+
|
| 555 |
+
---
|
| 556 |
+
|
| 557 |
+
## Success Criteria
|
| 558 |
+
|
| 559 |
+
1. **Unit tests pass**: All mocked tests in `test_openalex.py` pass
|
| 560 |
+
2. **Integration works**: Demo script returns real results
|
| 561 |
+
3. **No regressions**: `make check` passes completely
|
| 562 |
+
4. **SearchHandler integration**: OpenAlex appears in search results alongside other sources
|
| 563 |
+
5. **Citation metadata**: Results include `cited_by_count`, `concepts`, `is_open_access`
|
| 564 |
+
|
| 565 |
+
---
|
| 566 |
+
|
| 567 |
+
## Future Enhancements (P2)
|
| 568 |
+
|
| 569 |
+
Once basic integration works:
|
| 570 |
+
|
| 571 |
+
1. **Citation Network Queries**
|
| 572 |
+
```python
|
| 573 |
+
# Get papers citing a specific work
|
| 574 |
+
async def get_citing_works(self, work_id: str) -> list[Evidence]:
|
| 575 |
+
params = {"filter": f"cites:{work_id}"}
|
| 576 |
+
...
|
| 577 |
+
```
|
| 578 |
+
|
| 579 |
+
2. **Concept-Based Search**
|
| 580 |
+
```python
|
| 581 |
+
# Search by OpenAlex concept ID
|
| 582 |
+
async def search_by_concept(self, concept_id: str) -> list[Evidence]:
|
| 583 |
+
params = {"filter": f"concepts.id:{concept_id}"}
|
| 584 |
+
...
|
| 585 |
+
```
|
| 586 |
+
|
| 587 |
+
3. **Author Tracking**
|
| 588 |
+
```python
|
| 589 |
+
# Find all works by an author
|
| 590 |
+
async def search_by_author(self, author_id: str) -> list[Evidence]:
|
| 591 |
+
params = {"filter": f"authorships.author.id:{author_id}"}
|
| 592 |
+
...
|
| 593 |
+
```
|
| 594 |
+
|
| 595 |
+
---
|
| 596 |
+
|
| 597 |
+
## Notes
|
| 598 |
+
|
| 599 |
+
- OpenAlex is **very generous** with rate limits (no documented hard limit)
|
| 600 |
+
- Adding `mailto` parameter gives priority access (polite pool)
|
| 601 |
+
- Abstract is stored as inverted index - must reconstruct
|
| 602 |
+
- Citation count is a good proxy for paper quality/impact
|
| 603 |
+
- Consider caching responses for repeated queries
|
docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md
ADDED
|
@@ -0,0 +1,586 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 16: PubMed Full-Text Retrieval
|
| 2 |
+
|
| 3 |
+
**Priority**: MEDIUM - Enhances evidence quality
|
| 4 |
+
**Effort**: ~3 hours
|
| 5 |
+
**Dependencies**: None (existing PubMed tool sufficient)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Prerequisites (COMPLETED)
|
| 10 |
+
|
| 11 |
+
The `Evidence.metadata` field has been added to `src/utils/models.py` to support:
|
| 12 |
+
```python
|
| 13 |
+
metadata={"has_fulltext": True}
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Architecture Decision: Constructor Parameter vs Method Parameter
|
| 19 |
+
|
| 20 |
+
**IMPORTANT**: The original spec proposed `include_fulltext` as a method parameter:
|
| 21 |
+
```python
|
| 22 |
+
# WRONG - SearchHandler won't pass this parameter
|
| 23 |
+
async def search(self, query: str, max_results: int = 10, include_fulltext: bool = False):
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
**Problem**: `SearchHandler` calls `tool.search(query, max_results)` uniformly across all tools.
|
| 27 |
+
It has no mechanism to pass tool-specific parameters like `include_fulltext`.
|
| 28 |
+
|
| 29 |
+
**Solution**: Use constructor parameter instead:
|
| 30 |
+
```python
|
| 31 |
+
# CORRECT - Configured at instantiation time
|
| 32 |
+
class PubMedTool:
|
| 33 |
+
def __init__(self, api_key: str | None = None, include_fulltext: bool = False):
|
| 34 |
+
self.include_fulltext = include_fulltext
|
| 35 |
+
...
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
This way, you can create a full-text-enabled PubMed tool:
|
| 39 |
+
```python
|
| 40 |
+
# In orchestrator or wherever tools are created
|
| 41 |
+
tools = [
|
| 42 |
+
PubMedTool(include_fulltext=True), # Full-text enabled
|
| 43 |
+
ClinicalTrialsTool(),
|
| 44 |
+
EuropePMCTool(),
|
| 45 |
+
]
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## Overview
|
| 51 |
+
|
| 52 |
+
Add full-text retrieval for PubMed papers via the BioC API, enabling:
|
| 53 |
+
- Complete paper text for open-access PMC papers
|
| 54 |
+
- Structured sections (intro, methods, results, discussion)
|
| 55 |
+
- Better evidence for LLM synthesis
|
| 56 |
+
|
| 57 |
+
**Why Full-Text?**
|
| 58 |
+
- Abstracts only give ~200-300 words
|
| 59 |
+
- Full text provides detailed methods, results, figures
|
| 60 |
+
- Reference repo already has this implemented
|
| 61 |
+
- Makes LLM judgments more accurate
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## TDD Implementation Plan
|
| 66 |
+
|
| 67 |
+
### Step 1: Write the Tests First
|
| 68 |
+
|
| 69 |
+
**File**: `tests/unit/tools/test_pubmed_fulltext.py`
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
"""Tests for PubMed full-text retrieval."""
|
| 73 |
+
|
| 74 |
+
import pytest
|
| 75 |
+
import respx
|
| 76 |
+
from httpx import Response
|
| 77 |
+
|
| 78 |
+
from src.tools.pubmed import PubMedTool
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
class TestPubMedFullText:
|
| 82 |
+
"""Test suite for PubMed full-text functionality."""
|
| 83 |
+
|
| 84 |
+
@pytest.fixture
|
| 85 |
+
def tool(self) -> PubMedTool:
|
| 86 |
+
return PubMedTool()
|
| 87 |
+
|
| 88 |
+
@respx.mock
|
| 89 |
+
@pytest.mark.asyncio
|
| 90 |
+
async def test_get_pmc_id_success(self, tool: PubMedTool) -> None:
|
| 91 |
+
"""Should convert PMID to PMCID for full-text access."""
|
| 92 |
+
mock_response = {
|
| 93 |
+
"records": [
|
| 94 |
+
{
|
| 95 |
+
"pmid": "12345678",
|
| 96 |
+
"pmcid": "PMC1234567",
|
| 97 |
+
}
|
| 98 |
+
]
|
| 99 |
+
}
|
| 100 |
+
|
| 101 |
+
respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
|
| 102 |
+
return_value=Response(200, json=mock_response)
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
pmcid = await tool.get_pmc_id("12345678")
|
| 106 |
+
assert pmcid == "PMC1234567"
|
| 107 |
+
|
| 108 |
+
@respx.mock
|
| 109 |
+
@pytest.mark.asyncio
|
| 110 |
+
async def test_get_pmc_id_not_in_pmc(self, tool: PubMedTool) -> None:
|
| 111 |
+
"""Should return None if paper not in PMC."""
|
| 112 |
+
mock_response = {
|
| 113 |
+
"records": [
|
| 114 |
+
{
|
| 115 |
+
"pmid": "12345678",
|
| 116 |
+
# No pmcid means not in PMC
|
| 117 |
+
}
|
| 118 |
+
]
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
|
| 122 |
+
return_value=Response(200, json=mock_response)
|
| 123 |
+
)
|
| 124 |
+
|
| 125 |
+
pmcid = await tool.get_pmc_id("12345678")
|
| 126 |
+
assert pmcid is None
|
| 127 |
+
|
| 128 |
+
@respx.mock
|
| 129 |
+
@pytest.mark.asyncio
|
| 130 |
+
async def test_get_fulltext_success(self, tool: PubMedTool) -> None:
|
| 131 |
+
"""Should retrieve full text for PMC papers."""
|
| 132 |
+
# Mock BioC API response
|
| 133 |
+
mock_bioc = {
|
| 134 |
+
"documents": [
|
| 135 |
+
{
|
| 136 |
+
"passages": [
|
| 137 |
+
{
|
| 138 |
+
"infons": {"section_type": "INTRO"},
|
| 139 |
+
"text": "Introduction text here.",
|
| 140 |
+
},
|
| 141 |
+
{
|
| 142 |
+
"infons": {"section_type": "METHODS"},
|
| 143 |
+
"text": "Methods description here.",
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"infons": {"section_type": "RESULTS"},
|
| 147 |
+
"text": "Results summary here.",
|
| 148 |
+
},
|
| 149 |
+
{
|
| 150 |
+
"infons": {"section_type": "DISCUSS"},
|
| 151 |
+
"text": "Discussion and conclusions.",
|
| 152 |
+
},
|
| 153 |
+
]
|
| 154 |
+
}
|
| 155 |
+
]
|
| 156 |
+
}
|
| 157 |
+
|
| 158 |
+
respx.get(
|
| 159 |
+
"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
|
| 160 |
+
).mock(return_value=Response(200, json=mock_bioc))
|
| 161 |
+
|
| 162 |
+
fulltext = await tool.get_fulltext("12345678")
|
| 163 |
+
|
| 164 |
+
assert fulltext is not None
|
| 165 |
+
assert "Introduction text here" in fulltext
|
| 166 |
+
assert "Methods description here" in fulltext
|
| 167 |
+
assert "Results summary here" in fulltext
|
| 168 |
+
|
| 169 |
+
@respx.mock
|
| 170 |
+
@pytest.mark.asyncio
|
| 171 |
+
async def test_get_fulltext_not_available(self, tool: PubMedTool) -> None:
|
| 172 |
+
"""Should return None if full text not available."""
|
| 173 |
+
respx.get(
|
| 174 |
+
"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/99999999/unicode"
|
| 175 |
+
).mock(return_value=Response(404))
|
| 176 |
+
|
| 177 |
+
fulltext = await tool.get_fulltext("99999999")
|
| 178 |
+
assert fulltext is None
|
| 179 |
+
|
| 180 |
+
@respx.mock
|
| 181 |
+
@pytest.mark.asyncio
|
| 182 |
+
async def test_get_fulltext_structured(self, tool: PubMedTool) -> None:
|
| 183 |
+
"""Should return structured sections dict."""
|
| 184 |
+
mock_bioc = {
|
| 185 |
+
"documents": [
|
| 186 |
+
{
|
| 187 |
+
"passages": [
|
| 188 |
+
{"infons": {"section_type": "INTRO"}, "text": "Intro..."},
|
| 189 |
+
{"infons": {"section_type": "METHODS"}, "text": "Methods..."},
|
| 190 |
+
{"infons": {"section_type": "RESULTS"}, "text": "Results..."},
|
| 191 |
+
{"infons": {"section_type": "DISCUSS"}, "text": "Discussion..."},
|
| 192 |
+
]
|
| 193 |
+
}
|
| 194 |
+
]
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
respx.get(
|
| 198 |
+
"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
|
| 199 |
+
).mock(return_value=Response(200, json=mock_bioc))
|
| 200 |
+
|
| 201 |
+
sections = await tool.get_fulltext_structured("12345678")
|
| 202 |
+
|
| 203 |
+
assert sections is not None
|
| 204 |
+
assert "introduction" in sections
|
| 205 |
+
assert "methods" in sections
|
| 206 |
+
assert "results" in sections
|
| 207 |
+
assert "discussion" in sections
|
| 208 |
+
|
| 209 |
+
@respx.mock
|
| 210 |
+
@pytest.mark.asyncio
|
| 211 |
+
async def test_search_with_fulltext_enabled(self) -> None:
|
| 212 |
+
"""Search should include full text when tool is configured for it."""
|
| 213 |
+
# Create tool WITH full-text enabled via constructor
|
| 214 |
+
tool = PubMedTool(include_fulltext=True)
|
| 215 |
+
|
| 216 |
+
# Mock esearch
|
| 217 |
+
respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi").mock(
|
| 218 |
+
return_value=Response(
|
| 219 |
+
200, json={"esearchresult": {"idlist": ["12345678"]}}
|
| 220 |
+
)
|
| 221 |
+
)
|
| 222 |
+
|
| 223 |
+
# Mock efetch (abstract)
|
| 224 |
+
mock_xml = """
|
| 225 |
+
<PubmedArticleSet>
|
| 226 |
+
<PubmedArticle>
|
| 227 |
+
<MedlineCitation>
|
| 228 |
+
<PMID>12345678</PMID>
|
| 229 |
+
<Article>
|
| 230 |
+
<ArticleTitle>Test Paper</ArticleTitle>
|
| 231 |
+
<Abstract><AbstractText>Short abstract.</AbstractText></Abstract>
|
| 232 |
+
<AuthorList><Author><LastName>Smith</LastName></Author></AuthorList>
|
| 233 |
+
</Article>
|
| 234 |
+
</MedlineCitation>
|
| 235 |
+
</PubmedArticle>
|
| 236 |
+
</PubmedArticleSet>
|
| 237 |
+
"""
|
| 238 |
+
respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi").mock(
|
| 239 |
+
return_value=Response(200, text=mock_xml)
|
| 240 |
+
)
|
| 241 |
+
|
| 242 |
+
# Mock ID converter
|
| 243 |
+
respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
|
| 244 |
+
return_value=Response(
|
| 245 |
+
200, json={"records": [{"pmid": "12345678", "pmcid": "PMC1234567"}]}
|
| 246 |
+
)
|
| 247 |
+
)
|
| 248 |
+
|
| 249 |
+
# Mock BioC full text
|
| 250 |
+
mock_bioc = {
|
| 251 |
+
"documents": [
|
| 252 |
+
{
|
| 253 |
+
"passages": [
|
| 254 |
+
{"infons": {"section_type": "INTRO"}, "text": "Full intro..."},
|
| 255 |
+
]
|
| 256 |
+
}
|
| 257 |
+
]
|
| 258 |
+
}
|
| 259 |
+
respx.get(
|
| 260 |
+
"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
|
| 261 |
+
).mock(return_value=Response(200, json=mock_bioc))
|
| 262 |
+
|
| 263 |
+
# NOTE: No include_fulltext param - it's set via constructor
|
| 264 |
+
results = await tool.search("test", max_results=1)
|
| 265 |
+
|
| 266 |
+
assert len(results) == 1
|
| 267 |
+
# Full text should be appended or replace abstract
|
| 268 |
+
assert "Full intro" in results[0].content or "Short abstract" in results[0].content
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
---
|
| 272 |
+
|
| 273 |
+
### Step 2: Implement Full-Text Methods
|
| 274 |
+
|
| 275 |
+
**File**: `src/tools/pubmed.py` (additions to existing class)
|
| 276 |
+
|
| 277 |
+
```python
|
| 278 |
+
# Add these methods to PubMedTool class
|
| 279 |
+
|
| 280 |
+
async def get_pmc_id(self, pmid: str) -> str | None:
|
| 281 |
+
"""
|
| 282 |
+
Convert PMID to PMCID for full-text access.
|
| 283 |
+
|
| 284 |
+
Args:
|
| 285 |
+
pmid: PubMed ID
|
| 286 |
+
|
| 287 |
+
Returns:
|
| 288 |
+
PMCID if paper is in PMC, None otherwise
|
| 289 |
+
"""
|
| 290 |
+
url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
|
| 291 |
+
params = {"ids": pmid, "format": "json"}
|
| 292 |
+
|
| 293 |
+
async with httpx.AsyncClient(timeout=30.0) as client:
|
| 294 |
+
try:
|
| 295 |
+
response = await client.get(url, params=params)
|
| 296 |
+
response.raise_for_status()
|
| 297 |
+
data = response.json()
|
| 298 |
+
|
| 299 |
+
records = data.get("records", [])
|
| 300 |
+
if records and records[0].get("pmcid"):
|
| 301 |
+
return records[0]["pmcid"]
|
| 302 |
+
return None
|
| 303 |
+
|
| 304 |
+
except httpx.HTTPError:
|
| 305 |
+
return None
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
async def get_fulltext(self, pmid: str) -> str | None:
|
| 309 |
+
"""
|
| 310 |
+
Get full text for a PubMed paper via BioC API.
|
| 311 |
+
|
| 312 |
+
Only works for open-access papers in PubMed Central.
|
| 313 |
+
|
| 314 |
+
Args:
|
| 315 |
+
pmid: PubMed ID
|
| 316 |
+
|
| 317 |
+
Returns:
|
| 318 |
+
Full text as string, or None if not available
|
| 319 |
+
"""
|
| 320 |
+
url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
|
| 321 |
+
|
| 322 |
+
async with httpx.AsyncClient(timeout=60.0) as client:
|
| 323 |
+
try:
|
| 324 |
+
response = await client.get(url)
|
| 325 |
+
if response.status_code == 404:
|
| 326 |
+
return None
|
| 327 |
+
response.raise_for_status()
|
| 328 |
+
data = response.json()
|
| 329 |
+
|
| 330 |
+
# Extract text from all passages
|
| 331 |
+
documents = data.get("documents", [])
|
| 332 |
+
if not documents:
|
| 333 |
+
return None
|
| 334 |
+
|
| 335 |
+
passages = documents[0].get("passages", [])
|
| 336 |
+
text_parts = [p.get("text", "") for p in passages if p.get("text")]
|
| 337 |
+
|
| 338 |
+
return "\n\n".join(text_parts) if text_parts else None
|
| 339 |
+
|
| 340 |
+
except httpx.HTTPError:
|
| 341 |
+
return None
|
| 342 |
+
|
| 343 |
+
|
| 344 |
+
async def get_fulltext_structured(self, pmid: str) -> dict[str, str] | None:
|
| 345 |
+
"""
|
| 346 |
+
Get structured full text with sections.
|
| 347 |
+
|
| 348 |
+
Args:
|
| 349 |
+
pmid: PubMed ID
|
| 350 |
+
|
| 351 |
+
Returns:
|
| 352 |
+
Dict mapping section names to text, or None if not available
|
| 353 |
+
"""
|
| 354 |
+
url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
|
| 355 |
+
|
| 356 |
+
async with httpx.AsyncClient(timeout=60.0) as client:
|
| 357 |
+
try:
|
| 358 |
+
response = await client.get(url)
|
| 359 |
+
if response.status_code == 404:
|
| 360 |
+
return None
|
| 361 |
+
response.raise_for_status()
|
| 362 |
+
data = response.json()
|
| 363 |
+
|
| 364 |
+
documents = data.get("documents", [])
|
| 365 |
+
if not documents:
|
| 366 |
+
return None
|
| 367 |
+
|
| 368 |
+
# Map section types to readable names
|
| 369 |
+
section_map = {
|
| 370 |
+
"INTRO": "introduction",
|
| 371 |
+
"METHODS": "methods",
|
| 372 |
+
"RESULTS": "results",
|
| 373 |
+
"DISCUSS": "discussion",
|
| 374 |
+
"CONCL": "conclusion",
|
| 375 |
+
"ABSTRACT": "abstract",
|
| 376 |
+
}
|
| 377 |
+
|
| 378 |
+
sections: dict[str, list[str]] = {}
|
| 379 |
+
for passage in documents[0].get("passages", []):
|
| 380 |
+
section_type = passage.get("infons", {}).get("section_type", "other")
|
| 381 |
+
section_name = section_map.get(section_type, "other")
|
| 382 |
+
text = passage.get("text", "")
|
| 383 |
+
|
| 384 |
+
if text:
|
| 385 |
+
if section_name not in sections:
|
| 386 |
+
sections[section_name] = []
|
| 387 |
+
sections[section_name].append(text)
|
| 388 |
+
|
| 389 |
+
# Join multiple passages per section
|
| 390 |
+
return {k: "\n\n".join(v) for k, v in sections.items()}
|
| 391 |
+
|
| 392 |
+
except httpx.HTTPError:
|
| 393 |
+
return None
|
| 394 |
+
```
|
| 395 |
+
|
| 396 |
+
---
|
| 397 |
+
|
| 398 |
+
### Step 3: Update Constructor and Search Method
|
| 399 |
+
|
| 400 |
+
Add full-text flag to constructor and update search to use it:
|
| 401 |
+
|
| 402 |
+
```python
|
| 403 |
+
class PubMedTool:
|
| 404 |
+
"""Search tool for PubMed/NCBI."""
|
| 405 |
+
|
| 406 |
+
def __init__(
|
| 407 |
+
self,
|
| 408 |
+
api_key: str | None = None,
|
| 409 |
+
include_fulltext: bool = False, # NEW CONSTRUCTOR PARAM
|
| 410 |
+
) -> None:
|
| 411 |
+
self.api_key = api_key or settings.ncbi_api_key
|
| 412 |
+
if self.api_key == "your-ncbi-key-here":
|
| 413 |
+
self.api_key = None
|
| 414 |
+
self._last_request_time = 0.0
|
| 415 |
+
self.include_fulltext = include_fulltext # Store for use in search()
|
| 416 |
+
|
| 417 |
+
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
|
| 418 |
+
"""
|
| 419 |
+
Search PubMed and return evidence.
|
| 420 |
+
|
| 421 |
+
Note: Full-text enrichment is controlled by constructor parameter,
|
| 422 |
+
not method parameter, because SearchHandler doesn't pass extra args.
|
| 423 |
+
"""
|
| 424 |
+
# ... existing search logic ...
|
| 425 |
+
|
| 426 |
+
evidence_list = self._parse_pubmed_xml(fetch_resp.text)
|
| 427 |
+
|
| 428 |
+
# Optionally enrich with full text (if configured at construction)
|
| 429 |
+
if self.include_fulltext:
|
| 430 |
+
evidence_list = await self._enrich_with_fulltext(evidence_list)
|
| 431 |
+
|
| 432 |
+
return evidence_list
|
| 433 |
+
|
| 434 |
+
|
| 435 |
+
async def _enrich_with_fulltext(
|
| 436 |
+
self, evidence_list: list[Evidence]
|
| 437 |
+
) -> list[Evidence]:
|
| 438 |
+
"""Attempt to add full text to evidence items."""
|
| 439 |
+
enriched = []
|
| 440 |
+
|
| 441 |
+
for evidence in evidence_list:
|
| 442 |
+
# Extract PMID from URL
|
| 443 |
+
url = evidence.citation.url
|
| 444 |
+
pmid = url.rstrip("/").split("/")[-1] if url else None
|
| 445 |
+
|
| 446 |
+
if pmid:
|
| 447 |
+
fulltext = await self.get_fulltext(pmid)
|
| 448 |
+
if fulltext:
|
| 449 |
+
# Replace abstract with full text (truncated)
|
| 450 |
+
evidence = Evidence(
|
| 451 |
+
content=fulltext[:8000], # Larger limit for full text
|
| 452 |
+
citation=evidence.citation,
|
| 453 |
+
relevance=evidence.relevance,
|
| 454 |
+
metadata={
|
| 455 |
+
**evidence.metadata,
|
| 456 |
+
"has_fulltext": True,
|
| 457 |
+
},
|
| 458 |
+
)
|
| 459 |
+
|
| 460 |
+
enriched.append(evidence)
|
| 461 |
+
|
| 462 |
+
return enriched
|
| 463 |
+
```
|
| 464 |
+
|
| 465 |
+
---
|
| 466 |
+
|
| 467 |
+
## Demo Script
|
| 468 |
+
|
| 469 |
+
**File**: `examples/pubmed_fulltext_demo.py`
|
| 470 |
+
|
| 471 |
+
```python
|
| 472 |
+
#!/usr/bin/env python3
|
| 473 |
+
"""Demo script to verify PubMed full-text retrieval."""
|
| 474 |
+
|
| 475 |
+
import asyncio
|
| 476 |
+
from src.tools.pubmed import PubMedTool
|
| 477 |
+
|
| 478 |
+
|
| 479 |
+
async def main():
|
| 480 |
+
"""Run PubMed full-text demo."""
|
| 481 |
+
tool = PubMedTool()
|
| 482 |
+
|
| 483 |
+
print("=" * 60)
|
| 484 |
+
print("PubMed Full-Text Demo")
|
| 485 |
+
print("=" * 60)
|
| 486 |
+
|
| 487 |
+
# Test 1: Convert PMID to PMCID
|
| 488 |
+
print("\n[Test 1] Converting PMID to PMCID...")
|
| 489 |
+
# Use a known open-access paper
|
| 490 |
+
test_pmid = "34450029" # Example: COVID-related open-access paper
|
| 491 |
+
pmcid = await tool.get_pmc_id(test_pmid)
|
| 492 |
+
print(f"PMID {test_pmid} -> PMCID: {pmcid or 'Not in PMC'}")
|
| 493 |
+
|
| 494 |
+
# Test 2: Get full text
|
| 495 |
+
print("\n[Test 2] Fetching full text...")
|
| 496 |
+
if pmcid:
|
| 497 |
+
fulltext = await tool.get_fulltext(test_pmid)
|
| 498 |
+
if fulltext:
|
| 499 |
+
print(f"Full text length: {len(fulltext)} characters")
|
| 500 |
+
print(f"Preview: {fulltext[:500]}...")
|
| 501 |
+
else:
|
| 502 |
+
print("Full text not available")
|
| 503 |
+
|
| 504 |
+
# Test 3: Get structured sections
|
| 505 |
+
print("\n[Test 3] Fetching structured sections...")
|
| 506 |
+
if pmcid:
|
| 507 |
+
sections = await tool.get_fulltext_structured(test_pmid)
|
| 508 |
+
if sections:
|
| 509 |
+
print("Available sections:")
|
| 510 |
+
for section, text in sections.items():
|
| 511 |
+
print(f" - {section}: {len(text)} chars")
|
| 512 |
+
else:
|
| 513 |
+
print("Structured text not available")
|
| 514 |
+
|
| 515 |
+
# Test 4: Search with full text
|
| 516 |
+
print("\n[Test 4] Search with full-text enrichment...")
|
| 517 |
+
results = await tool.search(
|
| 518 |
+
"metformin cancer open access",
|
| 519 |
+
max_results=3,
|
| 520 |
+
include_fulltext=True
|
| 521 |
+
)
|
| 522 |
+
|
| 523 |
+
for i, evidence in enumerate(results, 1):
|
| 524 |
+
has_ft = evidence.metadata.get("has_fulltext", False)
|
| 525 |
+
print(f"\n--- Result {i} ---")
|
| 526 |
+
print(f"Title: {evidence.citation.title}")
|
| 527 |
+
print(f"Has Full Text: {has_ft}")
|
| 528 |
+
print(f"Content Length: {len(evidence.content)} chars")
|
| 529 |
+
|
| 530 |
+
print("\n" + "=" * 60)
|
| 531 |
+
print("Demo complete!")
|
| 532 |
+
|
| 533 |
+
|
| 534 |
+
if __name__ == "__main__":
|
| 535 |
+
asyncio.run(main())
|
| 536 |
+
```
|
| 537 |
+
|
| 538 |
+
---
|
| 539 |
+
|
| 540 |
+
## Verification Checklist
|
| 541 |
+
|
| 542 |
+
### Unit Tests
|
| 543 |
+
```bash
|
| 544 |
+
# Run full-text tests
|
| 545 |
+
uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
|
| 546 |
+
|
| 547 |
+
# Run all PubMed tests
|
| 548 |
+
uv run pytest tests/unit/tools/test_pubmed.py -v
|
| 549 |
+
|
| 550 |
+
# Expected: All tests pass
|
| 551 |
+
```
|
| 552 |
+
|
| 553 |
+
### Integration Test (Manual)
|
| 554 |
+
```bash
|
| 555 |
+
# Run demo with real API
|
| 556 |
+
uv run python examples/pubmed_fulltext_demo.py
|
| 557 |
+
|
| 558 |
+
# Expected: Real full text from PMC papers
|
| 559 |
+
```
|
| 560 |
+
|
| 561 |
+
### Full Test Suite
|
| 562 |
+
```bash
|
| 563 |
+
make check
|
| 564 |
+
# Expected: All tests pass, mypy clean
|
| 565 |
+
```
|
| 566 |
+
|
| 567 |
+
---
|
| 568 |
+
|
| 569 |
+
## Success Criteria
|
| 570 |
+
|
| 571 |
+
1. **ID Conversion works**: PMID -> PMCID conversion successful
|
| 572 |
+
2. **Full text retrieval works**: BioC API returns paper text
|
| 573 |
+
3. **Structured sections work**: Can get intro/methods/results/discussion separately
|
| 574 |
+
4. **Search integration works**: `include_fulltext=True` enriches results
|
| 575 |
+
5. **No regressions**: Existing tests still pass
|
| 576 |
+
6. **Graceful degradation**: Non-PMC papers still return abstracts
|
| 577 |
+
|
| 578 |
+
---
|
| 579 |
+
|
| 580 |
+
## Notes
|
| 581 |
+
|
| 582 |
+
- Only ~30% of PubMed papers have full text in PMC
|
| 583 |
+
- BioC API has no documented rate limit, but be respectful
|
| 584 |
+
- Full text can be very long - truncate appropriately
|
| 585 |
+
- Consider caching full text responses (they don't change)
|
| 586 |
+
- Timeout should be longer for full text (60s vs 30s)
|
docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md
ADDED
|
@@ -0,0 +1,540 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 17: Rate Limiting with `limits` Library
|
| 2 |
+
|
| 3 |
+
**Priority**: P0 CRITICAL - Prevents API blocks
|
| 4 |
+
**Effort**: ~1 hour
|
| 5 |
+
**Dependencies**: None
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## CRITICAL: Async Safety Requirements
|
| 10 |
+
|
| 11 |
+
**WARNING**: The rate limiter MUST be async-safe. Blocking the event loop will freeze:
|
| 12 |
+
- The Gradio UI
|
| 13 |
+
- All parallel searches
|
| 14 |
+
- The orchestrator
|
| 15 |
+
|
| 16 |
+
**Rules**:
|
| 17 |
+
1. **NEVER use `time.sleep()`** - Always use `await asyncio.sleep()`
|
| 18 |
+
2. **NEVER use blocking while loops** - Use async-aware polling
|
| 19 |
+
3. **The `limits` library check is synchronous** - Wrap it carefully
|
| 20 |
+
|
| 21 |
+
The implementation below uses a polling pattern that:
|
| 22 |
+
- Checks the limit (synchronous, fast)
|
| 23 |
+
- If exceeded, `await asyncio.sleep()` (non-blocking)
|
| 24 |
+
- Retry the check
|
| 25 |
+
|
| 26 |
+
**Alternative**: If `limits` proves problematic, use `aiolimiter` which is pure-async.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## Overview
|
| 31 |
+
|
| 32 |
+
Replace naive `asyncio.sleep` rate limiting with proper rate limiter using the `limits` library, which provides:
|
| 33 |
+
- Moving window rate limiting
|
| 34 |
+
- Per-API configurable limits
|
| 35 |
+
- Thread-safe storage
|
| 36 |
+
- Already used in reference repo
|
| 37 |
+
|
| 38 |
+
**Why This Matters?**
|
| 39 |
+
- NCBI will block us without proper rate limiting (3/sec without key, 10/sec with)
|
| 40 |
+
- Current implementation only has simple sleep delay
|
| 41 |
+
- Need coordinated limits across all PubMed calls
|
| 42 |
+
- Professional-grade rate limiting prevents production issues
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## Current State
|
| 47 |
+
|
| 48 |
+
### What We Have (`src/tools/pubmed.py:20-21, 34-41`)
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
RATE_LIMIT_DELAY = 0.34 # ~3 requests/sec without API key
|
| 52 |
+
|
| 53 |
+
async def _rate_limit(self) -> None:
|
| 54 |
+
"""Enforce NCBI rate limiting."""
|
| 55 |
+
loop = asyncio.get_running_loop()
|
| 56 |
+
now = loop.time()
|
| 57 |
+
elapsed = now - self._last_request_time
|
| 58 |
+
if elapsed < self.RATE_LIMIT_DELAY:
|
| 59 |
+
await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
|
| 60 |
+
self._last_request_time = loop.time()
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### Problems
|
| 64 |
+
|
| 65 |
+
1. **Not shared across instances**: Each `PubMedTool()` has its own counter
|
| 66 |
+
2. **Simple delay vs moving window**: Doesn't handle bursts properly
|
| 67 |
+
3. **Hardcoded rate**: Doesn't adapt to API key presence
|
| 68 |
+
4. **No backoff on 429**: Just retries blindly
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## TDD Implementation Plan
|
| 73 |
+
|
| 74 |
+
### Step 1: Add Dependency
|
| 75 |
+
|
| 76 |
+
**File**: `pyproject.toml`
|
| 77 |
+
|
| 78 |
+
```toml
|
| 79 |
+
dependencies = [
|
| 80 |
+
# ... existing deps ...
|
| 81 |
+
"limits>=3.0",
|
| 82 |
+
]
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
Then run:
|
| 86 |
+
```bash
|
| 87 |
+
uv sync
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
### Step 2: Write the Tests First
|
| 93 |
+
|
| 94 |
+
**File**: `tests/unit/tools/test_rate_limiting.py`
|
| 95 |
+
|
| 96 |
+
```python
|
| 97 |
+
"""Tests for rate limiting functionality."""
|
| 98 |
+
|
| 99 |
+
import asyncio
|
| 100 |
+
import time
|
| 101 |
+
|
| 102 |
+
import pytest
|
| 103 |
+
|
| 104 |
+
from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
class TestRateLimiter:
|
| 108 |
+
"""Test suite for rate limiter."""
|
| 109 |
+
|
| 110 |
+
def test_create_limiter_without_api_key(self) -> None:
|
| 111 |
+
"""Should create 3/sec limiter without API key."""
|
| 112 |
+
limiter = RateLimiter(rate="3/second")
|
| 113 |
+
assert limiter.rate == "3/second"
|
| 114 |
+
|
| 115 |
+
def test_create_limiter_with_api_key(self) -> None:
|
| 116 |
+
"""Should create 10/sec limiter with API key."""
|
| 117 |
+
limiter = RateLimiter(rate="10/second")
|
| 118 |
+
assert limiter.rate == "10/second"
|
| 119 |
+
|
| 120 |
+
@pytest.mark.asyncio
|
| 121 |
+
async def test_limiter_allows_requests_under_limit(self) -> None:
|
| 122 |
+
"""Should allow requests under the rate limit."""
|
| 123 |
+
limiter = RateLimiter(rate="10/second")
|
| 124 |
+
|
| 125 |
+
# 3 requests should all succeed immediately
|
| 126 |
+
for _ in range(3):
|
| 127 |
+
allowed = await limiter.acquire()
|
| 128 |
+
assert allowed is True
|
| 129 |
+
|
| 130 |
+
@pytest.mark.asyncio
|
| 131 |
+
async def test_limiter_blocks_when_exceeded(self) -> None:
|
| 132 |
+
"""Should wait when rate limit exceeded."""
|
| 133 |
+
limiter = RateLimiter(rate="2/second")
|
| 134 |
+
|
| 135 |
+
# First 2 should be instant
|
| 136 |
+
await limiter.acquire()
|
| 137 |
+
await limiter.acquire()
|
| 138 |
+
|
| 139 |
+
# Third should block briefly
|
| 140 |
+
start = time.monotonic()
|
| 141 |
+
await limiter.acquire()
|
| 142 |
+
elapsed = time.monotonic() - start
|
| 143 |
+
|
| 144 |
+
# Should have waited ~0.5 seconds (half second window for 2/sec)
|
| 145 |
+
assert elapsed >= 0.3
|
| 146 |
+
|
| 147 |
+
@pytest.mark.asyncio
|
| 148 |
+
async def test_limiter_resets_after_window(self) -> None:
|
| 149 |
+
"""Rate limit should reset after time window."""
|
| 150 |
+
limiter = RateLimiter(rate="5/second")
|
| 151 |
+
|
| 152 |
+
# Use up the limit
|
| 153 |
+
for _ in range(5):
|
| 154 |
+
await limiter.acquire()
|
| 155 |
+
|
| 156 |
+
# Wait for window to pass
|
| 157 |
+
await asyncio.sleep(1.1)
|
| 158 |
+
|
| 159 |
+
# Should be allowed again
|
| 160 |
+
start = time.monotonic()
|
| 161 |
+
await limiter.acquire()
|
| 162 |
+
elapsed = time.monotonic() - start
|
| 163 |
+
|
| 164 |
+
assert elapsed < 0.1 # Should be nearly instant
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
class TestGetPubmedLimiter:
|
| 168 |
+
"""Test PubMed-specific limiter factory."""
|
| 169 |
+
|
| 170 |
+
def test_limiter_without_api_key(self) -> None:
|
| 171 |
+
"""Should return 3/sec limiter without key."""
|
| 172 |
+
limiter = get_pubmed_limiter(api_key=None)
|
| 173 |
+
assert "3" in limiter.rate
|
| 174 |
+
|
| 175 |
+
def test_limiter_with_api_key(self) -> None:
|
| 176 |
+
"""Should return 10/sec limiter with key."""
|
| 177 |
+
limiter = get_pubmed_limiter(api_key="my-api-key")
|
| 178 |
+
assert "10" in limiter.rate
|
| 179 |
+
|
| 180 |
+
def test_limiter_is_singleton(self) -> None:
|
| 181 |
+
"""Same API key should return same limiter instance."""
|
| 182 |
+
limiter1 = get_pubmed_limiter(api_key="key1")
|
| 183 |
+
limiter2 = get_pubmed_limiter(api_key="key1")
|
| 184 |
+
assert limiter1 is limiter2
|
| 185 |
+
|
| 186 |
+
def test_different_keys_different_limiters(self) -> None:
|
| 187 |
+
"""Different API keys should return different limiters."""
|
| 188 |
+
limiter1 = get_pubmed_limiter(api_key="key1")
|
| 189 |
+
limiter2 = get_pubmed_limiter(api_key="key2")
|
| 190 |
+
# Clear cache for clean test
|
| 191 |
+
# Actually, different keys SHOULD share the same limiter
|
| 192 |
+
# since we're limiting against the same API
|
| 193 |
+
assert limiter1 is limiter2 # Shared NCBI rate limit
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
### Step 3: Create Rate Limiter Module
|
| 199 |
+
|
| 200 |
+
**File**: `src/tools/rate_limiter.py`
|
| 201 |
+
|
| 202 |
+
```python
|
| 203 |
+
"""Rate limiting utilities using the limits library."""
|
| 204 |
+
|
| 205 |
+
import asyncio
|
| 206 |
+
from typing import ClassVar
|
| 207 |
+
|
| 208 |
+
from limits import RateLimitItem, parse
|
| 209 |
+
from limits.storage import MemoryStorage
|
| 210 |
+
from limits.strategies import MovingWindowRateLimiter
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
class RateLimiter:
|
| 214 |
+
"""
|
| 215 |
+
Async-compatible rate limiter using limits library.
|
| 216 |
+
|
| 217 |
+
Uses moving window algorithm for smooth rate limiting.
|
| 218 |
+
"""
|
| 219 |
+
|
| 220 |
+
def __init__(self, rate: str) -> None:
|
| 221 |
+
"""
|
| 222 |
+
Initialize rate limiter.
|
| 223 |
+
|
| 224 |
+
Args:
|
| 225 |
+
rate: Rate string like "3/second" or "10/second"
|
| 226 |
+
"""
|
| 227 |
+
self.rate = rate
|
| 228 |
+
self._storage = MemoryStorage()
|
| 229 |
+
self._limiter = MovingWindowRateLimiter(self._storage)
|
| 230 |
+
self._rate_limit: RateLimitItem = parse(rate)
|
| 231 |
+
self._identity = "default" # Single identity for shared limiting
|
| 232 |
+
|
| 233 |
+
async def acquire(self, wait: bool = True) -> bool:
|
| 234 |
+
"""
|
| 235 |
+
Acquire permission to make a request.
|
| 236 |
+
|
| 237 |
+
ASYNC-SAFE: Uses asyncio.sleep(), never time.sleep().
|
| 238 |
+
The polling pattern allows other coroutines to run while waiting.
|
| 239 |
+
|
| 240 |
+
Args:
|
| 241 |
+
wait: If True, wait until allowed. If False, return immediately.
|
| 242 |
+
|
| 243 |
+
Returns:
|
| 244 |
+
True if allowed, False if not (only when wait=False)
|
| 245 |
+
"""
|
| 246 |
+
while True:
|
| 247 |
+
# Check if we can proceed (synchronous, fast - ~microseconds)
|
| 248 |
+
if self._limiter.hit(self._rate_limit, self._identity):
|
| 249 |
+
return True
|
| 250 |
+
|
| 251 |
+
if not wait:
|
| 252 |
+
return False
|
| 253 |
+
|
| 254 |
+
# CRITICAL: Use asyncio.sleep(), NOT time.sleep()
|
| 255 |
+
# This yields control to the event loop, allowing other
|
| 256 |
+
# coroutines (UI, parallel searches) to run
|
| 257 |
+
await asyncio.sleep(0.1)
|
| 258 |
+
|
| 259 |
+
def reset(self) -> None:
|
| 260 |
+
"""Reset the rate limiter (for testing)."""
|
| 261 |
+
self._storage.reset()
|
| 262 |
+
|
| 263 |
+
|
| 264 |
+
# Singleton limiter for PubMed/NCBI
|
| 265 |
+
_pubmed_limiter: RateLimiter | None = None
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
def get_pubmed_limiter(api_key: str | None = None) -> RateLimiter:
|
| 269 |
+
"""
|
| 270 |
+
Get the shared PubMed rate limiter.
|
| 271 |
+
|
| 272 |
+
Rate depends on whether API key is provided:
|
| 273 |
+
- Without key: 3 requests/second
|
| 274 |
+
- With key: 10 requests/second
|
| 275 |
+
|
| 276 |
+
Args:
|
| 277 |
+
api_key: NCBI API key (optional)
|
| 278 |
+
|
| 279 |
+
Returns:
|
| 280 |
+
Shared RateLimiter instance
|
| 281 |
+
"""
|
| 282 |
+
global _pubmed_limiter
|
| 283 |
+
|
| 284 |
+
if _pubmed_limiter is None:
|
| 285 |
+
rate = "10/second" if api_key else "3/second"
|
| 286 |
+
_pubmed_limiter = RateLimiter(rate)
|
| 287 |
+
|
| 288 |
+
return _pubmed_limiter
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
def reset_pubmed_limiter() -> None:
|
| 292 |
+
"""Reset the PubMed limiter (for testing)."""
|
| 293 |
+
global _pubmed_limiter
|
| 294 |
+
_pubmed_limiter = None
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
# Factory for other APIs
|
| 298 |
+
class RateLimiterFactory:
|
| 299 |
+
"""Factory for creating/getting rate limiters for different APIs."""
|
| 300 |
+
|
| 301 |
+
_limiters: ClassVar[dict[str, RateLimiter]] = {}
|
| 302 |
+
|
| 303 |
+
@classmethod
|
| 304 |
+
def get(cls, api_name: str, rate: str) -> RateLimiter:
|
| 305 |
+
"""
|
| 306 |
+
Get or create a rate limiter for an API.
|
| 307 |
+
|
| 308 |
+
Args:
|
| 309 |
+
api_name: Unique identifier for the API
|
| 310 |
+
rate: Rate limit string (e.g., "10/second")
|
| 311 |
+
|
| 312 |
+
Returns:
|
| 313 |
+
RateLimiter instance (shared for same api_name)
|
| 314 |
+
"""
|
| 315 |
+
if api_name not in cls._limiters:
|
| 316 |
+
cls._limiters[api_name] = RateLimiter(rate)
|
| 317 |
+
return cls._limiters[api_name]
|
| 318 |
+
|
| 319 |
+
@classmethod
|
| 320 |
+
def reset_all(cls) -> None:
|
| 321 |
+
"""Reset all limiters (for testing)."""
|
| 322 |
+
cls._limiters.clear()
|
| 323 |
+
```
|
| 324 |
+
|
| 325 |
+
---
|
| 326 |
+
|
| 327 |
+
### Step 4: Update PubMed Tool
|
| 328 |
+
|
| 329 |
+
**File**: `src/tools/pubmed.py` (replace rate limiting code)
|
| 330 |
+
|
| 331 |
+
```python
|
| 332 |
+
# Replace imports and rate limiting
|
| 333 |
+
|
| 334 |
+
from src.tools.rate_limiter import get_pubmed_limiter
|
| 335 |
+
|
| 336 |
+
|
| 337 |
+
class PubMedTool:
|
| 338 |
+
"""Search tool for PubMed/NCBI."""
|
| 339 |
+
|
| 340 |
+
BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
|
| 341 |
+
HTTP_TOO_MANY_REQUESTS = 429
|
| 342 |
+
|
| 343 |
+
def __init__(self, api_key: str | None = None) -> None:
|
| 344 |
+
self.api_key = api_key or settings.ncbi_api_key
|
| 345 |
+
if self.api_key == "your-ncbi-key-here":
|
| 346 |
+
self.api_key = None
|
| 347 |
+
# Use shared rate limiter
|
| 348 |
+
self._limiter = get_pubmed_limiter(self.api_key)
|
| 349 |
+
|
| 350 |
+
async def _rate_limit(self) -> None:
|
| 351 |
+
"""Enforce NCBI rate limiting using shared limiter."""
|
| 352 |
+
await self._limiter.acquire()
|
| 353 |
+
|
| 354 |
+
# ... rest of class unchanged ...
|
| 355 |
+
```
|
| 356 |
+
|
| 357 |
+
---
|
| 358 |
+
|
| 359 |
+
### Step 5: Add Rate Limiters for Other APIs
|
| 360 |
+
|
| 361 |
+
**File**: `src/tools/clinicaltrials.py` (optional)
|
| 362 |
+
|
| 363 |
+
```python
|
| 364 |
+
from src.tools.rate_limiter import RateLimiterFactory
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
class ClinicalTrialsTool:
|
| 368 |
+
def __init__(self) -> None:
|
| 369 |
+
# ClinicalTrials.gov doesn't document limits, but be conservative
|
| 370 |
+
self._limiter = RateLimiterFactory.get("clinicaltrials", "5/second")
|
| 371 |
+
|
| 372 |
+
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
|
| 373 |
+
await self._limiter.acquire()
|
| 374 |
+
# ... rest of method ...
|
| 375 |
+
```
|
| 376 |
+
|
| 377 |
+
**File**: `src/tools/europepmc.py` (optional)
|
| 378 |
+
|
| 379 |
+
```python
|
| 380 |
+
from src.tools.rate_limiter import RateLimiterFactory
|
| 381 |
+
|
| 382 |
+
|
| 383 |
+
class EuropePMCTool:
|
| 384 |
+
def __init__(self) -> None:
|
| 385 |
+
# Europe PMC is generous, but still be respectful
|
| 386 |
+
self._limiter = RateLimiterFactory.get("europepmc", "10/second")
|
| 387 |
+
|
| 388 |
+
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
|
| 389 |
+
await self._limiter.acquire()
|
| 390 |
+
# ... rest of method ...
|
| 391 |
+
```
|
| 392 |
+
|
| 393 |
+
---
|
| 394 |
+
|
| 395 |
+
## Demo Script
|
| 396 |
+
|
| 397 |
+
**File**: `examples/rate_limiting_demo.py`
|
| 398 |
+
|
| 399 |
+
```python
|
| 400 |
+
#!/usr/bin/env python3
|
| 401 |
+
"""Demo script to verify rate limiting works correctly."""
|
| 402 |
+
|
| 403 |
+
import asyncio
|
| 404 |
+
import time
|
| 405 |
+
|
| 406 |
+
from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
|
| 407 |
+
from src.tools.pubmed import PubMedTool
|
| 408 |
+
|
| 409 |
+
|
| 410 |
+
async def test_basic_limiter():
|
| 411 |
+
"""Test basic rate limiter behavior."""
|
| 412 |
+
print("=" * 60)
|
| 413 |
+
print("Rate Limiting Demo")
|
| 414 |
+
print("=" * 60)
|
| 415 |
+
|
| 416 |
+
# Test 1: Basic limiter
|
| 417 |
+
print("\n[Test 1] Testing 3/second limiter...")
|
| 418 |
+
limiter = RateLimiter("3/second")
|
| 419 |
+
|
| 420 |
+
start = time.monotonic()
|
| 421 |
+
for i in range(6):
|
| 422 |
+
await limiter.acquire()
|
| 423 |
+
elapsed = time.monotonic() - start
|
| 424 |
+
print(f" Request {i+1} at {elapsed:.2f}s")
|
| 425 |
+
|
| 426 |
+
total = time.monotonic() - start
|
| 427 |
+
print(f" Total time for 6 requests: {total:.2f}s (expected ~2s)")
|
| 428 |
+
|
| 429 |
+
|
| 430 |
+
async def test_pubmed_limiter():
|
| 431 |
+
"""Test PubMed-specific limiter."""
|
| 432 |
+
print("\n[Test 2] Testing PubMed limiter (shared)...")
|
| 433 |
+
|
| 434 |
+
reset_pubmed_limiter() # Clean state
|
| 435 |
+
|
| 436 |
+
# Without API key: 3/sec
|
| 437 |
+
limiter = get_pubmed_limiter(api_key=None)
|
| 438 |
+
print(f" Rate without key: {limiter.rate}")
|
| 439 |
+
|
| 440 |
+
# Multiple tools should share the same limiter
|
| 441 |
+
tool1 = PubMedTool()
|
| 442 |
+
tool2 = PubMedTool()
|
| 443 |
+
|
| 444 |
+
# Verify they share the limiter
|
| 445 |
+
print(f" Tools share limiter: {tool1._limiter is tool2._limiter}")
|
| 446 |
+
|
| 447 |
+
|
| 448 |
+
async def test_concurrent_requests():
|
| 449 |
+
"""Test rate limiting under concurrent load."""
|
| 450 |
+
print("\n[Test 3] Testing concurrent request limiting...")
|
| 451 |
+
|
| 452 |
+
limiter = RateLimiter("5/second")
|
| 453 |
+
|
| 454 |
+
async def make_request(i: int):
|
| 455 |
+
await limiter.acquire()
|
| 456 |
+
return time.monotonic()
|
| 457 |
+
|
| 458 |
+
start = time.monotonic()
|
| 459 |
+
# Launch 10 concurrent requests
|
| 460 |
+
tasks = [make_request(i) for i in range(10)]
|
| 461 |
+
times = await asyncio.gather(*tasks)
|
| 462 |
+
|
| 463 |
+
# Calculate distribution
|
| 464 |
+
relative_times = [t - start for t in times]
|
| 465 |
+
print(f" Request times: {[f'{t:.2f}s' for t in sorted(relative_times)]}")
|
| 466 |
+
|
| 467 |
+
total = max(relative_times)
|
| 468 |
+
print(f" All 10 requests completed in {total:.2f}s (expected ~2s)")
|
| 469 |
+
|
| 470 |
+
|
| 471 |
+
async def main():
|
| 472 |
+
await test_basic_limiter()
|
| 473 |
+
await test_pubmed_limiter()
|
| 474 |
+
await test_concurrent_requests()
|
| 475 |
+
|
| 476 |
+
print("\n" + "=" * 60)
|
| 477 |
+
print("Demo complete!")
|
| 478 |
+
|
| 479 |
+
|
| 480 |
+
if __name__ == "__main__":
|
| 481 |
+
asyncio.run(main())
|
| 482 |
+
```
|
| 483 |
+
|
| 484 |
+
---
|
| 485 |
+
|
| 486 |
+
## Verification Checklist
|
| 487 |
+
|
| 488 |
+
### Unit Tests
|
| 489 |
+
```bash
|
| 490 |
+
# Run rate limiting tests
|
| 491 |
+
uv run pytest tests/unit/tools/test_rate_limiting.py -v
|
| 492 |
+
|
| 493 |
+
# Expected: All tests pass
|
| 494 |
+
```
|
| 495 |
+
|
| 496 |
+
### Integration Test (Manual)
|
| 497 |
+
```bash
|
| 498 |
+
# Run demo
|
| 499 |
+
uv run python examples/rate_limiting_demo.py
|
| 500 |
+
|
| 501 |
+
# Expected: Requests properly spaced
|
| 502 |
+
```
|
| 503 |
+
|
| 504 |
+
### Full Test Suite
|
| 505 |
+
```bash
|
| 506 |
+
make check
|
| 507 |
+
# Expected: All tests pass, mypy clean
|
| 508 |
+
```
|
| 509 |
+
|
| 510 |
+
---
|
| 511 |
+
|
| 512 |
+
## Success Criteria
|
| 513 |
+
|
| 514 |
+
1. **`limits` library installed**: Dependency added to pyproject.toml
|
| 515 |
+
2. **RateLimiter class works**: Can create and use limiters
|
| 516 |
+
3. **PubMed uses new limiter**: Shared limiter across instances
|
| 517 |
+
4. **Rate adapts to API key**: 3/sec without, 10/sec with
|
| 518 |
+
5. **Concurrent requests handled**: Multiple async requests properly queued
|
| 519 |
+
6. **No regressions**: All existing tests pass
|
| 520 |
+
|
| 521 |
+
---
|
| 522 |
+
|
| 523 |
+
## API Rate Limit Reference
|
| 524 |
+
|
| 525 |
+
| API | Without Key | With Key |
|
| 526 |
+
|-----|-------------|----------|
|
| 527 |
+
| PubMed/NCBI | 3/sec | 10/sec |
|
| 528 |
+
| ClinicalTrials.gov | Undocumented (~5/sec safe) | N/A |
|
| 529 |
+
| Europe PMC | ~10-20/sec (generous) | N/A |
|
| 530 |
+
| OpenAlex | ~100k/day (no per-sec limit) | Faster with `mailto` |
|
| 531 |
+
|
| 532 |
+
---
|
| 533 |
+
|
| 534 |
+
## Notes
|
| 535 |
+
|
| 536 |
+
- `limits` library uses moving window algorithm (fairer than fixed window)
|
| 537 |
+
- Singleton pattern ensures all PubMed calls share the limit
|
| 538 |
+
- The factory pattern allows easy extension to other APIs
|
| 539 |
+
- Consider adding 429 response detection + exponential backoff
|
| 540 |
+
- In production, consider Redis storage for distributed rate limiting
|
docs/brainstorming/implementation/README.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Implementation Plans
|
| 2 |
+
|
| 3 |
+
TDD implementation plans based on the brainstorming documents. Each phase is a self-contained vertical slice with tests, implementation, and demo scripts.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Prerequisites (COMPLETED)
|
| 8 |
+
|
| 9 |
+
The following foundational changes have been implemented to support all three phases:
|
| 10 |
+
|
| 11 |
+
| Change | File | Status |
|
| 12 |
+
|--------|------|--------|
|
| 13 |
+
| Add `"openalex"` to `SourceName` | `src/utils/models.py:9` | β
Done |
|
| 14 |
+
| Add `metadata` field to `Evidence` | `src/utils/models.py:39-42` | β
Done |
|
| 15 |
+
| Export all tools from `__init__.py` | `src/tools/__init__.py` | β
Done |
|
| 16 |
+
|
| 17 |
+
All 110 tests pass after these changes.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Priority Order
|
| 22 |
+
|
| 23 |
+
| Phase | Name | Priority | Effort | Value |
|
| 24 |
+
|-------|------|----------|--------|-------|
|
| 25 |
+
| **17** | Rate Limiting | P0 CRITICAL | 1 hour | Stability |
|
| 26 |
+
| **15** | OpenAlex | HIGH | 2-3 hours | Very High |
|
| 27 |
+
| **16** | PubMed Full-Text | MEDIUM | 3 hours | High |
|
| 28 |
+
|
| 29 |
+
**Recommended implementation order**: 17 β 15 β 16
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Phase 15: OpenAlex Integration
|
| 34 |
+
|
| 35 |
+
**File**: [15_PHASE_OPENALEX.md](./15_PHASE_OPENALEX.md)
|
| 36 |
+
|
| 37 |
+
Add OpenAlex as 4th data source for:
|
| 38 |
+
- Citation networks (who cites whom)
|
| 39 |
+
- Concept tagging (semantic discovery)
|
| 40 |
+
- 209M+ scholarly works
|
| 41 |
+
- Free, no API key required
|
| 42 |
+
|
| 43 |
+
**Quick Start**:
|
| 44 |
+
```bash
|
| 45 |
+
# Create the tool
|
| 46 |
+
touch src/tools/openalex.py
|
| 47 |
+
touch tests/unit/tools/test_openalex.py
|
| 48 |
+
|
| 49 |
+
# Run tests first (TDD)
|
| 50 |
+
uv run pytest tests/unit/tools/test_openalex.py -v
|
| 51 |
+
|
| 52 |
+
# Demo
|
| 53 |
+
uv run python examples/openalex_demo.py
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## Phase 16: PubMed Full-Text
|
| 59 |
+
|
| 60 |
+
**File**: [16_PHASE_PUBMED_FULLTEXT.md](./16_PHASE_PUBMED_FULLTEXT.md)
|
| 61 |
+
|
| 62 |
+
Add full-text retrieval via BioC API for:
|
| 63 |
+
- Complete paper text (not just abstracts)
|
| 64 |
+
- Structured sections (intro, methods, results)
|
| 65 |
+
- Better evidence for LLM synthesis
|
| 66 |
+
|
| 67 |
+
**Quick Start**:
|
| 68 |
+
```bash
|
| 69 |
+
# Add methods to existing pubmed.py
|
| 70 |
+
# Tests in test_pubmed_fulltext.py
|
| 71 |
+
|
| 72 |
+
# Run tests
|
| 73 |
+
uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
|
| 74 |
+
|
| 75 |
+
# Demo
|
| 76 |
+
uv run python examples/pubmed_fulltext_demo.py
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## Phase 17: Rate Limiting
|
| 82 |
+
|
| 83 |
+
**File**: [17_PHASE_RATE_LIMITING.md](./17_PHASE_RATE_LIMITING.md)
|
| 84 |
+
|
| 85 |
+
Replace naive sleep-based rate limiting with `limits` library for:
|
| 86 |
+
- Moving window algorithm
|
| 87 |
+
- Shared limits across instances
|
| 88 |
+
- Configurable per-API rates
|
| 89 |
+
- Production-grade stability
|
| 90 |
+
|
| 91 |
+
**Quick Start**:
|
| 92 |
+
```bash
|
| 93 |
+
# Add dependency
|
| 94 |
+
uv add limits
|
| 95 |
+
|
| 96 |
+
# Create module
|
| 97 |
+
touch src/tools/rate_limiter.py
|
| 98 |
+
touch tests/unit/tools/test_rate_limiting.py
|
| 99 |
+
|
| 100 |
+
# Run tests
|
| 101 |
+
uv run pytest tests/unit/tools/test_rate_limiting.py -v
|
| 102 |
+
|
| 103 |
+
# Demo
|
| 104 |
+
uv run python examples/rate_limiting_demo.py
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## TDD Workflow
|
| 110 |
+
|
| 111 |
+
Each implementation doc follows this pattern:
|
| 112 |
+
|
| 113 |
+
1. **Write tests first** - Define expected behavior
|
| 114 |
+
2. **Run tests** - Verify they fail (red)
|
| 115 |
+
3. **Implement** - Write minimal code to pass
|
| 116 |
+
4. **Run tests** - Verify they pass (green)
|
| 117 |
+
5. **Refactor** - Clean up if needed
|
| 118 |
+
6. **Demo** - Verify end-to-end with real APIs
|
| 119 |
+
7. **`make check`** - Ensure no regressions
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Related Brainstorming Docs
|
| 124 |
+
|
| 125 |
+
These implementation plans are derived from:
|
| 126 |
+
|
| 127 |
+
- [00_ROADMAP_SUMMARY.md](../00_ROADMAP_SUMMARY.md) - Priority overview
|
| 128 |
+
- [01_PUBMED_IMPROVEMENTS.md](../01_PUBMED_IMPROVEMENTS.md) - PubMed details
|
| 129 |
+
- [02_CLINICALTRIALS_IMPROVEMENTS.md](../02_CLINICALTRIALS_IMPROVEMENTS.md) - CT.gov details
|
| 130 |
+
- [03_EUROPEPMC_IMPROVEMENTS.md](../03_EUROPEPMC_IMPROVEMENTS.md) - Europe PMC details
|
| 131 |
+
- [04_OPENALEX_INTEGRATION.md](../04_OPENALEX_INTEGRATION.md) - OpenAlex integration
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Future Phases (Not Yet Documented)
|
| 136 |
+
|
| 137 |
+
Based on brainstorming, these could be added later:
|
| 138 |
+
|
| 139 |
+
- **Phase 18**: ClinicalTrials.gov Results Retrieval
|
| 140 |
+
- **Phase 19**: Europe PMC Annotations API
|
| 141 |
+
- **Phase 20**: Drug Name Normalization (RxNorm)
|
| 142 |
+
- **Phase 21**: Citation Network Queries (OpenAlex)
|
| 143 |
+
- **Phase 22**: Semantic Search with Embeddings
|
src/tools/__init__.py
CHANGED
|
@@ -1,8 +1,16 @@
|
|
| 1 |
"""Search tools package."""
|
| 2 |
|
| 3 |
from src.tools.base import SearchTool
|
|
|
|
|
|
|
| 4 |
from src.tools.pubmed import PubMedTool
|
| 5 |
from src.tools.search_handler import SearchHandler
|
| 6 |
|
| 7 |
-
# Re-export
|
| 8 |
-
__all__ = [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""Search tools package."""
|
| 2 |
|
| 3 |
from src.tools.base import SearchTool
|
| 4 |
+
from src.tools.clinicaltrials import ClinicalTrialsTool
|
| 5 |
+
from src.tools.europepmc import EuropePMCTool
|
| 6 |
from src.tools.pubmed import PubMedTool
|
| 7 |
from src.tools.search_handler import SearchHandler
|
| 8 |
|
| 9 |
+
# Re-export all search tools
|
| 10 |
+
__all__ = [
|
| 11 |
+
"ClinicalTrialsTool",
|
| 12 |
+
"EuropePMCTool",
|
| 13 |
+
"PubMedTool",
|
| 14 |
+
"SearchHandler",
|
| 15 |
+
"SearchTool",
|
| 16 |
+
]
|
src/utils/models.py
CHANGED
|
@@ -6,7 +6,7 @@ from typing import Any, ClassVar, Literal
|
|
| 6 |
from pydantic import BaseModel, Field
|
| 7 |
|
| 8 |
# Centralized source type - add new sources here (e.g., new databases)
|
| 9 |
-
SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint"]
|
| 10 |
|
| 11 |
|
| 12 |
class Citation(BaseModel):
|
|
@@ -36,6 +36,10 @@ class Evidence(BaseModel):
|
|
| 36 |
content: str = Field(min_length=1, description="The actual text content")
|
| 37 |
citation: Citation
|
| 38 |
relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
model_config = {"frozen": True}
|
| 41 |
|
|
|
|
| 6 |
from pydantic import BaseModel, Field
|
| 7 |
|
| 8 |
# Centralized source type - add new sources here (e.g., new databases)
|
| 9 |
+
SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
|
| 10 |
|
| 11 |
|
| 12 |
class Citation(BaseModel):
|
|
|
|
| 36 |
content: str = Field(min_length=1, description="The actual text content")
|
| 37 |
citation: Citation
|
| 38 |
relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
|
| 39 |
+
metadata: dict[str, Any] = Field(
|
| 40 |
+
default_factory=dict,
|
| 41 |
+
description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
|
| 42 |
+
)
|
| 43 |
|
| 44 |
model_config = {"frozen": True}
|
| 45 |
|