VibecoderMcSwaggins commited on
Commit
9286db5
Β·
1 Parent(s): be7e1a2

feat: add roadmap summary and detailed improvement plans for data sources

Browse files

- Introduced new documentation files outlining the current state and future improvements for DeepCritical's data sources: PubMed, ClinicalTrials.gov, Europe PMC, and OpenAlex.
- Each document includes sections on current implementation, strengths, limitations, recommended improvements, and integration opportunities.
- Added a comprehensive roadmap summary to guide future maintainers and enhance project maintainability.

docs/brainstorming/00_ROADMAP_SUMMARY.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepCritical Data Sources: Roadmap Summary
2
+
3
+ **Created**: 2024-11-27
4
+ **Purpose**: Future maintainability and hackathon continuation
5
+
6
+ ---
7
+
8
+ ## Current State
9
+
10
+ ### Working Tools
11
+
12
+ | Tool | Status | Data Quality |
13
+ |------|--------|--------------|
14
+ | PubMed | βœ… Works | Good (abstracts only) |
15
+ | ClinicalTrials.gov | βœ… Works | Good (filtered for interventional) |
16
+ | Europe PMC | βœ… Works | Good (includes preprints) |
17
+
18
+ ### Removed Tools
19
+
20
+ | Tool | Status | Reason |
21
+ |------|--------|--------|
22
+ | bioRxiv | ❌ Removed | No search API - only date/DOI lookup |
23
+
24
+ ---
25
+
26
+ ## Priority Improvements
27
+
28
+ ### P0: Critical (Do First)
29
+
30
+ 1. **Add Rate Limiting to PubMed**
31
+ - NCBI will block us without it
32
+ - Use `limits` library (see reference repo)
33
+ - 3/sec without key, 10/sec with key
34
+
35
+ ### P1: High Value, Medium Effort
36
+
37
+ 2. **Add OpenAlex as 4th Source**
38
+ - Citation network (huge for drug repurposing)
39
+ - Concept tagging (semantic discovery)
40
+ - Already implemented in reference repo
41
+ - Free, no API key
42
+
43
+ 3. **PubMed Full-Text via BioC**
44
+ - Get full paper text for PMC papers
45
+ - Already in reference repo
46
+
47
+ ### P2: Nice to Have
48
+
49
+ 4. **ClinicalTrials.gov Results**
50
+ - Get efficacy data from completed trials
51
+ - Requires more complex API calls
52
+
53
+ 5. **Europe PMC Annotations**
54
+ - Text-mined entities (genes, drugs, diseases)
55
+ - Automatic entity extraction
56
+
57
+ ---
58
+
59
+ ## Effort Estimates
60
+
61
+ | Improvement | Effort | Impact | Priority |
62
+ |-------------|--------|--------|----------|
63
+ | PubMed rate limiting | 1 hour | Stability | P0 |
64
+ | OpenAlex basic search | 2 hours | High | P1 |
65
+ | OpenAlex citations | 2 hours | Very High | P1 |
66
+ | PubMed full-text | 3 hours | Medium | P1 |
67
+ | CT.gov results | 4 hours | Medium | P2 |
68
+ | Europe PMC annotations | 3 hours | Medium | P2 |
69
+
70
+ ---
71
+
72
+ ## Architecture Decision
73
+
74
+ ### Option A: Keep Current + Add OpenAlex
75
+
76
+ ```
77
+ User Query
78
+ ↓
79
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
80
+ ↓ ↓ ↓
81
+ PubMed ClinicalTrials Europe PMC
82
+ (abstracts) (trials only) (preprints)
83
+ ↓ ↓ ↓
84
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
85
+ ↓
86
+ OpenAlex ← NEW
87
+ (citations, concepts)
88
+ ↓
89
+ Orchestrator
90
+ ↓
91
+ Report
92
+ ```
93
+
94
+ **Pros**: Low risk, additive
95
+ **Cons**: More complexity, some overlap
96
+
97
+ ### Option B: OpenAlex as Primary
98
+
99
+ ```
100
+ User Query
101
+ ↓
102
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
103
+ ↓ ↓ ↓
104
+ OpenAlex ClinicalTrials Europe PMC
105
+ (primary (trials only) (full-text
106
+ search) fallback)
107
+ ↓ ↓ ↓
108
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
109
+ ↓
110
+ Orchestrator
111
+ ↓
112
+ Report
113
+ ```
114
+
115
+ **Pros**: Simpler, citation network built-in
116
+ **Cons**: Lose some PubMed-specific features
117
+
118
+ ### Recommendation: Option A
119
+
120
+ Keep current architecture working, add OpenAlex incrementally.
121
+
122
+ ---
123
+
124
+ ## Quick Wins (Can Do Today)
125
+
126
+ 1. **Add `limits` to `pyproject.toml`**
127
+ ```toml
128
+ dependencies = [
129
+ "limits>=3.0",
130
+ ]
131
+ ```
132
+
133
+ 2. **Copy OpenAlex tool from reference repo**
134
+ - File: `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`
135
+ - Adapt to our `SearchTool` base class
136
+
137
+ 3. **Enable NCBI API Key**
138
+ - Add to `.env`: `NCBI_API_KEY=your_key`
139
+ - 10x rate limit improvement
140
+
141
+ ---
142
+
143
+ ## External Resources Worth Exploring
144
+
145
+ ### Python Libraries
146
+
147
+ | Library | For | Notes |
148
+ |---------|-----|-------|
149
+ | `limits` | Rate limiting | Used by reference repo |
150
+ | `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
151
+ | `metapub` | PubMed | Full-featured |
152
+ | `sentence-transformers` | Semantic search | For embeddings |
153
+
154
+ ### APIs Not Yet Used
155
+
156
+ | API | Provides | Effort |
157
+ |-----|----------|--------|
158
+ | RxNorm | Drug name normalization | Low |
159
+ | DrugBank | Drug targets/mechanisms | Medium (license) |
160
+ | UniProt | Protein data | Medium |
161
+ | ChEMBL | Bioactivity data | Medium |
162
+
163
+ ### RAG Tools (Future)
164
+
165
+ | Tool | Purpose |
166
+ |------|---------|
167
+ | [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
168
+ | [txtai](https://github.com/neuml/txtai) | Embeddings + search |
169
+ | [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
170
+
171
+ ---
172
+
173
+ ## Files in This Directory
174
+
175
+ | File | Contents |
176
+ |------|----------|
177
+ | `00_ROADMAP_SUMMARY.md` | This file |
178
+ | `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
179
+ | `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
180
+ | `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
181
+ | `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
182
+
183
+ ---
184
+
185
+ ## For Future Maintainers
186
+
187
+ If you're picking this up after the hackathon:
188
+
189
+ 1. **Start with OpenAlex** - biggest bang for buck
190
+ 2. **Add rate limiting** - prevents API blocks
191
+ 3. **Don't bother with bioRxiv** - use Europe PMC instead
192
+ 4. **Reference repo is gold** - `reference_repos/DeepCritical/` has working implementations
193
+
194
+ Good luck! πŸš€
docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ClinicalTrials.gov Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented
4
+ **Priority**: High (Core Data Source for Drug Repurposing)
5
+
6
+ ---
7
+
8
+ ## Current Implementation
9
+
10
+ ### What We Have (`src/tools/clinicaltrials.py`)
11
+
12
+ - V2 API search via `clinicaltrials.gov/api/v2/studies`
13
+ - Filters: `INTERVENTIONAL` study type, `RECRUITING` status
14
+ - Returns: NCT ID, title, conditions, interventions, phase, status
15
+ - Query preprocessing via shared `query_utils.py`
16
+
17
+ ### Current Strengths
18
+
19
+ 1. **Good Filtering**: Already filtering for interventional + recruiting
20
+ 2. **V2 API**: Using the modern API (v1 deprecated)
21
+ 3. **Phase Info**: Extracting trial phases for drug development context
22
+
23
+ ### Current Limitations
24
+
25
+ 1. **No Outcome Data**: Missing primary/secondary outcomes
26
+ 2. **No Eligibility Criteria**: Missing inclusion/exclusion details
27
+ 3. **No Sponsor Info**: Missing who's running the trial
28
+ 4. **No Result Data**: For completed trials, no efficacy data
29
+ 5. **Limited Drug Mapping**: No integration with drug databases
30
+
31
+ ---
32
+
33
+ ## API Capabilities We're Not Using
34
+
35
+ ### Fields We Could Request
36
+
37
+ ```python
38
+ # Current fields
39
+ fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
40
+
41
+ # Additional valuable fields
42
+ additional_fields = [
43
+ "PrimaryOutcomeMeasure", # What are they measuring?
44
+ "SecondaryOutcomeMeasure", # Secondary endpoints
45
+ "EligibilityCriteria", # Who can participate?
46
+ "LeadSponsorName", # Who's funding?
47
+ "ResultsFirstPostDate", # Has results?
48
+ "StudyFirstPostDate", # When started?
49
+ "CompletionDate", # When finished?
50
+ "EnrollmentCount", # Sample size
51
+ "InterventionDescription", # Drug details
52
+ "ArmGroupLabel", # Treatment arms
53
+ "InterventionOtherName", # Drug aliases
54
+ ]
55
+ ```
56
+
57
+ ### Filter Enhancements
58
+
59
+ ```python
60
+ # Current
61
+ aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
62
+
63
+ # Could add
64
+ "status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED" # Include completed for results
65
+ "phase:PHASE2,PHASE3" # Only later-stage trials
66
+ "resultsFirstPostDateRange:2020-01-01_" # Trials with posted results
67
+ ```
68
+
69
+ ---
70
+
71
+ ## Recommended Improvements
72
+
73
+ ### Phase 1: Richer Metadata
74
+
75
+ ```python
76
+ EXTENDED_FIELDS = [
77
+ "NCTId",
78
+ "BriefTitle",
79
+ "OfficialTitle",
80
+ "Condition",
81
+ "InterventionName",
82
+ "InterventionDescription",
83
+ "InterventionOtherName", # Drug synonyms!
84
+ "Phase",
85
+ "OverallStatus",
86
+ "PrimaryOutcomeMeasure",
87
+ "EnrollmentCount",
88
+ "LeadSponsorName",
89
+ "StudyFirstPostDate",
90
+ ]
91
+ ```
92
+
93
+ ### Phase 2: Results Retrieval
94
+
95
+ For completed trials, we can get actual efficacy data:
96
+
97
+ ```python
98
+ async def get_trial_results(nct_id: str) -> dict | None:
99
+ """Fetch results for completed trials."""
100
+ url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
101
+ params = {
102
+ "fields": "ResultsSection",
103
+ }
104
+ # Returns outcome measures and statistics
105
+ ```
106
+
107
+ ### Phase 3: Drug Name Normalization
108
+
109
+ Map intervention names to standard identifiers:
110
+
111
+ ```python
112
+ # Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
113
+ # Solution: Use RxNorm or DrugBank for normalization
114
+
115
+ async def normalize_drug_name(intervention: str) -> str:
116
+ """Normalize drug name via RxNorm API."""
117
+ url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
118
+ # Returns standardized RxCUI
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Integration Opportunities
124
+
125
+ ### With PubMed
126
+
127
+ Cross-reference trials with publications:
128
+ ```python
129
+ # ClinicalTrials.gov provides PMID links
130
+ # Can correlate trial results with published papers
131
+ ```
132
+
133
+ ### With DrugBank/ChEMBL
134
+
135
+ Map interventions to:
136
+ - Mechanism of action
137
+ - Known targets
138
+ - Adverse effects
139
+ - Drug-drug interactions
140
+
141
+ ---
142
+
143
+ ## Python Libraries to Consider
144
+
145
+ | Library | Purpose | Notes |
146
+ |---------|---------|-------|
147
+ | [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
148
+ | [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
149
+ | [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
150
+
151
+ ---
152
+
153
+ ## API Quirks & Gotchas
154
+
155
+ 1. **Rate Limiting**: Undocumented, be conservative
156
+ 2. **Pagination**: Max 1000 results per request
157
+ 3. **Field Names**: Case-sensitive, camelCase
158
+ 4. **Empty Results**: Some fields may be null even if requested
159
+ 5. **Status Changes**: Trials change status frequently
160
+
161
+ ---
162
+
163
+ ## Example Enhanced Query
164
+
165
+ ```python
166
+ async def search_drug_repurposing_trials(
167
+ drug_name: str,
168
+ condition: str,
169
+ include_completed: bool = True,
170
+ ) -> list[Evidence]:
171
+ """Search for trials repurposing a drug for a new condition."""
172
+
173
+ statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
174
+ if include_completed:
175
+ statuses.append("COMPLETED")
176
+
177
+ params = {
178
+ "query.intr": drug_name,
179
+ "query.cond": condition,
180
+ "filter.overallStatus": ",".join(statuses),
181
+ "filter.studyType": "INTERVENTIONAL",
182
+ "fields": ",".join(EXTENDED_FIELDS),
183
+ "pageSize": 50,
184
+ }
185
+ ```
186
+
187
+ ---
188
+
189
+ ## Sources
190
+
191
+ - [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
192
+ - [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
193
+ - [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)
docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Europe PMC Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented (Replaced bioRxiv)
4
+ **Priority**: High (Preprint + Open Access Source)
5
+
6
+ ---
7
+
8
+ ## Why Europe PMC Over bioRxiv?
9
+
10
+ ### bioRxiv API Limitations (Why We Abandoned It)
11
+
12
+ 1. **No Search API**: Only returns papers by date range or DOI
13
+ 2. **No Query Capability**: Cannot search for "metformin cancer"
14
+ 3. **Workaround Required**: Would need to download ALL preprints and build local search
15
+ 4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
16
+
17
+ ### Europe PMC Advantages
18
+
19
+ 1. **Full Search API**: Boolean queries, filters, facets
20
+ 2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
21
+ 3. **Includes PubMed**: Also has MEDLINE content
22
+ 4. **34 Preprint Servers**: Not just bioRxiv
23
+ 5. **Open Access Focus**: Full-text when available
24
+
25
+ ---
26
+
27
+ ## Current Implementation
28
+
29
+ ### What We Have (`src/tools/europepmc.py`)
30
+
31
+ - REST API search via `europepmc.org/webservices/rest/search`
32
+ - Preprint flagging via `firstPublicationDate` heuristics
33
+ - Returns: title, abstract, authors, DOI, source
34
+ - Marks preprints for transparency
35
+
36
+ ### Current Limitations
37
+
38
+ 1. **No Full-Text Retrieval**: Only metadata/abstracts
39
+ 2. **No Citation Network**: Missing references/citations
40
+ 3. **No Supplementary Files**: Not fetching figures/data
41
+ 4. **Basic Preprint Detection**: Heuristic, not explicit flag
42
+
43
+ ---
44
+
45
+ ## Europe PMC API Capabilities
46
+
47
+ ### Endpoints We Could Use
48
+
49
+ | Endpoint | Purpose | Currently Using |
50
+ |----------|---------|-----------------|
51
+ | `/search` | Query papers | Yes |
52
+ | `/fulltext/{ID}` | Full text (XML/JSON) | No |
53
+ | `/{PMCID}/supplementaryFiles` | Figures, data | No |
54
+ | `/citations/{ID}` | Who cited this | No |
55
+ | `/references/{ID}` | What this cites | No |
56
+ | `/annotations` | Text-mined entities | No |
57
+
58
+ ### Rich Query Syntax
59
+
60
+ ```python
61
+ # Current simple query
62
+ query = "metformin cancer"
63
+
64
+ # Could use advanced syntax
65
+ query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
66
+ query += " AND (SRC:PPR)" # Only preprints
67
+ query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
68
+ query += " AND (OPEN_ACCESS:y)" # Only open access
69
+ ```
70
+
71
+ ### Source Filters
72
+
73
+ ```python
74
+ # Filter by source
75
+ "SRC:MED" # MEDLINE
76
+ "SRC:PMC" # PubMed Central
77
+ "SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
78
+ "SRC:AGR" # Agricola
79
+ "SRC:CBA" # Chinese Biological Abstracts
80
+ ```
81
+
82
+ ---
83
+
84
+ ## Recommended Improvements
85
+
86
+ ### Phase 1: Rich Metadata
87
+
88
+ ```python
89
+ # Add to search results
90
+ additional_fields = [
91
+ "citedByCount", # Impact indicator
92
+ "source", # Explicit source (MED, PMC, PPR)
93
+ "isOpenAccess", # Boolean flag
94
+ "fullTextUrlList", # URLs for full text
95
+ "authorAffiliations", # Institution info
96
+ "grantsList", # Funding info
97
+ ]
98
+ ```
99
+
100
+ ### Phase 2: Full-Text Retrieval
101
+
102
+ ```python
103
+ async def get_fulltext(pmcid: str) -> str | None:
104
+ """Get full text for open access papers."""
105
+ # XML format
106
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
107
+ # Or JSON
108
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
109
+ ```
110
+
111
+ ### Phase 3: Citation Network
112
+
113
+ ```python
114
+ async def get_citations(pmcid: str) -> list[str]:
115
+ """Get papers that cite this one."""
116
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
117
+
118
+ async def get_references(pmcid: str) -> list[str]:
119
+ """Get papers this one cites."""
120
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
121
+ ```
122
+
123
+ ### Phase 4: Text-Mined Annotations
124
+
125
+ Europe PMC extracts entities automatically:
126
+
127
+ ```python
128
+ async def get_annotations(pmcid: str) -> dict:
129
+ """Get text-mined entities (genes, diseases, drugs)."""
130
+ url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
131
+ params = {
132
+ "articleIds": f"PMC:{pmcid}",
133
+ "type": "Gene_Proteins,Diseases,Chemicals",
134
+ "format": "JSON",
135
+ }
136
+ # Returns structured entity mentions with positions
137
+ ```
138
+
139
+ ---
140
+
141
+ ## Supplementary File Retrieval
142
+
143
+ From reference repo (`bioinformatics_tools.py` lines 123-149):
144
+
145
+ ```python
146
+ def get_figures(pmcid: str) -> dict[str, str]:
147
+ """Download figures and supplementary files."""
148
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
149
+ # Returns ZIP with images, returns base64-encoded
150
+ ```
151
+
152
+ ---
153
+
154
+ ## Preprint-Specific Features
155
+
156
+ ### Identify Preprint Servers
157
+
158
+ ```python
159
+ PREPRINT_SOURCES = {
160
+ "PPR": "General preprints",
161
+ "bioRxiv": "Biology preprints",
162
+ "medRxiv": "Medical preprints",
163
+ "chemRxiv": "Chemistry preprints",
164
+ "Research Square": "Multi-disciplinary",
165
+ "Preprints.org": "MDPI preprints",
166
+ }
167
+
168
+ # Check if published version exists
169
+ async def check_published_version(preprint_doi: str) -> str | None:
170
+ """Check if preprint has been peer-reviewed and published."""
171
+ # Europe PMC links preprints to final versions
172
+ ```
173
+
174
+ ---
175
+
176
+ ## Rate Limiting
177
+
178
+ Europe PMC is more generous than NCBI:
179
+
180
+ ```python
181
+ # No documented hard limit, but be respectful
182
+ # Recommend: 10-20 requests/second max
183
+ # Use email in User-Agent for polite pool
184
+ headers = {
185
+ "User-Agent": "DeepCritical/1.0 (mailto:your@email.com)"
186
+ }
187
+ ```
188
+
189
+ ---
190
+
191
+ ## vs. The Lens & OpenAlex
192
+
193
+ | Feature | Europe PMC | The Lens | OpenAlex |
194
+ |---------|------------|----------|----------|
195
+ | Biomedical Focus | Yes | Partial | Partial |
196
+ | Preprints | Yes (34 servers) | Yes | Yes |
197
+ | Full Text | PMC papers | Links | No |
198
+ | Citations | Yes | Yes | Yes |
199
+ | Annotations | Yes (text-mined) | No | No |
200
+ | Rate Limits | Generous | Moderate | Very generous |
201
+ | API Key | Optional | Required | Optional |
202
+
203
+ ---
204
+
205
+ ## Sources
206
+
207
+ - [Europe PMC REST API](https://europepmc.org/RestfulWebService)
208
+ - [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
209
+ - [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
210
+ - [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
211
+ - [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)
docs/brainstorming/04_OPENALEX_INTEGRATION.md ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenAlex Integration: The Missing Piece?
2
+
3
+ **Status**: NOT Implemented (Candidate for Addition)
4
+ **Priority**: HIGH - Could Replace Multiple Tools
5
+ **Reference**: Already implemented in `reference_repos/DeepCritical`
6
+
7
+ ---
8
+
9
+ ## What is OpenAlex?
10
+
11
+ OpenAlex is a **fully open** index of the global research system:
12
+
13
+ - **209M+ works** (papers, books, datasets)
14
+ - **2B+ author records** (disambiguated)
15
+ - **124K+ venues** (journals, repositories)
16
+ - **109K+ institutions**
17
+ - **65K+ concepts** (hierarchical, linked to Wikidata)
18
+
19
+ **Free. Open. No API key required.**
20
+
21
+ ---
22
+
23
+ ## Why OpenAlex for DeepCritical?
24
+
25
+ ### Current Architecture
26
+
27
+ ```
28
+ User Query
29
+ ↓
30
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
31
+ β”‚ PubMed ClinicalTrials Europe PMC β”‚ ← 3 separate APIs
32
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
33
+ ↓
34
+ Orchestrator (deduplicate, judge, synthesize)
35
+ ```
36
+
37
+ ### With OpenAlex
38
+
39
+ ```
40
+ User Query
41
+ ↓
42
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
43
+ β”‚ OpenAlex β”‚ ← Single API
44
+ β”‚ (includes PubMed + preprints + β”‚
45
+ β”‚ citations + concepts + authors) β”‚
46
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
47
+ ↓
48
+ Orchestrator (enrich with CT.gov for trials)
49
+ ```
50
+
51
+ **OpenAlex already aggregates**:
52
+ - PubMed/MEDLINE
53
+ - Crossref
54
+ - ORCID
55
+ - Unpaywall (open access links)
56
+ - Microsoft Academic Graph (legacy)
57
+ - Preprint servers
58
+
59
+ ---
60
+
61
+ ## Reference Implementation
62
+
63
+ From `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`:
64
+
65
+ ```python
66
+ class OpenAlexFetchTool(ToolRunner):
67
+ def __init__(self):
68
+ super().__init__(
69
+ ToolSpec(
70
+ name="openalex_fetch",
71
+ description="Fetch OpenAlex work or author",
72
+ inputs={"entity": "TEXT", "identifier": "TEXT"},
73
+ outputs={"result": "JSON"},
74
+ )
75
+ )
76
+
77
+ def run(self, params: dict[str, Any]) -> ExecutionResult:
78
+ entity = params["entity"] # "works", "authors", "venues"
79
+ identifier = params["identifier"]
80
+ base = "https://api.openalex.org"
81
+ url = f"{base}/{entity}/{identifier}"
82
+ resp = requests.get(url, timeout=30)
83
+ return ExecutionResult(success=True, data={"result": resp.json()})
84
+ ```
85
+
86
+ ---
87
+
88
+ ## OpenAlex API Features
89
+
90
+ ### Search Works (Papers)
91
+
92
+ ```python
93
+ # Search for metformin + cancer papers
94
+ url = "https://api.openalex.org/works"
95
+ params = {
96
+ "search": "metformin cancer drug repurposing",
97
+ "filter": "publication_year:>2020,type:article",
98
+ "sort": "cited_by_count:desc",
99
+ "per_page": 50,
100
+ }
101
+ ```
102
+
103
+ ### Rich Filtering
104
+
105
+ ```python
106
+ # Filter examples
107
+ "publication_year:2023"
108
+ "type:article" # vs preprint, book, etc.
109
+ "is_oa:true" # Open access only
110
+ "concepts.id:C71924100" # Papers about "Medicine"
111
+ "authorships.institutions.id:I27837315" # From Harvard
112
+ "cited_by_count:>100" # Highly cited
113
+ "has_fulltext:true" # Full text available
114
+ ```
115
+
116
+ ### What You Get Back
117
+
118
+ ```json
119
+ {
120
+ "id": "W2741809807",
121
+ "title": "Metformin: A candidate drug for...",
122
+ "publication_year": 2023,
123
+ "type": "article",
124
+ "cited_by_count": 45,
125
+ "is_oa": true,
126
+ "primary_location": {
127
+ "source": {"display_name": "Nature Medicine"},
128
+ "pdf_url": "https://...",
129
+ "landing_page_url": "https://..."
130
+ },
131
+ "concepts": [
132
+ {"id": "C71924100", "display_name": "Medicine", "score": 0.95},
133
+ {"id": "C54355233", "display_name": "Pharmacology", "score": 0.88}
134
+ ],
135
+ "authorships": [
136
+ {
137
+ "author": {"id": "A123", "display_name": "John Smith"},
138
+ "institutions": [{"display_name": "Harvard Medical School"}]
139
+ }
140
+ ],
141
+ "referenced_works": ["W123", "W456"], # Citations
142
+ "related_works": ["W789", "W012"] # Similar papers
143
+ }
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Key Advantages Over Current Tools
149
+
150
+ ### 1. Citation Network (We Don't Have This!)
151
+
152
+ ```python
153
+ # Get papers that cite a work
154
+ url = f"https://api.openalex.org/works?filter=cites:{work_id}"
155
+
156
+ # Get papers cited by a work
157
+ # Already in `referenced_works` field
158
+ ```
159
+
160
+ ### 2. Concept Tagging (We Don't Have This!)
161
+
162
+ OpenAlex auto-tags papers with hierarchical concepts:
163
+ - "Medicine" β†’ "Pharmacology" β†’ "Drug Repurposing"
164
+ - Can search by concept, not just keywords
165
+
166
+ ### 3. Author Disambiguation (We Don't Have This!)
167
+
168
+ ```python
169
+ # Find all works by an author
170
+ url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}"
171
+ ```
172
+
173
+ ### 4. Institution Tracking
174
+
175
+ ```python
176
+ # Find drug repurposing papers from top institutions
177
+ url = "https://api.openalex.org/works"
178
+ params = {
179
+ "search": "drug repurposing",
180
+ "filter": "authorships.institutions.id:I27837315", # Harvard
181
+ }
182
+ ```
183
+
184
+ ### 5. Related Works
185
+
186
+ Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML.
187
+
188
+ ---
189
+
190
+ ## Proposed Implementation
191
+
192
+ ### New Tool: `src/tools/openalex.py`
193
+
194
+ ```python
195
+ """OpenAlex search tool for comprehensive scholarly data."""
196
+
197
+ import httpx
198
+ from src.tools.base import SearchTool
199
+ from src.utils.models import Evidence
200
+
201
+ class OpenAlexTool(SearchTool):
202
+ """Search OpenAlex for scholarly works with rich metadata."""
203
+
204
+ name = "openalex"
205
+
206
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
207
+ async with httpx.AsyncClient() as client:
208
+ resp = await client.get(
209
+ "https://api.openalex.org/works",
210
+ params={
211
+ "search": query,
212
+ "filter": "type:article,is_oa:true",
213
+ "sort": "cited_by_count:desc",
214
+ "per_page": max_results,
215
+ "mailto": "deepcritical@example.com", # Polite pool
216
+ },
217
+ )
218
+ data = resp.json()
219
+
220
+ return [
221
+ Evidence(
222
+ source="openalex",
223
+ title=work["title"],
224
+ abstract=work.get("abstract", ""),
225
+ url=work["primary_location"]["landing_page_url"],
226
+ metadata={
227
+ "cited_by_count": work["cited_by_count"],
228
+ "concepts": [c["display_name"] for c in work["concepts"][:5]],
229
+ "is_open_access": work["is_oa"],
230
+ "pdf_url": work["primary_location"].get("pdf_url"),
231
+ },
232
+ )
233
+ for work in data["results"]
234
+ ]
235
+ ```
236
+
237
+ ---
238
+
239
+ ## Rate Limits
240
+
241
+ OpenAlex is **extremely generous**:
242
+
243
+ - No hard rate limit documented
244
+ - Recommended: <100,000 requests/day
245
+ - **Polite pool**: Add `mailto=your@email.com` param for faster responses
246
+ - No API key required (optional for priority support)
247
+
248
+ ---
249
+
250
+ ## Should We Add OpenAlex?
251
+
252
+ ### Arguments FOR
253
+
254
+ 1. **Already in reference repo** - proven pattern
255
+ 2. **Richer data** - citations, concepts, authors
256
+ 3. **Single source** - reduces API complexity
257
+ 4. **Free & open** - no keys, no limits
258
+ 5. **Institution adoption** - Leiden, Sorbonne switched to it
259
+
260
+ ### Arguments AGAINST
261
+
262
+ 1. **Adds complexity** - another data source
263
+ 2. **Overlap** - duplicates some PubMed data
264
+ 3. **Not biomedical-focused** - covers all disciplines
265
+ 4. **No full text** - still need PMC/Europe PMC for that
266
+
267
+ ### Recommendation
268
+
269
+ **Add OpenAlex as a 4th source**, don't replace existing tools.
270
+
271
+ Use it for:
272
+ - Citation network analysis
273
+ - Concept-based discovery
274
+ - High-impact paper finding
275
+ - Author/institution tracking
276
+
277
+ Keep PubMed, ClinicalTrials, Europe PMC for:
278
+ - Authoritative biomedical search
279
+ - Clinical trial data
280
+ - Full-text access
281
+ - Preprint tracking
282
+
283
+ ---
284
+
285
+ ## Implementation Priority
286
+
287
+ | Task | Effort | Value |
288
+ |------|--------|-------|
289
+ | Basic search | Low | High |
290
+ | Citation network | Medium | Very High |
291
+ | Concept filtering | Low | High |
292
+ | Related works | Low | High |
293
+ | Author tracking | Medium | Medium |
294
+
295
+ ---
296
+
297
+ ## Sources
298
+
299
+ - [OpenAlex Documentation](https://docs.openalex.org)
300
+ - [OpenAlex API Overview](https://docs.openalex.org/api)
301
+ - [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex)
302
+ - [Leiden University Announcement](https://www.leidenranking.com/information/openalex)
303
+ - [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833)
docs/brainstorming/implementation/15_PHASE_OPENALEX.md ADDED
@@ -0,0 +1,603 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 15: OpenAlex Integration
2
+
3
+ **Priority**: HIGH - Biggest bang for buck
4
+ **Effort**: ~2-3 hours
5
+ **Dependencies**: None (existing codebase patterns sufficient)
6
+
7
+ ---
8
+
9
+ ## Prerequisites (COMPLETED)
10
+
11
+ The following model changes have been implemented to support this integration:
12
+
13
+ 1. **`SourceName` Literal Updated** (`src/utils/models.py:9`)
14
+ ```python
15
+ SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
16
+ ```
17
+ - Without this, `source="openalex"` would fail Pydantic validation
18
+
19
+ 2. **`Evidence.metadata` Field Added** (`src/utils/models.py:39-42`)
20
+ ```python
21
+ metadata: dict[str, Any] = Field(
22
+ default_factory=dict,
23
+ description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
24
+ )
25
+ ```
26
+ - Required for storing `cited_by_count`, `concepts`, etc.
27
+ - Model is still frozen - metadata must be passed at construction time
28
+
29
+ 3. **`__init__.py` Exports Updated** (`src/tools/__init__.py`)
30
+ - All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool`
31
+ - OpenAlexTool should be added here after implementation
32
+
33
+ ---
34
+
35
+ ## Overview
36
+
37
+ Add OpenAlex as a 4th data source for comprehensive scholarly data including:
38
+ - Citation networks (who cites whom)
39
+ - Concept tagging (hierarchical topic classification)
40
+ - Author disambiguation
41
+ - 209M+ works indexed
42
+
43
+ **Why OpenAlex?**
44
+ - Free, no API key required
45
+ - Already implemented in reference repo
46
+ - Provides citation data we don't have
47
+ - Aggregates PubMed + preprints + more
48
+
49
+ ---
50
+
51
+ ## TDD Implementation Plan
52
+
53
+ ### Step 1: Write the Tests First
54
+
55
+ **File**: `tests/unit/tools/test_openalex.py`
56
+
57
+ ```python
58
+ """Tests for OpenAlex search tool."""
59
+
60
+ import pytest
61
+ import respx
62
+ from httpx import Response
63
+
64
+ from src.tools.openalex import OpenAlexTool
65
+ from src.utils.models import Evidence
66
+
67
+
68
+ class TestOpenAlexTool:
69
+ """Test suite for OpenAlex search functionality."""
70
+
71
+ @pytest.fixture
72
+ def tool(self) -> OpenAlexTool:
73
+ return OpenAlexTool()
74
+
75
+ def test_name_property(self, tool: OpenAlexTool) -> None:
76
+ """Tool should identify itself as 'openalex'."""
77
+ assert tool.name == "openalex"
78
+
79
+ @respx.mock
80
+ @pytest.mark.asyncio
81
+ async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None:
82
+ """Search should return list of Evidence objects."""
83
+ mock_response = {
84
+ "results": [
85
+ {
86
+ "id": "W2741809807",
87
+ "title": "Metformin and cancer: A systematic review",
88
+ "publication_year": 2023,
89
+ "cited_by_count": 45,
90
+ "type": "article",
91
+ "is_oa": True,
92
+ "primary_location": {
93
+ "source": {"display_name": "Nature Medicine"},
94
+ "landing_page_url": "https://doi.org/10.1038/example",
95
+ "pdf_url": None,
96
+ },
97
+ "abstract_inverted_index": {
98
+ "Metformin": [0],
99
+ "shows": [1],
100
+ "anticancer": [2],
101
+ "effects": [3],
102
+ },
103
+ "concepts": [
104
+ {"display_name": "Medicine", "score": 0.95},
105
+ {"display_name": "Oncology", "score": 0.88},
106
+ ],
107
+ "authorships": [
108
+ {
109
+ "author": {"display_name": "John Smith"},
110
+ "institutions": [{"display_name": "Harvard"}],
111
+ }
112
+ ],
113
+ }
114
+ ]
115
+ }
116
+
117
+ respx.get("https://api.openalex.org/works").mock(
118
+ return_value=Response(200, json=mock_response)
119
+ )
120
+
121
+ results = await tool.search("metformin cancer", max_results=10)
122
+
123
+ assert len(results) == 1
124
+ assert isinstance(results[0], Evidence)
125
+ assert "Metformin and cancer" in results[0].citation.title
126
+ assert results[0].citation.source == "openalex"
127
+
128
+ @respx.mock
129
+ @pytest.mark.asyncio
130
+ async def test_search_empty_results(self, tool: OpenAlexTool) -> None:
131
+ """Search with no results should return empty list."""
132
+ respx.get("https://api.openalex.org/works").mock(
133
+ return_value=Response(200, json={"results": []})
134
+ )
135
+
136
+ results = await tool.search("xyznonexistentquery123")
137
+ assert results == []
138
+
139
+ @respx.mock
140
+ @pytest.mark.asyncio
141
+ async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None:
142
+ """Tool should handle papers without abstracts."""
143
+ mock_response = {
144
+ "results": [
145
+ {
146
+ "id": "W123",
147
+ "title": "Paper without abstract",
148
+ "publication_year": 2023,
149
+ "cited_by_count": 10,
150
+ "type": "article",
151
+ "is_oa": False,
152
+ "primary_location": {
153
+ "source": {"display_name": "Journal"},
154
+ "landing_page_url": "https://example.com",
155
+ },
156
+ "abstract_inverted_index": None,
157
+ "concepts": [],
158
+ "authorships": [],
159
+ }
160
+ ]
161
+ }
162
+
163
+ respx.get("https://api.openalex.org/works").mock(
164
+ return_value=Response(200, json=mock_response)
165
+ )
166
+
167
+ results = await tool.search("test query")
168
+ assert len(results) == 1
169
+ assert results[0].content == "" # No abstract
170
+
171
+ @respx.mock
172
+ @pytest.mark.asyncio
173
+ async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None:
174
+ """Citation count should be in metadata."""
175
+ mock_response = {
176
+ "results": [
177
+ {
178
+ "id": "W456",
179
+ "title": "Highly cited paper",
180
+ "publication_year": 2020,
181
+ "cited_by_count": 500,
182
+ "type": "article",
183
+ "is_oa": True,
184
+ "primary_location": {
185
+ "source": {"display_name": "Science"},
186
+ "landing_page_url": "https://example.com",
187
+ },
188
+ "abstract_inverted_index": {"Test": [0]},
189
+ "concepts": [],
190
+ "authorships": [],
191
+ }
192
+ ]
193
+ }
194
+
195
+ respx.get("https://api.openalex.org/works").mock(
196
+ return_value=Response(200, json=mock_response)
197
+ )
198
+
199
+ results = await tool.search("highly cited")
200
+ assert results[0].metadata["cited_by_count"] == 500
201
+
202
+ @respx.mock
203
+ @pytest.mark.asyncio
204
+ async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None:
205
+ """Concepts should be extracted for semantic discovery."""
206
+ mock_response = {
207
+ "results": [
208
+ {
209
+ "id": "W789",
210
+ "title": "Drug repurposing study",
211
+ "publication_year": 2023,
212
+ "cited_by_count": 25,
213
+ "type": "article",
214
+ "is_oa": True,
215
+ "primary_location": {
216
+ "source": {"display_name": "PLOS ONE"},
217
+ "landing_page_url": "https://example.com",
218
+ },
219
+ "abstract_inverted_index": {"Drug": [0], "repurposing": [1]},
220
+ "concepts": [
221
+ {"display_name": "Pharmacology", "score": 0.92},
222
+ {"display_name": "Drug Discovery", "score": 0.85},
223
+ {"display_name": "Medicine", "score": 0.80},
224
+ ],
225
+ "authorships": [],
226
+ }
227
+ ]
228
+ }
229
+
230
+ respx.get("https://api.openalex.org/works").mock(
231
+ return_value=Response(200, json=mock_response)
232
+ )
233
+
234
+ results = await tool.search("drug repurposing")
235
+ assert "Pharmacology" in results[0].metadata["concepts"]
236
+ assert "Drug Discovery" in results[0].metadata["concepts"]
237
+
238
+ @respx.mock
239
+ @pytest.mark.asyncio
240
+ async def test_search_api_error_raises_search_error(
241
+ self, tool: OpenAlexTool
242
+ ) -> None:
243
+ """API errors should raise SearchError."""
244
+ from src.utils.exceptions import SearchError
245
+
246
+ respx.get("https://api.openalex.org/works").mock(
247
+ return_value=Response(500, text="Internal Server Error")
248
+ )
249
+
250
+ with pytest.raises(SearchError):
251
+ await tool.search("test query")
252
+
253
+ def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
254
+ """Test abstract reconstruction from inverted index."""
255
+ inverted_index = {
256
+ "Metformin": [0, 5],
257
+ "is": [1],
258
+ "a": [2],
259
+ "diabetes": [3],
260
+ "drug": [4],
261
+ "effective": [6],
262
+ }
263
+ abstract = tool._reconstruct_abstract(inverted_index)
264
+ assert abstract == "Metformin is a diabetes drug Metformin effective"
265
+ ```
266
+
267
+ ---
268
+
269
+ ### Step 2: Create the Implementation
270
+
271
+ **File**: `src/tools/openalex.py`
272
+
273
+ ```python
274
+ """OpenAlex search tool for comprehensive scholarly data."""
275
+
276
+ from typing import Any
277
+
278
+ import httpx
279
+ from tenacity import retry, stop_after_attempt, wait_exponential
280
+
281
+ from src.utils.exceptions import SearchError
282
+ from src.utils.models import Citation, Evidence
283
+
284
+
285
+ class OpenAlexTool:
286
+ """
287
+ Search OpenAlex for scholarly works with rich metadata.
288
+
289
+ OpenAlex provides:
290
+ - 209M+ scholarly works
291
+ - Citation counts and networks
292
+ - Concept tagging (hierarchical)
293
+ - Author disambiguation
294
+ - Open access links
295
+
296
+ API Docs: https://docs.openalex.org/
297
+ """
298
+
299
+ BASE_URL = "https://api.openalex.org/works"
300
+
301
+ def __init__(self, email: str | None = None) -> None:
302
+ """
303
+ Initialize OpenAlex tool.
304
+
305
+ Args:
306
+ email: Optional email for polite pool (faster responses)
307
+ """
308
+ self.email = email or "deepcritical@example.com"
309
+
310
+ @property
311
+ def name(self) -> str:
312
+ return "openalex"
313
+
314
+ @retry(
315
+ stop=stop_after_attempt(3),
316
+ wait=wait_exponential(multiplier=1, min=1, max=10),
317
+ reraise=True,
318
+ )
319
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
320
+ """
321
+ Search OpenAlex for scholarly works.
322
+
323
+ Args:
324
+ query: Search terms
325
+ max_results: Maximum results to return (max 200 per request)
326
+
327
+ Returns:
328
+ List of Evidence objects with citation metadata
329
+
330
+ Raises:
331
+ SearchError: If API request fails
332
+ """
333
+ params = {
334
+ "search": query,
335
+ "filter": "type:article", # Only peer-reviewed articles
336
+ "sort": "cited_by_count:desc", # Most cited first
337
+ "per_page": min(max_results, 200),
338
+ "mailto": self.email, # Polite pool for faster responses
339
+ }
340
+
341
+ async with httpx.AsyncClient(timeout=30.0) as client:
342
+ try:
343
+ response = await client.get(self.BASE_URL, params=params)
344
+ response.raise_for_status()
345
+
346
+ data = response.json()
347
+ results = data.get("results", [])
348
+
349
+ return [self._to_evidence(work) for work in results[:max_results]]
350
+
351
+ except httpx.HTTPStatusError as e:
352
+ raise SearchError(f"OpenAlex API error: {e}") from e
353
+ except httpx.RequestError as e:
354
+ raise SearchError(f"OpenAlex connection failed: {e}") from e
355
+
356
+ def _to_evidence(self, work: dict[str, Any]) -> Evidence:
357
+ """Convert OpenAlex work to Evidence object."""
358
+ title = work.get("title", "Untitled")
359
+ pub_year = work.get("publication_year", "Unknown")
360
+ cited_by = work.get("cited_by_count", 0)
361
+ is_oa = work.get("is_oa", False)
362
+
363
+ # Reconstruct abstract from inverted index
364
+ abstract_index = work.get("abstract_inverted_index")
365
+ abstract = self._reconstruct_abstract(abstract_index) if abstract_index else ""
366
+
367
+ # Extract concepts (top 5)
368
+ concepts = [
369
+ c.get("display_name", "")
370
+ for c in work.get("concepts", [])[:5]
371
+ if c.get("display_name")
372
+ ]
373
+
374
+ # Extract authors (top 5)
375
+ authorships = work.get("authorships", [])
376
+ authors = [
377
+ a.get("author", {}).get("display_name", "")
378
+ for a in authorships[:5]
379
+ if a.get("author", {}).get("display_name")
380
+ ]
381
+
382
+ # Get URL
383
+ primary_loc = work.get("primary_location") or {}
384
+ url = primary_loc.get("landing_page_url", "")
385
+ if not url:
386
+ # Fallback to OpenAlex page
387
+ work_id = work.get("id", "").replace("https://openalex.org/", "")
388
+ url = f"https://openalex.org/{work_id}"
389
+
390
+ return Evidence(
391
+ content=abstract[:2000],
392
+ citation=Citation(
393
+ source="openalex",
394
+ title=title[:500],
395
+ url=url,
396
+ date=str(pub_year),
397
+ authors=authors,
398
+ ),
399
+ relevance=min(0.9, 0.5 + (cited_by / 1000)), # Boost by citations
400
+ metadata={
401
+ "cited_by_count": cited_by,
402
+ "is_open_access": is_oa,
403
+ "concepts": concepts,
404
+ "pdf_url": primary_loc.get("pdf_url"),
405
+ },
406
+ )
407
+
408
+ def _reconstruct_abstract(
409
+ self, inverted_index: dict[str, list[int]]
410
+ ) -> str:
411
+ """
412
+ Reconstruct abstract from OpenAlex inverted index format.
413
+
414
+ OpenAlex stores abstracts as {"word": [position1, position2, ...]}.
415
+ This rebuilds the original text.
416
+ """
417
+ if not inverted_index:
418
+ return ""
419
+
420
+ # Build position -> word mapping
421
+ position_word: dict[int, str] = {}
422
+ for word, positions in inverted_index.items():
423
+ for pos in positions:
424
+ position_word[pos] = word
425
+
426
+ # Reconstruct in order
427
+ if not position_word:
428
+ return ""
429
+
430
+ max_pos = max(position_word.keys())
431
+ words = [position_word.get(i, "") for i in range(max_pos + 1)]
432
+ return " ".join(w for w in words if w)
433
+ ```
434
+
435
+ ---
436
+
437
+ ### Step 3: Register in Search Handler
438
+
439
+ **File**: `src/tools/search_handler.py` (add to imports and tool list)
440
+
441
+ ```python
442
+ # Add import
443
+ from src.tools.openalex import OpenAlexTool
444
+
445
+ # Add to _create_tools method
446
+ def _create_tools(self) -> list[SearchTool]:
447
+ return [
448
+ PubMedTool(),
449
+ ClinicalTrialsTool(),
450
+ EuropePMCTool(),
451
+ OpenAlexTool(), # NEW
452
+ ]
453
+ ```
454
+
455
+ ---
456
+
457
+ ### Step 4: Update `__init__.py`
458
+
459
+ **File**: `src/tools/__init__.py`
460
+
461
+ ```python
462
+ from src.tools.openalex import OpenAlexTool
463
+
464
+ __all__ = [
465
+ "PubMedTool",
466
+ "ClinicalTrialsTool",
467
+ "EuropePMCTool",
468
+ "OpenAlexTool", # NEW
469
+ # ...
470
+ ]
471
+ ```
472
+
473
+ ---
474
+
475
+ ## Demo Script
476
+
477
+ **File**: `examples/openalex_demo.py`
478
+
479
+ ```python
480
+ #!/usr/bin/env python3
481
+ """Demo script to verify OpenAlex integration."""
482
+
483
+ import asyncio
484
+ from src.tools.openalex import OpenAlexTool
485
+
486
+
487
+ async def main():
488
+ """Run OpenAlex search demo."""
489
+ tool = OpenAlexTool()
490
+
491
+ print("=" * 60)
492
+ print("OpenAlex Integration Demo")
493
+ print("=" * 60)
494
+
495
+ # Test 1: Basic drug repurposing search
496
+ print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...")
497
+ results = await tool.search("metformin cancer drug repurposing", max_results=5)
498
+
499
+ for i, evidence in enumerate(results, 1):
500
+ print(f"\n--- Result {i} ---")
501
+ print(f"Title: {evidence.citation.title}")
502
+ print(f"Year: {evidence.citation.date}")
503
+ print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}")
504
+ print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}")
505
+ print(f"Open Access: {evidence.metadata.get('is_open_access', False)}")
506
+ print(f"URL: {evidence.citation.url}")
507
+ if evidence.content:
508
+ print(f"Abstract: {evidence.content[:200]}...")
509
+
510
+ # Test 2: High-impact papers
511
+ print("\n" + "=" * 60)
512
+ print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...")
513
+ results = await tool.search("long COVID treatment", max_results=3)
514
+
515
+ for evidence in results:
516
+ print(f"\n- {evidence.citation.title}")
517
+ print(f" Citations: {evidence.metadata.get('cited_by_count', 0)}")
518
+
519
+ print("\n" + "=" * 60)
520
+ print("Demo complete!")
521
+
522
+
523
+ if __name__ == "__main__":
524
+ asyncio.run(main())
525
+ ```
526
+
527
+ ---
528
+
529
+ ## Verification Checklist
530
+
531
+ ### Unit Tests
532
+ ```bash
533
+ # Run just OpenAlex tests
534
+ uv run pytest tests/unit/tools/test_openalex.py -v
535
+
536
+ # Expected: All tests pass
537
+ ```
538
+
539
+ ### Integration Test (Manual)
540
+ ```bash
541
+ # Run demo script with real API
542
+ uv run python examples/openalex_demo.py
543
+
544
+ # Expected: Real results from OpenAlex API
545
+ ```
546
+
547
+ ### Full Test Suite
548
+ ```bash
549
+ # Ensure nothing broke
550
+ make check
551
+
552
+ # Expected: All 110+ tests pass, mypy clean
553
+ ```
554
+
555
+ ---
556
+
557
+ ## Success Criteria
558
+
559
+ 1. **Unit tests pass**: All mocked tests in `test_openalex.py` pass
560
+ 2. **Integration works**: Demo script returns real results
561
+ 3. **No regressions**: `make check` passes completely
562
+ 4. **SearchHandler integration**: OpenAlex appears in search results alongside other sources
563
+ 5. **Citation metadata**: Results include `cited_by_count`, `concepts`, `is_open_access`
564
+
565
+ ---
566
+
567
+ ## Future Enhancements (P2)
568
+
569
+ Once basic integration works:
570
+
571
+ 1. **Citation Network Queries**
572
+ ```python
573
+ # Get papers citing a specific work
574
+ async def get_citing_works(self, work_id: str) -> list[Evidence]:
575
+ params = {"filter": f"cites:{work_id}"}
576
+ ...
577
+ ```
578
+
579
+ 2. **Concept-Based Search**
580
+ ```python
581
+ # Search by OpenAlex concept ID
582
+ async def search_by_concept(self, concept_id: str) -> list[Evidence]:
583
+ params = {"filter": f"concepts.id:{concept_id}"}
584
+ ...
585
+ ```
586
+
587
+ 3. **Author Tracking**
588
+ ```python
589
+ # Find all works by an author
590
+ async def search_by_author(self, author_id: str) -> list[Evidence]:
591
+ params = {"filter": f"authorships.author.id:{author_id}"}
592
+ ...
593
+ ```
594
+
595
+ ---
596
+
597
+ ## Notes
598
+
599
+ - OpenAlex is **very generous** with rate limits (no documented hard limit)
600
+ - Adding `mailto` parameter gives priority access (polite pool)
601
+ - Abstract is stored as inverted index - must reconstruct
602
+ - Citation count is a good proxy for paper quality/impact
603
+ - Consider caching responses for repeated queries
docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md ADDED
@@ -0,0 +1,586 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 16: PubMed Full-Text Retrieval
2
+
3
+ **Priority**: MEDIUM - Enhances evidence quality
4
+ **Effort**: ~3 hours
5
+ **Dependencies**: None (existing PubMed tool sufficient)
6
+
7
+ ---
8
+
9
+ ## Prerequisites (COMPLETED)
10
+
11
+ The `Evidence.metadata` field has been added to `src/utils/models.py` to support:
12
+ ```python
13
+ metadata={"has_fulltext": True}
14
+ ```
15
+
16
+ ---
17
+
18
+ ## Architecture Decision: Constructor Parameter vs Method Parameter
19
+
20
+ **IMPORTANT**: The original spec proposed `include_fulltext` as a method parameter:
21
+ ```python
22
+ # WRONG - SearchHandler won't pass this parameter
23
+ async def search(self, query: str, max_results: int = 10, include_fulltext: bool = False):
24
+ ```
25
+
26
+ **Problem**: `SearchHandler` calls `tool.search(query, max_results)` uniformly across all tools.
27
+ It has no mechanism to pass tool-specific parameters like `include_fulltext`.
28
+
29
+ **Solution**: Use constructor parameter instead:
30
+ ```python
31
+ # CORRECT - Configured at instantiation time
32
+ class PubMedTool:
33
+ def __init__(self, api_key: str | None = None, include_fulltext: bool = False):
34
+ self.include_fulltext = include_fulltext
35
+ ...
36
+ ```
37
+
38
+ This way, you can create a full-text-enabled PubMed tool:
39
+ ```python
40
+ # In orchestrator or wherever tools are created
41
+ tools = [
42
+ PubMedTool(include_fulltext=True), # Full-text enabled
43
+ ClinicalTrialsTool(),
44
+ EuropePMCTool(),
45
+ ]
46
+ ```
47
+
48
+ ---
49
+
50
+ ## Overview
51
+
52
+ Add full-text retrieval for PubMed papers via the BioC API, enabling:
53
+ - Complete paper text for open-access PMC papers
54
+ - Structured sections (intro, methods, results, discussion)
55
+ - Better evidence for LLM synthesis
56
+
57
+ **Why Full-Text?**
58
+ - Abstracts only give ~200-300 words
59
+ - Full text provides detailed methods, results, figures
60
+ - Reference repo already has this implemented
61
+ - Makes LLM judgments more accurate
62
+
63
+ ---
64
+
65
+ ## TDD Implementation Plan
66
+
67
+ ### Step 1: Write the Tests First
68
+
69
+ **File**: `tests/unit/tools/test_pubmed_fulltext.py`
70
+
71
+ ```python
72
+ """Tests for PubMed full-text retrieval."""
73
+
74
+ import pytest
75
+ import respx
76
+ from httpx import Response
77
+
78
+ from src.tools.pubmed import PubMedTool
79
+
80
+
81
+ class TestPubMedFullText:
82
+ """Test suite for PubMed full-text functionality."""
83
+
84
+ @pytest.fixture
85
+ def tool(self) -> PubMedTool:
86
+ return PubMedTool()
87
+
88
+ @respx.mock
89
+ @pytest.mark.asyncio
90
+ async def test_get_pmc_id_success(self, tool: PubMedTool) -> None:
91
+ """Should convert PMID to PMCID for full-text access."""
92
+ mock_response = {
93
+ "records": [
94
+ {
95
+ "pmid": "12345678",
96
+ "pmcid": "PMC1234567",
97
+ }
98
+ ]
99
+ }
100
+
101
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
102
+ return_value=Response(200, json=mock_response)
103
+ )
104
+
105
+ pmcid = await tool.get_pmc_id("12345678")
106
+ assert pmcid == "PMC1234567"
107
+
108
+ @respx.mock
109
+ @pytest.mark.asyncio
110
+ async def test_get_pmc_id_not_in_pmc(self, tool: PubMedTool) -> None:
111
+ """Should return None if paper not in PMC."""
112
+ mock_response = {
113
+ "records": [
114
+ {
115
+ "pmid": "12345678",
116
+ # No pmcid means not in PMC
117
+ }
118
+ ]
119
+ }
120
+
121
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
122
+ return_value=Response(200, json=mock_response)
123
+ )
124
+
125
+ pmcid = await tool.get_pmc_id("12345678")
126
+ assert pmcid is None
127
+
128
+ @respx.mock
129
+ @pytest.mark.asyncio
130
+ async def test_get_fulltext_success(self, tool: PubMedTool) -> None:
131
+ """Should retrieve full text for PMC papers."""
132
+ # Mock BioC API response
133
+ mock_bioc = {
134
+ "documents": [
135
+ {
136
+ "passages": [
137
+ {
138
+ "infons": {"section_type": "INTRO"},
139
+ "text": "Introduction text here.",
140
+ },
141
+ {
142
+ "infons": {"section_type": "METHODS"},
143
+ "text": "Methods description here.",
144
+ },
145
+ {
146
+ "infons": {"section_type": "RESULTS"},
147
+ "text": "Results summary here.",
148
+ },
149
+ {
150
+ "infons": {"section_type": "DISCUSS"},
151
+ "text": "Discussion and conclusions.",
152
+ },
153
+ ]
154
+ }
155
+ ]
156
+ }
157
+
158
+ respx.get(
159
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
160
+ ).mock(return_value=Response(200, json=mock_bioc))
161
+
162
+ fulltext = await tool.get_fulltext("12345678")
163
+
164
+ assert fulltext is not None
165
+ assert "Introduction text here" in fulltext
166
+ assert "Methods description here" in fulltext
167
+ assert "Results summary here" in fulltext
168
+
169
+ @respx.mock
170
+ @pytest.mark.asyncio
171
+ async def test_get_fulltext_not_available(self, tool: PubMedTool) -> None:
172
+ """Should return None if full text not available."""
173
+ respx.get(
174
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/99999999/unicode"
175
+ ).mock(return_value=Response(404))
176
+
177
+ fulltext = await tool.get_fulltext("99999999")
178
+ assert fulltext is None
179
+
180
+ @respx.mock
181
+ @pytest.mark.asyncio
182
+ async def test_get_fulltext_structured(self, tool: PubMedTool) -> None:
183
+ """Should return structured sections dict."""
184
+ mock_bioc = {
185
+ "documents": [
186
+ {
187
+ "passages": [
188
+ {"infons": {"section_type": "INTRO"}, "text": "Intro..."},
189
+ {"infons": {"section_type": "METHODS"}, "text": "Methods..."},
190
+ {"infons": {"section_type": "RESULTS"}, "text": "Results..."},
191
+ {"infons": {"section_type": "DISCUSS"}, "text": "Discussion..."},
192
+ ]
193
+ }
194
+ ]
195
+ }
196
+
197
+ respx.get(
198
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
199
+ ).mock(return_value=Response(200, json=mock_bioc))
200
+
201
+ sections = await tool.get_fulltext_structured("12345678")
202
+
203
+ assert sections is not None
204
+ assert "introduction" in sections
205
+ assert "methods" in sections
206
+ assert "results" in sections
207
+ assert "discussion" in sections
208
+
209
+ @respx.mock
210
+ @pytest.mark.asyncio
211
+ async def test_search_with_fulltext_enabled(self) -> None:
212
+ """Search should include full text when tool is configured for it."""
213
+ # Create tool WITH full-text enabled via constructor
214
+ tool = PubMedTool(include_fulltext=True)
215
+
216
+ # Mock esearch
217
+ respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi").mock(
218
+ return_value=Response(
219
+ 200, json={"esearchresult": {"idlist": ["12345678"]}}
220
+ )
221
+ )
222
+
223
+ # Mock efetch (abstract)
224
+ mock_xml = """
225
+ <PubmedArticleSet>
226
+ <PubmedArticle>
227
+ <MedlineCitation>
228
+ <PMID>12345678</PMID>
229
+ <Article>
230
+ <ArticleTitle>Test Paper</ArticleTitle>
231
+ <Abstract><AbstractText>Short abstract.</AbstractText></Abstract>
232
+ <AuthorList><Author><LastName>Smith</LastName></Author></AuthorList>
233
+ </Article>
234
+ </MedlineCitation>
235
+ </PubmedArticle>
236
+ </PubmedArticleSet>
237
+ """
238
+ respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi").mock(
239
+ return_value=Response(200, text=mock_xml)
240
+ )
241
+
242
+ # Mock ID converter
243
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
244
+ return_value=Response(
245
+ 200, json={"records": [{"pmid": "12345678", "pmcid": "PMC1234567"}]}
246
+ )
247
+ )
248
+
249
+ # Mock BioC full text
250
+ mock_bioc = {
251
+ "documents": [
252
+ {
253
+ "passages": [
254
+ {"infons": {"section_type": "INTRO"}, "text": "Full intro..."},
255
+ ]
256
+ }
257
+ ]
258
+ }
259
+ respx.get(
260
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
261
+ ).mock(return_value=Response(200, json=mock_bioc))
262
+
263
+ # NOTE: No include_fulltext param - it's set via constructor
264
+ results = await tool.search("test", max_results=1)
265
+
266
+ assert len(results) == 1
267
+ # Full text should be appended or replace abstract
268
+ assert "Full intro" in results[0].content or "Short abstract" in results[0].content
269
+ ```
270
+
271
+ ---
272
+
273
+ ### Step 2: Implement Full-Text Methods
274
+
275
+ **File**: `src/tools/pubmed.py` (additions to existing class)
276
+
277
+ ```python
278
+ # Add these methods to PubMedTool class
279
+
280
+ async def get_pmc_id(self, pmid: str) -> str | None:
281
+ """
282
+ Convert PMID to PMCID for full-text access.
283
+
284
+ Args:
285
+ pmid: PubMed ID
286
+
287
+ Returns:
288
+ PMCID if paper is in PMC, None otherwise
289
+ """
290
+ url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
291
+ params = {"ids": pmid, "format": "json"}
292
+
293
+ async with httpx.AsyncClient(timeout=30.0) as client:
294
+ try:
295
+ response = await client.get(url, params=params)
296
+ response.raise_for_status()
297
+ data = response.json()
298
+
299
+ records = data.get("records", [])
300
+ if records and records[0].get("pmcid"):
301
+ return records[0]["pmcid"]
302
+ return None
303
+
304
+ except httpx.HTTPError:
305
+ return None
306
+
307
+
308
+ async def get_fulltext(self, pmid: str) -> str | None:
309
+ """
310
+ Get full text for a PubMed paper via BioC API.
311
+
312
+ Only works for open-access papers in PubMed Central.
313
+
314
+ Args:
315
+ pmid: PubMed ID
316
+
317
+ Returns:
318
+ Full text as string, or None if not available
319
+ """
320
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
321
+
322
+ async with httpx.AsyncClient(timeout=60.0) as client:
323
+ try:
324
+ response = await client.get(url)
325
+ if response.status_code == 404:
326
+ return None
327
+ response.raise_for_status()
328
+ data = response.json()
329
+
330
+ # Extract text from all passages
331
+ documents = data.get("documents", [])
332
+ if not documents:
333
+ return None
334
+
335
+ passages = documents[0].get("passages", [])
336
+ text_parts = [p.get("text", "") for p in passages if p.get("text")]
337
+
338
+ return "\n\n".join(text_parts) if text_parts else None
339
+
340
+ except httpx.HTTPError:
341
+ return None
342
+
343
+
344
+ async def get_fulltext_structured(self, pmid: str) -> dict[str, str] | None:
345
+ """
346
+ Get structured full text with sections.
347
+
348
+ Args:
349
+ pmid: PubMed ID
350
+
351
+ Returns:
352
+ Dict mapping section names to text, or None if not available
353
+ """
354
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
355
+
356
+ async with httpx.AsyncClient(timeout=60.0) as client:
357
+ try:
358
+ response = await client.get(url)
359
+ if response.status_code == 404:
360
+ return None
361
+ response.raise_for_status()
362
+ data = response.json()
363
+
364
+ documents = data.get("documents", [])
365
+ if not documents:
366
+ return None
367
+
368
+ # Map section types to readable names
369
+ section_map = {
370
+ "INTRO": "introduction",
371
+ "METHODS": "methods",
372
+ "RESULTS": "results",
373
+ "DISCUSS": "discussion",
374
+ "CONCL": "conclusion",
375
+ "ABSTRACT": "abstract",
376
+ }
377
+
378
+ sections: dict[str, list[str]] = {}
379
+ for passage in documents[0].get("passages", []):
380
+ section_type = passage.get("infons", {}).get("section_type", "other")
381
+ section_name = section_map.get(section_type, "other")
382
+ text = passage.get("text", "")
383
+
384
+ if text:
385
+ if section_name not in sections:
386
+ sections[section_name] = []
387
+ sections[section_name].append(text)
388
+
389
+ # Join multiple passages per section
390
+ return {k: "\n\n".join(v) for k, v in sections.items()}
391
+
392
+ except httpx.HTTPError:
393
+ return None
394
+ ```
395
+
396
+ ---
397
+
398
+ ### Step 3: Update Constructor and Search Method
399
+
400
+ Add full-text flag to constructor and update search to use it:
401
+
402
+ ```python
403
+ class PubMedTool:
404
+ """Search tool for PubMed/NCBI."""
405
+
406
+ def __init__(
407
+ self,
408
+ api_key: str | None = None,
409
+ include_fulltext: bool = False, # NEW CONSTRUCTOR PARAM
410
+ ) -> None:
411
+ self.api_key = api_key or settings.ncbi_api_key
412
+ if self.api_key == "your-ncbi-key-here":
413
+ self.api_key = None
414
+ self._last_request_time = 0.0
415
+ self.include_fulltext = include_fulltext # Store for use in search()
416
+
417
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
418
+ """
419
+ Search PubMed and return evidence.
420
+
421
+ Note: Full-text enrichment is controlled by constructor parameter,
422
+ not method parameter, because SearchHandler doesn't pass extra args.
423
+ """
424
+ # ... existing search logic ...
425
+
426
+ evidence_list = self._parse_pubmed_xml(fetch_resp.text)
427
+
428
+ # Optionally enrich with full text (if configured at construction)
429
+ if self.include_fulltext:
430
+ evidence_list = await self._enrich_with_fulltext(evidence_list)
431
+
432
+ return evidence_list
433
+
434
+
435
+ async def _enrich_with_fulltext(
436
+ self, evidence_list: list[Evidence]
437
+ ) -> list[Evidence]:
438
+ """Attempt to add full text to evidence items."""
439
+ enriched = []
440
+
441
+ for evidence in evidence_list:
442
+ # Extract PMID from URL
443
+ url = evidence.citation.url
444
+ pmid = url.rstrip("/").split("/")[-1] if url else None
445
+
446
+ if pmid:
447
+ fulltext = await self.get_fulltext(pmid)
448
+ if fulltext:
449
+ # Replace abstract with full text (truncated)
450
+ evidence = Evidence(
451
+ content=fulltext[:8000], # Larger limit for full text
452
+ citation=evidence.citation,
453
+ relevance=evidence.relevance,
454
+ metadata={
455
+ **evidence.metadata,
456
+ "has_fulltext": True,
457
+ },
458
+ )
459
+
460
+ enriched.append(evidence)
461
+
462
+ return enriched
463
+ ```
464
+
465
+ ---
466
+
467
+ ## Demo Script
468
+
469
+ **File**: `examples/pubmed_fulltext_demo.py`
470
+
471
+ ```python
472
+ #!/usr/bin/env python3
473
+ """Demo script to verify PubMed full-text retrieval."""
474
+
475
+ import asyncio
476
+ from src.tools.pubmed import PubMedTool
477
+
478
+
479
+ async def main():
480
+ """Run PubMed full-text demo."""
481
+ tool = PubMedTool()
482
+
483
+ print("=" * 60)
484
+ print("PubMed Full-Text Demo")
485
+ print("=" * 60)
486
+
487
+ # Test 1: Convert PMID to PMCID
488
+ print("\n[Test 1] Converting PMID to PMCID...")
489
+ # Use a known open-access paper
490
+ test_pmid = "34450029" # Example: COVID-related open-access paper
491
+ pmcid = await tool.get_pmc_id(test_pmid)
492
+ print(f"PMID {test_pmid} -> PMCID: {pmcid or 'Not in PMC'}")
493
+
494
+ # Test 2: Get full text
495
+ print("\n[Test 2] Fetching full text...")
496
+ if pmcid:
497
+ fulltext = await tool.get_fulltext(test_pmid)
498
+ if fulltext:
499
+ print(f"Full text length: {len(fulltext)} characters")
500
+ print(f"Preview: {fulltext[:500]}...")
501
+ else:
502
+ print("Full text not available")
503
+
504
+ # Test 3: Get structured sections
505
+ print("\n[Test 3] Fetching structured sections...")
506
+ if pmcid:
507
+ sections = await tool.get_fulltext_structured(test_pmid)
508
+ if sections:
509
+ print("Available sections:")
510
+ for section, text in sections.items():
511
+ print(f" - {section}: {len(text)} chars")
512
+ else:
513
+ print("Structured text not available")
514
+
515
+ # Test 4: Search with full text
516
+ print("\n[Test 4] Search with full-text enrichment...")
517
+ results = await tool.search(
518
+ "metformin cancer open access",
519
+ max_results=3,
520
+ include_fulltext=True
521
+ )
522
+
523
+ for i, evidence in enumerate(results, 1):
524
+ has_ft = evidence.metadata.get("has_fulltext", False)
525
+ print(f"\n--- Result {i} ---")
526
+ print(f"Title: {evidence.citation.title}")
527
+ print(f"Has Full Text: {has_ft}")
528
+ print(f"Content Length: {len(evidence.content)} chars")
529
+
530
+ print("\n" + "=" * 60)
531
+ print("Demo complete!")
532
+
533
+
534
+ if __name__ == "__main__":
535
+ asyncio.run(main())
536
+ ```
537
+
538
+ ---
539
+
540
+ ## Verification Checklist
541
+
542
+ ### Unit Tests
543
+ ```bash
544
+ # Run full-text tests
545
+ uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
546
+
547
+ # Run all PubMed tests
548
+ uv run pytest tests/unit/tools/test_pubmed.py -v
549
+
550
+ # Expected: All tests pass
551
+ ```
552
+
553
+ ### Integration Test (Manual)
554
+ ```bash
555
+ # Run demo with real API
556
+ uv run python examples/pubmed_fulltext_demo.py
557
+
558
+ # Expected: Real full text from PMC papers
559
+ ```
560
+
561
+ ### Full Test Suite
562
+ ```bash
563
+ make check
564
+ # Expected: All tests pass, mypy clean
565
+ ```
566
+
567
+ ---
568
+
569
+ ## Success Criteria
570
+
571
+ 1. **ID Conversion works**: PMID -> PMCID conversion successful
572
+ 2. **Full text retrieval works**: BioC API returns paper text
573
+ 3. **Structured sections work**: Can get intro/methods/results/discussion separately
574
+ 4. **Search integration works**: `include_fulltext=True` enriches results
575
+ 5. **No regressions**: Existing tests still pass
576
+ 6. **Graceful degradation**: Non-PMC papers still return abstracts
577
+
578
+ ---
579
+
580
+ ## Notes
581
+
582
+ - Only ~30% of PubMed papers have full text in PMC
583
+ - BioC API has no documented rate limit, but be respectful
584
+ - Full text can be very long - truncate appropriately
585
+ - Consider caching full text responses (they don't change)
586
+ - Timeout should be longer for full text (60s vs 30s)
docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md ADDED
@@ -0,0 +1,540 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 17: Rate Limiting with `limits` Library
2
+
3
+ **Priority**: P0 CRITICAL - Prevents API blocks
4
+ **Effort**: ~1 hour
5
+ **Dependencies**: None
6
+
7
+ ---
8
+
9
+ ## CRITICAL: Async Safety Requirements
10
+
11
+ **WARNING**: The rate limiter MUST be async-safe. Blocking the event loop will freeze:
12
+ - The Gradio UI
13
+ - All parallel searches
14
+ - The orchestrator
15
+
16
+ **Rules**:
17
+ 1. **NEVER use `time.sleep()`** - Always use `await asyncio.sleep()`
18
+ 2. **NEVER use blocking while loops** - Use async-aware polling
19
+ 3. **The `limits` library check is synchronous** - Wrap it carefully
20
+
21
+ The implementation below uses a polling pattern that:
22
+ - Checks the limit (synchronous, fast)
23
+ - If exceeded, `await asyncio.sleep()` (non-blocking)
24
+ - Retry the check
25
+
26
+ **Alternative**: If `limits` proves problematic, use `aiolimiter` which is pure-async.
27
+
28
+ ---
29
+
30
+ ## Overview
31
+
32
+ Replace naive `asyncio.sleep` rate limiting with proper rate limiter using the `limits` library, which provides:
33
+ - Moving window rate limiting
34
+ - Per-API configurable limits
35
+ - Thread-safe storage
36
+ - Already used in reference repo
37
+
38
+ **Why This Matters?**
39
+ - NCBI will block us without proper rate limiting (3/sec without key, 10/sec with)
40
+ - Current implementation only has simple sleep delay
41
+ - Need coordinated limits across all PubMed calls
42
+ - Professional-grade rate limiting prevents production issues
43
+
44
+ ---
45
+
46
+ ## Current State
47
+
48
+ ### What We Have (`src/tools/pubmed.py:20-21, 34-41`)
49
+
50
+ ```python
51
+ RATE_LIMIT_DELAY = 0.34 # ~3 requests/sec without API key
52
+
53
+ async def _rate_limit(self) -> None:
54
+ """Enforce NCBI rate limiting."""
55
+ loop = asyncio.get_running_loop()
56
+ now = loop.time()
57
+ elapsed = now - self._last_request_time
58
+ if elapsed < self.RATE_LIMIT_DELAY:
59
+ await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
60
+ self._last_request_time = loop.time()
61
+ ```
62
+
63
+ ### Problems
64
+
65
+ 1. **Not shared across instances**: Each `PubMedTool()` has its own counter
66
+ 2. **Simple delay vs moving window**: Doesn't handle bursts properly
67
+ 3. **Hardcoded rate**: Doesn't adapt to API key presence
68
+ 4. **No backoff on 429**: Just retries blindly
69
+
70
+ ---
71
+
72
+ ## TDD Implementation Plan
73
+
74
+ ### Step 1: Add Dependency
75
+
76
+ **File**: `pyproject.toml`
77
+
78
+ ```toml
79
+ dependencies = [
80
+ # ... existing deps ...
81
+ "limits>=3.0",
82
+ ]
83
+ ```
84
+
85
+ Then run:
86
+ ```bash
87
+ uv sync
88
+ ```
89
+
90
+ ---
91
+
92
+ ### Step 2: Write the Tests First
93
+
94
+ **File**: `tests/unit/tools/test_rate_limiting.py`
95
+
96
+ ```python
97
+ """Tests for rate limiting functionality."""
98
+
99
+ import asyncio
100
+ import time
101
+
102
+ import pytest
103
+
104
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter
105
+
106
+
107
+ class TestRateLimiter:
108
+ """Test suite for rate limiter."""
109
+
110
+ def test_create_limiter_without_api_key(self) -> None:
111
+ """Should create 3/sec limiter without API key."""
112
+ limiter = RateLimiter(rate="3/second")
113
+ assert limiter.rate == "3/second"
114
+
115
+ def test_create_limiter_with_api_key(self) -> None:
116
+ """Should create 10/sec limiter with API key."""
117
+ limiter = RateLimiter(rate="10/second")
118
+ assert limiter.rate == "10/second"
119
+
120
+ @pytest.mark.asyncio
121
+ async def test_limiter_allows_requests_under_limit(self) -> None:
122
+ """Should allow requests under the rate limit."""
123
+ limiter = RateLimiter(rate="10/second")
124
+
125
+ # 3 requests should all succeed immediately
126
+ for _ in range(3):
127
+ allowed = await limiter.acquire()
128
+ assert allowed is True
129
+
130
+ @pytest.mark.asyncio
131
+ async def test_limiter_blocks_when_exceeded(self) -> None:
132
+ """Should wait when rate limit exceeded."""
133
+ limiter = RateLimiter(rate="2/second")
134
+
135
+ # First 2 should be instant
136
+ await limiter.acquire()
137
+ await limiter.acquire()
138
+
139
+ # Third should block briefly
140
+ start = time.monotonic()
141
+ await limiter.acquire()
142
+ elapsed = time.monotonic() - start
143
+
144
+ # Should have waited ~0.5 seconds (half second window for 2/sec)
145
+ assert elapsed >= 0.3
146
+
147
+ @pytest.mark.asyncio
148
+ async def test_limiter_resets_after_window(self) -> None:
149
+ """Rate limit should reset after time window."""
150
+ limiter = RateLimiter(rate="5/second")
151
+
152
+ # Use up the limit
153
+ for _ in range(5):
154
+ await limiter.acquire()
155
+
156
+ # Wait for window to pass
157
+ await asyncio.sleep(1.1)
158
+
159
+ # Should be allowed again
160
+ start = time.monotonic()
161
+ await limiter.acquire()
162
+ elapsed = time.monotonic() - start
163
+
164
+ assert elapsed < 0.1 # Should be nearly instant
165
+
166
+
167
+ class TestGetPubmedLimiter:
168
+ """Test PubMed-specific limiter factory."""
169
+
170
+ def test_limiter_without_api_key(self) -> None:
171
+ """Should return 3/sec limiter without key."""
172
+ limiter = get_pubmed_limiter(api_key=None)
173
+ assert "3" in limiter.rate
174
+
175
+ def test_limiter_with_api_key(self) -> None:
176
+ """Should return 10/sec limiter with key."""
177
+ limiter = get_pubmed_limiter(api_key="my-api-key")
178
+ assert "10" in limiter.rate
179
+
180
+ def test_limiter_is_singleton(self) -> None:
181
+ """Same API key should return same limiter instance."""
182
+ limiter1 = get_pubmed_limiter(api_key="key1")
183
+ limiter2 = get_pubmed_limiter(api_key="key1")
184
+ assert limiter1 is limiter2
185
+
186
+ def test_different_keys_different_limiters(self) -> None:
187
+ """Different API keys should return different limiters."""
188
+ limiter1 = get_pubmed_limiter(api_key="key1")
189
+ limiter2 = get_pubmed_limiter(api_key="key2")
190
+ # Clear cache for clean test
191
+ # Actually, different keys SHOULD share the same limiter
192
+ # since we're limiting against the same API
193
+ assert limiter1 is limiter2 # Shared NCBI rate limit
194
+ ```
195
+
196
+ ---
197
+
198
+ ### Step 3: Create Rate Limiter Module
199
+
200
+ **File**: `src/tools/rate_limiter.py`
201
+
202
+ ```python
203
+ """Rate limiting utilities using the limits library."""
204
+
205
+ import asyncio
206
+ from typing import ClassVar
207
+
208
+ from limits import RateLimitItem, parse
209
+ from limits.storage import MemoryStorage
210
+ from limits.strategies import MovingWindowRateLimiter
211
+
212
+
213
+ class RateLimiter:
214
+ """
215
+ Async-compatible rate limiter using limits library.
216
+
217
+ Uses moving window algorithm for smooth rate limiting.
218
+ """
219
+
220
+ def __init__(self, rate: str) -> None:
221
+ """
222
+ Initialize rate limiter.
223
+
224
+ Args:
225
+ rate: Rate string like "3/second" or "10/second"
226
+ """
227
+ self.rate = rate
228
+ self._storage = MemoryStorage()
229
+ self._limiter = MovingWindowRateLimiter(self._storage)
230
+ self._rate_limit: RateLimitItem = parse(rate)
231
+ self._identity = "default" # Single identity for shared limiting
232
+
233
+ async def acquire(self, wait: bool = True) -> bool:
234
+ """
235
+ Acquire permission to make a request.
236
+
237
+ ASYNC-SAFE: Uses asyncio.sleep(), never time.sleep().
238
+ The polling pattern allows other coroutines to run while waiting.
239
+
240
+ Args:
241
+ wait: If True, wait until allowed. If False, return immediately.
242
+
243
+ Returns:
244
+ True if allowed, False if not (only when wait=False)
245
+ """
246
+ while True:
247
+ # Check if we can proceed (synchronous, fast - ~microseconds)
248
+ if self._limiter.hit(self._rate_limit, self._identity):
249
+ return True
250
+
251
+ if not wait:
252
+ return False
253
+
254
+ # CRITICAL: Use asyncio.sleep(), NOT time.sleep()
255
+ # This yields control to the event loop, allowing other
256
+ # coroutines (UI, parallel searches) to run
257
+ await asyncio.sleep(0.1)
258
+
259
+ def reset(self) -> None:
260
+ """Reset the rate limiter (for testing)."""
261
+ self._storage.reset()
262
+
263
+
264
+ # Singleton limiter for PubMed/NCBI
265
+ _pubmed_limiter: RateLimiter | None = None
266
+
267
+
268
+ def get_pubmed_limiter(api_key: str | None = None) -> RateLimiter:
269
+ """
270
+ Get the shared PubMed rate limiter.
271
+
272
+ Rate depends on whether API key is provided:
273
+ - Without key: 3 requests/second
274
+ - With key: 10 requests/second
275
+
276
+ Args:
277
+ api_key: NCBI API key (optional)
278
+
279
+ Returns:
280
+ Shared RateLimiter instance
281
+ """
282
+ global _pubmed_limiter
283
+
284
+ if _pubmed_limiter is None:
285
+ rate = "10/second" if api_key else "3/second"
286
+ _pubmed_limiter = RateLimiter(rate)
287
+
288
+ return _pubmed_limiter
289
+
290
+
291
+ def reset_pubmed_limiter() -> None:
292
+ """Reset the PubMed limiter (for testing)."""
293
+ global _pubmed_limiter
294
+ _pubmed_limiter = None
295
+
296
+
297
+ # Factory for other APIs
298
+ class RateLimiterFactory:
299
+ """Factory for creating/getting rate limiters for different APIs."""
300
+
301
+ _limiters: ClassVar[dict[str, RateLimiter]] = {}
302
+
303
+ @classmethod
304
+ def get(cls, api_name: str, rate: str) -> RateLimiter:
305
+ """
306
+ Get or create a rate limiter for an API.
307
+
308
+ Args:
309
+ api_name: Unique identifier for the API
310
+ rate: Rate limit string (e.g., "10/second")
311
+
312
+ Returns:
313
+ RateLimiter instance (shared for same api_name)
314
+ """
315
+ if api_name not in cls._limiters:
316
+ cls._limiters[api_name] = RateLimiter(rate)
317
+ return cls._limiters[api_name]
318
+
319
+ @classmethod
320
+ def reset_all(cls) -> None:
321
+ """Reset all limiters (for testing)."""
322
+ cls._limiters.clear()
323
+ ```
324
+
325
+ ---
326
+
327
+ ### Step 4: Update PubMed Tool
328
+
329
+ **File**: `src/tools/pubmed.py` (replace rate limiting code)
330
+
331
+ ```python
332
+ # Replace imports and rate limiting
333
+
334
+ from src.tools.rate_limiter import get_pubmed_limiter
335
+
336
+
337
+ class PubMedTool:
338
+ """Search tool for PubMed/NCBI."""
339
+
340
+ BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
341
+ HTTP_TOO_MANY_REQUESTS = 429
342
+
343
+ def __init__(self, api_key: str | None = None) -> None:
344
+ self.api_key = api_key or settings.ncbi_api_key
345
+ if self.api_key == "your-ncbi-key-here":
346
+ self.api_key = None
347
+ # Use shared rate limiter
348
+ self._limiter = get_pubmed_limiter(self.api_key)
349
+
350
+ async def _rate_limit(self) -> None:
351
+ """Enforce NCBI rate limiting using shared limiter."""
352
+ await self._limiter.acquire()
353
+
354
+ # ... rest of class unchanged ...
355
+ ```
356
+
357
+ ---
358
+
359
+ ### Step 5: Add Rate Limiters for Other APIs
360
+
361
+ **File**: `src/tools/clinicaltrials.py` (optional)
362
+
363
+ ```python
364
+ from src.tools.rate_limiter import RateLimiterFactory
365
+
366
+
367
+ class ClinicalTrialsTool:
368
+ def __init__(self) -> None:
369
+ # ClinicalTrials.gov doesn't document limits, but be conservative
370
+ self._limiter = RateLimiterFactory.get("clinicaltrials", "5/second")
371
+
372
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
373
+ await self._limiter.acquire()
374
+ # ... rest of method ...
375
+ ```
376
+
377
+ **File**: `src/tools/europepmc.py` (optional)
378
+
379
+ ```python
380
+ from src.tools.rate_limiter import RateLimiterFactory
381
+
382
+
383
+ class EuropePMCTool:
384
+ def __init__(self) -> None:
385
+ # Europe PMC is generous, but still be respectful
386
+ self._limiter = RateLimiterFactory.get("europepmc", "10/second")
387
+
388
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
389
+ await self._limiter.acquire()
390
+ # ... rest of method ...
391
+ ```
392
+
393
+ ---
394
+
395
+ ## Demo Script
396
+
397
+ **File**: `examples/rate_limiting_demo.py`
398
+
399
+ ```python
400
+ #!/usr/bin/env python3
401
+ """Demo script to verify rate limiting works correctly."""
402
+
403
+ import asyncio
404
+ import time
405
+
406
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
407
+ from src.tools.pubmed import PubMedTool
408
+
409
+
410
+ async def test_basic_limiter():
411
+ """Test basic rate limiter behavior."""
412
+ print("=" * 60)
413
+ print("Rate Limiting Demo")
414
+ print("=" * 60)
415
+
416
+ # Test 1: Basic limiter
417
+ print("\n[Test 1] Testing 3/second limiter...")
418
+ limiter = RateLimiter("3/second")
419
+
420
+ start = time.monotonic()
421
+ for i in range(6):
422
+ await limiter.acquire()
423
+ elapsed = time.monotonic() - start
424
+ print(f" Request {i+1} at {elapsed:.2f}s")
425
+
426
+ total = time.monotonic() - start
427
+ print(f" Total time for 6 requests: {total:.2f}s (expected ~2s)")
428
+
429
+
430
+ async def test_pubmed_limiter():
431
+ """Test PubMed-specific limiter."""
432
+ print("\n[Test 2] Testing PubMed limiter (shared)...")
433
+
434
+ reset_pubmed_limiter() # Clean state
435
+
436
+ # Without API key: 3/sec
437
+ limiter = get_pubmed_limiter(api_key=None)
438
+ print(f" Rate without key: {limiter.rate}")
439
+
440
+ # Multiple tools should share the same limiter
441
+ tool1 = PubMedTool()
442
+ tool2 = PubMedTool()
443
+
444
+ # Verify they share the limiter
445
+ print(f" Tools share limiter: {tool1._limiter is tool2._limiter}")
446
+
447
+
448
+ async def test_concurrent_requests():
449
+ """Test rate limiting under concurrent load."""
450
+ print("\n[Test 3] Testing concurrent request limiting...")
451
+
452
+ limiter = RateLimiter("5/second")
453
+
454
+ async def make_request(i: int):
455
+ await limiter.acquire()
456
+ return time.monotonic()
457
+
458
+ start = time.monotonic()
459
+ # Launch 10 concurrent requests
460
+ tasks = [make_request(i) for i in range(10)]
461
+ times = await asyncio.gather(*tasks)
462
+
463
+ # Calculate distribution
464
+ relative_times = [t - start for t in times]
465
+ print(f" Request times: {[f'{t:.2f}s' for t in sorted(relative_times)]}")
466
+
467
+ total = max(relative_times)
468
+ print(f" All 10 requests completed in {total:.2f}s (expected ~2s)")
469
+
470
+
471
+ async def main():
472
+ await test_basic_limiter()
473
+ await test_pubmed_limiter()
474
+ await test_concurrent_requests()
475
+
476
+ print("\n" + "=" * 60)
477
+ print("Demo complete!")
478
+
479
+
480
+ if __name__ == "__main__":
481
+ asyncio.run(main())
482
+ ```
483
+
484
+ ---
485
+
486
+ ## Verification Checklist
487
+
488
+ ### Unit Tests
489
+ ```bash
490
+ # Run rate limiting tests
491
+ uv run pytest tests/unit/tools/test_rate_limiting.py -v
492
+
493
+ # Expected: All tests pass
494
+ ```
495
+
496
+ ### Integration Test (Manual)
497
+ ```bash
498
+ # Run demo
499
+ uv run python examples/rate_limiting_demo.py
500
+
501
+ # Expected: Requests properly spaced
502
+ ```
503
+
504
+ ### Full Test Suite
505
+ ```bash
506
+ make check
507
+ # Expected: All tests pass, mypy clean
508
+ ```
509
+
510
+ ---
511
+
512
+ ## Success Criteria
513
+
514
+ 1. **`limits` library installed**: Dependency added to pyproject.toml
515
+ 2. **RateLimiter class works**: Can create and use limiters
516
+ 3. **PubMed uses new limiter**: Shared limiter across instances
517
+ 4. **Rate adapts to API key**: 3/sec without, 10/sec with
518
+ 5. **Concurrent requests handled**: Multiple async requests properly queued
519
+ 6. **No regressions**: All existing tests pass
520
+
521
+ ---
522
+
523
+ ## API Rate Limit Reference
524
+
525
+ | API | Without Key | With Key |
526
+ |-----|-------------|----------|
527
+ | PubMed/NCBI | 3/sec | 10/sec |
528
+ | ClinicalTrials.gov | Undocumented (~5/sec safe) | N/A |
529
+ | Europe PMC | ~10-20/sec (generous) | N/A |
530
+ | OpenAlex | ~100k/day (no per-sec limit) | Faster with `mailto` |
531
+
532
+ ---
533
+
534
+ ## Notes
535
+
536
+ - `limits` library uses moving window algorithm (fairer than fixed window)
537
+ - Singleton pattern ensures all PubMed calls share the limit
538
+ - The factory pattern allows easy extension to other APIs
539
+ - Consider adding 429 response detection + exponential backoff
540
+ - In production, consider Redis storage for distributed rate limiting
docs/brainstorming/implementation/README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plans
2
+
3
+ TDD implementation plans based on the brainstorming documents. Each phase is a self-contained vertical slice with tests, implementation, and demo scripts.
4
+
5
+ ---
6
+
7
+ ## Prerequisites (COMPLETED)
8
+
9
+ The following foundational changes have been implemented to support all three phases:
10
+
11
+ | Change | File | Status |
12
+ |--------|------|--------|
13
+ | Add `"openalex"` to `SourceName` | `src/utils/models.py:9` | βœ… Done |
14
+ | Add `metadata` field to `Evidence` | `src/utils/models.py:39-42` | βœ… Done |
15
+ | Export all tools from `__init__.py` | `src/tools/__init__.py` | βœ… Done |
16
+
17
+ All 110 tests pass after these changes.
18
+
19
+ ---
20
+
21
+ ## Priority Order
22
+
23
+ | Phase | Name | Priority | Effort | Value |
24
+ |-------|------|----------|--------|-------|
25
+ | **17** | Rate Limiting | P0 CRITICAL | 1 hour | Stability |
26
+ | **15** | OpenAlex | HIGH | 2-3 hours | Very High |
27
+ | **16** | PubMed Full-Text | MEDIUM | 3 hours | High |
28
+
29
+ **Recommended implementation order**: 17 β†’ 15 β†’ 16
30
+
31
+ ---
32
+
33
+ ## Phase 15: OpenAlex Integration
34
+
35
+ **File**: [15_PHASE_OPENALEX.md](./15_PHASE_OPENALEX.md)
36
+
37
+ Add OpenAlex as 4th data source for:
38
+ - Citation networks (who cites whom)
39
+ - Concept tagging (semantic discovery)
40
+ - 209M+ scholarly works
41
+ - Free, no API key required
42
+
43
+ **Quick Start**:
44
+ ```bash
45
+ # Create the tool
46
+ touch src/tools/openalex.py
47
+ touch tests/unit/tools/test_openalex.py
48
+
49
+ # Run tests first (TDD)
50
+ uv run pytest tests/unit/tools/test_openalex.py -v
51
+
52
+ # Demo
53
+ uv run python examples/openalex_demo.py
54
+ ```
55
+
56
+ ---
57
+
58
+ ## Phase 16: PubMed Full-Text
59
+
60
+ **File**: [16_PHASE_PUBMED_FULLTEXT.md](./16_PHASE_PUBMED_FULLTEXT.md)
61
+
62
+ Add full-text retrieval via BioC API for:
63
+ - Complete paper text (not just abstracts)
64
+ - Structured sections (intro, methods, results)
65
+ - Better evidence for LLM synthesis
66
+
67
+ **Quick Start**:
68
+ ```bash
69
+ # Add methods to existing pubmed.py
70
+ # Tests in test_pubmed_fulltext.py
71
+
72
+ # Run tests
73
+ uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
74
+
75
+ # Demo
76
+ uv run python examples/pubmed_fulltext_demo.py
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Phase 17: Rate Limiting
82
+
83
+ **File**: [17_PHASE_RATE_LIMITING.md](./17_PHASE_RATE_LIMITING.md)
84
+
85
+ Replace naive sleep-based rate limiting with `limits` library for:
86
+ - Moving window algorithm
87
+ - Shared limits across instances
88
+ - Configurable per-API rates
89
+ - Production-grade stability
90
+
91
+ **Quick Start**:
92
+ ```bash
93
+ # Add dependency
94
+ uv add limits
95
+
96
+ # Create module
97
+ touch src/tools/rate_limiter.py
98
+ touch tests/unit/tools/test_rate_limiting.py
99
+
100
+ # Run tests
101
+ uv run pytest tests/unit/tools/test_rate_limiting.py -v
102
+
103
+ # Demo
104
+ uv run python examples/rate_limiting_demo.py
105
+ ```
106
+
107
+ ---
108
+
109
+ ## TDD Workflow
110
+
111
+ Each implementation doc follows this pattern:
112
+
113
+ 1. **Write tests first** - Define expected behavior
114
+ 2. **Run tests** - Verify they fail (red)
115
+ 3. **Implement** - Write minimal code to pass
116
+ 4. **Run tests** - Verify they pass (green)
117
+ 5. **Refactor** - Clean up if needed
118
+ 6. **Demo** - Verify end-to-end with real APIs
119
+ 7. **`make check`** - Ensure no regressions
120
+
121
+ ---
122
+
123
+ ## Related Brainstorming Docs
124
+
125
+ These implementation plans are derived from:
126
+
127
+ - [00_ROADMAP_SUMMARY.md](../00_ROADMAP_SUMMARY.md) - Priority overview
128
+ - [01_PUBMED_IMPROVEMENTS.md](../01_PUBMED_IMPROVEMENTS.md) - PubMed details
129
+ - [02_CLINICALTRIALS_IMPROVEMENTS.md](../02_CLINICALTRIALS_IMPROVEMENTS.md) - CT.gov details
130
+ - [03_EUROPEPMC_IMPROVEMENTS.md](../03_EUROPEPMC_IMPROVEMENTS.md) - Europe PMC details
131
+ - [04_OPENALEX_INTEGRATION.md](../04_OPENALEX_INTEGRATION.md) - OpenAlex integration
132
+
133
+ ---
134
+
135
+ ## Future Phases (Not Yet Documented)
136
+
137
+ Based on brainstorming, these could be added later:
138
+
139
+ - **Phase 18**: ClinicalTrials.gov Results Retrieval
140
+ - **Phase 19**: Europe PMC Annotations API
141
+ - **Phase 20**: Drug Name Normalization (RxNorm)
142
+ - **Phase 21**: Citation Network Queries (OpenAlex)
143
+ - **Phase 22**: Semantic Search with Embeddings
src/tools/__init__.py CHANGED
@@ -1,8 +1,16 @@
1
  """Search tools package."""
2
 
3
  from src.tools.base import SearchTool
 
 
4
  from src.tools.pubmed import PubMedTool
5
  from src.tools.search_handler import SearchHandler
6
 
7
- # Re-export
8
- __all__ = ["PubMedTool", "SearchHandler", "SearchTool"]
 
 
 
 
 
 
 
1
  """Search tools package."""
2
 
3
  from src.tools.base import SearchTool
4
+ from src.tools.clinicaltrials import ClinicalTrialsTool
5
+ from src.tools.europepmc import EuropePMCTool
6
  from src.tools.pubmed import PubMedTool
7
  from src.tools.search_handler import SearchHandler
8
 
9
+ # Re-export all search tools
10
+ __all__ = [
11
+ "ClinicalTrialsTool",
12
+ "EuropePMCTool",
13
+ "PubMedTool",
14
+ "SearchHandler",
15
+ "SearchTool",
16
+ ]
src/utils/models.py CHANGED
@@ -6,7 +6,7 @@ from typing import Any, ClassVar, Literal
6
  from pydantic import BaseModel, Field
7
 
8
  # Centralized source type - add new sources here (e.g., new databases)
9
- SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint"]
10
 
11
 
12
  class Citation(BaseModel):
@@ -36,6 +36,10 @@ class Evidence(BaseModel):
36
  content: str = Field(min_length=1, description="The actual text content")
37
  citation: Citation
38
  relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
 
 
 
 
39
 
40
  model_config = {"frozen": True}
41
 
 
6
  from pydantic import BaseModel, Field
7
 
8
  # Centralized source type - add new sources here (e.g., new databases)
9
+ SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
10
 
11
 
12
  class Citation(BaseModel):
 
36
  content: str = Field(min_length=1, description="The actual text content")
37
  citation: Citation
38
  relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
39
+ metadata: dict[str, Any] = Field(
40
+ default_factory=dict,
41
+ description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
42
+ )
43
 
44
  model_config = {"frozen": True}
45