VibecoderMcSwaggins commited on
Commit
b72f9f1
·
2 Parent(s): 5b5d2ca 316dc7d

Merge branch 'dev'

Browse files
.env.example CHANGED
@@ -8,8 +8,16 @@ OPENAI_API_KEY=sk-your-key-here
8
  ANTHROPIC_API_KEY=sk-ant-your-key-here
9
 
10
  # Model names (optional - sensible defaults)
11
- OPENAI_MODEL=gpt-5.1
12
- ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
 
 
 
 
 
 
 
 
13
 
14
  # ============== HUGGINGFACE (FREE TIER) ==============
15
 
@@ -20,7 +28,7 @@ ANTHROPIC_MODEL=claude-sonnet-4-5-20250929
20
  # WITH HF_TOKEN: Uses Llama 3.1 8B Instruct (requires accepting license)
21
  #
22
  # For HuggingFace Spaces deployment:
23
- # Set this as a "Secret" in Space Settings Variables and secrets
24
  # Users/judges don't need their own token - the Space secret is used
25
  #
26
  HF_TOKEN=hf_your-token-here
@@ -36,9 +44,5 @@ LOG_LEVEL=INFO
36
  # PubMed (optional - higher rate limits)
37
  NCBI_API_KEY=your-ncbi-key-here
38
 
39
- # Modal Sandbox (optional - for secure code execution)
40
- MODAL_TOKEN_ID=ak-your-modal-token-id-here
41
- MODAL_TOKEN_SECRET=your-modal-token-secret-here
42
-
43
  # Vector Database (optional - for LlamaIndex RAG)
44
  CHROMA_DB_PATH=./chroma_db
 
8
  ANTHROPIC_API_KEY=sk-ant-your-key-here
9
 
10
  # Model names (optional - sensible defaults)
11
+ ANTHROPIC_MODEL=claude-3-5-sonnet-20240620
12
+ OPENAI_MODEL=gpt-4-turbo
13
+
14
+ # ============== EMBEDDINGS ==============
15
+
16
+ # OpenAI Embedding Model (used if LLM_PROVIDER is openai and performing RAG/Embeddings)
17
+ OPENAI_EMBEDDING_MODEL=text-embedding-3-small
18
+
19
+ # Local Embedding Model (used for local/offline embeddings)
20
+ LOCAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
21
 
22
  # ============== HUGGINGFACE (FREE TIER) ==============
23
 
 
28
  # WITH HF_TOKEN: Uses Llama 3.1 8B Instruct (requires accepting license)
29
  #
30
  # For HuggingFace Spaces deployment:
31
+ # Set this as a "Secret" in Space Settings -> Variables and secrets
32
  # Users/judges don't need their own token - the Space secret is used
33
  #
34
  HF_TOKEN=hf_your-token-here
 
44
  # PubMed (optional - higher rate limits)
45
  NCBI_API_KEY=your-ncbi-key-here
46
 
 
 
 
 
47
  # Vector Database (optional - for LlamaIndex RAG)
48
  CHROMA_DB_PATH=./chroma_db
docs/brainstorming/00_ROADMAP_SUMMARY.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepCritical Data Sources: Roadmap Summary
2
+
3
+ **Created**: 2024-11-27
4
+ **Purpose**: Future maintainability and hackathon continuation
5
+
6
+ ---
7
+
8
+ ## Current State
9
+
10
+ ### Working Tools
11
+
12
+ | Tool | Status | Data Quality |
13
+ |------|--------|--------------|
14
+ | PubMed | ✅ Works | Good (abstracts only) |
15
+ | ClinicalTrials.gov | ✅ Works | Good (filtered for interventional) |
16
+ | Europe PMC | ✅ Works | Good (includes preprints) |
17
+
18
+ ### Removed Tools
19
+
20
+ | Tool | Status | Reason |
21
+ |------|--------|--------|
22
+ | bioRxiv | ❌ Removed | No search API - only date/DOI lookup |
23
+
24
+ ---
25
+
26
+ ## Priority Improvements
27
+
28
+ ### P0: Critical (Do First)
29
+
30
+ 1. **Add Rate Limiting to PubMed**
31
+ - NCBI will block us without it
32
+ - Use `limits` library (see reference repo)
33
+ - 3/sec without key, 10/sec with key
34
+
35
+ ### P1: High Value, Medium Effort
36
+
37
+ 2. **Add OpenAlex as 4th Source**
38
+ - Citation network (huge for drug repurposing)
39
+ - Concept tagging (semantic discovery)
40
+ - Already implemented in reference repo
41
+ - Free, no API key
42
+
43
+ 3. **PubMed Full-Text via BioC**
44
+ - Get full paper text for PMC papers
45
+ - Already in reference repo
46
+
47
+ ### P2: Nice to Have
48
+
49
+ 4. **ClinicalTrials.gov Results**
50
+ - Get efficacy data from completed trials
51
+ - Requires more complex API calls
52
+
53
+ 5. **Europe PMC Annotations**
54
+ - Text-mined entities (genes, drugs, diseases)
55
+ - Automatic entity extraction
56
+
57
+ ---
58
+
59
+ ## Effort Estimates
60
+
61
+ | Improvement | Effort | Impact | Priority |
62
+ |-------------|--------|--------|----------|
63
+ | PubMed rate limiting | 1 hour | Stability | P0 |
64
+ | OpenAlex basic search | 2 hours | High | P1 |
65
+ | OpenAlex citations | 2 hours | Very High | P1 |
66
+ | PubMed full-text | 3 hours | Medium | P1 |
67
+ | CT.gov results | 4 hours | Medium | P2 |
68
+ | Europe PMC annotations | 3 hours | Medium | P2 |
69
+
70
+ ---
71
+
72
+ ## Architecture Decision
73
+
74
+ ### Option A: Keep Current + Add OpenAlex
75
+
76
+ ```
77
+ User Query
78
+
79
+ ┌───────────────────┼───────────────────┐
80
+ ↓ ↓ ↓
81
+ PubMed ClinicalTrials Europe PMC
82
+ (abstracts) (trials only) (preprints)
83
+ ↓ ↓ ↓
84
+ └───────────────────┼───────────────────┘
85
+
86
+ OpenAlex ← NEW
87
+ (citations, concepts)
88
+
89
+ Orchestrator
90
+
91
+ Report
92
+ ```
93
+
94
+ **Pros**: Low risk, additive
95
+ **Cons**: More complexity, some overlap
96
+
97
+ ### Option B: OpenAlex as Primary
98
+
99
+ ```
100
+ User Query
101
+
102
+ ┌───────────────────┼───────────────────┐
103
+ ↓ ↓ ↓
104
+ OpenAlex ClinicalTrials Europe PMC
105
+ (primary (trials only) (full-text
106
+ search) fallback)
107
+ ↓ ↓ ↓
108
+ └───────────────────┼───────────────────┘
109
+
110
+ Orchestrator
111
+
112
+ Report
113
+ ```
114
+
115
+ **Pros**: Simpler, citation network built-in
116
+ **Cons**: Lose some PubMed-specific features
117
+
118
+ ### Recommendation: Option A
119
+
120
+ Keep current architecture working, add OpenAlex incrementally.
121
+
122
+ ---
123
+
124
+ ## Quick Wins (Can Do Today)
125
+
126
+ 1. **Add `limits` to `pyproject.toml`**
127
+ ```toml
128
+ dependencies = [
129
+ "limits>=3.0",
130
+ ]
131
+ ```
132
+
133
+ 2. **Copy OpenAlex tool from reference repo**
134
+ - File: `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`
135
+ - Adapt to our `SearchTool` base class
136
+
137
+ 3. **Enable NCBI API Key**
138
+ - Add to `.env`: `NCBI_API_KEY=your_key`
139
+ - 10x rate limit improvement
140
+
141
+ ---
142
+
143
+ ## External Resources Worth Exploring
144
+
145
+ ### Python Libraries
146
+
147
+ | Library | For | Notes |
148
+ |---------|-----|-------|
149
+ | `limits` | Rate limiting | Used by reference repo |
150
+ | `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
151
+ | `metapub` | PubMed | Full-featured |
152
+ | `sentence-transformers` | Semantic search | For embeddings |
153
+
154
+ ### APIs Not Yet Used
155
+
156
+ | API | Provides | Effort |
157
+ |-----|----------|--------|
158
+ | RxNorm | Drug name normalization | Low |
159
+ | DrugBank | Drug targets/mechanisms | Medium (license) |
160
+ | UniProt | Protein data | Medium |
161
+ | ChEMBL | Bioactivity data | Medium |
162
+
163
+ ### RAG Tools (Future)
164
+
165
+ | Tool | Purpose |
166
+ |------|---------|
167
+ | [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
168
+ | [txtai](https://github.com/neuml/txtai) | Embeddings + search |
169
+ | [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
170
+
171
+ ---
172
+
173
+ ## Files in This Directory
174
+
175
+ | File | Contents |
176
+ |------|----------|
177
+ | `00_ROADMAP_SUMMARY.md` | This file |
178
+ | `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
179
+ | `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
180
+ | `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
181
+ | `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
182
+
183
+ ---
184
+
185
+ ## For Future Maintainers
186
+
187
+ If you're picking this up after the hackathon:
188
+
189
+ 1. **Start with OpenAlex** - biggest bang for buck
190
+ 2. **Add rate limiting** - prevents API blocks
191
+ 3. **Don't bother with bioRxiv** - use Europe PMC instead
192
+ 4. **Reference repo is gold** - `reference_repos/DeepCritical/` has working implementations
193
+
194
+ Good luck! 🚀
docs/brainstorming/01_PUBMED_IMPROVEMENTS.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PubMed Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented
4
+ **Priority**: High (Core Data Source)
5
+
6
+ ---
7
+
8
+ ## Current Implementation
9
+
10
+ ### What We Have (`src/tools/pubmed.py`)
11
+
12
+ - Basic E-utilities search via `esearch.fcgi` and `efetch.fcgi`
13
+ - Query preprocessing (strips question words, expands synonyms)
14
+ - Returns: title, abstract, authors, journal, PMID
15
+ - Rate limiting: None implemented (relying on NCBI defaults)
16
+
17
+ ### Current Limitations
18
+
19
+ 1. **No Full-Text Access**: Only retrieves abstracts, not full paper text
20
+ 2. **No Rate Limiting**: Risk of being blocked by NCBI
21
+ 3. **No BioC Format**: Missing structured full-text extraction
22
+ 4. **No Figure Retrieval**: No supplementary materials access
23
+ 5. **No PMC Integration**: Missing open-access full-text via PMC
24
+
25
+ ---
26
+
27
+ ## Reference Implementation (DeepCritical Reference Repo)
28
+
29
+ The reference repo at `reference_repos/DeepCritical/DeepResearch/src/tools/bioinformatics_tools.py` has a more sophisticated implementation:
30
+
31
+ ### Features We're Missing
32
+
33
+ ```python
34
+ # Rate limiting (lines 47-50)
35
+ from limits import parse
36
+ from limits.storage import MemoryStorage
37
+ from limits.strategies import MovingWindowRateLimiter
38
+
39
+ storage = MemoryStorage()
40
+ limiter = MovingWindowRateLimiter(storage)
41
+ rate_limit = parse("3/second") # NCBI allows 3/sec without API key, 10/sec with
42
+
43
+ # Full-text via BioC format (lines 108-120)
44
+ def _get_fulltext(pmid: int) -> dict[str, Any] | None:
45
+ pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
46
+ # Returns structured JSON with full text for open-access papers
47
+
48
+ # Figure retrieval via Europe PMC (lines 123-149)
49
+ def _get_figures(pmcid: str) -> dict[str, str]:
50
+ suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
51
+ # Returns base64-encoded images from supplementary materials
52
+ ```
53
+
54
+ ---
55
+
56
+ ## Recommended Improvements
57
+
58
+ ### Phase 1: Rate Limiting (Critical)
59
+
60
+ ```python
61
+ # Add to src/tools/pubmed.py
62
+ from limits import parse
63
+ from limits.storage import MemoryStorage
64
+ from limits.strategies import MovingWindowRateLimiter
65
+
66
+ storage = MemoryStorage()
67
+ limiter = MovingWindowRateLimiter(storage)
68
+
69
+ # With NCBI_API_KEY: 10/sec, without: 3/sec
70
+ def get_rate_limit():
71
+ if settings.ncbi_api_key:
72
+ return parse("10/second")
73
+ return parse("3/second")
74
+ ```
75
+
76
+ **Dependencies**: `pip install limits`
77
+
78
+ ### Phase 2: Full-Text Retrieval
79
+
80
+ ```python
81
+ async def get_fulltext(pmid: str) -> str | None:
82
+ """Get full text for open-access papers via BioC API."""
83
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
84
+ # Only works for PMC papers (open access)
85
+ ```
86
+
87
+ ### Phase 3: PMC ID Resolution
88
+
89
+ ```python
90
+ async def get_pmc_id(pmid: str) -> str | None:
91
+ """Convert PMID to PMCID for full-text access."""
92
+ url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
93
+ ```
94
+
95
+ ---
96
+
97
+ ## Python Libraries to Consider
98
+
99
+ | Library | Purpose | Notes |
100
+ |---------|---------|-------|
101
+ | [Biopython](https://biopython.org/) | `Bio.Entrez` module | Official, well-maintained |
102
+ | [PyMed](https://pypi.org/project/pymed/) | PubMed wrapper | Simpler API, less control |
103
+ | [metapub](https://pypi.org/project/metapub/) | Full-featured | Tested on 1/3 of PubMed |
104
+ | [limits](https://pypi.org/project/limits/) | Rate limiting | Used by reference repo |
105
+
106
+ ---
107
+
108
+ ## API Endpoints Reference
109
+
110
+ | Endpoint | Purpose | Rate Limit |
111
+ |----------|---------|------------|
112
+ | `esearch.fcgi` | Search for PMIDs | 3/sec (10 with key) |
113
+ | `efetch.fcgi` | Fetch metadata | 3/sec (10 with key) |
114
+ | `esummary.fcgi` | Quick metadata | 3/sec (10 with key) |
115
+ | `pmcoa.cgi/BioC_json` | Full text (PMC only) | Unknown |
116
+ | `idconv/v1.0` | PMID ↔ PMCID | Unknown |
117
+
118
+ ---
119
+
120
+ ## Sources
121
+
122
+ - [PubMed E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
123
+ - [NCBI BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
124
+ - [Searching PubMed with Python](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/)
125
+ - [PyMed on PyPI](https://pypi.org/project/pymed/)
docs/brainstorming/02_CLINICALTRIALS_IMPROVEMENTS.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ClinicalTrials.gov Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented
4
+ **Priority**: High (Core Data Source for Drug Repurposing)
5
+
6
+ ---
7
+
8
+ ## Current Implementation
9
+
10
+ ### What We Have (`src/tools/clinicaltrials.py`)
11
+
12
+ - V2 API search via `clinicaltrials.gov/api/v2/studies`
13
+ - Filters: `INTERVENTIONAL` study type, `RECRUITING` status
14
+ - Returns: NCT ID, title, conditions, interventions, phase, status
15
+ - Query preprocessing via shared `query_utils.py`
16
+
17
+ ### Current Strengths
18
+
19
+ 1. **Good Filtering**: Already filtering for interventional + recruiting
20
+ 2. **V2 API**: Using the modern API (v1 deprecated)
21
+ 3. **Phase Info**: Extracting trial phases for drug development context
22
+
23
+ ### Current Limitations
24
+
25
+ 1. **No Outcome Data**: Missing primary/secondary outcomes
26
+ 2. **No Eligibility Criteria**: Missing inclusion/exclusion details
27
+ 3. **No Sponsor Info**: Missing who's running the trial
28
+ 4. **No Result Data**: For completed trials, no efficacy data
29
+ 5. **Limited Drug Mapping**: No integration with drug databases
30
+
31
+ ---
32
+
33
+ ## API Capabilities We're Not Using
34
+
35
+ ### Fields We Could Request
36
+
37
+ ```python
38
+ # Current fields
39
+ fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
40
+
41
+ # Additional valuable fields
42
+ additional_fields = [
43
+ "PrimaryOutcomeMeasure", # What are they measuring?
44
+ "SecondaryOutcomeMeasure", # Secondary endpoints
45
+ "EligibilityCriteria", # Who can participate?
46
+ "LeadSponsorName", # Who's funding?
47
+ "ResultsFirstPostDate", # Has results?
48
+ "StudyFirstPostDate", # When started?
49
+ "CompletionDate", # When finished?
50
+ "EnrollmentCount", # Sample size
51
+ "InterventionDescription", # Drug details
52
+ "ArmGroupLabel", # Treatment arms
53
+ "InterventionOtherName", # Drug aliases
54
+ ]
55
+ ```
56
+
57
+ ### Filter Enhancements
58
+
59
+ ```python
60
+ # Current
61
+ aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
62
+
63
+ # Could add
64
+ "status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED" # Include completed for results
65
+ "phase:PHASE2,PHASE3" # Only later-stage trials
66
+ "resultsFirstPostDateRange:2020-01-01_" # Trials with posted results
67
+ ```
68
+
69
+ ---
70
+
71
+ ## Recommended Improvements
72
+
73
+ ### Phase 1: Richer Metadata
74
+
75
+ ```python
76
+ EXTENDED_FIELDS = [
77
+ "NCTId",
78
+ "BriefTitle",
79
+ "OfficialTitle",
80
+ "Condition",
81
+ "InterventionName",
82
+ "InterventionDescription",
83
+ "InterventionOtherName", # Drug synonyms!
84
+ "Phase",
85
+ "OverallStatus",
86
+ "PrimaryOutcomeMeasure",
87
+ "EnrollmentCount",
88
+ "LeadSponsorName",
89
+ "StudyFirstPostDate",
90
+ ]
91
+ ```
92
+
93
+ ### Phase 2: Results Retrieval
94
+
95
+ For completed trials, we can get actual efficacy data:
96
+
97
+ ```python
98
+ async def get_trial_results(nct_id: str) -> dict | None:
99
+ """Fetch results for completed trials."""
100
+ url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
101
+ params = {
102
+ "fields": "ResultsSection",
103
+ }
104
+ # Returns outcome measures and statistics
105
+ ```
106
+
107
+ ### Phase 3: Drug Name Normalization
108
+
109
+ Map intervention names to standard identifiers:
110
+
111
+ ```python
112
+ # Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
113
+ # Solution: Use RxNorm or DrugBank for normalization
114
+
115
+ async def normalize_drug_name(intervention: str) -> str:
116
+ """Normalize drug name via RxNorm API."""
117
+ url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
118
+ # Returns standardized RxCUI
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Integration Opportunities
124
+
125
+ ### With PubMed
126
+
127
+ Cross-reference trials with publications:
128
+ ```python
129
+ # ClinicalTrials.gov provides PMID links
130
+ # Can correlate trial results with published papers
131
+ ```
132
+
133
+ ### With DrugBank/ChEMBL
134
+
135
+ Map interventions to:
136
+ - Mechanism of action
137
+ - Known targets
138
+ - Adverse effects
139
+ - Drug-drug interactions
140
+
141
+ ---
142
+
143
+ ## Python Libraries to Consider
144
+
145
+ | Library | Purpose | Notes |
146
+ |---------|---------|-------|
147
+ | [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
148
+ | [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
149
+ | [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
150
+
151
+ ---
152
+
153
+ ## API Quirks & Gotchas
154
+
155
+ 1. **Rate Limiting**: Undocumented, be conservative
156
+ 2. **Pagination**: Max 1000 results per request
157
+ 3. **Field Names**: Case-sensitive, camelCase
158
+ 4. **Empty Results**: Some fields may be null even if requested
159
+ 5. **Status Changes**: Trials change status frequently
160
+
161
+ ---
162
+
163
+ ## Example Enhanced Query
164
+
165
+ ```python
166
+ async def search_drug_repurposing_trials(
167
+ drug_name: str,
168
+ condition: str,
169
+ include_completed: bool = True,
170
+ ) -> list[Evidence]:
171
+ """Search for trials repurposing a drug for a new condition."""
172
+
173
+ statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
174
+ if include_completed:
175
+ statuses.append("COMPLETED")
176
+
177
+ params = {
178
+ "query.intr": drug_name,
179
+ "query.cond": condition,
180
+ "filter.overallStatus": ",".join(statuses),
181
+ "filter.studyType": "INTERVENTIONAL",
182
+ "fields": ",".join(EXTENDED_FIELDS),
183
+ "pageSize": 50,
184
+ }
185
+ ```
186
+
187
+ ---
188
+
189
+ ## Sources
190
+
191
+ - [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
192
+ - [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
193
+ - [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)
docs/brainstorming/03_EUROPEPMC_IMPROVEMENTS.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Europe PMC Tool: Current State & Future Improvements
2
+
3
+ **Status**: Currently Implemented (Replaced bioRxiv)
4
+ **Priority**: High (Preprint + Open Access Source)
5
+
6
+ ---
7
+
8
+ ## Why Europe PMC Over bioRxiv?
9
+
10
+ ### bioRxiv API Limitations (Why We Abandoned It)
11
+
12
+ 1. **No Search API**: Only returns papers by date range or DOI
13
+ 2. **No Query Capability**: Cannot search for "metformin cancer"
14
+ 3. **Workaround Required**: Would need to download ALL preprints and build local search
15
+ 4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
16
+
17
+ ### Europe PMC Advantages
18
+
19
+ 1. **Full Search API**: Boolean queries, filters, facets
20
+ 2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
21
+ 3. **Includes PubMed**: Also has MEDLINE content
22
+ 4. **34 Preprint Servers**: Not just bioRxiv
23
+ 5. **Open Access Focus**: Full-text when available
24
+
25
+ ---
26
+
27
+ ## Current Implementation
28
+
29
+ ### What We Have (`src/tools/europepmc.py`)
30
+
31
+ - REST API search via `europepmc.org/webservices/rest/search`
32
+ - Preprint flagging via `firstPublicationDate` heuristics
33
+ - Returns: title, abstract, authors, DOI, source
34
+ - Marks preprints for transparency
35
+
36
+ ### Current Limitations
37
+
38
+ 1. **No Full-Text Retrieval**: Only metadata/abstracts
39
+ 2. **No Citation Network**: Missing references/citations
40
+ 3. **No Supplementary Files**: Not fetching figures/data
41
+ 4. **Basic Preprint Detection**: Heuristic, not explicit flag
42
+
43
+ ---
44
+
45
+ ## Europe PMC API Capabilities
46
+
47
+ ### Endpoints We Could Use
48
+
49
+ | Endpoint | Purpose | Currently Using |
50
+ |----------|---------|-----------------|
51
+ | `/search` | Query papers | Yes |
52
+ | `/fulltext/{ID}` | Full text (XML/JSON) | No |
53
+ | `/{PMCID}/supplementaryFiles` | Figures, data | No |
54
+ | `/citations/{ID}` | Who cited this | No |
55
+ | `/references/{ID}` | What this cites | No |
56
+ | `/annotations` | Text-mined entities | No |
57
+
58
+ ### Rich Query Syntax
59
+
60
+ ```python
61
+ # Current simple query
62
+ query = "metformin cancer"
63
+
64
+ # Could use advanced syntax
65
+ query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
66
+ query += " AND (SRC:PPR)" # Only preprints
67
+ query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
68
+ query += " AND (OPEN_ACCESS:y)" # Only open access
69
+ ```
70
+
71
+ ### Source Filters
72
+
73
+ ```python
74
+ # Filter by source
75
+ "SRC:MED" # MEDLINE
76
+ "SRC:PMC" # PubMed Central
77
+ "SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
78
+ "SRC:AGR" # Agricola
79
+ "SRC:CBA" # Chinese Biological Abstracts
80
+ ```
81
+
82
+ ---
83
+
84
+ ## Recommended Improvements
85
+
86
+ ### Phase 1: Rich Metadata
87
+
88
+ ```python
89
+ # Add to search results
90
+ additional_fields = [
91
+ "citedByCount", # Impact indicator
92
+ "source", # Explicit source (MED, PMC, PPR)
93
+ "isOpenAccess", # Boolean flag
94
+ "fullTextUrlList", # URLs for full text
95
+ "authorAffiliations", # Institution info
96
+ "grantsList", # Funding info
97
+ ]
98
+ ```
99
+
100
+ ### Phase 2: Full-Text Retrieval
101
+
102
+ ```python
103
+ async def get_fulltext(pmcid: str) -> str | None:
104
+ """Get full text for open access papers."""
105
+ # XML format
106
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
107
+ # Or JSON
108
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
109
+ ```
110
+
111
+ ### Phase 3: Citation Network
112
+
113
+ ```python
114
+ async def get_citations(pmcid: str) -> list[str]:
115
+ """Get papers that cite this one."""
116
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
117
+
118
+ async def get_references(pmcid: str) -> list[str]:
119
+ """Get papers this one cites."""
120
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
121
+ ```
122
+
123
+ ### Phase 4: Text-Mined Annotations
124
+
125
+ Europe PMC extracts entities automatically:
126
+
127
+ ```python
128
+ async def get_annotations(pmcid: str) -> dict:
129
+ """Get text-mined entities (genes, diseases, drugs)."""
130
+ url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
131
+ params = {
132
+ "articleIds": f"PMC:{pmcid}",
133
+ "type": "Gene_Proteins,Diseases,Chemicals",
134
+ "format": "JSON",
135
+ }
136
+ # Returns structured entity mentions with positions
137
+ ```
138
+
139
+ ---
140
+
141
+ ## Supplementary File Retrieval
142
+
143
+ From reference repo (`bioinformatics_tools.py` lines 123-149):
144
+
145
+ ```python
146
+ def get_figures(pmcid: str) -> dict[str, str]:
147
+ """Download figures and supplementary files."""
148
+ url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
149
+ # Returns ZIP with images, returns base64-encoded
150
+ ```
151
+
152
+ ---
153
+
154
+ ## Preprint-Specific Features
155
+
156
+ ### Identify Preprint Servers
157
+
158
+ ```python
159
+ PREPRINT_SOURCES = {
160
+ "PPR": "General preprints",
161
+ "bioRxiv": "Biology preprints",
162
+ "medRxiv": "Medical preprints",
163
+ "chemRxiv": "Chemistry preprints",
164
+ "Research Square": "Multi-disciplinary",
165
+ "Preprints.org": "MDPI preprints",
166
+ }
167
+
168
+ # Check if published version exists
169
+ async def check_published_version(preprint_doi: str) -> str | None:
170
+ """Check if preprint has been peer-reviewed and published."""
171
+ # Europe PMC links preprints to final versions
172
+ ```
173
+
174
+ ---
175
+
176
+ ## Rate Limiting
177
+
178
+ Europe PMC is more generous than NCBI:
179
+
180
+ ```python
181
+ # No documented hard limit, but be respectful
182
+ # Recommend: 10-20 requests/second max
183
+ # Use email in User-Agent for polite pool
184
+ headers = {
185
+ "User-Agent": "DeepCritical/1.0 (mailto:your@email.com)"
186
+ }
187
+ ```
188
+
189
+ ---
190
+
191
+ ## vs. The Lens & OpenAlex
192
+
193
+ | Feature | Europe PMC | The Lens | OpenAlex |
194
+ |---------|------------|----------|----------|
195
+ | Biomedical Focus | Yes | Partial | Partial |
196
+ | Preprints | Yes (34 servers) | Yes | Yes |
197
+ | Full Text | PMC papers | Links | No |
198
+ | Citations | Yes | Yes | Yes |
199
+ | Annotations | Yes (text-mined) | No | No |
200
+ | Rate Limits | Generous | Moderate | Very generous |
201
+ | API Key | Optional | Required | Optional |
202
+
203
+ ---
204
+
205
+ ## Sources
206
+
207
+ - [Europe PMC REST API](https://europepmc.org/RestfulWebService)
208
+ - [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
209
+ - [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
210
+ - [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
211
+ - [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)
docs/brainstorming/04_OPENALEX_INTEGRATION.md ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenAlex Integration: The Missing Piece?
2
+
3
+ **Status**: NOT Implemented (Candidate for Addition)
4
+ **Priority**: HIGH - Could Replace Multiple Tools
5
+ **Reference**: Already implemented in `reference_repos/DeepCritical`
6
+
7
+ ---
8
+
9
+ ## What is OpenAlex?
10
+
11
+ OpenAlex is a **fully open** index of the global research system:
12
+
13
+ - **209M+ works** (papers, books, datasets)
14
+ - **2B+ author records** (disambiguated)
15
+ - **124K+ venues** (journals, repositories)
16
+ - **109K+ institutions**
17
+ - **65K+ concepts** (hierarchical, linked to Wikidata)
18
+
19
+ **Free. Open. No API key required.**
20
+
21
+ ---
22
+
23
+ ## Why OpenAlex for DeepCritical?
24
+
25
+ ### Current Architecture
26
+
27
+ ```
28
+ User Query
29
+
30
+ ┌──────────────────────────────────────┐
31
+ │ PubMed ClinicalTrials Europe PMC │ ← 3 separate APIs
32
+ └──────────────────────────────────────┘
33
+
34
+ Orchestrator (deduplicate, judge, synthesize)
35
+ ```
36
+
37
+ ### With OpenAlex
38
+
39
+ ```
40
+ User Query
41
+
42
+ ┌──────────────────────────────────────┐
43
+ │ OpenAlex │ ← Single API
44
+ │ (includes PubMed + preprints + │
45
+ │ citations + concepts + authors) │
46
+ └──────────────────────────────────────┘
47
+
48
+ Orchestrator (enrich with CT.gov for trials)
49
+ ```
50
+
51
+ **OpenAlex already aggregates**:
52
+ - PubMed/MEDLINE
53
+ - Crossref
54
+ - ORCID
55
+ - Unpaywall (open access links)
56
+ - Microsoft Academic Graph (legacy)
57
+ - Preprint servers
58
+
59
+ ---
60
+
61
+ ## Reference Implementation
62
+
63
+ From `reference_repos/DeepCritical/DeepResearch/src/tools/openalex_tools.py`:
64
+
65
+ ```python
66
+ class OpenAlexFetchTool(ToolRunner):
67
+ def __init__(self):
68
+ super().__init__(
69
+ ToolSpec(
70
+ name="openalex_fetch",
71
+ description="Fetch OpenAlex work or author",
72
+ inputs={"entity": "TEXT", "identifier": "TEXT"},
73
+ outputs={"result": "JSON"},
74
+ )
75
+ )
76
+
77
+ def run(self, params: dict[str, Any]) -> ExecutionResult:
78
+ entity = params["entity"] # "works", "authors", "venues"
79
+ identifier = params["identifier"]
80
+ base = "https://api.openalex.org"
81
+ url = f"{base}/{entity}/{identifier}"
82
+ resp = requests.get(url, timeout=30)
83
+ return ExecutionResult(success=True, data={"result": resp.json()})
84
+ ```
85
+
86
+ ---
87
+
88
+ ## OpenAlex API Features
89
+
90
+ ### Search Works (Papers)
91
+
92
+ ```python
93
+ # Search for metformin + cancer papers
94
+ url = "https://api.openalex.org/works"
95
+ params = {
96
+ "search": "metformin cancer drug repurposing",
97
+ "filter": "publication_year:>2020,type:article",
98
+ "sort": "cited_by_count:desc",
99
+ "per_page": 50,
100
+ }
101
+ ```
102
+
103
+ ### Rich Filtering
104
+
105
+ ```python
106
+ # Filter examples
107
+ "publication_year:2023"
108
+ "type:article" # vs preprint, book, etc.
109
+ "is_oa:true" # Open access only
110
+ "concepts.id:C71924100" # Papers about "Medicine"
111
+ "authorships.institutions.id:I27837315" # From Harvard
112
+ "cited_by_count:>100" # Highly cited
113
+ "has_fulltext:true" # Full text available
114
+ ```
115
+
116
+ ### What You Get Back
117
+
118
+ ```json
119
+ {
120
+ "id": "W2741809807",
121
+ "title": "Metformin: A candidate drug for...",
122
+ "publication_year": 2023,
123
+ "type": "article",
124
+ "cited_by_count": 45,
125
+ "is_oa": true,
126
+ "primary_location": {
127
+ "source": {"display_name": "Nature Medicine"},
128
+ "pdf_url": "https://...",
129
+ "landing_page_url": "https://..."
130
+ },
131
+ "concepts": [
132
+ {"id": "C71924100", "display_name": "Medicine", "score": 0.95},
133
+ {"id": "C54355233", "display_name": "Pharmacology", "score": 0.88}
134
+ ],
135
+ "authorships": [
136
+ {
137
+ "author": {"id": "A123", "display_name": "John Smith"},
138
+ "institutions": [{"display_name": "Harvard Medical School"}]
139
+ }
140
+ ],
141
+ "referenced_works": ["W123", "W456"], # Citations
142
+ "related_works": ["W789", "W012"] # Similar papers
143
+ }
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Key Advantages Over Current Tools
149
+
150
+ ### 1. Citation Network (We Don't Have This!)
151
+
152
+ ```python
153
+ # Get papers that cite a work
154
+ url = f"https://api.openalex.org/works?filter=cites:{work_id}"
155
+
156
+ # Get papers cited by a work
157
+ # Already in `referenced_works` field
158
+ ```
159
+
160
+ ### 2. Concept Tagging (We Don't Have This!)
161
+
162
+ OpenAlex auto-tags papers with hierarchical concepts:
163
+ - "Medicine" → "Pharmacology" → "Drug Repurposing"
164
+ - Can search by concept, not just keywords
165
+
166
+ ### 3. Author Disambiguation (We Don't Have This!)
167
+
168
+ ```python
169
+ # Find all works by an author
170
+ url = f"https://api.openalex.org/works?filter=authorships.author.id:{author_id}"
171
+ ```
172
+
173
+ ### 4. Institution Tracking
174
+
175
+ ```python
176
+ # Find drug repurposing papers from top institutions
177
+ url = "https://api.openalex.org/works"
178
+ params = {
179
+ "search": "drug repurposing",
180
+ "filter": "authorships.institutions.id:I27837315", # Harvard
181
+ }
182
+ ```
183
+
184
+ ### 5. Related Works
185
+
186
+ Each paper comes with `related_works` - semantically similar papers discovered by OpenAlex's ML.
187
+
188
+ ---
189
+
190
+ ## Proposed Implementation
191
+
192
+ ### New Tool: `src/tools/openalex.py`
193
+
194
+ ```python
195
+ """OpenAlex search tool for comprehensive scholarly data."""
196
+
197
+ import httpx
198
+ from src.tools.base import SearchTool
199
+ from src.utils.models import Evidence
200
+
201
+ class OpenAlexTool(SearchTool):
202
+ """Search OpenAlex for scholarly works with rich metadata."""
203
+
204
+ name = "openalex"
205
+
206
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
207
+ async with httpx.AsyncClient() as client:
208
+ resp = await client.get(
209
+ "https://api.openalex.org/works",
210
+ params={
211
+ "search": query,
212
+ "filter": "type:article,is_oa:true",
213
+ "sort": "cited_by_count:desc",
214
+ "per_page": max_results,
215
+ "mailto": "deepcritical@example.com", # Polite pool
216
+ },
217
+ )
218
+ data = resp.json()
219
+
220
+ return [
221
+ Evidence(
222
+ source="openalex",
223
+ title=work["title"],
224
+ abstract=work.get("abstract", ""),
225
+ url=work["primary_location"]["landing_page_url"],
226
+ metadata={
227
+ "cited_by_count": work["cited_by_count"],
228
+ "concepts": [c["display_name"] for c in work["concepts"][:5]],
229
+ "is_open_access": work["is_oa"],
230
+ "pdf_url": work["primary_location"].get("pdf_url"),
231
+ },
232
+ )
233
+ for work in data["results"]
234
+ ]
235
+ ```
236
+
237
+ ---
238
+
239
+ ## Rate Limits
240
+
241
+ OpenAlex is **extremely generous**:
242
+
243
+ - No hard rate limit documented
244
+ - Recommended: <100,000 requests/day
245
+ - **Polite pool**: Add `mailto=your@email.com` param for faster responses
246
+ - No API key required (optional for priority support)
247
+
248
+ ---
249
+
250
+ ## Should We Add OpenAlex?
251
+
252
+ ### Arguments FOR
253
+
254
+ 1. **Already in reference repo** - proven pattern
255
+ 2. **Richer data** - citations, concepts, authors
256
+ 3. **Single source** - reduces API complexity
257
+ 4. **Free & open** - no keys, no limits
258
+ 5. **Institution adoption** - Leiden, Sorbonne switched to it
259
+
260
+ ### Arguments AGAINST
261
+
262
+ 1. **Adds complexity** - another data source
263
+ 2. **Overlap** - duplicates some PubMed data
264
+ 3. **Not biomedical-focused** - covers all disciplines
265
+ 4. **No full text** - still need PMC/Europe PMC for that
266
+
267
+ ### Recommendation
268
+
269
+ **Add OpenAlex as a 4th source**, don't replace existing tools.
270
+
271
+ Use it for:
272
+ - Citation network analysis
273
+ - Concept-based discovery
274
+ - High-impact paper finding
275
+ - Author/institution tracking
276
+
277
+ Keep PubMed, ClinicalTrials, Europe PMC for:
278
+ - Authoritative biomedical search
279
+ - Clinical trial data
280
+ - Full-text access
281
+ - Preprint tracking
282
+
283
+ ---
284
+
285
+ ## Implementation Priority
286
+
287
+ | Task | Effort | Value |
288
+ |------|--------|-------|
289
+ | Basic search | Low | High |
290
+ | Citation network | Medium | Very High |
291
+ | Concept filtering | Low | High |
292
+ | Related works | Low | High |
293
+ | Author tracking | Medium | Medium |
294
+
295
+ ---
296
+
297
+ ## Sources
298
+
299
+ - [OpenAlex Documentation](https://docs.openalex.org)
300
+ - [OpenAlex API Overview](https://docs.openalex.org/api)
301
+ - [OpenAlex Wikipedia](https://en.wikipedia.org/wiki/OpenAlex)
302
+ - [Leiden University Announcement](https://www.leidenranking.com/information/openalex)
303
+ - [OpenAlex: A fully-open index (Paper)](https://arxiv.org/abs/2205.01833)
docs/brainstorming/implementation/15_PHASE_OPENALEX.md ADDED
@@ -0,0 +1,603 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 15: OpenAlex Integration
2
+
3
+ **Priority**: HIGH - Biggest bang for buck
4
+ **Effort**: ~2-3 hours
5
+ **Dependencies**: None (existing codebase patterns sufficient)
6
+
7
+ ---
8
+
9
+ ## Prerequisites (COMPLETED)
10
+
11
+ The following model changes have been implemented to support this integration:
12
+
13
+ 1. **`SourceName` Literal Updated** (`src/utils/models.py:9`)
14
+ ```python
15
+ SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
16
+ ```
17
+ - Without this, `source="openalex"` would fail Pydantic validation
18
+
19
+ 2. **`Evidence.metadata` Field Added** (`src/utils/models.py:39-42`)
20
+ ```python
21
+ metadata: dict[str, Any] = Field(
22
+ default_factory=dict,
23
+ description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
24
+ )
25
+ ```
26
+ - Required for storing `cited_by_count`, `concepts`, etc.
27
+ - Model is still frozen - metadata must be passed at construction time
28
+
29
+ 3. **`__init__.py` Exports Updated** (`src/tools/__init__.py`)
30
+ - All tools are now exported: `ClinicalTrialsTool`, `EuropePMCTool`, `PubMedTool`
31
+ - OpenAlexTool should be added here after implementation
32
+
33
+ ---
34
+
35
+ ## Overview
36
+
37
+ Add OpenAlex as a 4th data source for comprehensive scholarly data including:
38
+ - Citation networks (who cites whom)
39
+ - Concept tagging (hierarchical topic classification)
40
+ - Author disambiguation
41
+ - 209M+ works indexed
42
+
43
+ **Why OpenAlex?**
44
+ - Free, no API key required
45
+ - Already implemented in reference repo
46
+ - Provides citation data we don't have
47
+ - Aggregates PubMed + preprints + more
48
+
49
+ ---
50
+
51
+ ## TDD Implementation Plan
52
+
53
+ ### Step 1: Write the Tests First
54
+
55
+ **File**: `tests/unit/tools/test_openalex.py`
56
+
57
+ ```python
58
+ """Tests for OpenAlex search tool."""
59
+
60
+ import pytest
61
+ import respx
62
+ from httpx import Response
63
+
64
+ from src.tools.openalex import OpenAlexTool
65
+ from src.utils.models import Evidence
66
+
67
+
68
+ class TestOpenAlexTool:
69
+ """Test suite for OpenAlex search functionality."""
70
+
71
+ @pytest.fixture
72
+ def tool(self) -> OpenAlexTool:
73
+ return OpenAlexTool()
74
+
75
+ def test_name_property(self, tool: OpenAlexTool) -> None:
76
+ """Tool should identify itself as 'openalex'."""
77
+ assert tool.name == "openalex"
78
+
79
+ @respx.mock
80
+ @pytest.mark.asyncio
81
+ async def test_search_returns_evidence(self, tool: OpenAlexTool) -> None:
82
+ """Search should return list of Evidence objects."""
83
+ mock_response = {
84
+ "results": [
85
+ {
86
+ "id": "W2741809807",
87
+ "title": "Metformin and cancer: A systematic review",
88
+ "publication_year": 2023,
89
+ "cited_by_count": 45,
90
+ "type": "article",
91
+ "is_oa": True,
92
+ "primary_location": {
93
+ "source": {"display_name": "Nature Medicine"},
94
+ "landing_page_url": "https://doi.org/10.1038/example",
95
+ "pdf_url": None,
96
+ },
97
+ "abstract_inverted_index": {
98
+ "Metformin": [0],
99
+ "shows": [1],
100
+ "anticancer": [2],
101
+ "effects": [3],
102
+ },
103
+ "concepts": [
104
+ {"display_name": "Medicine", "score": 0.95},
105
+ {"display_name": "Oncology", "score": 0.88},
106
+ ],
107
+ "authorships": [
108
+ {
109
+ "author": {"display_name": "John Smith"},
110
+ "institutions": [{"display_name": "Harvard"}],
111
+ }
112
+ ],
113
+ }
114
+ ]
115
+ }
116
+
117
+ respx.get("https://api.openalex.org/works").mock(
118
+ return_value=Response(200, json=mock_response)
119
+ )
120
+
121
+ results = await tool.search("metformin cancer", max_results=10)
122
+
123
+ assert len(results) == 1
124
+ assert isinstance(results[0], Evidence)
125
+ assert "Metformin and cancer" in results[0].citation.title
126
+ assert results[0].citation.source == "openalex"
127
+
128
+ @respx.mock
129
+ @pytest.mark.asyncio
130
+ async def test_search_empty_results(self, tool: OpenAlexTool) -> None:
131
+ """Search with no results should return empty list."""
132
+ respx.get("https://api.openalex.org/works").mock(
133
+ return_value=Response(200, json={"results": []})
134
+ )
135
+
136
+ results = await tool.search("xyznonexistentquery123")
137
+ assert results == []
138
+
139
+ @respx.mock
140
+ @pytest.mark.asyncio
141
+ async def test_search_handles_missing_abstract(self, tool: OpenAlexTool) -> None:
142
+ """Tool should handle papers without abstracts."""
143
+ mock_response = {
144
+ "results": [
145
+ {
146
+ "id": "W123",
147
+ "title": "Paper without abstract",
148
+ "publication_year": 2023,
149
+ "cited_by_count": 10,
150
+ "type": "article",
151
+ "is_oa": False,
152
+ "primary_location": {
153
+ "source": {"display_name": "Journal"},
154
+ "landing_page_url": "https://example.com",
155
+ },
156
+ "abstract_inverted_index": None,
157
+ "concepts": [],
158
+ "authorships": [],
159
+ }
160
+ ]
161
+ }
162
+
163
+ respx.get("https://api.openalex.org/works").mock(
164
+ return_value=Response(200, json=mock_response)
165
+ )
166
+
167
+ results = await tool.search("test query")
168
+ assert len(results) == 1
169
+ assert results[0].content == "" # No abstract
170
+
171
+ @respx.mock
172
+ @pytest.mark.asyncio
173
+ async def test_search_extracts_citation_count(self, tool: OpenAlexTool) -> None:
174
+ """Citation count should be in metadata."""
175
+ mock_response = {
176
+ "results": [
177
+ {
178
+ "id": "W456",
179
+ "title": "Highly cited paper",
180
+ "publication_year": 2020,
181
+ "cited_by_count": 500,
182
+ "type": "article",
183
+ "is_oa": True,
184
+ "primary_location": {
185
+ "source": {"display_name": "Science"},
186
+ "landing_page_url": "https://example.com",
187
+ },
188
+ "abstract_inverted_index": {"Test": [0]},
189
+ "concepts": [],
190
+ "authorships": [],
191
+ }
192
+ ]
193
+ }
194
+
195
+ respx.get("https://api.openalex.org/works").mock(
196
+ return_value=Response(200, json=mock_response)
197
+ )
198
+
199
+ results = await tool.search("highly cited")
200
+ assert results[0].metadata["cited_by_count"] == 500
201
+
202
+ @respx.mock
203
+ @pytest.mark.asyncio
204
+ async def test_search_extracts_concepts(self, tool: OpenAlexTool) -> None:
205
+ """Concepts should be extracted for semantic discovery."""
206
+ mock_response = {
207
+ "results": [
208
+ {
209
+ "id": "W789",
210
+ "title": "Drug repurposing study",
211
+ "publication_year": 2023,
212
+ "cited_by_count": 25,
213
+ "type": "article",
214
+ "is_oa": True,
215
+ "primary_location": {
216
+ "source": {"display_name": "PLOS ONE"},
217
+ "landing_page_url": "https://example.com",
218
+ },
219
+ "abstract_inverted_index": {"Drug": [0], "repurposing": [1]},
220
+ "concepts": [
221
+ {"display_name": "Pharmacology", "score": 0.92},
222
+ {"display_name": "Drug Discovery", "score": 0.85},
223
+ {"display_name": "Medicine", "score": 0.80},
224
+ ],
225
+ "authorships": [],
226
+ }
227
+ ]
228
+ }
229
+
230
+ respx.get("https://api.openalex.org/works").mock(
231
+ return_value=Response(200, json=mock_response)
232
+ )
233
+
234
+ results = await tool.search("drug repurposing")
235
+ assert "Pharmacology" in results[0].metadata["concepts"]
236
+ assert "Drug Discovery" in results[0].metadata["concepts"]
237
+
238
+ @respx.mock
239
+ @pytest.mark.asyncio
240
+ async def test_search_api_error_raises_search_error(
241
+ self, tool: OpenAlexTool
242
+ ) -> None:
243
+ """API errors should raise SearchError."""
244
+ from src.utils.exceptions import SearchError
245
+
246
+ respx.get("https://api.openalex.org/works").mock(
247
+ return_value=Response(500, text="Internal Server Error")
248
+ )
249
+
250
+ with pytest.raises(SearchError):
251
+ await tool.search("test query")
252
+
253
+ def test_reconstruct_abstract(self, tool: OpenAlexTool) -> None:
254
+ """Test abstract reconstruction from inverted index."""
255
+ inverted_index = {
256
+ "Metformin": [0, 5],
257
+ "is": [1],
258
+ "a": [2],
259
+ "diabetes": [3],
260
+ "drug": [4],
261
+ "effective": [6],
262
+ }
263
+ abstract = tool._reconstruct_abstract(inverted_index)
264
+ assert abstract == "Metformin is a diabetes drug Metformin effective"
265
+ ```
266
+
267
+ ---
268
+
269
+ ### Step 2: Create the Implementation
270
+
271
+ **File**: `src/tools/openalex.py`
272
+
273
+ ```python
274
+ """OpenAlex search tool for comprehensive scholarly data."""
275
+
276
+ from typing import Any
277
+
278
+ import httpx
279
+ from tenacity import retry, stop_after_attempt, wait_exponential
280
+
281
+ from src.utils.exceptions import SearchError
282
+ from src.utils.models import Citation, Evidence
283
+
284
+
285
+ class OpenAlexTool:
286
+ """
287
+ Search OpenAlex for scholarly works with rich metadata.
288
+
289
+ OpenAlex provides:
290
+ - 209M+ scholarly works
291
+ - Citation counts and networks
292
+ - Concept tagging (hierarchical)
293
+ - Author disambiguation
294
+ - Open access links
295
+
296
+ API Docs: https://docs.openalex.org/
297
+ """
298
+
299
+ BASE_URL = "https://api.openalex.org/works"
300
+
301
+ def __init__(self, email: str | None = None) -> None:
302
+ """
303
+ Initialize OpenAlex tool.
304
+
305
+ Args:
306
+ email: Optional email for polite pool (faster responses)
307
+ """
308
+ self.email = email or "deepcritical@example.com"
309
+
310
+ @property
311
+ def name(self) -> str:
312
+ return "openalex"
313
+
314
+ @retry(
315
+ stop=stop_after_attempt(3),
316
+ wait=wait_exponential(multiplier=1, min=1, max=10),
317
+ reraise=True,
318
+ )
319
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
320
+ """
321
+ Search OpenAlex for scholarly works.
322
+
323
+ Args:
324
+ query: Search terms
325
+ max_results: Maximum results to return (max 200 per request)
326
+
327
+ Returns:
328
+ List of Evidence objects with citation metadata
329
+
330
+ Raises:
331
+ SearchError: If API request fails
332
+ """
333
+ params = {
334
+ "search": query,
335
+ "filter": "type:article", # Only peer-reviewed articles
336
+ "sort": "cited_by_count:desc", # Most cited first
337
+ "per_page": min(max_results, 200),
338
+ "mailto": self.email, # Polite pool for faster responses
339
+ }
340
+
341
+ async with httpx.AsyncClient(timeout=30.0) as client:
342
+ try:
343
+ response = await client.get(self.BASE_URL, params=params)
344
+ response.raise_for_status()
345
+
346
+ data = response.json()
347
+ results = data.get("results", [])
348
+
349
+ return [self._to_evidence(work) for work in results[:max_results]]
350
+
351
+ except httpx.HTTPStatusError as e:
352
+ raise SearchError(f"OpenAlex API error: {e}") from e
353
+ except httpx.RequestError as e:
354
+ raise SearchError(f"OpenAlex connection failed: {e}") from e
355
+
356
+ def _to_evidence(self, work: dict[str, Any]) -> Evidence:
357
+ """Convert OpenAlex work to Evidence object."""
358
+ title = work.get("title", "Untitled")
359
+ pub_year = work.get("publication_year", "Unknown")
360
+ cited_by = work.get("cited_by_count", 0)
361
+ is_oa = work.get("is_oa", False)
362
+
363
+ # Reconstruct abstract from inverted index
364
+ abstract_index = work.get("abstract_inverted_index")
365
+ abstract = self._reconstruct_abstract(abstract_index) if abstract_index else ""
366
+
367
+ # Extract concepts (top 5)
368
+ concepts = [
369
+ c.get("display_name", "")
370
+ for c in work.get("concepts", [])[:5]
371
+ if c.get("display_name")
372
+ ]
373
+
374
+ # Extract authors (top 5)
375
+ authorships = work.get("authorships", [])
376
+ authors = [
377
+ a.get("author", {}).get("display_name", "")
378
+ for a in authorships[:5]
379
+ if a.get("author", {}).get("display_name")
380
+ ]
381
+
382
+ # Get URL
383
+ primary_loc = work.get("primary_location") or {}
384
+ url = primary_loc.get("landing_page_url", "")
385
+ if not url:
386
+ # Fallback to OpenAlex page
387
+ work_id = work.get("id", "").replace("https://openalex.org/", "")
388
+ url = f"https://openalex.org/{work_id}"
389
+
390
+ return Evidence(
391
+ content=abstract[:2000],
392
+ citation=Citation(
393
+ source="openalex",
394
+ title=title[:500],
395
+ url=url,
396
+ date=str(pub_year),
397
+ authors=authors,
398
+ ),
399
+ relevance=min(0.9, 0.5 + (cited_by / 1000)), # Boost by citations
400
+ metadata={
401
+ "cited_by_count": cited_by,
402
+ "is_open_access": is_oa,
403
+ "concepts": concepts,
404
+ "pdf_url": primary_loc.get("pdf_url"),
405
+ },
406
+ )
407
+
408
+ def _reconstruct_abstract(
409
+ self, inverted_index: dict[str, list[int]]
410
+ ) -> str:
411
+ """
412
+ Reconstruct abstract from OpenAlex inverted index format.
413
+
414
+ OpenAlex stores abstracts as {"word": [position1, position2, ...]}.
415
+ This rebuilds the original text.
416
+ """
417
+ if not inverted_index:
418
+ return ""
419
+
420
+ # Build position -> word mapping
421
+ position_word: dict[int, str] = {}
422
+ for word, positions in inverted_index.items():
423
+ for pos in positions:
424
+ position_word[pos] = word
425
+
426
+ # Reconstruct in order
427
+ if not position_word:
428
+ return ""
429
+
430
+ max_pos = max(position_word.keys())
431
+ words = [position_word.get(i, "") for i in range(max_pos + 1)]
432
+ return " ".join(w for w in words if w)
433
+ ```
434
+
435
+ ---
436
+
437
+ ### Step 3: Register in Search Handler
438
+
439
+ **File**: `src/tools/search_handler.py` (add to imports and tool list)
440
+
441
+ ```python
442
+ # Add import
443
+ from src.tools.openalex import OpenAlexTool
444
+
445
+ # Add to _create_tools method
446
+ def _create_tools(self) -> list[SearchTool]:
447
+ return [
448
+ PubMedTool(),
449
+ ClinicalTrialsTool(),
450
+ EuropePMCTool(),
451
+ OpenAlexTool(), # NEW
452
+ ]
453
+ ```
454
+
455
+ ---
456
+
457
+ ### Step 4: Update `__init__.py`
458
+
459
+ **File**: `src/tools/__init__.py`
460
+
461
+ ```python
462
+ from src.tools.openalex import OpenAlexTool
463
+
464
+ __all__ = [
465
+ "PubMedTool",
466
+ "ClinicalTrialsTool",
467
+ "EuropePMCTool",
468
+ "OpenAlexTool", # NEW
469
+ # ...
470
+ ]
471
+ ```
472
+
473
+ ---
474
+
475
+ ## Demo Script
476
+
477
+ **File**: `examples/openalex_demo.py`
478
+
479
+ ```python
480
+ #!/usr/bin/env python3
481
+ """Demo script to verify OpenAlex integration."""
482
+
483
+ import asyncio
484
+ from src.tools.openalex import OpenAlexTool
485
+
486
+
487
+ async def main():
488
+ """Run OpenAlex search demo."""
489
+ tool = OpenAlexTool()
490
+
491
+ print("=" * 60)
492
+ print("OpenAlex Integration Demo")
493
+ print("=" * 60)
494
+
495
+ # Test 1: Basic drug repurposing search
496
+ print("\n[Test 1] Searching for 'metformin cancer drug repurposing'...")
497
+ results = await tool.search("metformin cancer drug repurposing", max_results=5)
498
+
499
+ for i, evidence in enumerate(results, 1):
500
+ print(f"\n--- Result {i} ---")
501
+ print(f"Title: {evidence.citation.title}")
502
+ print(f"Year: {evidence.citation.date}")
503
+ print(f"Citations: {evidence.metadata.get('cited_by_count', 'N/A')}")
504
+ print(f"Concepts: {', '.join(evidence.metadata.get('concepts', []))}")
505
+ print(f"Open Access: {evidence.metadata.get('is_open_access', False)}")
506
+ print(f"URL: {evidence.citation.url}")
507
+ if evidence.content:
508
+ print(f"Abstract: {evidence.content[:200]}...")
509
+
510
+ # Test 2: High-impact papers
511
+ print("\n" + "=" * 60)
512
+ print("[Test 2] Finding highly-cited papers on 'long COVID treatment'...")
513
+ results = await tool.search("long COVID treatment", max_results=3)
514
+
515
+ for evidence in results:
516
+ print(f"\n- {evidence.citation.title}")
517
+ print(f" Citations: {evidence.metadata.get('cited_by_count', 0)}")
518
+
519
+ print("\n" + "=" * 60)
520
+ print("Demo complete!")
521
+
522
+
523
+ if __name__ == "__main__":
524
+ asyncio.run(main())
525
+ ```
526
+
527
+ ---
528
+
529
+ ## Verification Checklist
530
+
531
+ ### Unit Tests
532
+ ```bash
533
+ # Run just OpenAlex tests
534
+ uv run pytest tests/unit/tools/test_openalex.py -v
535
+
536
+ # Expected: All tests pass
537
+ ```
538
+
539
+ ### Integration Test (Manual)
540
+ ```bash
541
+ # Run demo script with real API
542
+ uv run python examples/openalex_demo.py
543
+
544
+ # Expected: Real results from OpenAlex API
545
+ ```
546
+
547
+ ### Full Test Suite
548
+ ```bash
549
+ # Ensure nothing broke
550
+ make check
551
+
552
+ # Expected: All 110+ tests pass, mypy clean
553
+ ```
554
+
555
+ ---
556
+
557
+ ## Success Criteria
558
+
559
+ 1. **Unit tests pass**: All mocked tests in `test_openalex.py` pass
560
+ 2. **Integration works**: Demo script returns real results
561
+ 3. **No regressions**: `make check` passes completely
562
+ 4. **SearchHandler integration**: OpenAlex appears in search results alongside other sources
563
+ 5. **Citation metadata**: Results include `cited_by_count`, `concepts`, `is_open_access`
564
+
565
+ ---
566
+
567
+ ## Future Enhancements (P2)
568
+
569
+ Once basic integration works:
570
+
571
+ 1. **Citation Network Queries**
572
+ ```python
573
+ # Get papers citing a specific work
574
+ async def get_citing_works(self, work_id: str) -> list[Evidence]:
575
+ params = {"filter": f"cites:{work_id}"}
576
+ ...
577
+ ```
578
+
579
+ 2. **Concept-Based Search**
580
+ ```python
581
+ # Search by OpenAlex concept ID
582
+ async def search_by_concept(self, concept_id: str) -> list[Evidence]:
583
+ params = {"filter": f"concepts.id:{concept_id}"}
584
+ ...
585
+ ```
586
+
587
+ 3. **Author Tracking**
588
+ ```python
589
+ # Find all works by an author
590
+ async def search_by_author(self, author_id: str) -> list[Evidence]:
591
+ params = {"filter": f"authorships.author.id:{author_id}"}
592
+ ...
593
+ ```
594
+
595
+ ---
596
+
597
+ ## Notes
598
+
599
+ - OpenAlex is **very generous** with rate limits (no documented hard limit)
600
+ - Adding `mailto` parameter gives priority access (polite pool)
601
+ - Abstract is stored as inverted index - must reconstruct
602
+ - Citation count is a good proxy for paper quality/impact
603
+ - Consider caching responses for repeated queries
docs/brainstorming/implementation/16_PHASE_PUBMED_FULLTEXT.md ADDED
@@ -0,0 +1,586 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 16: PubMed Full-Text Retrieval
2
+
3
+ **Priority**: MEDIUM - Enhances evidence quality
4
+ **Effort**: ~3 hours
5
+ **Dependencies**: None (existing PubMed tool sufficient)
6
+
7
+ ---
8
+
9
+ ## Prerequisites (COMPLETED)
10
+
11
+ The `Evidence.metadata` field has been added to `src/utils/models.py` to support:
12
+ ```python
13
+ metadata={"has_fulltext": True}
14
+ ```
15
+
16
+ ---
17
+
18
+ ## Architecture Decision: Constructor Parameter vs Method Parameter
19
+
20
+ **IMPORTANT**: The original spec proposed `include_fulltext` as a method parameter:
21
+ ```python
22
+ # WRONG - SearchHandler won't pass this parameter
23
+ async def search(self, query: str, max_results: int = 10, include_fulltext: bool = False):
24
+ ```
25
+
26
+ **Problem**: `SearchHandler` calls `tool.search(query, max_results)` uniformly across all tools.
27
+ It has no mechanism to pass tool-specific parameters like `include_fulltext`.
28
+
29
+ **Solution**: Use constructor parameter instead:
30
+ ```python
31
+ # CORRECT - Configured at instantiation time
32
+ class PubMedTool:
33
+ def __init__(self, api_key: str | None = None, include_fulltext: bool = False):
34
+ self.include_fulltext = include_fulltext
35
+ ...
36
+ ```
37
+
38
+ This way, you can create a full-text-enabled PubMed tool:
39
+ ```python
40
+ # In orchestrator or wherever tools are created
41
+ tools = [
42
+ PubMedTool(include_fulltext=True), # Full-text enabled
43
+ ClinicalTrialsTool(),
44
+ EuropePMCTool(),
45
+ ]
46
+ ```
47
+
48
+ ---
49
+
50
+ ## Overview
51
+
52
+ Add full-text retrieval for PubMed papers via the BioC API, enabling:
53
+ - Complete paper text for open-access PMC papers
54
+ - Structured sections (intro, methods, results, discussion)
55
+ - Better evidence for LLM synthesis
56
+
57
+ **Why Full-Text?**
58
+ - Abstracts only give ~200-300 words
59
+ - Full text provides detailed methods, results, figures
60
+ - Reference repo already has this implemented
61
+ - Makes LLM judgments more accurate
62
+
63
+ ---
64
+
65
+ ## TDD Implementation Plan
66
+
67
+ ### Step 1: Write the Tests First
68
+
69
+ **File**: `tests/unit/tools/test_pubmed_fulltext.py`
70
+
71
+ ```python
72
+ """Tests for PubMed full-text retrieval."""
73
+
74
+ import pytest
75
+ import respx
76
+ from httpx import Response
77
+
78
+ from src.tools.pubmed import PubMedTool
79
+
80
+
81
+ class TestPubMedFullText:
82
+ """Test suite for PubMed full-text functionality."""
83
+
84
+ @pytest.fixture
85
+ def tool(self) -> PubMedTool:
86
+ return PubMedTool()
87
+
88
+ @respx.mock
89
+ @pytest.mark.asyncio
90
+ async def test_get_pmc_id_success(self, tool: PubMedTool) -> None:
91
+ """Should convert PMID to PMCID for full-text access."""
92
+ mock_response = {
93
+ "records": [
94
+ {
95
+ "pmid": "12345678",
96
+ "pmcid": "PMC1234567",
97
+ }
98
+ ]
99
+ }
100
+
101
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
102
+ return_value=Response(200, json=mock_response)
103
+ )
104
+
105
+ pmcid = await tool.get_pmc_id("12345678")
106
+ assert pmcid == "PMC1234567"
107
+
108
+ @respx.mock
109
+ @pytest.mark.asyncio
110
+ async def test_get_pmc_id_not_in_pmc(self, tool: PubMedTool) -> None:
111
+ """Should return None if paper not in PMC."""
112
+ mock_response = {
113
+ "records": [
114
+ {
115
+ "pmid": "12345678",
116
+ # No pmcid means not in PMC
117
+ }
118
+ ]
119
+ }
120
+
121
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
122
+ return_value=Response(200, json=mock_response)
123
+ )
124
+
125
+ pmcid = await tool.get_pmc_id("12345678")
126
+ assert pmcid is None
127
+
128
+ @respx.mock
129
+ @pytest.mark.asyncio
130
+ async def test_get_fulltext_success(self, tool: PubMedTool) -> None:
131
+ """Should retrieve full text for PMC papers."""
132
+ # Mock BioC API response
133
+ mock_bioc = {
134
+ "documents": [
135
+ {
136
+ "passages": [
137
+ {
138
+ "infons": {"section_type": "INTRO"},
139
+ "text": "Introduction text here.",
140
+ },
141
+ {
142
+ "infons": {"section_type": "METHODS"},
143
+ "text": "Methods description here.",
144
+ },
145
+ {
146
+ "infons": {"section_type": "RESULTS"},
147
+ "text": "Results summary here.",
148
+ },
149
+ {
150
+ "infons": {"section_type": "DISCUSS"},
151
+ "text": "Discussion and conclusions.",
152
+ },
153
+ ]
154
+ }
155
+ ]
156
+ }
157
+
158
+ respx.get(
159
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
160
+ ).mock(return_value=Response(200, json=mock_bioc))
161
+
162
+ fulltext = await tool.get_fulltext("12345678")
163
+
164
+ assert fulltext is not None
165
+ assert "Introduction text here" in fulltext
166
+ assert "Methods description here" in fulltext
167
+ assert "Results summary here" in fulltext
168
+
169
+ @respx.mock
170
+ @pytest.mark.asyncio
171
+ async def test_get_fulltext_not_available(self, tool: PubMedTool) -> None:
172
+ """Should return None if full text not available."""
173
+ respx.get(
174
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/99999999/unicode"
175
+ ).mock(return_value=Response(404))
176
+
177
+ fulltext = await tool.get_fulltext("99999999")
178
+ assert fulltext is None
179
+
180
+ @respx.mock
181
+ @pytest.mark.asyncio
182
+ async def test_get_fulltext_structured(self, tool: PubMedTool) -> None:
183
+ """Should return structured sections dict."""
184
+ mock_bioc = {
185
+ "documents": [
186
+ {
187
+ "passages": [
188
+ {"infons": {"section_type": "INTRO"}, "text": "Intro..."},
189
+ {"infons": {"section_type": "METHODS"}, "text": "Methods..."},
190
+ {"infons": {"section_type": "RESULTS"}, "text": "Results..."},
191
+ {"infons": {"section_type": "DISCUSS"}, "text": "Discussion..."},
192
+ ]
193
+ }
194
+ ]
195
+ }
196
+
197
+ respx.get(
198
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
199
+ ).mock(return_value=Response(200, json=mock_bioc))
200
+
201
+ sections = await tool.get_fulltext_structured("12345678")
202
+
203
+ assert sections is not None
204
+ assert "introduction" in sections
205
+ assert "methods" in sections
206
+ assert "results" in sections
207
+ assert "discussion" in sections
208
+
209
+ @respx.mock
210
+ @pytest.mark.asyncio
211
+ async def test_search_with_fulltext_enabled(self) -> None:
212
+ """Search should include full text when tool is configured for it."""
213
+ # Create tool WITH full-text enabled via constructor
214
+ tool = PubMedTool(include_fulltext=True)
215
+
216
+ # Mock esearch
217
+ respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi").mock(
218
+ return_value=Response(
219
+ 200, json={"esearchresult": {"idlist": ["12345678"]}}
220
+ )
221
+ )
222
+
223
+ # Mock efetch (abstract)
224
+ mock_xml = """
225
+ <PubmedArticleSet>
226
+ <PubmedArticle>
227
+ <MedlineCitation>
228
+ <PMID>12345678</PMID>
229
+ <Article>
230
+ <ArticleTitle>Test Paper</ArticleTitle>
231
+ <Abstract><AbstractText>Short abstract.</AbstractText></Abstract>
232
+ <AuthorList><Author><LastName>Smith</LastName></Author></AuthorList>
233
+ </Article>
234
+ </MedlineCitation>
235
+ </PubmedArticle>
236
+ </PubmedArticleSet>
237
+ """
238
+ respx.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi").mock(
239
+ return_value=Response(200, text=mock_xml)
240
+ )
241
+
242
+ # Mock ID converter
243
+ respx.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/").mock(
244
+ return_value=Response(
245
+ 200, json={"records": [{"pmid": "12345678", "pmcid": "PMC1234567"}]}
246
+ )
247
+ )
248
+
249
+ # Mock BioC full text
250
+ mock_bioc = {
251
+ "documents": [
252
+ {
253
+ "passages": [
254
+ {"infons": {"section_type": "INTRO"}, "text": "Full intro..."},
255
+ ]
256
+ }
257
+ ]
258
+ }
259
+ respx.get(
260
+ "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/12345678/unicode"
261
+ ).mock(return_value=Response(200, json=mock_bioc))
262
+
263
+ # NOTE: No include_fulltext param - it's set via constructor
264
+ results = await tool.search("test", max_results=1)
265
+
266
+ assert len(results) == 1
267
+ # Full text should be appended or replace abstract
268
+ assert "Full intro" in results[0].content or "Short abstract" in results[0].content
269
+ ```
270
+
271
+ ---
272
+
273
+ ### Step 2: Implement Full-Text Methods
274
+
275
+ **File**: `src/tools/pubmed.py` (additions to existing class)
276
+
277
+ ```python
278
+ # Add these methods to PubMedTool class
279
+
280
+ async def get_pmc_id(self, pmid: str) -> str | None:
281
+ """
282
+ Convert PMID to PMCID for full-text access.
283
+
284
+ Args:
285
+ pmid: PubMed ID
286
+
287
+ Returns:
288
+ PMCID if paper is in PMC, None otherwise
289
+ """
290
+ url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
291
+ params = {"ids": pmid, "format": "json"}
292
+
293
+ async with httpx.AsyncClient(timeout=30.0) as client:
294
+ try:
295
+ response = await client.get(url, params=params)
296
+ response.raise_for_status()
297
+ data = response.json()
298
+
299
+ records = data.get("records", [])
300
+ if records and records[0].get("pmcid"):
301
+ return records[0]["pmcid"]
302
+ return None
303
+
304
+ except httpx.HTTPError:
305
+ return None
306
+
307
+
308
+ async def get_fulltext(self, pmid: str) -> str | None:
309
+ """
310
+ Get full text for a PubMed paper via BioC API.
311
+
312
+ Only works for open-access papers in PubMed Central.
313
+
314
+ Args:
315
+ pmid: PubMed ID
316
+
317
+ Returns:
318
+ Full text as string, or None if not available
319
+ """
320
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
321
+
322
+ async with httpx.AsyncClient(timeout=60.0) as client:
323
+ try:
324
+ response = await client.get(url)
325
+ if response.status_code == 404:
326
+ return None
327
+ response.raise_for_status()
328
+ data = response.json()
329
+
330
+ # Extract text from all passages
331
+ documents = data.get("documents", [])
332
+ if not documents:
333
+ return None
334
+
335
+ passages = documents[0].get("passages", [])
336
+ text_parts = [p.get("text", "") for p in passages if p.get("text")]
337
+
338
+ return "\n\n".join(text_parts) if text_parts else None
339
+
340
+ except httpx.HTTPError:
341
+ return None
342
+
343
+
344
+ async def get_fulltext_structured(self, pmid: str) -> dict[str, str] | None:
345
+ """
346
+ Get structured full text with sections.
347
+
348
+ Args:
349
+ pmid: PubMed ID
350
+
351
+ Returns:
352
+ Dict mapping section names to text, or None if not available
353
+ """
354
+ url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
355
+
356
+ async with httpx.AsyncClient(timeout=60.0) as client:
357
+ try:
358
+ response = await client.get(url)
359
+ if response.status_code == 404:
360
+ return None
361
+ response.raise_for_status()
362
+ data = response.json()
363
+
364
+ documents = data.get("documents", [])
365
+ if not documents:
366
+ return None
367
+
368
+ # Map section types to readable names
369
+ section_map = {
370
+ "INTRO": "introduction",
371
+ "METHODS": "methods",
372
+ "RESULTS": "results",
373
+ "DISCUSS": "discussion",
374
+ "CONCL": "conclusion",
375
+ "ABSTRACT": "abstract",
376
+ }
377
+
378
+ sections: dict[str, list[str]] = {}
379
+ for passage in documents[0].get("passages", []):
380
+ section_type = passage.get("infons", {}).get("section_type", "other")
381
+ section_name = section_map.get(section_type, "other")
382
+ text = passage.get("text", "")
383
+
384
+ if text:
385
+ if section_name not in sections:
386
+ sections[section_name] = []
387
+ sections[section_name].append(text)
388
+
389
+ # Join multiple passages per section
390
+ return {k: "\n\n".join(v) for k, v in sections.items()}
391
+
392
+ except httpx.HTTPError:
393
+ return None
394
+ ```
395
+
396
+ ---
397
+
398
+ ### Step 3: Update Constructor and Search Method
399
+
400
+ Add full-text flag to constructor and update search to use it:
401
+
402
+ ```python
403
+ class PubMedTool:
404
+ """Search tool for PubMed/NCBI."""
405
+
406
+ def __init__(
407
+ self,
408
+ api_key: str | None = None,
409
+ include_fulltext: bool = False, # NEW CONSTRUCTOR PARAM
410
+ ) -> None:
411
+ self.api_key = api_key or settings.ncbi_api_key
412
+ if self.api_key == "your-ncbi-key-here":
413
+ self.api_key = None
414
+ self._last_request_time = 0.0
415
+ self.include_fulltext = include_fulltext # Store for use in search()
416
+
417
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
418
+ """
419
+ Search PubMed and return evidence.
420
+
421
+ Note: Full-text enrichment is controlled by constructor parameter,
422
+ not method parameter, because SearchHandler doesn't pass extra args.
423
+ """
424
+ # ... existing search logic ...
425
+
426
+ evidence_list = self._parse_pubmed_xml(fetch_resp.text)
427
+
428
+ # Optionally enrich with full text (if configured at construction)
429
+ if self.include_fulltext:
430
+ evidence_list = await self._enrich_with_fulltext(evidence_list)
431
+
432
+ return evidence_list
433
+
434
+
435
+ async def _enrich_with_fulltext(
436
+ self, evidence_list: list[Evidence]
437
+ ) -> list[Evidence]:
438
+ """Attempt to add full text to evidence items."""
439
+ enriched = []
440
+
441
+ for evidence in evidence_list:
442
+ # Extract PMID from URL
443
+ url = evidence.citation.url
444
+ pmid = url.rstrip("/").split("/")[-1] if url else None
445
+
446
+ if pmid:
447
+ fulltext = await self.get_fulltext(pmid)
448
+ if fulltext:
449
+ # Replace abstract with full text (truncated)
450
+ evidence = Evidence(
451
+ content=fulltext[:8000], # Larger limit for full text
452
+ citation=evidence.citation,
453
+ relevance=evidence.relevance,
454
+ metadata={
455
+ **evidence.metadata,
456
+ "has_fulltext": True,
457
+ },
458
+ )
459
+
460
+ enriched.append(evidence)
461
+
462
+ return enriched
463
+ ```
464
+
465
+ ---
466
+
467
+ ## Demo Script
468
+
469
+ **File**: `examples/pubmed_fulltext_demo.py`
470
+
471
+ ```python
472
+ #!/usr/bin/env python3
473
+ """Demo script to verify PubMed full-text retrieval."""
474
+
475
+ import asyncio
476
+ from src.tools.pubmed import PubMedTool
477
+
478
+
479
+ async def main():
480
+ """Run PubMed full-text demo."""
481
+ tool = PubMedTool()
482
+
483
+ print("=" * 60)
484
+ print("PubMed Full-Text Demo")
485
+ print("=" * 60)
486
+
487
+ # Test 1: Convert PMID to PMCID
488
+ print("\n[Test 1] Converting PMID to PMCID...")
489
+ # Use a known open-access paper
490
+ test_pmid = "34450029" # Example: COVID-related open-access paper
491
+ pmcid = await tool.get_pmc_id(test_pmid)
492
+ print(f"PMID {test_pmid} -> PMCID: {pmcid or 'Not in PMC'}")
493
+
494
+ # Test 2: Get full text
495
+ print("\n[Test 2] Fetching full text...")
496
+ if pmcid:
497
+ fulltext = await tool.get_fulltext(test_pmid)
498
+ if fulltext:
499
+ print(f"Full text length: {len(fulltext)} characters")
500
+ print(f"Preview: {fulltext[:500]}...")
501
+ else:
502
+ print("Full text not available")
503
+
504
+ # Test 3: Get structured sections
505
+ print("\n[Test 3] Fetching structured sections...")
506
+ if pmcid:
507
+ sections = await tool.get_fulltext_structured(test_pmid)
508
+ if sections:
509
+ print("Available sections:")
510
+ for section, text in sections.items():
511
+ print(f" - {section}: {len(text)} chars")
512
+ else:
513
+ print("Structured text not available")
514
+
515
+ # Test 4: Search with full text
516
+ print("\n[Test 4] Search with full-text enrichment...")
517
+ results = await tool.search(
518
+ "metformin cancer open access",
519
+ max_results=3,
520
+ include_fulltext=True
521
+ )
522
+
523
+ for i, evidence in enumerate(results, 1):
524
+ has_ft = evidence.metadata.get("has_fulltext", False)
525
+ print(f"\n--- Result {i} ---")
526
+ print(f"Title: {evidence.citation.title}")
527
+ print(f"Has Full Text: {has_ft}")
528
+ print(f"Content Length: {len(evidence.content)} chars")
529
+
530
+ print("\n" + "=" * 60)
531
+ print("Demo complete!")
532
+
533
+
534
+ if __name__ == "__main__":
535
+ asyncio.run(main())
536
+ ```
537
+
538
+ ---
539
+
540
+ ## Verification Checklist
541
+
542
+ ### Unit Tests
543
+ ```bash
544
+ # Run full-text tests
545
+ uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
546
+
547
+ # Run all PubMed tests
548
+ uv run pytest tests/unit/tools/test_pubmed.py -v
549
+
550
+ # Expected: All tests pass
551
+ ```
552
+
553
+ ### Integration Test (Manual)
554
+ ```bash
555
+ # Run demo with real API
556
+ uv run python examples/pubmed_fulltext_demo.py
557
+
558
+ # Expected: Real full text from PMC papers
559
+ ```
560
+
561
+ ### Full Test Suite
562
+ ```bash
563
+ make check
564
+ # Expected: All tests pass, mypy clean
565
+ ```
566
+
567
+ ---
568
+
569
+ ## Success Criteria
570
+
571
+ 1. **ID Conversion works**: PMID -> PMCID conversion successful
572
+ 2. **Full text retrieval works**: BioC API returns paper text
573
+ 3. **Structured sections work**: Can get intro/methods/results/discussion separately
574
+ 4. **Search integration works**: `include_fulltext=True` enriches results
575
+ 5. **No regressions**: Existing tests still pass
576
+ 6. **Graceful degradation**: Non-PMC papers still return abstracts
577
+
578
+ ---
579
+
580
+ ## Notes
581
+
582
+ - Only ~30% of PubMed papers have full text in PMC
583
+ - BioC API has no documented rate limit, but be respectful
584
+ - Full text can be very long - truncate appropriately
585
+ - Consider caching full text responses (they don't change)
586
+ - Timeout should be longer for full text (60s vs 30s)
docs/brainstorming/implementation/17_PHASE_RATE_LIMITING.md ADDED
@@ -0,0 +1,540 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 17: Rate Limiting with `limits` Library
2
+
3
+ **Priority**: P0 CRITICAL - Prevents API blocks
4
+ **Effort**: ~1 hour
5
+ **Dependencies**: None
6
+
7
+ ---
8
+
9
+ ## CRITICAL: Async Safety Requirements
10
+
11
+ **WARNING**: The rate limiter MUST be async-safe. Blocking the event loop will freeze:
12
+ - The Gradio UI
13
+ - All parallel searches
14
+ - The orchestrator
15
+
16
+ **Rules**:
17
+ 1. **NEVER use `time.sleep()`** - Always use `await asyncio.sleep()`
18
+ 2. **NEVER use blocking while loops** - Use async-aware polling
19
+ 3. **The `limits` library check is synchronous** - Wrap it carefully
20
+
21
+ The implementation below uses a polling pattern that:
22
+ - Checks the limit (synchronous, fast)
23
+ - If exceeded, `await asyncio.sleep()` (non-blocking)
24
+ - Retry the check
25
+
26
+ **Alternative**: If `limits` proves problematic, use `aiolimiter` which is pure-async.
27
+
28
+ ---
29
+
30
+ ## Overview
31
+
32
+ Replace naive `asyncio.sleep` rate limiting with proper rate limiter using the `limits` library, which provides:
33
+ - Moving window rate limiting
34
+ - Per-API configurable limits
35
+ - Thread-safe storage
36
+ - Already used in reference repo
37
+
38
+ **Why This Matters?**
39
+ - NCBI will block us without proper rate limiting (3/sec without key, 10/sec with)
40
+ - Current implementation only has simple sleep delay
41
+ - Need coordinated limits across all PubMed calls
42
+ - Professional-grade rate limiting prevents production issues
43
+
44
+ ---
45
+
46
+ ## Current State
47
+
48
+ ### What We Have (`src/tools/pubmed.py:20-21, 34-41`)
49
+
50
+ ```python
51
+ RATE_LIMIT_DELAY = 0.34 # ~3 requests/sec without API key
52
+
53
+ async def _rate_limit(self) -> None:
54
+ """Enforce NCBI rate limiting."""
55
+ loop = asyncio.get_running_loop()
56
+ now = loop.time()
57
+ elapsed = now - self._last_request_time
58
+ if elapsed < self.RATE_LIMIT_DELAY:
59
+ await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
60
+ self._last_request_time = loop.time()
61
+ ```
62
+
63
+ ### Problems
64
+
65
+ 1. **Not shared across instances**: Each `PubMedTool()` has its own counter
66
+ 2. **Simple delay vs moving window**: Doesn't handle bursts properly
67
+ 3. **Hardcoded rate**: Doesn't adapt to API key presence
68
+ 4. **No backoff on 429**: Just retries blindly
69
+
70
+ ---
71
+
72
+ ## TDD Implementation Plan
73
+
74
+ ### Step 1: Add Dependency
75
+
76
+ **File**: `pyproject.toml`
77
+
78
+ ```toml
79
+ dependencies = [
80
+ # ... existing deps ...
81
+ "limits>=3.0",
82
+ ]
83
+ ```
84
+
85
+ Then run:
86
+ ```bash
87
+ uv sync
88
+ ```
89
+
90
+ ---
91
+
92
+ ### Step 2: Write the Tests First
93
+
94
+ **File**: `tests/unit/tools/test_rate_limiting.py`
95
+
96
+ ```python
97
+ """Tests for rate limiting functionality."""
98
+
99
+ import asyncio
100
+ import time
101
+
102
+ import pytest
103
+
104
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter
105
+
106
+
107
+ class TestRateLimiter:
108
+ """Test suite for rate limiter."""
109
+
110
+ def test_create_limiter_without_api_key(self) -> None:
111
+ """Should create 3/sec limiter without API key."""
112
+ limiter = RateLimiter(rate="3/second")
113
+ assert limiter.rate == "3/second"
114
+
115
+ def test_create_limiter_with_api_key(self) -> None:
116
+ """Should create 10/sec limiter with API key."""
117
+ limiter = RateLimiter(rate="10/second")
118
+ assert limiter.rate == "10/second"
119
+
120
+ @pytest.mark.asyncio
121
+ async def test_limiter_allows_requests_under_limit(self) -> None:
122
+ """Should allow requests under the rate limit."""
123
+ limiter = RateLimiter(rate="10/second")
124
+
125
+ # 3 requests should all succeed immediately
126
+ for _ in range(3):
127
+ allowed = await limiter.acquire()
128
+ assert allowed is True
129
+
130
+ @pytest.mark.asyncio
131
+ async def test_limiter_blocks_when_exceeded(self) -> None:
132
+ """Should wait when rate limit exceeded."""
133
+ limiter = RateLimiter(rate="2/second")
134
+
135
+ # First 2 should be instant
136
+ await limiter.acquire()
137
+ await limiter.acquire()
138
+
139
+ # Third should block briefly
140
+ start = time.monotonic()
141
+ await limiter.acquire()
142
+ elapsed = time.monotonic() - start
143
+
144
+ # Should have waited ~0.5 seconds (half second window for 2/sec)
145
+ assert elapsed >= 0.3
146
+
147
+ @pytest.mark.asyncio
148
+ async def test_limiter_resets_after_window(self) -> None:
149
+ """Rate limit should reset after time window."""
150
+ limiter = RateLimiter(rate="5/second")
151
+
152
+ # Use up the limit
153
+ for _ in range(5):
154
+ await limiter.acquire()
155
+
156
+ # Wait for window to pass
157
+ await asyncio.sleep(1.1)
158
+
159
+ # Should be allowed again
160
+ start = time.monotonic()
161
+ await limiter.acquire()
162
+ elapsed = time.monotonic() - start
163
+
164
+ assert elapsed < 0.1 # Should be nearly instant
165
+
166
+
167
+ class TestGetPubmedLimiter:
168
+ """Test PubMed-specific limiter factory."""
169
+
170
+ def test_limiter_without_api_key(self) -> None:
171
+ """Should return 3/sec limiter without key."""
172
+ limiter = get_pubmed_limiter(api_key=None)
173
+ assert "3" in limiter.rate
174
+
175
+ def test_limiter_with_api_key(self) -> None:
176
+ """Should return 10/sec limiter with key."""
177
+ limiter = get_pubmed_limiter(api_key="my-api-key")
178
+ assert "10" in limiter.rate
179
+
180
+ def test_limiter_is_singleton(self) -> None:
181
+ """Same API key should return same limiter instance."""
182
+ limiter1 = get_pubmed_limiter(api_key="key1")
183
+ limiter2 = get_pubmed_limiter(api_key="key1")
184
+ assert limiter1 is limiter2
185
+
186
+ def test_different_keys_different_limiters(self) -> None:
187
+ """Different API keys should return different limiters."""
188
+ limiter1 = get_pubmed_limiter(api_key="key1")
189
+ limiter2 = get_pubmed_limiter(api_key="key2")
190
+ # Clear cache for clean test
191
+ # Actually, different keys SHOULD share the same limiter
192
+ # since we're limiting against the same API
193
+ assert limiter1 is limiter2 # Shared NCBI rate limit
194
+ ```
195
+
196
+ ---
197
+
198
+ ### Step 3: Create Rate Limiter Module
199
+
200
+ **File**: `src/tools/rate_limiter.py`
201
+
202
+ ```python
203
+ """Rate limiting utilities using the limits library."""
204
+
205
+ import asyncio
206
+ from typing import ClassVar
207
+
208
+ from limits import RateLimitItem, parse
209
+ from limits.storage import MemoryStorage
210
+ from limits.strategies import MovingWindowRateLimiter
211
+
212
+
213
+ class RateLimiter:
214
+ """
215
+ Async-compatible rate limiter using limits library.
216
+
217
+ Uses moving window algorithm for smooth rate limiting.
218
+ """
219
+
220
+ def __init__(self, rate: str) -> None:
221
+ """
222
+ Initialize rate limiter.
223
+
224
+ Args:
225
+ rate: Rate string like "3/second" or "10/second"
226
+ """
227
+ self.rate = rate
228
+ self._storage = MemoryStorage()
229
+ self._limiter = MovingWindowRateLimiter(self._storage)
230
+ self._rate_limit: RateLimitItem = parse(rate)
231
+ self._identity = "default" # Single identity for shared limiting
232
+
233
+ async def acquire(self, wait: bool = True) -> bool:
234
+ """
235
+ Acquire permission to make a request.
236
+
237
+ ASYNC-SAFE: Uses asyncio.sleep(), never time.sleep().
238
+ The polling pattern allows other coroutines to run while waiting.
239
+
240
+ Args:
241
+ wait: If True, wait until allowed. If False, return immediately.
242
+
243
+ Returns:
244
+ True if allowed, False if not (only when wait=False)
245
+ """
246
+ while True:
247
+ # Check if we can proceed (synchronous, fast - ~microseconds)
248
+ if self._limiter.hit(self._rate_limit, self._identity):
249
+ return True
250
+
251
+ if not wait:
252
+ return False
253
+
254
+ # CRITICAL: Use asyncio.sleep(), NOT time.sleep()
255
+ # This yields control to the event loop, allowing other
256
+ # coroutines (UI, parallel searches) to run
257
+ await asyncio.sleep(0.1)
258
+
259
+ def reset(self) -> None:
260
+ """Reset the rate limiter (for testing)."""
261
+ self._storage.reset()
262
+
263
+
264
+ # Singleton limiter for PubMed/NCBI
265
+ _pubmed_limiter: RateLimiter | None = None
266
+
267
+
268
+ def get_pubmed_limiter(api_key: str | None = None) -> RateLimiter:
269
+ """
270
+ Get the shared PubMed rate limiter.
271
+
272
+ Rate depends on whether API key is provided:
273
+ - Without key: 3 requests/second
274
+ - With key: 10 requests/second
275
+
276
+ Args:
277
+ api_key: NCBI API key (optional)
278
+
279
+ Returns:
280
+ Shared RateLimiter instance
281
+ """
282
+ global _pubmed_limiter
283
+
284
+ if _pubmed_limiter is None:
285
+ rate = "10/second" if api_key else "3/second"
286
+ _pubmed_limiter = RateLimiter(rate)
287
+
288
+ return _pubmed_limiter
289
+
290
+
291
+ def reset_pubmed_limiter() -> None:
292
+ """Reset the PubMed limiter (for testing)."""
293
+ global _pubmed_limiter
294
+ _pubmed_limiter = None
295
+
296
+
297
+ # Factory for other APIs
298
+ class RateLimiterFactory:
299
+ """Factory for creating/getting rate limiters for different APIs."""
300
+
301
+ _limiters: ClassVar[dict[str, RateLimiter]] = {}
302
+
303
+ @classmethod
304
+ def get(cls, api_name: str, rate: str) -> RateLimiter:
305
+ """
306
+ Get or create a rate limiter for an API.
307
+
308
+ Args:
309
+ api_name: Unique identifier for the API
310
+ rate: Rate limit string (e.g., "10/second")
311
+
312
+ Returns:
313
+ RateLimiter instance (shared for same api_name)
314
+ """
315
+ if api_name not in cls._limiters:
316
+ cls._limiters[api_name] = RateLimiter(rate)
317
+ return cls._limiters[api_name]
318
+
319
+ @classmethod
320
+ def reset_all(cls) -> None:
321
+ """Reset all limiters (for testing)."""
322
+ cls._limiters.clear()
323
+ ```
324
+
325
+ ---
326
+
327
+ ### Step 4: Update PubMed Tool
328
+
329
+ **File**: `src/tools/pubmed.py` (replace rate limiting code)
330
+
331
+ ```python
332
+ # Replace imports and rate limiting
333
+
334
+ from src.tools.rate_limiter import get_pubmed_limiter
335
+
336
+
337
+ class PubMedTool:
338
+ """Search tool for PubMed/NCBI."""
339
+
340
+ BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
341
+ HTTP_TOO_MANY_REQUESTS = 429
342
+
343
+ def __init__(self, api_key: str | None = None) -> None:
344
+ self.api_key = api_key or settings.ncbi_api_key
345
+ if self.api_key == "your-ncbi-key-here":
346
+ self.api_key = None
347
+ # Use shared rate limiter
348
+ self._limiter = get_pubmed_limiter(self.api_key)
349
+
350
+ async def _rate_limit(self) -> None:
351
+ """Enforce NCBI rate limiting using shared limiter."""
352
+ await self._limiter.acquire()
353
+
354
+ # ... rest of class unchanged ...
355
+ ```
356
+
357
+ ---
358
+
359
+ ### Step 5: Add Rate Limiters for Other APIs
360
+
361
+ **File**: `src/tools/clinicaltrials.py` (optional)
362
+
363
+ ```python
364
+ from src.tools.rate_limiter import RateLimiterFactory
365
+
366
+
367
+ class ClinicalTrialsTool:
368
+ def __init__(self) -> None:
369
+ # ClinicalTrials.gov doesn't document limits, but be conservative
370
+ self._limiter = RateLimiterFactory.get("clinicaltrials", "5/second")
371
+
372
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
373
+ await self._limiter.acquire()
374
+ # ... rest of method ...
375
+ ```
376
+
377
+ **File**: `src/tools/europepmc.py` (optional)
378
+
379
+ ```python
380
+ from src.tools.rate_limiter import RateLimiterFactory
381
+
382
+
383
+ class EuropePMCTool:
384
+ def __init__(self) -> None:
385
+ # Europe PMC is generous, but still be respectful
386
+ self._limiter = RateLimiterFactory.get("europepmc", "10/second")
387
+
388
+ async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
389
+ await self._limiter.acquire()
390
+ # ... rest of method ...
391
+ ```
392
+
393
+ ---
394
+
395
+ ## Demo Script
396
+
397
+ **File**: `examples/rate_limiting_demo.py`
398
+
399
+ ```python
400
+ #!/usr/bin/env python3
401
+ """Demo script to verify rate limiting works correctly."""
402
+
403
+ import asyncio
404
+ import time
405
+
406
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
407
+ from src.tools.pubmed import PubMedTool
408
+
409
+
410
+ async def test_basic_limiter():
411
+ """Test basic rate limiter behavior."""
412
+ print("=" * 60)
413
+ print("Rate Limiting Demo")
414
+ print("=" * 60)
415
+
416
+ # Test 1: Basic limiter
417
+ print("\n[Test 1] Testing 3/second limiter...")
418
+ limiter = RateLimiter("3/second")
419
+
420
+ start = time.monotonic()
421
+ for i in range(6):
422
+ await limiter.acquire()
423
+ elapsed = time.monotonic() - start
424
+ print(f" Request {i+1} at {elapsed:.2f}s")
425
+
426
+ total = time.monotonic() - start
427
+ print(f" Total time for 6 requests: {total:.2f}s (expected ~2s)")
428
+
429
+
430
+ async def test_pubmed_limiter():
431
+ """Test PubMed-specific limiter."""
432
+ print("\n[Test 2] Testing PubMed limiter (shared)...")
433
+
434
+ reset_pubmed_limiter() # Clean state
435
+
436
+ # Without API key: 3/sec
437
+ limiter = get_pubmed_limiter(api_key=None)
438
+ print(f" Rate without key: {limiter.rate}")
439
+
440
+ # Multiple tools should share the same limiter
441
+ tool1 = PubMedTool()
442
+ tool2 = PubMedTool()
443
+
444
+ # Verify they share the limiter
445
+ print(f" Tools share limiter: {tool1._limiter is tool2._limiter}")
446
+
447
+
448
+ async def test_concurrent_requests():
449
+ """Test rate limiting under concurrent load."""
450
+ print("\n[Test 3] Testing concurrent request limiting...")
451
+
452
+ limiter = RateLimiter("5/second")
453
+
454
+ async def make_request(i: int):
455
+ await limiter.acquire()
456
+ return time.monotonic()
457
+
458
+ start = time.monotonic()
459
+ # Launch 10 concurrent requests
460
+ tasks = [make_request(i) for i in range(10)]
461
+ times = await asyncio.gather(*tasks)
462
+
463
+ # Calculate distribution
464
+ relative_times = [t - start for t in times]
465
+ print(f" Request times: {[f'{t:.2f}s' for t in sorted(relative_times)]}")
466
+
467
+ total = max(relative_times)
468
+ print(f" All 10 requests completed in {total:.2f}s (expected ~2s)")
469
+
470
+
471
+ async def main():
472
+ await test_basic_limiter()
473
+ await test_pubmed_limiter()
474
+ await test_concurrent_requests()
475
+
476
+ print("\n" + "=" * 60)
477
+ print("Demo complete!")
478
+
479
+
480
+ if __name__ == "__main__":
481
+ asyncio.run(main())
482
+ ```
483
+
484
+ ---
485
+
486
+ ## Verification Checklist
487
+
488
+ ### Unit Tests
489
+ ```bash
490
+ # Run rate limiting tests
491
+ uv run pytest tests/unit/tools/test_rate_limiting.py -v
492
+
493
+ # Expected: All tests pass
494
+ ```
495
+
496
+ ### Integration Test (Manual)
497
+ ```bash
498
+ # Run demo
499
+ uv run python examples/rate_limiting_demo.py
500
+
501
+ # Expected: Requests properly spaced
502
+ ```
503
+
504
+ ### Full Test Suite
505
+ ```bash
506
+ make check
507
+ # Expected: All tests pass, mypy clean
508
+ ```
509
+
510
+ ---
511
+
512
+ ## Success Criteria
513
+
514
+ 1. **`limits` library installed**: Dependency added to pyproject.toml
515
+ 2. **RateLimiter class works**: Can create and use limiters
516
+ 3. **PubMed uses new limiter**: Shared limiter across instances
517
+ 4. **Rate adapts to API key**: 3/sec without, 10/sec with
518
+ 5. **Concurrent requests handled**: Multiple async requests properly queued
519
+ 6. **No regressions**: All existing tests pass
520
+
521
+ ---
522
+
523
+ ## API Rate Limit Reference
524
+
525
+ | API | Without Key | With Key |
526
+ |-----|-------------|----------|
527
+ | PubMed/NCBI | 3/sec | 10/sec |
528
+ | ClinicalTrials.gov | Undocumented (~5/sec safe) | N/A |
529
+ | Europe PMC | ~10-20/sec (generous) | N/A |
530
+ | OpenAlex | ~100k/day (no per-sec limit) | Faster with `mailto` |
531
+
532
+ ---
533
+
534
+ ## Notes
535
+
536
+ - `limits` library uses moving window algorithm (fairer than fixed window)
537
+ - Singleton pattern ensures all PubMed calls share the limit
538
+ - The factory pattern allows easy extension to other APIs
539
+ - Consider adding 429 response detection + exponential backoff
540
+ - In production, consider Redis storage for distributed rate limiting
docs/brainstorming/implementation/README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plans
2
+
3
+ TDD implementation plans based on the brainstorming documents. Each phase is a self-contained vertical slice with tests, implementation, and demo scripts.
4
+
5
+ ---
6
+
7
+ ## Prerequisites (COMPLETED)
8
+
9
+ The following foundational changes have been implemented to support all three phases:
10
+
11
+ | Change | File | Status |
12
+ |--------|------|--------|
13
+ | Add `"openalex"` to `SourceName` | `src/utils/models.py:9` | ✅ Done |
14
+ | Add `metadata` field to `Evidence` | `src/utils/models.py:39-42` | ✅ Done |
15
+ | Export all tools from `__init__.py` | `src/tools/__init__.py` | ✅ Done |
16
+
17
+ All 110 tests pass after these changes.
18
+
19
+ ---
20
+
21
+ ## Priority Order
22
+
23
+ | Phase | Name | Priority | Effort | Value |
24
+ |-------|------|----------|--------|-------|
25
+ | **17** | Rate Limiting | P0 CRITICAL | 1 hour | Stability |
26
+ | **15** | OpenAlex | HIGH | 2-3 hours | Very High |
27
+ | **16** | PubMed Full-Text | MEDIUM | 3 hours | High |
28
+
29
+ **Recommended implementation order**: 17 → 15 → 16
30
+
31
+ ---
32
+
33
+ ## Phase 15: OpenAlex Integration
34
+
35
+ **File**: [15_PHASE_OPENALEX.md](./15_PHASE_OPENALEX.md)
36
+
37
+ Add OpenAlex as 4th data source for:
38
+ - Citation networks (who cites whom)
39
+ - Concept tagging (semantic discovery)
40
+ - 209M+ scholarly works
41
+ - Free, no API key required
42
+
43
+ **Quick Start**:
44
+ ```bash
45
+ # Create the tool
46
+ touch src/tools/openalex.py
47
+ touch tests/unit/tools/test_openalex.py
48
+
49
+ # Run tests first (TDD)
50
+ uv run pytest tests/unit/tools/test_openalex.py -v
51
+
52
+ # Demo
53
+ uv run python examples/openalex_demo.py
54
+ ```
55
+
56
+ ---
57
+
58
+ ## Phase 16: PubMed Full-Text
59
+
60
+ **File**: [16_PHASE_PUBMED_FULLTEXT.md](./16_PHASE_PUBMED_FULLTEXT.md)
61
+
62
+ Add full-text retrieval via BioC API for:
63
+ - Complete paper text (not just abstracts)
64
+ - Structured sections (intro, methods, results)
65
+ - Better evidence for LLM synthesis
66
+
67
+ **Quick Start**:
68
+ ```bash
69
+ # Add methods to existing pubmed.py
70
+ # Tests in test_pubmed_fulltext.py
71
+
72
+ # Run tests
73
+ uv run pytest tests/unit/tools/test_pubmed_fulltext.py -v
74
+
75
+ # Demo
76
+ uv run python examples/pubmed_fulltext_demo.py
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Phase 17: Rate Limiting
82
+
83
+ **File**: [17_PHASE_RATE_LIMITING.md](./17_PHASE_RATE_LIMITING.md)
84
+
85
+ Replace naive sleep-based rate limiting with `limits` library for:
86
+ - Moving window algorithm
87
+ - Shared limits across instances
88
+ - Configurable per-API rates
89
+ - Production-grade stability
90
+
91
+ **Quick Start**:
92
+ ```bash
93
+ # Add dependency
94
+ uv add limits
95
+
96
+ # Create module
97
+ touch src/tools/rate_limiter.py
98
+ touch tests/unit/tools/test_rate_limiting.py
99
+
100
+ # Run tests
101
+ uv run pytest tests/unit/tools/test_rate_limiting.py -v
102
+
103
+ # Demo
104
+ uv run python examples/rate_limiting_demo.py
105
+ ```
106
+
107
+ ---
108
+
109
+ ## TDD Workflow
110
+
111
+ Each implementation doc follows this pattern:
112
+
113
+ 1. **Write tests first** - Define expected behavior
114
+ 2. **Run tests** - Verify they fail (red)
115
+ 3. **Implement** - Write minimal code to pass
116
+ 4. **Run tests** - Verify they pass (green)
117
+ 5. **Refactor** - Clean up if needed
118
+ 6. **Demo** - Verify end-to-end with real APIs
119
+ 7. **`make check`** - Ensure no regressions
120
+
121
+ ---
122
+
123
+ ## Related Brainstorming Docs
124
+
125
+ These implementation plans are derived from:
126
+
127
+ - [00_ROADMAP_SUMMARY.md](../00_ROADMAP_SUMMARY.md) - Priority overview
128
+ - [01_PUBMED_IMPROVEMENTS.md](../01_PUBMED_IMPROVEMENTS.md) - PubMed details
129
+ - [02_CLINICALTRIALS_IMPROVEMENTS.md](../02_CLINICALTRIALS_IMPROVEMENTS.md) - CT.gov details
130
+ - [03_EUROPEPMC_IMPROVEMENTS.md](../03_EUROPEPMC_IMPROVEMENTS.md) - Europe PMC details
131
+ - [04_OPENALEX_INTEGRATION.md](../04_OPENALEX_INTEGRATION.md) - OpenAlex integration
132
+
133
+ ---
134
+
135
+ ## Future Phases (Not Yet Documented)
136
+
137
+ Based on brainstorming, these could be added later:
138
+
139
+ - **Phase 18**: ClinicalTrials.gov Results Retrieval
140
+ - **Phase 19**: Europe PMC Annotations API
141
+ - **Phase 20**: Drug Name Normalization (RxNorm)
142
+ - **Phase 21**: Citation Network Queries (OpenAlex)
143
+ - **Phase 22**: Semantic Search with Embeddings
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md CHANGED
@@ -3,131 +3,79 @@
3
  **Priority**: P1 (UX Bug)
4
  **Status**: OPEN
5
  **Date**: 2025-11-27
 
6
 
7
  ---
8
 
9
- ## Bug Description
10
 
11
- The "Settings" accordion in the Gradio UI does not collapse/hide its content. Even when the accordion arrow shows "collapsed" state, all settings (Orchestrator Mode, API Key, API Provider) remain visible.
12
 
13
- ---
14
-
15
- ## Root Cause
16
 
17
- **Known Gradio Bug**: `additional_inputs_accordion` does not work correctly when `ChatInterface` is used inside `gr.Blocks()`.
18
 
19
- **GitHub Issue**: [gradio-app/gradio#8861](https://github.com/gradio-app/gradio/issues/8861)
20
- > "Is there any subsequent plan to support gr.ChatInterface inheritance under gr.Block()? Currently using accordion is not working well."
21
 
22
- **Our Code** (`src/app.py` lines 196-250):
23
- ```python
24
- with gr.Blocks(...) as demo: # <-- Using gr.Blocks wrapper
25
- gr.ChatInterface(
26
- ...
27
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
28
- additional_inputs=[...],
29
- )
30
- ```
31
 
32
- The `additional_inputs_accordion` parameter is designed for standalone `ChatInterface`, but breaks when wrapped in `gr.Blocks()`.
33
 
34
  ---
35
 
36
- ## Evidence
37
 
38
- - Accordion arrow toggles (visual feedback works)
39
- - Content does NOT hide when collapsed
40
- - Same behavior in local dev and HuggingFace Spaces
41
-
42
- ---
43
 
44
- ## Possible Fixes
45
 
46
- ### Option 1: Remove gr.Blocks Wrapper (Recommended)
47
 
48
- If we don't need the header/footer markdown, use standalone `ChatInterface`:
 
 
 
 
49
 
 
50
  ```python
51
- # Instead of gr.Blocks wrapper
52
- demo = gr.ChatInterface(
53
- fn=research_agent,
54
- title="🧬 DeepCritical",
55
- description="AI-Powered Drug Repurposing Agent",
56
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
57
- additional_inputs=[...],
58
- )
59
  ```
60
 
61
- **Pros**: Accordion should work correctly
62
- **Cons**: Less control over layout, no custom header/footer
63
-
64
- ### Option 2: Manual Accordion Outside ChatInterface
65
-
66
- Move settings outside `ChatInterface` into a proper `gr.Accordion`:
67
-
68
  ```python
69
- with gr.Blocks() as demo:
70
- gr.Markdown("# 🧬 DeepCritical")
71
-
72
- with gr.Accordion("⚙️ Settings", open=False):
73
- mode = gr.Radio(choices=["simple", "magentic"], value="simple", label="Mode")
74
- api_key = gr.Textbox(label="API Key", type="password")
75
- provider = gr.Radio(choices=["openai", "anthropic"], value="openai", label="Provider")
76
-
77
- chatbot = gr.Chatbot()
78
- msg = gr.Textbox(label="Ask a research question")
79
-
80
- msg.submit(research_agent, [msg, chatbot, mode, api_key, provider], chatbot)
81
- ```
82
-
83
- **Pros**: Full control, accordion works
84
- **Cons**: More code, lose ChatInterface conveniences (examples, etc.)
85
-
86
- ### Option 3: Wait for Gradio Fix
87
-
88
- Gradio added `.expand()` and `.collapse()` events in recent versions. Upgrading might help.
89
-
90
- **Check current version**:
91
- ```bash
92
- pip show gradio | grep Version
93
- ```
94
-
95
- **Upgrade**:
96
- ```bash
97
- pip install --upgrade gradio
98
  ```
99
 
100
  ---
101
 
102
- ## Recommendation
103
-
104
- **Option 1** (Remove gr.Blocks) is cleanest if we can live without custom header/footer.
105
-
106
- If header/footer needed, **Option 2** gives working accordion at cost of more code.
107
-
108
- ---
109
-
110
- ## Files to Modify
111
-
112
- | File | Change |
113
- |------|--------|
114
- | `src/app.py` | Restructure UI per chosen option |
115
- | `pyproject.toml` | Possibly upgrade Gradio version |
116
-
117
- ---
118
-
119
- ## Test Plan
120
 
121
- 1. Run locally: `uv run python -m src.app`
122
- 2. Click Settings accordion to collapse
123
- 3. Verify content hides when collapsed
124
- 4. Verify content shows when expanded
125
- 5. Test on HuggingFace Spaces after deploy
 
 
126
 
127
  ---
128
 
129
- ## Sources
130
 
131
- - [Gradio Issue #8861 - Accordion not working in Blocks](https://github.com/gradio-app/gradio/issues/8861)
132
- - [Gradio ChatInterface Docs](https://www.gradio.app/docs/gradio/chatinterface)
133
- - [Gradio Accordion Docs](https://www.gradio.app/docs/gradio/accordion)
 
3
  **Priority**: P1 (UX Bug)
4
  **Status**: OPEN
5
  **Date**: 2025-11-27
6
+ **Target Component**: `src/app.py`
7
 
8
  ---
9
 
10
+ ## 1. Problem Description
11
 
12
+ The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
13
 
14
+ ### Symptoms
15
+ - Accordion arrow toggles visually, but content remains visible.
16
+ - Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
17
 
18
+ ---
19
 
20
+ ## 2. Root Cause Analysis
 
21
 
22
+ **Definitive Cause**: Nested `Blocks` Context Bug.
23
+ `gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
 
 
 
 
 
 
 
24
 
25
+ **Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
26
 
27
  ---
28
 
29
+ ## 3. Solution Strategy: "The Unwrap Fix"
30
 
31
+ We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
 
 
 
 
32
 
33
+ ### Implementation Plan
34
 
35
+ **Refactor `src/app.py` / `create_demo()`**:
36
 
37
+ 1. **Remove** the `with gr.Blocks() as demo:` context manager.
38
+ 2. **Instantiate** `gr.ChatInterface` directly as the `demo` object.
39
+ 3. **Migrate UI Elements**:
40
+ * **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
41
+ * **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
42
 
43
+ ### Before (Buggy)
44
  ```python
45
+ def create_demo():
46
+ with gr.Blocks() as demo: # <--- CAUSE OF BUG
47
+ gr.Markdown("# Title")
48
+ gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
49
+ gr.Markdown("Footer")
50
+ return demo
 
 
51
  ```
52
 
53
+ ### After (Correct)
 
 
 
 
 
 
54
  ```python
55
+ def create_demo():
56
+ return gr.ChatInterface( # <--- FIX: Top-level component
57
+ ...,
58
+ title="🧬 DeepCritical",
59
+ description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
60
+ additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
61
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ```
63
 
64
  ---
65
 
66
+ ## 4. Validation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
+ 1. **Run**: `uv run python src/app.py`
69
+ 2. **Check**: Open `http://localhost:7860`
70
+ 3. **Verify**:
71
+ * Settings accordion starts **COLLAPSED**.
72
+ * Header title ("DeepCritical") is visible.
73
+ * Footer text ("MCP Server Active") is visible in the description area.
74
+ * Chat functionality works (Magentic/Simple modes).
75
 
76
  ---
77
 
78
+ ## 5. Constraints & Notes
79
 
80
+ - **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
81
+ - **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.
 
examples/rate_limiting_demo.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Demo script to verify rate limiting works correctly."""
3
+
4
+ import asyncio
5
+ import time
6
+
7
+ from src.tools.pubmed import PubMedTool
8
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
9
+
10
+
11
+ async def test_basic_limiter():
12
+ """Test basic rate limiter behavior."""
13
+ print("=" * 60)
14
+ print("Rate Limiting Demo")
15
+ print("=" * 60)
16
+
17
+ # Test 1: Basic limiter
18
+ print("\n[Test 1] Testing 3/second limiter...")
19
+ limiter = RateLimiter("3/second")
20
+
21
+ start = time.monotonic()
22
+ for i in range(6):
23
+ await limiter.acquire()
24
+ elapsed = time.monotonic() - start
25
+ print(f" Request {i+1} at {elapsed:.2f}s")
26
+
27
+ total = time.monotonic() - start
28
+ print(f" Total time for 6 requests: {total:.2f}s (expected ~2s)")
29
+
30
+
31
+ async def test_pubmed_limiter():
32
+ """Test PubMed-specific limiter."""
33
+ print("\n[Test 2] Testing PubMed limiter (shared)...")
34
+
35
+ reset_pubmed_limiter() # Clean state
36
+
37
+ # Without API key: 3/sec
38
+ limiter = get_pubmed_limiter(api_key=None)
39
+ print(f" Rate without key: {limiter.rate}")
40
+
41
+ # Multiple tools should share the same limiter
42
+ tool1 = PubMedTool()
43
+ tool2 = PubMedTool()
44
+
45
+ # Verify they share the limiter
46
+ print(f" Tools share limiter: {tool1._limiter is tool2._limiter}")
47
+
48
+
49
+ async def test_concurrent_requests():
50
+ """Test rate limiting under concurrent load."""
51
+ print("\n[Test 3] Testing concurrent request limiting...")
52
+
53
+ limiter = RateLimiter("5/second")
54
+
55
+ async def make_request(i: int):
56
+ await limiter.acquire()
57
+ return time.monotonic()
58
+
59
+ start = time.monotonic()
60
+ # Launch 10 concurrent requests
61
+ tasks = [make_request(i) for i in range(10)]
62
+ times = await asyncio.gather(*tasks)
63
+
64
+ # Calculate distribution
65
+ relative_times = [t - start for t in times]
66
+ print(f" Request times: {[f'{t:.2f}s' for t in sorted(relative_times)]}")
67
+
68
+ total = max(relative_times)
69
+ print(f" All 10 requests completed in {total:.2f}s (expected ~2s)")
70
+
71
+
72
+ async def main():
73
+ await test_basic_limiter()
74
+ await test_pubmed_limiter()
75
+ await test_concurrent_requests()
76
+
77
+ print("\n" + "=" * 60)
78
+ print("Demo complete!")
79
+
80
+
81
+ if __name__ == "__main__":
82
+ asyncio.run(main())
pyproject.toml CHANGED
@@ -24,6 +24,7 @@ dependencies = [
24
  "tenacity>=8.2", # Retry logic
25
  "structlog>=24.1", # Structured logging
26
  "requests>=2.32.5", # ClinicalTrials.gov (httpx blocked by WAF)
 
27
  ]
28
 
29
  [project.optional-dependencies]
 
24
  "tenacity>=8.2", # Retry logic
25
  "structlog>=24.1", # Structured logging
26
  "requests>=2.32.5", # ClinicalTrials.gov (httpx blocked by WAF)
27
+ "limits>=3.0", # Rate limiting
28
  ]
29
 
30
  [project.optional-dependencies]
src/app.py CHANGED
@@ -186,78 +186,66 @@ async def research_agent(
186
  yield f"❌ **Error**: {e!s}"
187
 
188
 
189
- def create_demo() -> Any:
190
  """
191
  Create the Gradio demo interface with MCP support.
192
 
193
  Returns:
194
  Configured Gradio Blocks interface with MCP server enabled
195
  """
196
- with gr.Blocks(
197
- title="DeepCritical - Drug Repurposing Research Agent",
198
- ) as demo:
199
- # 1. Minimal Header (Option A: 2 lines max)
200
- gr.Markdown(
201
- "# 🧬 DeepCritical\n"
202
- "*AI-Powered Drug Repurposing Agent — searches PubMed, ClinicalTrials.gov & Europe PMC*"
203
- )
204
-
205
- # 2. Main Chat Interface
206
- # Config inputs will be in a collapsed accordion below the chat input
207
- gr.ChatInterface(
208
- fn=research_agent,
209
- examples=[
210
- [
211
- "What drugs could be repurposed for Alzheimer's disease?",
212
- "simple",
213
- "",
214
- "openai",
215
- ],
216
- [
217
- "Is metformin effective for treating cancer?",
218
- "simple",
219
- "",
220
- "openai",
221
- ],
222
- [
223
- "What medications show promise for Long COVID treatment?",
224
- "simple",
225
- "",
226
- "openai",
227
- ],
228
  ],
229
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
230
- additional_inputs=[
231
- gr.Radio(
232
- choices=["simple", "magentic"],
233
- value="simple",
234
- label="Orchestrator Mode",
235
- info="Simple: Linear | Magentic: Multi-Agent (OpenAI)",
236
- ),
237
- gr.Textbox(
238
- label="🔑 API Key (Optional - BYOK)",
239
- placeholder="sk-... or sk-ant-...",
240
- type="password",
241
- info="Enter your own API key. Never stored.",
242
- ),
243
- gr.Radio(
244
- choices=["openai", "anthropic"],
245
- value="openai",
246
- label="API Provider",
247
- info="Select the provider for your API key",
248
- ),
249
  ],
250
- )
251
-
252
- # 3. Minimal Footer (Option C: Remove MCP Tabs, keep info)
253
- gr.Markdown(
254
- """
255
- ---
256
- *Research tool only — not for medical advice.*
257
- **MCP Server Active**: Connect Claude Desktop to `/gradio_api/mcp/`
258
- """,
259
- elem_classes=["footer"],
260
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  return demo
263
 
 
186
  yield f"❌ **Error**: {e!s}"
187
 
188
 
189
+ def create_demo() -> gr.ChatInterface:
190
  """
191
  Create the Gradio demo interface with MCP support.
192
 
193
  Returns:
194
  Configured Gradio Blocks interface with MCP server enabled
195
  """
196
+ # 1. Unwrapped ChatInterface (Fixes Accordion Bug)
197
+ demo = gr.ChatInterface(
198
+ fn=research_agent,
199
+ title="🧬 DeepCritical",
200
+ description=(
201
+ "*AI-Powered Drug Repurposing Agent — searches PubMed, "
202
+ "ClinicalTrials.gov & Europe PMC*\n\n"
203
+ "---\n"
204
+ "*Research tool only — not for medical advice.* \n"
205
+ "**MCP Server Active**: Connect Claude Desktop to `/gradio_api/mcp/`"
206
+ ),
207
+ examples=[
208
+ [
209
+ "What drugs could be repurposed for Alzheimer's disease?",
210
+ "simple",
211
+ "",
212
+ "openai",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
  ],
214
+ [
215
+ "Is metformin effective for treating cancer?",
216
+ "simple",
217
+ "",
218
+ "openai",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
  ],
220
+ [
221
+ "What medications show promise for Long COVID treatment?",
222
+ "simple",
223
+ "",
224
+ "openai",
225
+ ],
226
+ ],
227
+ additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
228
+ additional_inputs=[
229
+ gr.Radio(
230
+ choices=["simple", "magentic"],
231
+ value="simple",
232
+ label="Orchestrator Mode",
233
+ info="Simple: Linear | Magentic: Multi-Agent (OpenAI)",
234
+ ),
235
+ gr.Textbox(
236
+ label="🔑 API Key (Optional - BYOK)",
237
+ placeholder="sk-... or sk-ant-...",
238
+ type="password",
239
+ info="Enter your own API key. Never stored.",
240
+ ),
241
+ gr.Radio(
242
+ choices=["openai", "anthropic"],
243
+ value="openai",
244
+ label="API Provider",
245
+ info="Select the provider for your API key",
246
+ ),
247
+ ],
248
+ )
249
 
250
  return demo
251
 
src/tools/__init__.py CHANGED
@@ -1,8 +1,16 @@
1
  """Search tools package."""
2
 
3
  from src.tools.base import SearchTool
 
 
4
  from src.tools.pubmed import PubMedTool
5
  from src.tools.search_handler import SearchHandler
6
 
7
- # Re-export
8
- __all__ = ["PubMedTool", "SearchHandler", "SearchTool"]
 
 
 
 
 
 
 
1
  """Search tools package."""
2
 
3
  from src.tools.base import SearchTool
4
+ from src.tools.clinicaltrials import ClinicalTrialsTool
5
+ from src.tools.europepmc import EuropePMCTool
6
  from src.tools.pubmed import PubMedTool
7
  from src.tools.search_handler import SearchHandler
8
 
9
+ # Re-export all search tools
10
+ __all__ = [
11
+ "ClinicalTrialsTool",
12
+ "EuropePMCTool",
13
+ "PubMedTool",
14
+ "SearchHandler",
15
+ "SearchTool",
16
+ ]
src/tools/pubmed.py CHANGED
@@ -1,6 +1,5 @@
1
  """PubMed search tool using NCBI E-utilities."""
2
 
3
- import asyncio
4
  from typing import Any
5
 
6
  import httpx
@@ -8,6 +7,7 @@ import xmltodict
8
  from tenacity import retry, stop_after_attempt, wait_exponential
9
 
10
  from src.tools.query_utils import preprocess_query
 
11
  from src.utils.config import settings
12
  from src.utils.exceptions import RateLimitError, SearchError
13
  from src.utils.models import Citation, Evidence
@@ -17,7 +17,6 @@ class PubMedTool:
17
  """Search tool for PubMed/NCBI."""
18
 
19
  BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
20
- RATE_LIMIT_DELAY = 0.34 # ~3 requests/sec without API key
21
  HTTP_TOO_MANY_REQUESTS = 429
22
 
23
  def __init__(self, api_key: str | None = None) -> None:
@@ -25,7 +24,9 @@ class PubMedTool:
25
  # Ignore placeholder values from .env.example
26
  if self.api_key == "your-ncbi-key-here":
27
  self.api_key = None
28
- self._last_request_time = 0.0
 
 
29
 
30
  @property
31
  def name(self) -> str:
@@ -33,12 +34,7 @@ class PubMedTool:
33
 
34
  async def _rate_limit(self) -> None:
35
  """Enforce NCBI rate limiting."""
36
- loop = asyncio.get_running_loop()
37
- now = loop.time()
38
- elapsed = now - self._last_request_time
39
- if elapsed < self.RATE_LIMIT_DELAY:
40
- await asyncio.sleep(self.RATE_LIMIT_DELAY - elapsed)
41
- self._last_request_time = loop.time()
42
 
43
  def _build_params(self, **kwargs: Any) -> dict[str, Any]:
44
  """Build request params with optional API key."""
 
1
  """PubMed search tool using NCBI E-utilities."""
2
 
 
3
  from typing import Any
4
 
5
  import httpx
 
7
  from tenacity import retry, stop_after_attempt, wait_exponential
8
 
9
  from src.tools.query_utils import preprocess_query
10
+ from src.tools.rate_limiter import get_pubmed_limiter
11
  from src.utils.config import settings
12
  from src.utils.exceptions import RateLimitError, SearchError
13
  from src.utils.models import Citation, Evidence
 
17
  """Search tool for PubMed/NCBI."""
18
 
19
  BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
 
20
  HTTP_TOO_MANY_REQUESTS = 429
21
 
22
  def __init__(self, api_key: str | None = None) -> None:
 
24
  # Ignore placeholder values from .env.example
25
  if self.api_key == "your-ncbi-key-here":
26
  self.api_key = None
27
+
28
+ # Use shared rate limiter
29
+ self._limiter = get_pubmed_limiter(self.api_key)
30
 
31
  @property
32
  def name(self) -> str:
 
34
 
35
  async def _rate_limit(self) -> None:
36
  """Enforce NCBI rate limiting."""
37
+ await self._limiter.acquire()
 
 
 
 
 
38
 
39
  def _build_params(self, **kwargs: Any) -> dict[str, Any]:
40
  """Build request params with optional API key."""
src/tools/rate_limiter.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Rate limiting utilities using the limits library."""
2
+
3
+ import asyncio
4
+ from typing import ClassVar
5
+
6
+ from limits import RateLimitItem, parse
7
+ from limits.storage import MemoryStorage
8
+ from limits.strategies import MovingWindowRateLimiter
9
+
10
+
11
+ class RateLimiter:
12
+ """
13
+ Async-compatible rate limiter using limits library.
14
+
15
+ Uses moving window algorithm for smooth rate limiting.
16
+ """
17
+
18
+ def __init__(self, rate: str) -> None:
19
+ """
20
+ Initialize rate limiter.
21
+
22
+ Args:
23
+ rate: Rate string like "3/second" or "10/second"
24
+ """
25
+ self.rate = rate
26
+ self._storage = MemoryStorage()
27
+ self._limiter = MovingWindowRateLimiter(self._storage)
28
+ self._rate_limit: RateLimitItem = parse(rate)
29
+ self._identity = "default" # Single identity for shared limiting
30
+
31
+ async def acquire(self, wait: bool = True) -> bool:
32
+ """
33
+ Acquire permission to make a request.
34
+
35
+ ASYNC-SAFE: Uses asyncio.sleep(), never time.sleep().
36
+ The polling pattern allows other coroutines to run while waiting.
37
+
38
+ Args:
39
+ wait: If True, wait until allowed. If False, return immediately.
40
+
41
+ Returns:
42
+ True if allowed, False if not (only when wait=False)
43
+ """
44
+ while True:
45
+ # Check if we can proceed (synchronous, fast - ~microseconds)
46
+ if self._limiter.hit(self._rate_limit, self._identity):
47
+ return True
48
+
49
+ if not wait:
50
+ return False
51
+
52
+ # CRITICAL: Use asyncio.sleep(), NOT time.sleep()
53
+ # This yields control to the event loop, allowing other
54
+ # coroutines (UI, parallel searches) to run.
55
+ # Using 0.01s for fine-grained responsiveness.
56
+ await asyncio.sleep(0.01)
57
+
58
+ def reset(self) -> None:
59
+ """Reset the rate limiter (for testing)."""
60
+ self._storage.reset()
61
+
62
+
63
+ # Singleton limiter for PubMed/NCBI
64
+ _pubmed_limiter: RateLimiter | None = None
65
+
66
+
67
+ def get_pubmed_limiter(api_key: str | None = None) -> RateLimiter:
68
+ """
69
+ Get the shared PubMed rate limiter.
70
+
71
+ Rate depends on whether API key is provided:
72
+ - Without key: 3 requests/second
73
+ - With key: 10 requests/second
74
+
75
+ Args:
76
+ api_key: NCBI API key (optional)
77
+
78
+ Returns:
79
+ Shared RateLimiter instance
80
+ """
81
+ global _pubmed_limiter
82
+
83
+ if _pubmed_limiter is None:
84
+ rate = "10/second" if api_key else "3/second"
85
+ _pubmed_limiter = RateLimiter(rate)
86
+
87
+ return _pubmed_limiter
88
+
89
+
90
+ def reset_pubmed_limiter() -> None:
91
+ """Reset the PubMed limiter (for testing)."""
92
+ global _pubmed_limiter
93
+ _pubmed_limiter = None
94
+
95
+
96
+ # Factory for other APIs
97
+ class RateLimiterFactory:
98
+ """Factory for creating/getting rate limiters for different APIs."""
99
+
100
+ _limiters: ClassVar[dict[str, RateLimiter]] = {}
101
+
102
+ @classmethod
103
+ def get(cls, api_name: str, rate: str) -> RateLimiter:
104
+ """
105
+ Get or create a rate limiter for an API.
106
+
107
+ Args:
108
+ api_name: Unique identifier for the API
109
+ rate: Rate limit string (e.g., "10/second")
110
+
111
+ Returns:
112
+ RateLimiter instance (shared for same api_name)
113
+ """
114
+ if api_name not in cls._limiters:
115
+ cls._limiters[api_name] = RateLimiter(rate)
116
+ return cls._limiters[api_name]
117
+
118
+ @classmethod
119
+ def reset_all(cls) -> None:
120
+ """Reset all limiters (for testing)."""
121
+ cls._limiters.clear()
src/utils/models.py CHANGED
@@ -6,7 +6,7 @@ from typing import Any, ClassVar, Literal
6
  from pydantic import BaseModel, Field
7
 
8
  # Centralized source type - add new sources here (e.g., new databases)
9
- SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint"]
10
 
11
 
12
  class Citation(BaseModel):
@@ -36,6 +36,10 @@ class Evidence(BaseModel):
36
  content: str = Field(min_length=1, description="The actual text content")
37
  citation: Citation
38
  relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
 
 
 
 
39
 
40
  model_config = {"frozen": True}
41
 
 
6
  from pydantic import BaseModel, Field
7
 
8
  # Centralized source type - add new sources here (e.g., new databases)
9
+ SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex"]
10
 
11
 
12
  class Citation(BaseModel):
 
36
  content: str = Field(min_length=1, description="The actual text content")
37
  citation: Citation
38
  relevance: float = Field(default=0.0, ge=0.0, le=1.0, description="Relevance score 0-1")
39
+ metadata: dict[str, Any] = Field(
40
+ default_factory=dict,
41
+ description="Additional metadata (e.g., cited_by_count, concepts, is_open_access)",
42
+ )
43
 
44
  model_config = {"frozen": True}
45
 
tests/unit/tools/test_rate_limiting.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for rate limiting functionality."""
2
+
3
+ import asyncio
4
+ import time
5
+
6
+ import pytest
7
+
8
+ from src.tools.rate_limiter import RateLimiter, get_pubmed_limiter, reset_pubmed_limiter
9
+
10
+
11
+ class TestRateLimiter:
12
+ """Test suite for rate limiter."""
13
+
14
+ def test_create_limiter_without_api_key(self) -> None:
15
+ """Should create 3/sec limiter without API key."""
16
+ limiter = RateLimiter(rate="3/second")
17
+ assert limiter.rate == "3/second"
18
+
19
+ def test_create_limiter_with_api_key(self) -> None:
20
+ """Should create 10/sec limiter with API key."""
21
+ limiter = RateLimiter(rate="10/second")
22
+ assert limiter.rate == "10/second"
23
+
24
+ @pytest.mark.asyncio
25
+ async def test_limiter_allows_requests_under_limit(self) -> None:
26
+ """Should allow requests under the rate limit."""
27
+ limiter = RateLimiter(rate="10/second")
28
+
29
+ # 3 requests should all succeed immediately
30
+ for _ in range(3):
31
+ allowed = await limiter.acquire()
32
+ assert allowed is True
33
+
34
+ @pytest.mark.asyncio
35
+ async def test_limiter_blocks_when_exceeded(self) -> None:
36
+ """Should wait when rate limit exceeded."""
37
+ limiter = RateLimiter(rate="2/second")
38
+
39
+ # First 2 should be instant
40
+ await limiter.acquire()
41
+ await limiter.acquire()
42
+
43
+ # Third should block briefly
44
+ start = time.monotonic()
45
+ await limiter.acquire()
46
+ elapsed = time.monotonic() - start
47
+
48
+ # Should have waited ~0.5 seconds (half second window for 2/sec)
49
+ assert elapsed >= 0.3
50
+
51
+ @pytest.mark.asyncio
52
+ async def test_limiter_resets_after_window(self) -> None:
53
+ """Rate limit should reset after time window."""
54
+ limiter = RateLimiter(rate="5/second")
55
+
56
+ # Use up the limit
57
+ for _ in range(5):
58
+ await limiter.acquire()
59
+
60
+ # Wait for window to pass
61
+ await asyncio.sleep(1.1)
62
+
63
+ # Should be allowed again
64
+ start = time.monotonic()
65
+ await limiter.acquire()
66
+ elapsed = time.monotonic() - start
67
+
68
+ assert elapsed < 0.1 # Should be nearly instant
69
+
70
+
71
+ class TestGetPubmedLimiter:
72
+ """Test PubMed-specific limiter factory."""
73
+
74
+ @pytest.fixture(autouse=True)
75
+ def setup_teardown(self):
76
+ """Reset limiter before and after each test."""
77
+ reset_pubmed_limiter()
78
+ yield
79
+ reset_pubmed_limiter()
80
+
81
+ def test_limiter_without_api_key(self) -> None:
82
+ """Should return 3/sec limiter without key."""
83
+ limiter = get_pubmed_limiter(api_key=None)
84
+ assert "3" in limiter.rate
85
+
86
+ def test_limiter_with_api_key(self) -> None:
87
+ """Should return 10/sec limiter with key."""
88
+ limiter = get_pubmed_limiter(api_key="my-api-key")
89
+ assert "10" in limiter.rate
90
+
91
+ def test_limiter_is_singleton(self) -> None:
92
+ """Same API key should return same limiter instance."""
93
+ limiter1 = get_pubmed_limiter(api_key="key1")
94
+ limiter2 = get_pubmed_limiter(api_key="key1")
95
+ assert limiter1 is limiter2
96
+
97
+ def test_different_keys_different_limiters(self) -> None:
98
+ """Different API keys should return different limiters."""
99
+ limiter1 = get_pubmed_limiter(api_key="key1")
100
+ limiter2 = get_pubmed_limiter(api_key="key2")
101
+ # Clear cache for clean test
102
+ # Actually, different keys SHOULD share the same limiter
103
+ # since we're limiting against the same API
104
+ assert limiter1 is limiter2 # Shared NCBI rate limit
uv.lock CHANGED
@@ -1066,6 +1066,7 @@ dependencies = [
1066
  { name = "gradio", extra = ["mcp"] },
1067
  { name = "httpx" },
1068
  { name = "huggingface-hub" },
 
1069
  { name = "openai" },
1070
  { name = "pydantic" },
1071
  { name = "pydantic-ai" },
@@ -1116,6 +1117,7 @@ requires-dist = [
1116
  { name = "gradio", extras = ["mcp"], specifier = ">=6.0.0" },
1117
  { name = "httpx", specifier = ">=0.27" },
1118
  { name = "huggingface-hub", specifier = ">=0.20.0" },
 
1119
  { name = "llama-index", marker = "extra == 'modal'", specifier = ">=0.11.0" },
1120
  { name = "llama-index-embeddings-openai", marker = "extra == 'modal'" },
1121
  { name = "llama-index-llms-openai", marker = "extra == 'modal'" },
@@ -2259,6 +2261,20 @@ wheels = [
2259
  { url = "https://files.pythonhosted.org/packages/ca/ec/65f7d563aa4a62dd58777e8f6aa882f15db53b14eb29aba0c28a20f7eb26/kubernetes-34.1.0-py2.py3-none-any.whl", hash = "sha256:bffba2272534e224e6a7a74d582deb0b545b7c9879d2cd9e4aae9481d1f2cc2a", size = 2008380 },
2260
  ]
2261
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2262
  [[package]]
2263
  name = "llama-cloud"
2264
  version = "0.1.35"
 
1066
  { name = "gradio", extra = ["mcp"] },
1067
  { name = "httpx" },
1068
  { name = "huggingface-hub" },
1069
+ { name = "limits" },
1070
  { name = "openai" },
1071
  { name = "pydantic" },
1072
  { name = "pydantic-ai" },
 
1117
  { name = "gradio", extras = ["mcp"], specifier = ">=6.0.0" },
1118
  { name = "httpx", specifier = ">=0.27" },
1119
  { name = "huggingface-hub", specifier = ">=0.20.0" },
1120
+ { name = "limits", specifier = ">=3.0" },
1121
  { name = "llama-index", marker = "extra == 'modal'", specifier = ">=0.11.0" },
1122
  { name = "llama-index-embeddings-openai", marker = "extra == 'modal'" },
1123
  { name = "llama-index-llms-openai", marker = "extra == 'modal'" },
 
2261
  { url = "https://files.pythonhosted.org/packages/ca/ec/65f7d563aa4a62dd58777e8f6aa882f15db53b14eb29aba0c28a20f7eb26/kubernetes-34.1.0-py2.py3-none-any.whl", hash = "sha256:bffba2272534e224e6a7a74d582deb0b545b7c9879d2cd9e4aae9481d1f2cc2a", size = 2008380 },
2262
  ]
2263
 
2264
+ [[package]]
2265
+ name = "limits"
2266
+ version = "5.6.0"
2267
+ source = { registry = "https://pypi.org/simple" }
2268
+ dependencies = [
2269
+ { name = "deprecated" },
2270
+ { name = "packaging" },
2271
+ { name = "typing-extensions" },
2272
+ ]
2273
+ sdist = { url = "https://files.pythonhosted.org/packages/bb/e5/c968d43a65128cd54fb685f257aafb90cd5e4e1c67d084a58f0e4cbed557/limits-5.6.0.tar.gz", hash = "sha256:807fac75755e73912e894fdd61e2838de574c5721876a19f7ab454ae1fffb4b5", size = 182984 }
2274
+ wheels = [
2275
+ { url = "https://files.pythonhosted.org/packages/40/96/4fcd44aed47b8fcc457653b12915fcad192cd646510ef3f29fd216f4b0ab/limits-5.6.0-py3-none-any.whl", hash = "sha256:b585c2104274528536a5b68864ec3835602b3c4a802cd6aa0b07419798394021", size = 60604 },
2276
+ ]
2277
+
2278
  [[package]]
2279
  name = "llama-cloud"
2280
  version = "0.1.35"