# PubMed Tool: Current State & Future Improvements

**Status**: Currently Implemented
**Priority**: High (Core Data Source)

---

## Current Implementation

### What We Have (`src/tools/pubmed.py`)

- Basic E-utilities search via `esearch.fcgi` and `efetch.fcgi`
- Query preprocessing (strips question words, expands synonyms)
- Returns: title, abstract, authors, journal, PMID
- Rate limiting: None implemented (relying on NCBI defaults)

### Current Limitations

1. **No Full-Text Access**: Only retrieves abstracts, not full paper text
2. **No Rate Limiting**: Risk of being blocked by NCBI
3. **No BioC Format**: Missing structured full-text extraction
4. **No Figure Retrieval**: No supplementary materials access
5. **No PMC Integration**: Missing open-access full-text via PMC

---

## Reference Implementation (DeepCritical Reference Repo)

The reference repo at `reference_repos/DeepCritical/DeepResearch/src/tools/bioinformatics_tools.py` has a more sophisticated implementation:

### Features We're Missing

```python
# Rate limiting (lines 47-50)
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter

storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)
rate_limit = parse("3/second")  # NCBI allows 3/sec without API key, 10/sec with

# Full-text via BioC format (lines 108-120)
def _get_fulltext(pmid: int) -> dict[str, Any] | None:
    pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
    # Returns structured JSON with full text for open-access papers

# Figure retrieval via Europe PMC (lines 123-149)
def _get_figures(pmcid: str) -> dict[str, str]:
    suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
    # Returns base64-encoded images from supplementary materials
```

---

## Recommended Improvements

### Phase 1: Rate Limiting (Critical)

```python
# Add to src/tools/pubmed.py
from limits import parse
from limits.storage import MemoryStorage
from limits.strategies import MovingWindowRateLimiter

storage = MemoryStorage()
limiter = MovingWindowRateLimiter(storage)

# With NCBI_API_KEY: 10/sec, without: 3/sec
def get_rate_limit():
    if settings.ncbi_api_key:
        return parse("10/second")
    return parse("3/second")
```

**Dependencies**: `pip install limits`

### Phase 2: Full-Text Retrieval

```python
async def get_fulltext(pmid: str) -> str | None:
    """Get full text for open-access papers via BioC API."""
    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
    # Only works for PMC papers (open access)
```

### Phase 3: PMC ID Resolution

```python
async def get_pmc_id(pmid: str) -> str | None:
    """Convert PMID to PMCID for full-text access."""
    url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
```

---

## Python Libraries to Consider

| Library | Purpose | Notes |
|---------|---------|-------|
| [Biopython](https://biopython.org/) | `Bio.Entrez` module | Official, well-maintained |
| [PyMed](https://pypi.org/project/pymed/) | PubMed wrapper | Simpler API, less control |
| [metapub](https://pypi.org/project/metapub/) | Full-featured | Tested on 1/3 of PubMed |
| [limits](https://pypi.org/project/limits/) | Rate limiting | Used by reference repo |

---

## API Endpoints Reference

| Endpoint | Purpose | Rate Limit |
|----------|---------|------------|
| `esearch.fcgi` | Search for PMIDs | 3/sec (10 with key) |
| `efetch.fcgi` | Fetch metadata | 3/sec (10 with key) |
| `esummary.fcgi` | Quick metadata | 3/sec (10 with key) |
| `pmcoa.cgi/BioC_json` | Full text (PMC only) | Unknown |
| `idconv/v1.0` | PMID ↔ PMCID | Unknown |

---

## Sources

- [PubMed E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
- [NCBI BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
- [Searching PubMed with Python](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/)
- [PyMed on PyPI](https://pypi.org/project/pymed/)