# Critical Analysis: Search Tools - Limitations, Gaps, and Improvements **Date**: November 2025 **Purpose**: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl. --- ## Executive Summary DeepBoner currently has **4 search tools**: 1. PubMed (NCBI E-utilities) 2. ClinicalTrials.gov (API v2) 3. Europe PMC (includes preprints) 4. OpenAlex (citation-aware) **Overall Assessment**: Tools are functional but have significant gaps in: - Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap) - Full-text retrieval (only abstracts currently) - Citation graph traversal (OpenAlex has data but we don't use it) - Query optimization (basic synonym expansion, no MeSH term mapping) --- ## Tool 1: PubMed (NCBI E-utilities) **File**: `src/tools/pubmed.py` ### What It Does Well | Feature | Status | Notes | |---------|--------|-------| | Rate limiting | ✅ | Shared limiter, respects 3/sec (no key) or 10/sec (with key) | | Retry logic | ✅ | tenacity with exponential backoff | | Query preprocessing | ✅ | Strips question words, expands synonyms | | Abstract parsing | ✅ | Handles XML edge cases (dict vs list) | ### Limitations (API-Level) | Limitation | Severity | Workaround Possible? | |------------|----------|---------------------| | **10,000 result cap per query** | Medium | Yes - use date ranges to paginate | | **Abstracts only** (no full text) | High | No - full text requires PMC or publisher | | **No citation counts** | Medium | Yes - cross-reference with OpenAlex | | **Rate limit (10/sec max)** | Low | Already handled | ### Current Implementation Gaps ```python # GAP 1: No MeSH term expansion # Current: expand_synonyms() uses hardcoded dict # Better: Use NCBI's E-utilities to get MeSH terms for query # GAP 2: No date filtering # Current: Gets whatever PubMed returns (biased toward recent) # Better: Add date range parameter for historical research # GAP 3: No publication type filtering # Current: Returns all types (reviews, case reports, RCTs) # Better: Filter for RCTs and systematic reviews when appropriate ``` ### Priority Improvements 1. **HIGH**: Add publication type filter (Reviews, RCTs, Meta-analyses) 2. **MEDIUM**: Add date range parameter 3. **LOW**: MeSH term expansion via E-utilities --- ## Tool 2: ClinicalTrials.gov **File**: `src/tools/clinicaltrials.py` ### What It Does Well | Feature | Status | Notes | |---------|--------|-------| | API v2 usage | ✅ | Modern API, not deprecated v1 | | Interventional filter | ✅ | Only gets drug/treatment studies | | Status filter | ✅ | COMPLETED, ACTIVE, RECRUITING | | httpx → requests workaround | ✅ | Bypasses WAF TLS fingerprint block | ### Limitations (API-Level) | Limitation | Severity | Workaround Possible? | |------------|----------|---------------------| | **No results data** | High | Yes - available via different endpoint | | **No outcome measures** | High | Yes - add to FIELDS list | | **No adverse events** | Medium | Yes - separate API call | | **Sparse drug mechanism data** | Medium | No - not in API | ### Current Implementation Gaps ```python # GAP 1: Missing critical fields FIELDS: ClassVar[list[str]] = [ "NCTId", "BriefTitle", "Phase", "OverallStatus", "Condition", "InterventionName", "StartDate", "BriefSummary", # MISSING: # "PrimaryOutcome", # "SecondaryOutcome", # "ResultsFirstSubmitDate", # "StudyResults", # Whether results are posted ] # GAP 2: No results retrieval # Many completed trials have posted results # We could get actual efficacy data, not just trial existence # GAP 3: No linked publications # Trials often link to PubMed articles with results # We could follow these links for richer evidence ``` ### Priority Improvements 1. **HIGH**: Add outcome measures to FIELDS 2. **HIGH**: Check for and retrieve posted results 3. **MEDIUM**: Follow linked publications (NCT → PMID) --- ## Tool 3: Europe PMC **File**: `src/tools/europepmc.py` ### What It Does Well | Feature | Status | Notes | |---------|--------|-------| | Preprint coverage | ✅ | bioRxiv, medRxiv, ChemRxiv indexed | | Preprint labeling | ✅ | `[PREPRINT - Not peer-reviewed]` marker | | DOI/PMID fallback URLs | ✅ | Smart URL construction | | Relevance scoring | ✅ | Preprints weighted lower (0.75 vs 0.9) | ### Limitations (API-Level) | Limitation | Severity | Workaround Possible? | |------------|----------|---------------------| | **No full text for most articles** | High | Partial - CC-licensed available after 14 days | | **Citation data limited** | Medium | Only journal articles, not preprints | | **Preprint-publication linking gaps** | Medium | ~50% of links missing per Crossref | | **License info sometimes missing** | Low | Manual review required | ### Current Implementation Gaps ```python # GAP 1: No full-text retrieval # Europe PMC has full text for many CC-licensed articles # Could retrieve full text XML via separate endpoint # GAP 2: Massive overlap with PubMed # Europe PMC indexes all of PubMed/MEDLINE # We're getting duplicates with no deduplication # GAP 3: No citation network # Europe PMC has "citedByCount" but we don't use it # Could prioritize highly-cited preprints ``` ### Priority Improvements 1. **HIGH**: Add deduplication with PubMed (by PMID) 2. **MEDIUM**: Retrieve citation counts for ranking 3. **LOW**: Full-text retrieval for CC-licensed articles --- ## Tool 4: OpenAlex **File**: `src/tools/openalex.py` ### What It Does Well | Feature | Status | Notes | |---------|--------|-------| | Citation counts | ✅ | Sorted by `cited_by_count:desc` | | Abstract reconstruction | ✅ | Handles inverted index format | | Concept extraction | ✅ | Hierarchical classification | | Open access detection | ✅ | `is_oa` and `pdf_url` | | Polite pool | ✅ | mailto for 100k/day limit | | Rich metadata | ✅ | Best metadata of all tools | ### Limitations (API-Level) | Limitation | Severity | Workaround Possible? | |------------|----------|---------------------| | **Author truncation at 100** | Low | Only affects mega-author papers | | **No full text** | High | No - OpenAlex is metadata only | | **Stale data (1-2 day lag)** | Low | Acceptable for research | ### Current Implementation Gaps ```python # GAP 1: No citation graph traversal # OpenAlex has `cited_by` and `references` endpoints # We could find seminal papers by following citation chains # GAP 2: No related works # OpenAlex has ML-powered "related_works" field # Could expand search to similar papers # GAP 3: No concept filtering # OpenAlex has hierarchical concepts # Could filter for specific domains (e.g., "Sexual health" concept) # GAP 4: Overlap with PubMed # OpenAlex indexes most of PubMed # More duplicates without deduplication ``` ### Priority Improvements 1. **HIGH**: Add citation graph traversal (find seminal papers) 2. **HIGH**: Add deduplication with PubMed/Europe PMC 3. **MEDIUM**: Use `related_works` for query expansion 4. **LOW**: Concept-based filtering --- ## Cross-Tool Issues ### Issue 1: MASSIVE DUPLICATION ``` PubMed: 36M+ articles Europe PMC: Indexes ALL of PubMed + preprints OpenAlex: 250M+ works (includes PubMed) Current behavior: All 3 return the same papers Result: Duplicate evidence, wasted tokens, inflated counts ``` **Solution**: Deduplication by PMID/DOI ```python # Proposed: Add to SearchHandler def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]: seen_ids: set[str] = set() unique: list[Evidence] = [] for e in evidence_list: # Extract PMID or DOI from URL paper_id = extract_paper_id(e.citation.url) if paper_id not in seen_ids: seen_ids.add(paper_id) unique.append(e) return unique ``` ### Issue 2: NO FULL-TEXT RETRIEVAL All tools return **abstracts only**. For deep research, this is limiting. **What's Actually Possible**: | Source | Full Text Access | How | |--------|------------------|-----| | PubMed Central (PMC) | Yes, for OA articles | Separate API: `efetch` with `db=pmc` | | Europe PMC | Yes, CC-licensed after 14 days | `/fullTextXML/{id}` endpoint | | OpenAlex | No | Metadata only | | Unpaywall | Yes, OA link discovery | Separate API | **Recommendation**: Add PMC full-text retrieval for open access articles. ### Issue 3: NO CITATION GRAPH OpenAlex has rich citation data but we only use `cited_by_count` for sorting. **Untapped Capabilities**: - `cited_by`: Find papers that cite a key paper - `references`: Find sources a paper cites - `related_works`: ML-powered similar papers **Use Case**: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find: - Papers that cite it (newer evidence) - Papers it cites (foundational research) - Related papers (similar topics) --- ## What's NOT Possible (API Constraints) | Feature | Why Not Possible | |---------|------------------| | **bioRxiv direct search** | No keyword search API, only RSS feed of latest | | **arXiv search** | API exists but irrelevant for sexual health | | **PubMed full text** | Requires publisher access or PMC | | **Real-time trial results** | ClinicalTrials.gov results are static snapshots | | **Drug mechanism data** | Not in any API - would need ChEMBL or DrugBank | --- ## Recommended Improvements (Priority Order) ### Phase 1: Fix Fundamentals (High ROI) 1. **Deduplication** - Stop returning the same paper 3 times 2. **Outcome measures in ClinicalTrials** - Get actual efficacy data 3. **Citation counts from all sources** - Rank by influence, not recency ### Phase 2: Depth Improvements (Medium ROI) 4. **PMC full-text retrieval** - Get full papers for OA articles 5. **Citation graph traversal** - Find seminal papers automatically 6. **Publication type filtering** - Prioritize RCTs and meta-analyses ### Phase 3: Quality Improvements (Lower ROI, Nice-to-Have) 7. **MeSH term expansion** - Better PubMed queries 8. **Related works expansion** - Use OpenAlex ML similarity 9. **Date range filtering** - Historical vs recent research --- ## Neo4j Integration (Future Consideration) **Question**: Should we add Neo4j for citation graph storage? **Answer**: Not yet. Here's why: | Approach | Complexity | Value | |----------|------------|-------| | OpenAlex API for citation traversal | Low | High | | Neo4j for local citation graph | High | Medium (unless doing graph analytics) | | Cron job to sync OpenAlex → Neo4j | Medium | Only if we need offline access | **Recommendation**: Use OpenAlex API for citation traversal first. Only add Neo4j if: 1. We need to do complex graph queries (PageRank on citations, community detection) 2. We need offline access to citation data 3. We're hitting OpenAlex rate limits --- ## Summary: What's Broken vs What's Working ### Working Well - Basic search across all 4 sources - Rate limiting and retry logic - Query preprocessing - Evidence model with citations ### Needs Fixing (Current Scope) - Deduplication (critical) - Outcome measures in ClinicalTrials (critical) - Citation-based ranking (important) ### Future Enhancements (Out of Current Scope) - Full-text retrieval - Citation graph traversal - Neo4j integration - Drug mechanism data (would need new data sources) --- ## Sources - [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/) - [NCBI Rate Limits](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/) - [OpenAlex API Docs](https://docs.openalex.org/) - [OpenAlex Limitations](https://docs.openalex.org/api-entities/authors/limitations) - [Europe PMC RESTful API](https://europepmc.org/RestfulWebService) - [Europe PMC Preprints](https://pmc.ncbi.nlm.nih.gov/articles/PMC11426508/) - [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api)