DETERMINATOR / SERPER_WEBSEARCH_IMPLEMENTATION_PLAN.md
Joseph Pollack
adds youtube video
25435fb unverified

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

SERPER Web Search Implementation Plan

Executive Summary

This plan details the implementation of SERPER-based web search by vendoring code from folder/tools/web_search.py into src/tools/, creating a protocol-compliant SerperWebSearchTool, fixing the existing WebSearchTool, and integrating both into the main search flow.

Project Structure

Project 1: Vendor and Refactor Core Web Search Components

Goal: Extract and vendor Serper/SearchXNG search logic from folder/tools/web_search.py into src/tools/

Project 2: Create Protocol-Compliant SerperWebSearchTool

Goal: Implement SerperWebSearchTool class that fully complies with SearchTool protocol

Project 3: Fix Existing WebSearchTool Protocol Compliance

Goal: Make existing WebSearchTool (DuckDuckGo) protocol-compliant

Project 4: Integrate Web Search into SearchHandler

Goal: Add web search tools to main search flow in src/app.py

Project 5: Update Callers and Dependencies

Goal: Update all code that uses web search to work with new implementation

Project 6: Testing and Validation

Goal: Add comprehensive tests for all web search implementations


Detailed Implementation Plan

PROJECT 1: Vendor and Refactor Core Web Search Components

Activity 1.1: Create Vendor Module Structure

File: src/tools/vendored/__init__.py

  • Task 1.1.1: Create src/tools/vendored/ directory
  • Task 1.1.2: Create __init__.py with exports

File: src/tools/vendored/web_search_core.py

  • Task 1.1.3: Vendor ScrapeResult, WebpageSnippet, SearchResults models from folder/tools/web_search.py (lines 23-37)
  • Task 1.1.4: Vendor scrape_urls() function (lines 274-299)
  • Task 1.1.5: Vendor fetch_and_process_url() function (lines 302-348)
  • Task 1.1.6: Vendor html_to_text() function (lines 351-368)
  • Task 1.1.7: Vendor is_valid_url() function (lines 371-410)
  • Task 1.1.8: Vendor ssl_context setup (lines 115-120)
  • Task 1.1.9: Add imports: aiohttp, asyncio, BeautifulSoup, ssl
  • Task 1.1.10: Add CONTENT_LENGTH_LIMIT = 10000 constant
  • Task 1.1.11: Add type hints following project standards
  • Task 1.1.12: Add structlog logging
  • Task 1.1.13: Replace print() statements with logger calls

File: src/tools/vendored/serper_client.py

  • Task 1.1.14: Vendor SerperClient class from folder/tools/web_search.py (lines 123-196)
  • Task 1.1.15: Remove dependency on ResearchAgent and ResearchRunner
  • Task 1.1.16: Replace filter agent with simple relevance filtering or remove it
  • Task 1.1.17: Add __init__ that takes api_key: str | None parameter
  • Task 1.1.18: Update search() method to return list[WebpageSnippet] without filtering
  • Task 1.1.19: Remove _filter_results() method (or make it optional)
  • Task 1.1.20: Add error handling with SearchError and RateLimitError
  • Task 1.1.21: Add structlog logging
  • Task 1.1.22: Add type hints

File: src/tools/vendored/searchxng_client.py

  • Task 1.1.23: Vendor SearchXNGClient class from folder/tools/web_search.py (lines 199-271)
  • Task 1.1.24: Remove dependency on ResearchAgent and ResearchRunner
  • Task 1.1.25: Replace filter agent with simple relevance filtering or remove it
  • Task 1.1.26: Add __init__ that takes host: str parameter
  • Task 1.1.27: Update search() method to return list[WebpageSnippet] without filtering
  • Task 1.1.28: Remove _filter_results() method (or make it optional)
  • Task 1.1.29: Add error handling with SearchError and RateLimitError
  • Task 1.1.30: Add structlog logging
  • Task 1.1.31: Add type hints

Activity 1.2: Create Rate Limiting for Web Search

File: src/tools/rate_limiter.py

  • Task 1.2.1: Add get_serper_limiter() function (rate: "10/second" with API key)
  • Task 1.2.2: Add get_searchxng_limiter() function (rate: "5/second")
  • Task 1.2.3: Use RateLimiterFactory.get() pattern

PROJECT 2: Create Protocol-Compliant SerperWebSearchTool

Activity 2.1: Implement SerperWebSearchTool Class

File: src/tools/serper_web_search.py

  • Task 2.1.1: Create new file src/tools/serper_web_search.py

  • Task 2.1.2: Add imports:

    • from src.tools.base import SearchTool
    • from src.tools.vendored.serper_client import SerperClient
    • from src.tools.vendored.web_search_core import scrape_urls, WebpageSnippet
    • from src.tools.rate_limiter import get_serper_limiter
    • from src.tools.query_utils import preprocess_query
    • from src.utils.config import settings
    • from src.utils.exceptions import SearchError, RateLimitError
    • from src.utils.models import Citation, Evidence
    • import structlog
    • from tenacity import retry, stop_after_attempt, wait_exponential
  • Task 2.1.3: Create SerperWebSearchTool class

  • Task 2.1.4: Add __init__(self, api_key: str | None = None) method

    • Line 2.1.4.1: Get API key from parameter or settings.serper_api_key
    • Line 2.1.4.2: Validate API key is not None, raise ConfigurationError if missing
    • Line 2.1.4.3: Initialize SerperClient(api_key=self.api_key)
    • Line 2.1.4.4: Get rate limiter: self._limiter = get_serper_limiter(self.api_key)
  • Task 2.1.5: Add @property def name(self) -> str: returning "serper"

  • Task 2.1.6: Add async def _rate_limit(self) -> None: method

    • Line 2.1.6.1: Call await self._limiter.acquire()
  • Task 2.1.7: Add @retry(...) decorator with exponential backoff

  • Task 2.1.8: Add async def search(self, query: str, max_results: int = 10) -> list[Evidence]: method

    • Line 2.1.8.1: Call await self._rate_limit()
    • Line 2.1.8.2: Preprocess query: clean_query = preprocess_query(query)
    • Line 2.1.8.3: Use clean_query if clean_query else query
    • Line 2.1.8.4: Call search_results = await self._client.search(query, filter_for_relevance=False, max_results=max_results)
    • Line 2.1.8.5: Call scraped = await scrape_urls(search_results)
    • Line 2.1.8.6: Convert ScrapeResult to Evidence objects:
      • Line 2.1.8.6.1: Create Citation with title, url, source="serper", date="Unknown", authors=[]
      • Line 2.1.8.6.2: Create Evidence with content=scraped.text, citation, relevance=0.0
    • Line 2.1.8.7: Return list[Evidence]
    • Line 2.1.8.8: Add try/except for httpx.HTTPStatusError:
      • Line 2.1.8.8.1: Check for 429 status, raise RateLimitError
      • Line 2.1.8.8.2: Otherwise raise SearchError
    • Line 2.1.8.9: Add try/except for httpx.TimeoutException, raise SearchError
    • Line 2.1.8.10: Add generic exception handler, log and raise SearchError

Activity 2.2: Implement SearchXNGWebSearchTool Class

File: src/tools/searchxng_web_search.py

  • Task 2.2.1: Create new file src/tools/searchxng_web_search.py

  • Task 2.2.2: Add imports (similar to SerperWebSearchTool)

  • Task 2.2.3: Create SearchXNGWebSearchTool class

  • Task 2.2.4: Add __init__(self, host: str | None = None) method

    • Line 2.2.4.1: Get host from parameter or settings.searchxng_host
    • Line 2.2.4.2: Validate host is not None, raise ConfigurationError if missing
    • Line 2.2.4.3: Initialize SearchXNGClient(host=self.host)
    • Line 2.2.4.4: Get rate limiter: self._limiter = get_searchxng_limiter()
  • Task 2.2.5: Add @property def name(self) -> str: returning "searchxng"

  • Task 2.2.6: Add async def _rate_limit(self) -> None: method

  • Task 2.2.7: Add @retry(...) decorator

  • Task 2.2.8: Add async def search(self, query: str, max_results: int = 10) -> list[Evidence]: method

    • Line 2.2.8.1-2.2.8.10: Similar structure to SerperWebSearchTool

PROJECT 3: Fix Existing WebSearchTool Protocol Compliance

Activity 3.1: Update WebSearchTool Class

File: src/tools/web_search.py

  • Task 3.1.1: Add @property def name(self) -> str: method returning "duckduckgo" (after line 17)

  • Task 3.1.2: Change search() return type from SearchResult to list[Evidence] (line 19)

  • Task 3.1.3: Update search() method body:

    • Line 3.1.3.1: Keep existing search logic (lines 21-43)
    • Line 3.1.3.2: Instead of returning SearchResult, return evidence list directly (line 44)
    • Line 3.1.3.3: Update exception handler to return empty list [] instead of SearchResult (line 51)
  • Task 3.1.4: Add imports if needed:

    • Line 3.1.4.1: from src.utils.exceptions import SearchError
    • Line 3.1.4.2: Update exception handling to raise SearchError instead of returning error SearchResult
  • Task 3.1.5: Add query preprocessing:

    • Line 3.1.5.1: Import from src.tools.query_utils import preprocess_query
    • Line 3.1.5.2: Add clean_query = preprocess_query(query) before search
    • Line 3.1.5.3: Use clean_query if clean_query else query

Activity 3.2: Update Retrieval Agent Caller

File: src/agents/retrieval_agent.py

  • Task 3.2.1: Update search_web() function (line 31):
    • Line 3.2.1.1: Change results = await _web_search.search(query, max_results)
    • Line 3.2.1.2: Change to evidence = await _web_search.search(query, max_results)
    • Line 3.2.1.3: Update check: if not evidence: instead of if not results.evidence:
    • Line 3.2.1.4: Update state update: new_count = state.add_evidence(evidence) instead of results.evidence
    • Line 3.2.1.5: Update logging: results_found=len(evidence) instead of len(results.evidence)
    • Line 3.2.1.6: Update output formatting: for i, r in enumerate(evidence[:max_results], 1): instead of results.evidence[:max_results]
    • Line 3.2.1.7: Update deduplication: await state.embedding_service.deduplicate(evidence) instead of results.evidence
    • Line 3.2.1.8: Update output message: Found {len(evidence)} web results instead of len(results.evidence)

PROJECT 4: Integrate Web Search into SearchHandler

Activity 4.1: Create Web Search Tool Factory

File: src/tools/web_search_factory.py

  • Task 4.1.1: Create new file src/tools/web_search_factory.py

  • Task 4.1.2: Add imports:

    • from src.tools.web_search import WebSearchTool
    • from src.tools.serper_web_search import SerperWebSearchTool
    • from src.tools.searchxng_web_search import SearchXNGWebSearchTool
    • from src.utils.config import settings
    • from src.utils.exceptions import ConfigurationError
    • import structlog
  • Task 4.1.3: Add logger = structlog.get_logger()

  • Task 4.1.4: Create def create_web_search_tool() -> SearchTool | None: function

    • Line 4.1.4.1: Check settings.web_search_provider
    • Line 4.1.4.2: If "serper":
      • Line 4.1.4.2.1: Check settings.serper_api_key or settings.web_search_available()
      • Line 4.1.4.2.2: If available, return SerperWebSearchTool()
      • Line 4.1.4.2.3: Else log warning and return None
    • Line 4.1.4.3: If "searchxng":
      • Line 4.1.4.3.1: Check settings.searchxng_host or settings.web_search_available()
      • Line 4.1.4.3.2: If available, return SearchXNGWebSearchTool()
      • Line 4.1.4.3.3: Else log warning and return None
    • Line 4.1.4.4: If "duckduckgo":
      • Line 4.1.4.4.1: Return WebSearchTool() (always available)
    • Line 4.1.4.5: If "brave" or "tavily":
      • Line 4.1.4.5.1: Log warning "Not yet implemented"
      • Line 4.1.4.5.2: Return None
    • Line 4.1.4.6: Default: return WebSearchTool() (fallback to DuckDuckGo)

Activity 4.2: Update SearchHandler Initialization

File: src/app.py

  • Task 4.2.1: Add import: from src.tools.web_search_factory import create_web_search_tool

  • Task 4.2.2: Update configure_orchestrator() function (around line 73):

    • Line 4.2.2.1: Before creating SearchHandler, call web_search_tool = create_web_search_tool()
    • Line 4.2.2.2: Create tools list: tools = [PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()]
    • Line 4.2.2.3: If web_search_tool is not None:
      • Line 4.2.2.3.1: Append web_search_tool to tools list
      • Line 4.2.2.3.2: Log info: "Web search tool added to search handler"
    • Line 4.2.2.4: Update SearchHandler initialization to use tools list

PROJECT 5: Update Callers and Dependencies

Activity 5.1: Update web_search_adapter

File: src/tools/web_search_adapter.py

  • Task 5.1.1: Update web_search() function to use new implementation:
    • Line 5.1.1.1: Import from src.tools.web_search_factory import create_web_search_tool
    • Line 5.1.1.2: Remove dependency on folder.tools.web_search
    • Line 5.1.1.3: Get tool: tool = create_web_search_tool()
    • Line 5.1.1.4: If tool is None, return error message
    • Line 5.1.1.5: Call evidence = await tool.search(query, max_results=5)
    • Line 5.1.1.6: Convert Evidence objects to formatted string:
      • Line 5.1.1.6.1: Format each evidence with title, URL, content preview
    • Line 5.1.1.7: Return formatted string

Activity 5.2: Update Tool Executor

File: src/tools/tool_executor.py

  • Task 5.2.1: Verify web_search_adapter.web_search() usage (line 86) still works
  • Task 5.2.2: No changes needed if adapter is updated correctly

Activity 5.3: Update Planner Agent

File: src/orchestrator/planner_agent.py

  • Task 5.3.1: Verify web_search_adapter.web_search() usage (line 14) still works
  • Task 5.3.2: No changes needed if adapter is updated correctly

Activity 5.4: Remove Legacy Dependencies

File: src/tools/web_search_adapter.py

  • Task 5.4.1: Remove import of folder.llm_config and folder.tools.web_search
  • Task 5.4.2: Update error messages to reflect new implementation

PROJECT 6: Testing and Validation

Activity 6.1: Unit Tests for Vendored Components

File: tests/unit/tools/test_vendored_web_search_core.py

  • Task 6.1.1: Test scrape_urls() function
  • Task 6.1.2: Test fetch_and_process_url() function
  • Task 6.1.3: Test html_to_text() function
  • Task 6.1.4: Test is_valid_url() function

File: tests/unit/tools/test_vendored_serper_client.py

  • Task 6.1.5: Mock SerperClient API calls
  • Task 6.1.6: Test successful search
  • Task 6.1.7: Test error handling
  • Task 6.1.8: Test rate limiting

File: tests/unit/tools/test_vendored_searchxng_client.py

  • Task 6.1.9: Mock SearchXNGClient API calls
  • Task 6.1.10: Test successful search
  • Task 6.1.11: Test error handling
  • Task 6.1.12: Test rate limiting

Activity 6.2: Unit Tests for Web Search Tools

File: tests/unit/tools/test_serper_web_search.py

  • Task 6.2.1: Test SerperWebSearchTool.__init__() with valid API key
  • Task 6.2.2: Test SerperWebSearchTool.__init__() without API key (should raise)
  • Task 6.2.3: Test name property returns "serper"
  • Task 6.2.4: Test search() returns list[Evidence]
  • Task 6.2.5: Test search() with mocked SerperClient
  • Task 6.2.6: Test error handling (SearchError, RateLimitError)
  • Task 6.2.7: Test query preprocessing
  • Task 6.2.8: Test rate limiting

File: tests/unit/tools/test_searchxng_web_search.py

  • Task 6.2.9: Similar tests for SearchXNGWebSearchTool

File: tests/unit/tools/test_web_search.py

  • Task 6.2.10: Test WebSearchTool.name property returns "duckduckgo"
  • Task 6.2.11: Test WebSearchTool.search() returns list[Evidence]
  • Task 6.2.12: Test WebSearchTool.search() with mocked DDGS
  • Task 6.2.13: Test error handling
  • Task 6.2.14: Test query preprocessing

Activity 6.3: Integration Tests

File: tests/integration/test_web_search_integration.py

  • Task 6.3.1: Test SerperWebSearchTool with real API (marked @pytest.mark.integration)
  • Task 6.3.2: Test SearchXNGWebSearchTool with real API (marked @pytest.mark.integration)
  • Task 6.3.3: Test WebSearchTool with real DuckDuckGo (marked @pytest.mark.integration)
  • Task 6.3.4: Test create_web_search_tool() factory function
  • Task 6.3.5: Test SearchHandler with web search tool

Activity 6.4: Update Existing Tests

File: tests/unit/agents/test_retrieval_agent.py

  • Task 6.4.1: Update tests to expect list[Evidence] instead of SearchResult
  • Task 6.4.2: Mock WebSearchTool.search() to return list[Evidence]

File: tests/unit/tools/test_tool_executor.py

  • Task 6.4.3: Verify tests still pass with updated web_search_adapter

Implementation Order

  1. PROJECT 1: Vendor core components (foundation)
  2. PROJECT 3: Fix existing WebSearchTool (quick win, unblocks retrieval agent)
  3. PROJECT 2: Create SerperWebSearchTool (new functionality)
  4. PROJECT 4: Integrate into SearchHandler (main integration)
  5. PROJECT 5: Update callers (cleanup dependencies)
  6. PROJECT 6: Testing (validation)

Dependencies and Prerequisites

External Dependencies

  • aiohttp - Already in requirements
  • beautifulsoup4 - Already in requirements
  • duckduckgo-search - Already in requirements
  • tenacity - Already in requirements
  • structlog - Already in requirements

Internal Dependencies

  • src/tools/base.py - SearchTool protocol
  • src/tools/rate_limiter.py - Rate limiting utilities
  • src/tools/query_utils.py - Query preprocessing
  • src/utils/config.py - Settings and configuration
  • src/utils/exceptions.py - Custom exceptions
  • src/utils/models.py - Evidence, Citation models

Configuration Requirements

  • SERPER_API_KEY - For Serper provider
  • SEARCHXNG_HOST - For SearchXNG provider
  • WEB_SEARCH_PROVIDER - Environment variable (default: "duckduckgo")

Risk Assessment

High Risk

  • Breaking changes to retrieval_agent.py: Must update carefully to handle list[Evidence] instead of SearchResult
  • Legacy folder dependencies: Need to ensure all code is properly vendored

Medium Risk

  • Rate limiting: Serper API may have different limits than expected
  • Error handling: Need to handle API failures gracefully

Low Risk

  • Query preprocessing: May need adjustment for web search vs PubMed
  • Testing: Integration tests require API keys

Success Criteria

  1. βœ… SerperWebSearchTool implements SearchTool protocol correctly
  2. βœ… WebSearchTool implements SearchTool protocol correctly
  3. βœ… Both tools can be added to SearchHandler
  4. βœ… web_search_adapter works with new implementation
  5. βœ… retrieval_agent works with updated WebSearchTool
  6. βœ… All unit tests pass
  7. βœ… Integration tests pass (with API keys)
  8. βœ… No dependencies on folder/tools/web_search.py in src/ code
  9. βœ… Configuration supports multiple providers
  10. βœ… Error handling is robust

Notes

  • The vendored code should be self-contained and not depend on folder/ modules
  • Filter agent functionality from original code is removed (can be added later if needed)
  • Rate limiting follows the same pattern as PubMed tool
  • Query preprocessing may need web-specific adjustments (less aggressive than PubMed)
  • Consider adding relevance scoring in the future