DETERMINATOR

Running

App Files Files Community

DETERMINATOR / SERPER_WEBSEARCH_IMPLEMENTATION_PLAN.md

Joseph Pollack

adds youtube video

25435fb unverified 8 days ago

preview code

raw

history blame contribute delete

19.1 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

SERPER Web Search Implementation Plan

Executive Summary

This plan details the implementation of SERPER-based web search by vendoring code from folder/tools/web_search.py into src/tools/, creating a protocol-compliant SerperWebSearchTool, fixing the existing WebSearchTool, and integrating both into the main search flow.

Project Structure

Project 1: Vendor and Refactor Core Web Search Components

Goal: Extract and vendor Serper/SearchXNG search logic from folder/tools/web_search.py into src/tools/

Project 2: Create Protocol-Compliant SerperWebSearchTool

Goal: Implement SerperWebSearchTool class that fully complies with SearchTool protocol

Project 3: Fix Existing WebSearchTool Protocol Compliance

Goal: Make existing WebSearchTool (DuckDuckGo) protocol-compliant

Project 4: Integrate Web Search into SearchHandler

Goal: Add web search tools to main search flow in src/app.py

Project 5: Update Callers and Dependencies

Goal: Update all code that uses web search to work with new implementation

Project 6: Testing and Validation

Goal: Add comprehensive tests for all web search implementations

Detailed Implementation Plan

PROJECT 1: Vendor and Refactor Core Web Search Components

Activity 1.1: Create Vendor Module Structure

File: src/tools/vendored/__init__.py

Task 1.1.1: Create src/tools/vendored/ directory
Task 1.1.2: Create __init__.py with exports

File: src/tools/vendored/web_search_core.py

Task 1.1.3: Vendor ScrapeResult, WebpageSnippet, SearchResults models from folder/tools/web_search.py (lines 23-37)
Task 1.1.4: Vendor scrape_urls() function (lines 274-299)
Task 1.1.5: Vendor fetch_and_process_url() function (lines 302-348)
Task 1.1.6: Vendor html_to_text() function (lines 351-368)
Task 1.1.7: Vendor is_valid_url() function (lines 371-410)
Task 1.1.8: Vendor ssl_context setup (lines 115-120)
Task 1.1.9: Add imports: aiohttp, asyncio, BeautifulSoup, ssl
Task 1.1.10: Add CONTENT_LENGTH_LIMIT = 10000 constant
Task 1.1.11: Add type hints following project standards
Task 1.1.12: Add structlog logging
Task 1.1.13: Replace print() statements with logger calls

File: src/tools/vendored/serper_client.py

Task 1.1.14: Vendor SerperClient class from folder/tools/web_search.py (lines 123-196)
Task 1.1.15: Remove dependency on ResearchAgent and ResearchRunner
Task 1.1.16: Replace filter agent with simple relevance filtering or remove it
Task 1.1.17: Add __init__ that takes api_key: str | None parameter
Task 1.1.18: Update search() method to return list[WebpageSnippet] without filtering
Task 1.1.19: Remove _filter_results() method (or make it optional)
Task 1.1.20: Add error handling with SearchError and RateLimitError
Task 1.1.21: Add structlog logging
Task 1.1.22: Add type hints

File: src/tools/vendored/searchxng_client.py

Task 1.1.23: Vendor SearchXNGClient class from folder/tools/web_search.py (lines 199-271)
Task 1.1.24: Remove dependency on ResearchAgent and ResearchRunner
Task 1.1.25: Replace filter agent with simple relevance filtering or remove it
Task 1.1.26: Add __init__ that takes host: str parameter
Task 1.1.27: Update search() method to return list[WebpageSnippet] without filtering
Task 1.1.28: Remove _filter_results() method (or make it optional)
Task 1.1.29: Add error handling with SearchError and RateLimitError
Task 1.1.30: Add structlog logging
Task 1.1.31: Add type hints

Activity 1.2: Create Rate Limiting for Web Search

File: src/tools/rate_limiter.py

Task 1.2.1: Add get_serper_limiter() function (rate: "10/second" with API key)
Task 1.2.2: Add get_searchxng_limiter() function (rate: "5/second")
Task 1.2.3: Use RateLimiterFactory.get() pattern

PROJECT 2: Create Protocol-Compliant SerperWebSearchTool

Activity 2.1: Implement SerperWebSearchTool Class

File: src/tools/serper_web_search.py

Task 2.1.1: Create new file src/tools/serper_web_search.py
Task 2.1.2: Add imports:
- from src.tools.base import SearchTool
- from src.tools.vendored.serper_client import SerperClient
- from src.tools.vendored.web_search_core import scrape_urls, WebpageSnippet
- from src.tools.rate_limiter import get_serper_limiter
- from src.tools.query_utils import preprocess_query
- from src.utils.config import settings
- from src.utils.exceptions import SearchError, RateLimitError
- from src.utils.models import Citation, Evidence
- import structlog
- from tenacity import retry, stop_after_attempt, wait_exponential
Task 2.1.3: Create SerperWebSearchTool class
Task 2.1.4: Add __init__(self, api_key: str | None = None) method
- Line 2.1.4.1: Get API key from parameter or settings.serper_api_key
- Line 2.1.4.2: Validate API key is not None, raise ConfigurationError if missing
- Line 2.1.4.3: Initialize SerperClient(api_key=self.api_key)
- Line 2.1.4.4: Get rate limiter: self._limiter = get_serper_limiter(self.api_key)
Task 2.1.5: Add @property def name(self) -> str: returning "serper"
Task 2.1.6: Add async def _rate_limit(self) -> None: method
- Line 2.1.6.1: Call await self._limiter.acquire()
Task 2.1.7: Add @retry(...) decorator with exponential backoff
Task 2.1.8: Add async def search(self, query: str, max_results: int = 10) -> list[Evidence]: method
- Line 2.1.8.1: Call await self._rate_limit()
- Line 2.1.8.2: Preprocess query: clean_query = preprocess_query(query)
- Line 2.1.8.3: Use clean_query if clean_query else query
- Line 2.1.8.4: Call search_results = await self._client.search(query, filter_for_relevance=False, max_results=max_results)
- Line 2.1.8.5: Call scraped = await scrape_urls(search_results)
- Line 2.1.8.6: Convert ScrapeResult to Evidence objects:
  - Line 2.1.8.6.1: Create Citation with title, url, source="serper", date="Unknown", authors=[]
  - Line 2.1.8.6.2: Create Evidence with content=scraped.text, citation, relevance=0.0
- Line 2.1.8.7: Return list[Evidence]
- Line 2.1.8.8: Add try/except for httpx.HTTPStatusError:
  - Line 2.1.8.8.1: Check for 429 status, raise RateLimitError
  - Line 2.1.8.8.2: Otherwise raise SearchError
- Line 2.1.8.9: Add try/except for httpx.TimeoutException, raise SearchError
- Line 2.1.8.10: Add generic exception handler, log and raise SearchError

Activity 2.2: Implement SearchXNGWebSearchTool Class

File: src/tools/searchxng_web_search.py

Task 2.2.1: Create new file src/tools/searchxng_web_search.py
Task 2.2.2: Add imports (similar to SerperWebSearchTool)
Task 2.2.3: Create SearchXNGWebSearchTool class
Task 2.2.4: Add __init__(self, host: str | None = None) method
- Line 2.2.4.1: Get host from parameter or settings.searchxng_host
- Line 2.2.4.2: Validate host is not None, raise ConfigurationError if missing
- Line 2.2.4.3: Initialize SearchXNGClient(host=self.host)
- Line 2.2.4.4: Get rate limiter: self._limiter = get_searchxng_limiter()
Task 2.2.5: Add @property def name(self) -> str: returning "searchxng"
Task 2.2.6: Add async def _rate_limit(self) -> None: method
Task 2.2.7: Add @retry(...) decorator
Task 2.2.8: Add async def search(self, query: str, max_results: int = 10) -> list[Evidence]: method
- Line 2.2.8.1-2.2.8.10: Similar structure to SerperWebSearchTool

PROJECT 3: Fix Existing WebSearchTool Protocol Compliance

Activity 3.1: Update WebSearchTool Class

File: src/tools/web_search.py

Task 3.1.1: Add @property def name(self) -> str: method returning "duckduckgo" (after line 17)
Task 3.1.2: Change search() return type from SearchResult to list[Evidence] (line 19)
Task 3.1.3: Update search() method body:
- Line 3.1.3.1: Keep existing search logic (lines 21-43)
- Line 3.1.3.2: Instead of returning SearchResult, return evidence list directly (line 44)
- Line 3.1.3.3: Update exception handler to return empty list [] instead of SearchResult (line 51)
Task 3.1.4: Add imports if needed:
- Line 3.1.4.1: from src.utils.exceptions import SearchError
- Line 3.1.4.2: Update exception handling to raise SearchError instead of returning error SearchResult
Task 3.1.5: Add query preprocessing:
- Line 3.1.5.1: Import from src.tools.query_utils import preprocess_query
- Line 3.1.5.2: Add clean_query = preprocess_query(query) before search
- Line 3.1.5.3: Use clean_query if clean_query else query

Activity 3.2: Update Retrieval Agent Caller

File: src/agents/retrieval_agent.py

Task 3.2.1: Update search_web() function (line 31):
- Line 3.2.1.1: Change results = await _web_search.search(query, max_results)
- Line 3.2.1.2: Change to evidence = await _web_search.search(query, max_results)
- Line 3.2.1.3: Update check: if not evidence: instead of if not results.evidence:
- Line 3.2.1.4: Update state update: new_count = state.add_evidence(evidence) instead of results.evidence
- Line 3.2.1.5: Update logging: results_found=len(evidence) instead of len(results.evidence)
- Line 3.2.1.6: Update output formatting: for i, r in enumerate(evidence[:max_results], 1): instead of results.evidence[:max_results]
- Line 3.2.1.7: Update deduplication: await state.embedding_service.deduplicate(evidence) instead of results.evidence
- Line 3.2.1.8: Update output message: Found {len(evidence)} web results instead of len(results.evidence)

PROJECT 4: Integrate Web Search into SearchHandler

Activity 4.1: Create Web Search Tool Factory

File: src/tools/web_search_factory.py

Task 4.1.1: Create new file src/tools/web_search_factory.py
Task 4.1.2: Add imports:
- from src.tools.web_search import WebSearchTool
- from src.tools.serper_web_search import SerperWebSearchTool
- from src.tools.searchxng_web_search import SearchXNGWebSearchTool
- from src.utils.config import settings
- from src.utils.exceptions import ConfigurationError
- import structlog
Task 4.1.3: Add logger = structlog.get_logger()
Task 4.1.4: Create def create_web_search_tool() -> SearchTool | None: function
- Line 4.1.4.1: Check settings.web_search_provider
- Line 4.1.4.2: If "serper":
  - Line 4.1.4.2.1: Check settings.serper_api_key or settings.web_search_available()
  - Line 4.1.4.2.2: If available, return SerperWebSearchTool()
  - Line 4.1.4.2.3: Else log warning and return None
- Line 4.1.4.3: If "searchxng":
  - Line 4.1.4.3.1: Check settings.searchxng_host or settings.web_search_available()
  - Line 4.1.4.3.2: If available, return SearchXNGWebSearchTool()
  - Line 4.1.4.3.3: Else log warning and return None
- Line 4.1.4.4: If "duckduckgo":
  - Line 4.1.4.4.1: Return WebSearchTool() (always available)
- Line 4.1.4.5: If "brave" or "tavily":
  - Line 4.1.4.5.1: Log warning "Not yet implemented"
  - Line 4.1.4.5.2: Return None
- Line 4.1.4.6: Default: return WebSearchTool() (fallback to DuckDuckGo)

Activity 4.2: Update SearchHandler Initialization

File: src/app.py

Task 4.2.1: Add import: from src.tools.web_search_factory import create_web_search_tool
Task 4.2.2: Update configure_orchestrator() function (around line 73):
- Line 4.2.2.1: Before creating SearchHandler, call web_search_tool = create_web_search_tool()
- Line 4.2.2.2: Create tools list: tools = [PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()]
- Line 4.2.2.3: If web_search_tool is not None:
  - Line 4.2.2.3.1: Append web_search_tool to tools list
  - Line 4.2.2.3.2: Log info: "Web search tool added to search handler"
- Line 4.2.2.4: Update SearchHandler initialization to use tools list

PROJECT 5: Update Callers and Dependencies

Activity 5.1: Update web_search_adapter

File: src/tools/web_search_adapter.py

Task 5.1.1: Update web_search() function to use new implementation:
- Line 5.1.1.1: Import from src.tools.web_search_factory import create_web_search_tool
- Line 5.1.1.2: Remove dependency on folder.tools.web_search
- Line 5.1.1.3: Get tool: tool = create_web_search_tool()
- Line 5.1.1.4: If tool is None, return error message
- Line 5.1.1.5: Call evidence = await tool.search(query, max_results=5)
- Line 5.1.1.6: Convert Evidence objects to formatted string:
  - Line 5.1.1.6.1: Format each evidence with title, URL, content preview
- Line 5.1.1.7: Return formatted string

Activity 5.2: Update Tool Executor

File: src/tools/tool_executor.py

Task 5.2.1: Verify web_search_adapter.web_search() usage (line 86) still works
Task 5.2.2: No changes needed if adapter is updated correctly

Activity 5.3: Update Planner Agent

File: src/orchestrator/planner_agent.py

Task 5.3.1: Verify web_search_adapter.web_search() usage (line 14) still works
Task 5.3.2: No changes needed if adapter is updated correctly

Activity 5.4: Remove Legacy Dependencies

File: src/tools/web_search_adapter.py

Task 5.4.1: Remove import of folder.llm_config and folder.tools.web_search
Task 5.4.2: Update error messages to reflect new implementation

PROJECT 6: Testing and Validation

Activity 6.1: Unit Tests for Vendored Components

File: tests/unit/tools/test_vendored_web_search_core.py

Task 6.1.1: Test scrape_urls() function
Task 6.1.2: Test fetch_and_process_url() function
Task 6.1.3: Test html_to_text() function
Task 6.1.4: Test is_valid_url() function

File: tests/unit/tools/test_vendored_serper_client.py

Task 6.1.5: Mock SerperClient API calls
Task 6.1.6: Test successful search
Task 6.1.7: Test error handling
Task 6.1.8: Test rate limiting

File: tests/unit/tools/test_vendored_searchxng_client.py

Task 6.1.9: Mock SearchXNGClient API calls
Task 6.1.10: Test successful search
Task 6.1.11: Test error handling
Task 6.1.12: Test rate limiting

Activity 6.2: Unit Tests for Web Search Tools

File: tests/unit/tools/test_serper_web_search.py

Task 6.2.1: Test SerperWebSearchTool.__init__() with valid API key
Task 6.2.2: Test SerperWebSearchTool.__init__() without API key (should raise)
Task 6.2.3: Test name property returns "serper"
Task 6.2.4: Test search() returns list[Evidence]
Task 6.2.5: Test search() with mocked SerperClient
Task 6.2.6: Test error handling (SearchError, RateLimitError)
Task 6.2.7: Test query preprocessing
Task 6.2.8: Test rate limiting

File: tests/unit/tools/test_searchxng_web_search.py

Task 6.2.9: Similar tests for SearchXNGWebSearchTool

File: tests/unit/tools/test_web_search.py

Task 6.2.10: Test WebSearchTool.name property returns "duckduckgo"
Task 6.2.11: Test WebSearchTool.search() returns list[Evidence]
Task 6.2.12: Test WebSearchTool.search() with mocked DDGS
Task 6.2.13: Test error handling
Task 6.2.14: Test query preprocessing

Activity 6.3: Integration Tests

File: tests/integration/test_web_search_integration.py

Task 6.3.1: Test SerperWebSearchTool with real API (marked @pytest.mark.integration)
Task 6.3.2: Test SearchXNGWebSearchTool with real API (marked @pytest.mark.integration)
Task 6.3.3: Test WebSearchTool with real DuckDuckGo (marked @pytest.mark.integration)
Task 6.3.4: Test create_web_search_tool() factory function
Task 6.3.5: Test SearchHandler with web search tool

Activity 6.4: Update Existing Tests

File: tests/unit/agents/test_retrieval_agent.py

Task 6.4.1: Update tests to expect list[Evidence] instead of SearchResult
Task 6.4.2: Mock WebSearchTool.search() to return list[Evidence]

File: tests/unit/tools/test_tool_executor.py

Task 6.4.3: Verify tests still pass with updated web_search_adapter

Implementation Order

PROJECT 1: Vendor core components (foundation)
PROJECT 3: Fix existing WebSearchTool (quick win, unblocks retrieval agent)
PROJECT 2: Create SerperWebSearchTool (new functionality)
PROJECT 4: Integrate into SearchHandler (main integration)
PROJECT 5: Update callers (cleanup dependencies)
PROJECT 6: Testing (validation)

Dependencies and Prerequisites

External Dependencies

aiohttp - Already in requirements
beautifulsoup4 - Already in requirements
duckduckgo-search - Already in requirements
tenacity - Already in requirements
structlog - Already in requirements

Internal Dependencies

src/tools/base.py - SearchTool protocol
src/tools/rate_limiter.py - Rate limiting utilities
src/tools/query_utils.py - Query preprocessing
src/utils/config.py - Settings and configuration
src/utils/exceptions.py - Custom exceptions
src/utils/models.py - Evidence, Citation models

Configuration Requirements

SERPER_API_KEY - For Serper provider
SEARCHXNG_HOST - For SearchXNG provider
WEB_SEARCH_PROVIDER - Environment variable (default: "duckduckgo")

Risk Assessment

High Risk

Breaking changes to retrieval_agent.py: Must update carefully to handle list[Evidence] instead of SearchResult
Legacy folder dependencies: Need to ensure all code is properly vendored

Medium Risk

Rate limiting: Serper API may have different limits than expected
Error handling: Need to handle API failures gracefully

Low Risk

Query preprocessing: May need adjustment for web search vs PubMed
Testing: Integration tests require API keys

Success Criteria

✅ SerperWebSearchTool implements SearchTool protocol correctly
✅ WebSearchTool implements SearchTool protocol correctly
✅ Both tools can be added to SearchHandler
✅ web_search_adapter works with new implementation
✅ retrieval_agent works with updated WebSearchTool
✅ All unit tests pass
✅ Integration tests pass (with API keys)
✅ No dependencies on folder/tools/web_search.py in src/ code
✅ Configuration supports multiple providers
✅ Error handling is robust

Notes

The vendored code should be self-contained and not depend on folder/ modules
Filter agent functionality from original code is removed (can be added later if needed)
Rate limiting follows the same pattern as PubMed tool
Query preprocessing may need web-specific adjustments (less aggressive than PubMed)
Consider adding relevance scoring in the future