VibecoderMcSwaggins's picture
docs: update guides and add testing strategy documentation
b4aa4ad
|
raw
history blame
4.31 kB

Testing Strategy

ensuring DeepCritical is Ironclad


Overview

Our testing strategy follows a strict Pyramid of Reliability:

  1. Unit Tests: Fast, isolated logic checks (60% of tests)
  2. Integration Tests: Tool interactions & Agent loops (30% of tests)
  3. E2E / Regression Tests: Full research workflows (10% of tests)

Goal: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.


1. Unit Tests (Fast & Cheap)

Location: tests/unit/

Focus on individual components without external network calls. Mock everything.

Key Test Cases

Agent Logic

  • Initialization: Verify default config loads correctly.
  • State Updates: Ensure ResearchState updates correctly (e.g., token counts increment).
  • Budget Checks: Test should_continue() returns False when budget exceeded.
  • Error Handling: Test partial failure recovery (e.g., one tool fails, agent continues).

Tools (Mocked)

  • Parser Logic: Feed raw XML/JSON to tool parsers and verify Evidence objects.
  • Validation: Ensure tools reject invalid queries (empty strings, etc.).

Judge Prompts

  • Schema Compliance: Verify prompt templates generate valid JSON structure instructions.
  • Variable Injection: Ensure {question} and {context} are injected correctly into prompts.
# Example: Testing State Logic
def test_budget_stop():
    state = ResearchState(tokens_used=50001, max_tokens=50000)
    assert should_continue(state) is False

2. Integration Tests (Realistic & Mocked I/O)

Location: tests/integration/

Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use VCR.py or Replay patterns to record/replay API calls to save money/time.

Key Test Cases

Search Loop

  • Iteration Flow: Verify agent performs Search -> Judge -> Search loop.
  • Tool Selection: Verify correct tools are called based on judge output (mocked judge).
  • Context Accumulation: Ensure findings from Iteration 1 are passed to Iteration 2.

MCP Server Integration

  • Server Startup: Verify MCP server starts and exposes tools.
  • Client Connection: Verify agent can call tools via MCP protocol.
# Example: Testing Search Loop with Mocked Tools
async def test_search_loop_flow():
    agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
    report = await agent.run("test query")
    assert agent.state.iterations > 0
    assert len(report.sources) > 0

3. End-to-End (E2E) Tests (The "Real Deal")

Location: tests/e2e/

Run against real APIs (with strict rate limits) to verify system integrity. Run these on demand or nightly, not on every commit.

Key Test Cases

The "Golden Query"

Run the primary demo query: "What existing drugs might help treat long COVID fatigue?"

  • Success Criteria:
    • Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
    • Includes citations from PubMed.
    • Completes within 3 iterations.
    • JSON output matches schema.

Deployment Smoke Test

  • Gradio UI: Verify UI launches and accepts input.
  • Streaming: Verify generator yields chunks (first chunk within 2s).

4. Tools & Config

Pytest Configuration

# pyproject.toml
[tool.pytest.ini_options]
markers = [
    "unit: fast, isolated tests",
    "integration: mocked network tests",
    "e2e: real network tests (slow, expensive)"
]
asyncio_mode = "auto"

CI/CD Pipeline (GitHub Actions)

  1. Lint: ruff check .
  2. Type Check: mypy .
  3. Unit: pytest -m unit
  4. Integration: pytest -m integration
  5. E2E: (Manual trigger only)

5. Anti-Hallucination Validation

How do we test if the agent is lying?

  1. Citation Check:

    • Regex verify that every [PMID: 12345] in the report exists in the Evidence list.
    • Fail if a citation is "orphaned" (hallucinated ID).
  2. Negative Constraints:

    • Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".

Checklist for Implementation

  • Set up tests/ directory structure
  • Configure pytest and vcrpy
  • Create tests/fixtures/ for mock data (PubMed XMLs)
  • Write first unit test for ResearchState