Spaces:
Running
Testing Strategy
ensuring DeepCritical is Ironclad
Overview
Our testing strategy follows a strict Pyramid of Reliability:
- Unit Tests: Fast, isolated logic checks (60% of tests)
- Integration Tests: Tool interactions & Agent loops (30% of tests)
- E2E / Regression Tests: Full research workflows (10% of tests)
Goal: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
1. Unit Tests (Fast & Cheap)
Location: tests/unit/
Focus on individual components without external network calls. Mock everything.
Key Test Cases
Agent Logic
- Initialization: Verify default config loads correctly.
- State Updates: Ensure
ResearchStateupdates correctly (e.g., token counts increment). - Budget Checks: Test
should_continue()returnsFalsewhen budget exceeded. - Error Handling: Test partial failure recovery (e.g., one tool fails, agent continues).
Tools (Mocked)
- Parser Logic: Feed raw XML/JSON to tool parsers and verify
Evidenceobjects. - Validation: Ensure tools reject invalid queries (empty strings, etc.).
Judge Prompts
- Schema Compliance: Verify prompt templates generate valid JSON structure instructions.
- Variable Injection: Ensure
{question}and{context}are injected correctly into prompts.
# Example: Testing State Logic
def test_budget_stop():
state = ResearchState(tokens_used=50001, max_tokens=50000)
assert should_continue(state) is False
2. Integration Tests (Realistic & Mocked I/O)
Location: tests/integration/
Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use VCR.py or Replay patterns to record/replay API calls to save money/time.
Key Test Cases
Search Loop
- Iteration Flow: Verify agent performs Search -> Judge -> Search loop.
- Tool Selection: Verify correct tools are called based on judge output (mocked judge).
- Context Accumulation: Ensure findings from Iteration 1 are passed to Iteration 2.
MCP Server Integration
- Server Startup: Verify MCP server starts and exposes tools.
- Client Connection: Verify agent can call tools via MCP protocol.
# Example: Testing Search Loop with Mocked Tools
async def test_search_loop_flow():
agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
report = await agent.run("test query")
assert agent.state.iterations > 0
assert len(report.sources) > 0
3. End-to-End (E2E) Tests (The "Real Deal")
Location: tests/e2e/
Run against real APIs (with strict rate limits) to verify system integrity. Run these on demand or nightly, not on every commit.
Key Test Cases
The "Golden Query"
Run the primary demo query: "What existing drugs might help treat long COVID fatigue?"
- Success Criteria:
- Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
- Includes citations from PubMed.
- Completes within 3 iterations.
- JSON output matches schema.
Deployment Smoke Test
- Gradio UI: Verify UI launches and accepts input.
- Streaming: Verify generator yields chunks (first chunk within 2s).
4. Tools & Config
Pytest Configuration
# pyproject.toml
[tool.pytest.ini_options]
markers = [
"unit: fast, isolated tests",
"integration: mocked network tests",
"e2e: real network tests (slow, expensive)"
]
asyncio_mode = "auto"
CI/CD Pipeline (GitHub Actions)
- Lint:
ruff check . - Type Check:
mypy . - Unit:
pytest -m unit - Integration:
pytest -m integration - E2E: (Manual trigger only)
5. Anti-Hallucination Validation
How do we test if the agent is lying?
Citation Check:
- Regex verify that every
[PMID: 12345]in the report exists in theEvidencelist. - Fail if a citation is "orphaned" (hallucinated ID).
- Regex verify that every
Negative Constraints:
- Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
Checklist for Implementation
- Set up
tests/directory structure - Configure
pytestandvcrpy - Create
tests/fixtures/for mock data (PubMed XMLs) - Write first unit test for
ResearchState