Spaces:

DataQuests
/

DeepCritical

Running

Initialization: Verify default config loads correctly.
State Updates: Ensure ResearchState updates correctly (e.g., token counts increment).
Budget Checks: Test should_continue() returns False when budget exceeded.
Error Handling: Test partial failure recovery (e.g., one tool fails, agent continues).

Tools (Mocked)

Parser Logic: Feed raw XML/JSON to tool parsers and verify Evidence objects.
Validation: Ensure tools reject invalid queries (empty strings, etc.).

Judge Prompts

Schema Compliance: Verify prompt templates generate valid JSON structure instructions.
Variable Injection: Ensure {question} and {context} are injected correctly into prompts.

# Example: Testing State Logic
def test_budget_stop():
    state = ResearchState(tokens_used=50001, max_tokens=50000)
    assert should_continue(state) is False

2. Integration Tests (Realistic & Mocked I/O)

Location: tests/integration/

Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use VCR.py or Replay patterns to record/replay API calls to save money/time.

Key Test Cases

Search Loop

Iteration Flow: Verify agent performs Search -> Judge -> Search loop.
Tool Selection: Verify correct tools are called based on judge output (mocked judge).
Context Accumulation: Ensure findings from Iteration 1 are passed to Iteration 2.

MCP Server Integration

Server Startup: Verify MCP server starts and exposes tools.
Client Connection: Verify agent can call tools via MCP protocol.

# Example: Testing Search Loop with Mocked Tools
async def test_search_loop_flow():
    agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
    report = await agent.run("test query")
    assert agent.state.iterations > 0
    assert len(report.sources) > 0

3. End-to-End (E2E) Tests (The "Real Deal")

Location: tests/e2e/

Run against real APIs (with strict rate limits) to verify system integrity. Run these on demand or nightly, not on every commit.

Key Test Cases

The "Golden Query"

Run the primary demo query: "What existing drugs might help treat long COVID fatigue?"

Success Criteria:
- Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
- Includes citations from PubMed.
- Completes within 3 iterations.
- JSON output matches schema.

Deployment Smoke Test

Gradio UI: Verify UI launches and accepts input.
Streaming: Verify generator yields chunks (first chunk within 2s).

4. Tools & Config

Pytest Configuration

# pyproject.toml
[tool.pytest.ini_options]
markers = [
    "unit: fast, isolated tests",
    "integration: mocked network tests",
    "e2e: real network tests (slow, expensive)"
]
asyncio_mode = "auto"

CI/CD Pipeline (GitHub Actions)

Lint: ruff check .
Type Check: mypy .
Unit: pytest -m unit
Integration: pytest -m integration
E2E: (Manual trigger only)

5. Anti-Hallucination Validation

How do we test if the agent is lying?

Citation Check:
- Regex verify that every [PMID: 12345] in the report exists in the Evidence list.
- Fail if a citation is "orphaned" (hallucinated ID).
Negative Constraints:
- Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".

Checklist for Implementation

Set up tests/ directory structure
Configure pytest and vcrpy
Create tests/fixtures/ for mock data (PubMed XMLs)
Write first unit test for ResearchState