Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

DeepCritical / docs /development /testing.md

VibecoderMcSwaggins

docs: update guides and add testing strategy documentation

b4aa4ad 18 days ago

preview code

raw

history blame

4.31 kB

	# Testing Strategy
	## ensuring DeepCritical is Ironclad

	---

	## Overview

	Our testing strategy follows a strict Pyramid of Reliability:
	1. Unit Tests: Fast, isolated logic checks (60% of tests)
	2. Integration Tests: Tool interactions & Agent loops (30% of tests)
	3. E2E / Regression Tests: Full research workflows (10% of tests)

	Goal: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.

	---

	## 1. Unit Tests (Fast & Cheap)

	Location: `tests/unit/`

	Focus on individual components without external network calls. Mock everything.

	### Key Test Cases

	#### Agent Logic
	- Initialization: Verify default config loads correctly.
	- State Updates: Ensure `ResearchState` updates correctly (e.g., token counts increment).
	- Budget Checks: Test `should_continue()` returns `False` when budget exceeded.
	- Error Handling: Test partial failure recovery (e.g., one tool fails, agent continues).

	#### Tools (Mocked)
	- Parser Logic: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
	- Validation: Ensure tools reject invalid queries (empty strings, etc.).

	#### Judge Prompts
	- Schema Compliance: Verify prompt templates generate valid JSON structure instructions.
	- Variable Injection: Ensure `{question}` and `{context}` are injected correctly into prompts.

	```python
	# Example: Testing State Logic
	def test_budget_stop():
	state = ResearchState(tokens_used=50001, max_tokens=50000)
	assert should_continue(state) is False
	```

	---

	## 2. Integration Tests (Realistic & Mocked I/O)

	Location: `tests/integration/`

	Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use VCR.py or Replay patterns to record/replay API calls to save money/time.

	### Key Test Cases

	#### Search Loop
	- Iteration Flow: Verify agent performs Search -> Judge -> Search loop.
	- Tool Selection: Verify correct tools are called based on judge output (mocked judge).
	- Context Accumulation: Ensure findings from Iteration 1 are passed to Iteration 2.

	#### MCP Server Integration
	- Server Startup: Verify MCP server starts and exposes tools.
	- Client Connection: Verify agent can call tools via MCP protocol.

	```python
	# Example: Testing Search Loop with Mocked Tools
	async def test_search_loop_flow():
	agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
	report = await agent.run("test query")
	assert agent.state.iterations > 0
	assert len(report.sources) > 0
	```

	---

	## 3. End-to-End (E2E) Tests (The "Real Deal")

	Location: `tests/e2e/`

	Run against real APIs (with strict rate limits) to verify system integrity. Run these on demand or nightly, not on every commit.

	### Key Test Cases

	#### The "Golden Query"
	Run the primary demo query: "What existing drugs might help treat long COVID fatigue?"
	- Success Criteria:
	- Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
	- Includes citations from PubMed.
	- Completes within 3 iterations.
	- JSON output matches schema.

	#### Deployment Smoke Test
	- Gradio UI: Verify UI launches and accepts input.
	- Streaming: Verify generator yields chunks (first chunk within 2s).

	---

	## 4. Tools & Config

	### Pytest Configuration
	```toml
	# pyproject.toml
	[tool.pytest.ini_options]
	markers = [
	"unit: fast, isolated tests",
	"integration: mocked network tests",
	"e2e: real network tests (slow, expensive)"
	]
	asyncio_mode = "auto"
	```

	### CI/CD Pipeline (GitHub Actions)
	1. Lint: `ruff check .`
	2. Type Check: `mypy .`
	3. Unit: `pytest -m unit`
	4. Integration: `pytest -m integration`
	5. E2E: (Manual trigger only)

	---

	## 5. Anti-Hallucination Validation

	How do we test if the agent is lying?

	1. Citation Check:
	- Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
	- Fail if a citation is "orphaned" (hallucinated ID).

	2. Negative Constraints:
	- Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".

	---

	## Checklist for Implementation

	- [ ] Set up `tests/` directory structure
	- [ ] Configure `pytest` and `vcrpy`
	- [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
	- [ ] Write first unit test for `ResearchState`