DeepCritical / docs /implementation /03_phase_judge.md
VibecoderMcSwaggins's picture
docs: update implementation documentation for Phases 1-4
77627ff
|
raw
history blame
23.3 kB

Phase 3 Implementation Spec: Judge Vertical Slice

Goal: Implement the "Brain" of the agent β€” evaluating evidence quality. Philosophy: "Structured Output or Bust." Prerequisite: Phase 2 complete (all search tests passing)


1. The Slice Definition

This slice covers:

  1. Input: A user question + a list of Evidence (from Phase 2).
  2. Process:
    • Construct a prompt with the evidence.
    • Call LLM (PydanticAI / OpenAI / Anthropic).
    • Force JSON structured output.
  3. Output: A JudgeAssessment object.

Files to Create:

  • src/utils/models.py - Add JudgeAssessment models (extend from Phase 2)
  • src/prompts/judge.py - Judge prompt templates
  • src/agent_factory/judges.py - JudgeHandler with PydanticAI
  • tests/unit/agent_factory/test_judges.py - Unit tests

2. Models (Add to src/utils/models.py)

The output schema must be strict for reliable structured output.

"""Add these models to src/utils/models.py (after Evidence models from Phase 2)."""
from pydantic import BaseModel, Field
from typing import List, Literal


class AssessmentDetails(BaseModel):
    """Detailed assessment of evidence quality."""

    mechanism_score: int = Field(
        ...,
        ge=0,
        le=10,
        description="How well does the evidence explain the mechanism? 0-10"
    )
    mechanism_reasoning: str = Field(
        ...,
        min_length=10,
        description="Explanation of mechanism score"
    )
    clinical_evidence_score: int = Field(
        ...,
        ge=0,
        le=10,
        description="Strength of clinical/preclinical evidence. 0-10"
    )
    clinical_reasoning: str = Field(
        ...,
        min_length=10,
        description="Explanation of clinical evidence score"
    )
    drug_candidates: List[str] = Field(
        default_factory=list,
        description="List of specific drug candidates mentioned"
    )
    key_findings: List[str] = Field(
        default_factory=list,
        description="Key findings from the evidence"
    )


class JudgeAssessment(BaseModel):
    """Complete assessment from the Judge."""

    details: AssessmentDetails
    sufficient: bool = Field(
        ...,
        description="Is evidence sufficient to provide a recommendation?"
    )
    confidence: float = Field(
        ...,
        ge=0.0,
        le=1.0,
        description="Confidence in the assessment (0-1)"
    )
    recommendation: Literal["continue", "synthesize"] = Field(
        ...,
        description="continue = need more evidence, synthesize = ready to answer"
    )
    next_search_queries: List[str] = Field(
        default_factory=list,
        description="If continue, what queries to search next"
    )
    reasoning: str = Field(
        ...,
        min_length=20,
        description="Overall reasoning for the recommendation"
    )

3. Prompt Engineering (src/prompts/judge.py)

We treat prompts as code. They should be versioned and clean.

"""Judge prompts for evidence assessment."""
from typing import List
from src.utils.models import Evidence


SYSTEM_PROMPT = """You are an expert drug repurposing research judge.

Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition.

## Evaluation Criteria

1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
   - 0-3: No clear mechanism, speculative
   - 4-6: Some mechanistic insight, but gaps exist
   - 7-10: Clear, well-supported mechanism of action

2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
   - 0-3: No clinical data, only theoretical
   - 4-6: Preclinical or early clinical data
   - 7-10: Strong clinical evidence (trials, meta-analyses)

3. **Sufficiency**: Evidence is sufficient when:
   - Combined scores >= 12 AND
   - At least one specific drug candidate identified AND
   - Clear mechanistic rationale exists

## Output Rules

- Always output valid JSON matching the schema
- Be conservative: only recommend "synthesize" when truly confident
- If continuing, suggest specific, actionable search queries
- Never hallucinate drug names or findings not in the evidence
"""


def format_user_prompt(question: str, evidence: List[Evidence]) -> str:
    """
    Format the user prompt with question and evidence.

    Args:
        question: The user's research question
        evidence: List of Evidence objects from search

    Returns:
        Formatted prompt string
    """
    evidence_text = "\n\n".join([
        f"### Evidence {i+1}\n"
        f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
        f"**URL**: {e.citation.url}\n"
        f"**Date**: {e.citation.date}\n"
        f"**Content**:\n{e.content[:1500]}..."
        if len(e.content) > 1500 else
        f"### Evidence {i+1}\n"
        f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
        f"**URL**: {e.citation.url}\n"
        f"**Date**: {e.citation.date}\n"
        f"**Content**:\n{e.content}"
        for i, e in enumerate(evidence)
    ])

    return f"""## Research Question
{question}

## Available Evidence ({len(evidence)} sources)

{evidence_text}

## Your Task

Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
Respond with a JSON object matching the JudgeAssessment schema.
"""


def format_empty_evidence_prompt(question: str) -> str:
    """
    Format prompt when no evidence was found.

    Args:
        question: The user's research question

    Returns:
        Formatted prompt string
    """
    return f"""## Research Question
{question}

## Available Evidence

No evidence was found from the search.

## Your Task

Since no evidence was found, recommend search queries that might yield better results.
Set sufficient=False and recommendation="continue".
Suggest 3-5 specific search queries.
"""

4. JudgeHandler Implementation (src/agent_factory/judges.py)

Using PydanticAI for structured output with retry logic.

"""Judge handler for evidence assessment using PydanticAI."""
import os
from typing import List
import structlog
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.models.anthropic import AnthropicModel

from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails
from src.utils.config import settings
from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt

logger = structlog.get_logger()


def get_model():
    """Get the LLM model based on configuration."""
    provider = getattr(settings, "llm_provider", "openai")

    if provider == "anthropic":
        return AnthropicModel(
            model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"),
            api_key=os.getenv("ANTHROPIC_API_KEY"),
        )
    else:
        return OpenAIModel(
            model_name=getattr(settings, "openai_model", "gpt-4o"),
            api_key=os.getenv("OPENAI_API_KEY"),
        )


class JudgeHandler:
    """
    Handles evidence assessment using an LLM with structured output.

    Uses PydanticAI to ensure responses match the JudgeAssessment schema.
    """

    def __init__(self, model=None):
        """
        Initialize the JudgeHandler.

        Args:
            model: Optional PydanticAI model. If None, uses config default.
        """
        self.model = model or get_model()
        self.agent = Agent(
            model=self.model,
            result_type=JudgeAssessment,
            system_prompt=SYSTEM_PROMPT,
            retries=3,
        )

    async def assess(
        self,
        question: str,
        evidence: List[Evidence],
    ) -> JudgeAssessment:
        """
        Assess evidence and determine if it's sufficient.

        Args:
            question: The user's research question
            evidence: List of Evidence objects from search

        Returns:
            JudgeAssessment with evaluation results

        Raises:
            JudgeError: If assessment fails after retries
        """
        logger.info(
            "Starting evidence assessment",
            question=question[:100],
            evidence_count=len(evidence),
        )

        # Format the prompt based on whether we have evidence
        if evidence:
            user_prompt = format_user_prompt(question, evidence)
        else:
            user_prompt = format_empty_evidence_prompt(question)

        try:
            # Run the agent with structured output
            result = await self.agent.run(user_prompt)
            assessment = result.data

            logger.info(
                "Assessment complete",
                sufficient=assessment.sufficient,
                recommendation=assessment.recommendation,
                confidence=assessment.confidence,
            )

            return assessment

        except Exception as e:
            logger.error("Assessment failed", error=str(e))
            # Return a safe default assessment on failure
            return self._create_fallback_assessment(question, str(e))

    def _create_fallback_assessment(
        self,
        question: str,
        error: str,
    ) -> JudgeAssessment:
        """
        Create a fallback assessment when LLM fails.

        Args:
            question: The original question
            error: The error message

        Returns:
            Safe fallback JudgeAssessment
        """
        return JudgeAssessment(
            details=AssessmentDetails(
                mechanism_score=0,
                mechanism_reasoning="Assessment failed due to LLM error",
                clinical_evidence_score=0,
                clinical_reasoning="Assessment failed due to LLM error",
                drug_candidates=[],
                key_findings=[],
            ),
            sufficient=False,
            confidence=0.0,
            recommendation="continue",
            next_search_queries=[
                f"{question} mechanism",
                f"{question} clinical trials",
                f"{question} drug candidates",
            ],
            reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
        )


class MockJudgeHandler:
    """
    Mock JudgeHandler for testing without LLM calls.

    Use this in unit tests to avoid API calls.
    """

    def __init__(self, mock_response: JudgeAssessment | None = None):
        """
        Initialize with optional mock response.

        Args:
            mock_response: The assessment to return. If None, uses default.
        """
        self.mock_response = mock_response
        self.call_count = 0
        self.last_question = None
        self.last_evidence = None

    async def assess(
        self,
        question: str,
        evidence: List[Evidence],
    ) -> JudgeAssessment:
        """Return the mock response."""
        self.call_count += 1
        self.last_question = question
        self.last_evidence = evidence

        if self.mock_response:
            return self.mock_response

        # Default mock response
        return JudgeAssessment(
            details=AssessmentDetails(
                mechanism_score=7,
                mechanism_reasoning="Mock assessment - good mechanism evidence",
                clinical_evidence_score=6,
                clinical_reasoning="Mock assessment - moderate clinical evidence",
                drug_candidates=["Drug A", "Drug B"],
                key_findings=["Finding 1", "Finding 2"],
            ),
            sufficient=len(evidence) >= 3,
            confidence=0.75,
            recommendation="synthesize" if len(evidence) >= 3 else "continue",
            next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
            reasoning="Mock assessment for testing purposes",
        )

5. TDD Workflow

Test File: tests/unit/agent_factory/test_judges.py

"""Unit tests for JudgeHandler."""
import pytest
from unittest.mock import AsyncMock, MagicMock, patch

from src.utils.models import (
    Evidence,
    Citation,
    JudgeAssessment,
    AssessmentDetails,
)


class TestJudgeHandler:
    """Tests for JudgeHandler."""

    @pytest.mark.asyncio
    async def test_assess_returns_assessment(self):
        """JudgeHandler should return JudgeAssessment from LLM."""
        from src.agent_factory.judges import JudgeHandler

        # Create mock assessment
        mock_assessment = JudgeAssessment(
            details=AssessmentDetails(
                mechanism_score=8,
                mechanism_reasoning="Strong mechanistic evidence",
                clinical_evidence_score=7,
                clinical_reasoning="Good clinical support",
                drug_candidates=["Metformin"],
                key_findings=["Neuroprotective effects"],
            ),
            sufficient=True,
            confidence=0.85,
            recommendation="synthesize",
            next_search_queries=[],
            reasoning="Evidence is sufficient for synthesis",
        )

        # Mock the PydanticAI agent
        mock_result = MagicMock()
        mock_result.data = mock_assessment

        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
            mock_agent = AsyncMock()
            mock_agent.run = AsyncMock(return_value=mock_result)
            mock_agent_class.return_value = mock_agent

            handler = JudgeHandler()
            # Replace the agent with our mock
            handler.agent = mock_agent

            evidence = [
                Evidence(
                    content="Metformin shows neuroprotective properties...",
                    citation=Citation(
                        source="pubmed",
                        title="Metformin in AD",
                        url="https://pubmed.ncbi.nlm.nih.gov/12345/",
                        date="2024-01-01",
                    ),
                )
            ]

            result = await handler.assess("metformin alzheimer", evidence)

            assert result.sufficient is True
            assert result.recommendation == "synthesize"
            assert result.confidence == 0.85
            assert "Metformin" in result.details.drug_candidates

    @pytest.mark.asyncio
    async def test_assess_empty_evidence(self):
        """JudgeHandler should handle empty evidence gracefully."""
        from src.agent_factory.judges import JudgeHandler

        mock_assessment = JudgeAssessment(
            details=AssessmentDetails(
                mechanism_score=0,
                mechanism_reasoning="No evidence to assess",
                clinical_evidence_score=0,
                clinical_reasoning="No evidence to assess",
                drug_candidates=[],
                key_findings=[],
            ),
            sufficient=False,
            confidence=0.0,
            recommendation="continue",
            next_search_queries=["metformin alzheimer mechanism"],
            reasoning="No evidence found, need to search more",
        )

        mock_result = MagicMock()
        mock_result.data = mock_assessment

        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
            mock_agent = AsyncMock()
            mock_agent.run = AsyncMock(return_value=mock_result)
            mock_agent_class.return_value = mock_agent

            handler = JudgeHandler()
            handler.agent = mock_agent

            result = await handler.assess("metformin alzheimer", [])

            assert result.sufficient is False
            assert result.recommendation == "continue"
            assert len(result.next_search_queries) > 0

    @pytest.mark.asyncio
    async def test_assess_handles_llm_failure(self):
        """JudgeHandler should return fallback on LLM failure."""
        from src.agent_factory.judges import JudgeHandler

        with patch("src.agent_factory.judges.Agent") as mock_agent_class:
            mock_agent = AsyncMock()
            mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
            mock_agent_class.return_value = mock_agent

            handler = JudgeHandler()
            handler.agent = mock_agent

            evidence = [
                Evidence(
                    content="Some content",
                    citation=Citation(
                        source="pubmed",
                        title="Title",
                        url="url",
                        date="2024",
                    ),
                )
            ]

            result = await handler.assess("test question", evidence)

            # Should return fallback, not raise
            assert result.sufficient is False
            assert result.recommendation == "continue"
            assert "failed" in result.reasoning.lower()


class TestMockJudgeHandler:
    """Tests for MockJudgeHandler."""

    @pytest.mark.asyncio
    async def test_mock_handler_returns_default(self):
        """MockJudgeHandler should return default assessment."""
        from src.agent_factory.judges import MockJudgeHandler

        handler = MockJudgeHandler()

        evidence = [
            Evidence(
                content="Content 1",
                citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
            ),
            Evidence(
                content="Content 2",
                citation=Citation(source="web", title="T2", url="u2", date="2024"),
            ),
        ]

        result = await handler.assess("test", evidence)

        assert handler.call_count == 1
        assert handler.last_question == "test"
        assert len(handler.last_evidence) == 2
        assert result.details.mechanism_score == 7

    @pytest.mark.asyncio
    async def test_mock_handler_custom_response(self):
        """MockJudgeHandler should return custom response when provided."""
        from src.agent_factory.judges import MockJudgeHandler

        custom_assessment = JudgeAssessment(
            details=AssessmentDetails(
                mechanism_score=10,
                mechanism_reasoning="Custom reasoning",
                clinical_evidence_score=10,
                clinical_reasoning="Custom clinical",
                drug_candidates=["CustomDrug"],
                key_findings=["Custom finding"],
            ),
            sufficient=True,
            confidence=1.0,
            recommendation="synthesize",
            next_search_queries=[],
            reasoning="Custom assessment",
        )

        handler = MockJudgeHandler(mock_response=custom_assessment)
        result = await handler.assess("test", [])

        assert result.details.mechanism_score == 10
        assert result.details.drug_candidates == ["CustomDrug"]

    @pytest.mark.asyncio
    async def test_mock_handler_insufficient_with_few_evidence(self):
        """MockJudgeHandler should recommend continue with < 3 evidence."""
        from src.agent_factory.judges import MockJudgeHandler

        handler = MockJudgeHandler()

        # Only 2 pieces of evidence
        evidence = [
            Evidence(
                content="Content",
                citation=Citation(source="pubmed", title="T", url="u", date="2024"),
            ),
            Evidence(
                content="Content 2",
                citation=Citation(source="web", title="T2", url="u2", date="2024"),
            ),
        ]

        result = await handler.assess("test", evidence)

        assert result.sufficient is False
        assert result.recommendation == "continue"
        assert len(result.next_search_queries) > 0

6. Dependencies

Add to pyproject.toml:

[project]
dependencies = [
    # ... existing deps ...
    "pydantic-ai>=0.0.16",
    "openai>=1.0.0",
    "anthropic>=0.18.0",
]

7. Configuration (src/utils/config.py)

Add LLM configuration:

"""Add to src/utils/config.py."""
from pydantic_settings import BaseSettings
from typing import Literal


class Settings(BaseSettings):
    """Application settings."""

    # LLM Configuration
    llm_provider: Literal["openai", "anthropic"] = "openai"
    openai_model: str = "gpt-4o"
    anthropic_model: str = "claude-3-5-sonnet-20241022"

    # API Keys (loaded from environment)
    openai_api_key: str | None = None
    anthropic_api_key: str | None = None
    ncbi_api_key: str | None = None

    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"


settings = Settings()

8. Implementation Checklist

  • Add AssessmentDetails and JudgeAssessment models to src/utils/models.py
  • Create src/prompts/__init__.py (empty, for package)
  • Create src/prompts/judge.py with prompt templates
  • Create src/agent_factory/__init__.py with exports
  • Implement src/agent_factory/judges.py with JudgeHandler
  • Update src/utils/config.py with LLM settings
  • Create tests/unit/agent_factory/__init__.py
  • Write tests in tests/unit/agent_factory/test_judges.py
  • Run uv run pytest tests/unit/agent_factory/ -v β€” ALL TESTS MUST PASS
  • Commit: git commit -m "feat: phase 3 judge slice complete"

9. Definition of Done

Phase 3 is COMPLETE when:

  1. All unit tests pass: uv run pytest tests/unit/agent_factory/ -v
  2. JudgeHandler can assess evidence and return structured output
  3. Graceful degradation: if LLM fails, returns safe fallback
  4. MockJudgeHandler works for testing without API calls
  5. Can run this in Python REPL:
import asyncio
import os
from src.utils.models import Evidence, Citation
from src.agent_factory.judges import JudgeHandler, MockJudgeHandler

# Test with mock (no API key needed)
async def test_mock():
    handler = MockJudgeHandler()
    evidence = [
        Evidence(
            content="Metformin shows neuroprotective effects in AD models",
            citation=Citation(
                source="pubmed",
                title="Metformin and Alzheimer's",
                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
                date="2024-01-01",
            ),
        ),
    ]
    result = await handler.assess("metformin alzheimer", evidence)
    print(f"Sufficient: {result.sufficient}")
    print(f"Recommendation: {result.recommendation}")
    print(f"Drug candidates: {result.details.drug_candidates}")

asyncio.run(test_mock())

# Test with real LLM (requires API key)
async def test_real():
    os.environ["OPENAI_API_KEY"] = "your-key-here"  # Or set in .env
    handler = JudgeHandler()
    evidence = [
        Evidence(
            content="Metformin shows neuroprotective effects in AD models...",
            citation=Citation(
                source="pubmed",
                title="Metformin and Alzheimer's",
                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
                date="2024-01-01",
            ),
        ),
    ]
    result = await handler.assess("metformin alzheimer", evidence)
    print(f"Sufficient: {result.sufficient}")
    print(f"Confidence: {result.confidence}")
    print(f"Reasoning: {result.reasoning}")

# asyncio.run(test_real())  # Uncomment with valid API key

Proceed to Phase 4 ONLY after all checkboxes are complete.