Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

VibecoderMcSwaggins commited on 13 days ago

Commit

24a5878

2 Parent(s): ecbc47b ec3d7dc

Merge main: Phase 5 + Phase 6-8 doc revisions

Browse files

Files changed (5) hide show

docs/architecture/overview.md +67 -68
docs/implementation/06_phase_embeddings.md +409 -0
docs/implementation/07_phase_hypothesis.md +630 -0
docs/implementation/08_phase_report.md +854 -0
docs/implementation/roadmap.md +82 -12

docs/architecture/overview.md CHANGED Viewed

@@ -63,53 +63,58 @@ Using existing approved drugs to treat NEW diseases they weren't originally desi
 ## System Architecture
-### High-Level Design
 ```
-User Question
     ↓
-Research Agent (Orchestrator)
     ↓
-Search Loop:
-  1. Query Tools (PubMed, Web, Clinical Trials)
-  2. Gather Evidence
-  3. Judge Quality ("Do we have enough?")
-  4. If NO → Refine query, search more
-  5. If YES → Synthesize findings
     ↓
-Research Report with Citations
 ```
 ### Key Components
-1. **Research Agent (Orchestrator)**
-   - Manages the research process
-   - Plans search strategies
-   - Coordinates tools
-   - Tracks token budget and iterations
-2. **Tools**
-   - PubMed Search (biomedical papers)
-   - Web Search (general medical info)
-   - Clinical Trials Database
-   - Drug Information APIs
-   - (Future: Protein databases, pathways)
-3. **Judge System**
-   - LLM-based quality assessment
-   - Evaluates: "Do we have enough evidence?"
-   - Criteria: Coverage, reliability, citation quality
-4. **Break Conditions**
-   - Token budget cap (cost control)
-   - Max iterations (time control)
-   - Judge says "sufficient evidence" (quality control)
-5. **Gradio UI**
-   - Simple text input for questions
-   - Real-time progress display
-   - Formatted research report output
-   - Source citations and links
 ---
@@ -275,37 +280,31 @@ httpx = "^0.27"
 ## Success Criteria
-### Minimum Viable Product (MVP) - Days 1-3
-**MUST HAVE for working demo:**
 - [x] User can ask drug repurposing question
-- [ ] Agent searches PubMed (async)
-- [ ] Agent searches web (Brave/DuckDuckGo)
-- [ ] LLM judge evaluates evidence quality
-- [ ] System respects token budget (50K tokens max)
-- [ ] Output includes drug candidates + citations
-- [ ] Works end-to-end for demo query: "Long COVID fatigue"
-- [ ] Gradio UI with streaming progress
-### Hackathon Submission - Days 4-5
-**Required for all tracks:**
-- [ ] Gradio UI deployed on HuggingFace Spaces
-- [ ] 3 example queries working and tested
-- [ ] This architecture documentation
-- [ ] Demo video (2-3 min) showing workflow
-- [ ] README with setup instructions
-**Track-Specific:**
-- [ ] **Gradio Track**: Streaming UI, progress indicators, modern design
-- [ ] **MCP Track**: PubMed tool as MCP server (reusable by others)
-- [ ] **Modal Track**: GPU inference option (stretch)
-### Stretch Goals - Day 6+
-**Nice-to-have if time permits:**
-- [ ] Modal integration for local LLM fallback
-- [ ] Clinical trials database search
-- [ ] Checkpoint/resume functionality
-- [ ] OpenFDA drug safety lookup
-- [ ] PDF export of research reports
 ### What's EXPLICITLY Out of Scope
 **NOT building (to stay focused):**

 ## System Architecture
+### High-Level Design (Phases 1-8)
 ```
+User Query
     ↓
+Gradio UI (Phase 4)
     ↓
+Magentic Manager (Phase 5) ← LLM-powered coordinator
+    ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6)
+    ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning
+    ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment
+    └── ReportAgent (Phase 8) ←→ Final Synthesis
     ↓
+Structured Research Report
 ```
 ### Key Components
+1. **Magentic Manager (Orchestrator)**
+   - LLM-powered multi-agent coordinator
+   - Dynamic planning and agent selection
+   - Built-in stall detection and replanning
+   - Microsoft Agent Framework integration
+2. **SearchAgent (Phase 2+5+6)**
+   - PubMed E-utilities search
+   - DuckDuckGo web search
+   - Semantic search via ChromaDB (Phase 6)
+   - Evidence deduplication
+3. **HypothesisAgent (Phase 7)**
+   - Generates Drug → Target → Pathway → Effect hypotheses
+   - Guides targeted searches
+   - Scientific reasoning about mechanisms
+4. **JudgeAgent (Phase 3+5)**
+   - LLM-based evidence assessment
+   - Mechanism score + Clinical score
+   - Recommends continue/synthesize
+   - Generates refined search queries
+5. **ReportAgent (Phase 8)**
+   - Structured scientific reports
+   - Executive summary, methodology
+   - Hypotheses tested with evidence counts
+   - Proper citations and limitations
+6. **Gradio UI (Phase 4)**
+   - Chat interface for questions
+   - Real-time progress via events
+   - Mode toggle (Simple/Magentic)
+   - Formatted markdown output
 ---
 ## Success Criteria
+### Phase 1-5 (MVP) ✅ COMPLETE
+**Completed in ONE DAY:**
 - [x] User can ask drug repurposing question
+- [x] Agent searches PubMed (async)
+- [x] Agent searches web (DuckDuckGo)
+- [x] LLM judge evaluates evidence quality
+- [x] System respects token budget and iterations
+- [x] Output includes drug candidates + citations
+- [x] Works end-to-end for demo query
+- [x] Gradio UI with streaming progress
+- [x] Magentic multi-agent orchestration
+- [x] 38 unit tests passing
+- [x] CI/CD pipeline green
+### Hackathon Submission ✅ COMPLETE
+- [x] Gradio UI deployed on HuggingFace Spaces
+- [x] Example queries working and tested
+- [x] Architecture documentation
+- [x] README with setup instructions
+### Phase 6-8 (Enhanced)
+**Specs ready for implementation:**
+- [ ] Embeddings & Semantic Search (Phase 6)
+- [ ] Hypothesis Agent (Phase 7)
+- [ ] Report Agent (Phase 8)
 ### What's EXPLICITLY Out of Scope
 **NOT building (to stay focused):**

docs/implementation/06_phase_embeddings.md ADDED Viewed

	@@ -0,0 +1,409 @@

+# Phase 6 Implementation Spec: Embeddings & Semantic Search
+**Goal**: Add vector search for semantic evidence retrieval.
+**Philosophy**: "Find what you mean, not just what you type."
+**Prerequisite**: Phase 5 complete (Magentic working)
+---
+## 1. Why Embeddings?
+Current limitation: **Keyword-only search misses semantically related papers.**
+Example problem:
+- User searches: "metformin alzheimer"
+- PubMed returns: Papers with exact keywords
+- MISSED: Papers about "AMPK activation neuroprotection" (same mechanism, different words)
+With embeddings:
+- Embed the query AND all evidence
+- Find semantically similar papers even without keyword match
+- Deduplicate by meaning, not just URL
+---
+## 2. Architecture
+### Current (Phase 5)
+```
+Query → SearchAgent → PubMed/Web (keyword) → Evidence
+```
+### Phase 6
+```
+Query → Embed(Query) → SearchAgent
+                          ├── PubMed/Web (keyword) → Evidence
+                          └── VectorDB (semantic) → Related Evidence
+                                    ↑
+                          Evidence → Embed → Store
+```
+### Shared Context Enhancement
+```python
+# Current
+evidence_store = {"current": []}
+# Phase 6
+evidence_store = {
+    "current": [],           # Raw evidence
+    "embeddings": {},        # URL -> embedding vector
+    "vector_index": None,    # ChromaDB collection
+}
+```
+---
+## 3. Technology Choice
+### ChromaDB (Recommended)
+- **Free**, open-source, local-first
+- No API keys, no cloud dependency
+- Supports sentence-transformers out of the box
+- Perfect for hackathon (no infra setup)
+### Embedding Model
+- `sentence-transformers/all-MiniLM-L6-v2` (fast, good quality)
+- Or `BAAI/bge-small-en-v1.5` (better quality, still fast)
+---
+## 4. Implementation
+### 4.1 Dependencies
+Add to `pyproject.toml`:
+```toml
+[project.optional-dependencies]
+embeddings = [
+    "chromadb>=0.4.0",
+    "sentence-transformers>=2.2.0",
+]
+```
+### 4.2 Embedding Service (`src/services/embeddings.py`)
+> **CRITICAL: Async Pattern Required**
+>
+> `sentence-transformers` is synchronous and CPU-bound. Running it directly in async code
+> will **block the event loop**, freezing the UI and halting all concurrent operations.
+>
+> **Solution**: Use `asyncio.run_in_executor()` to offload to thread pool.
+> This pattern already exists in `src/tools/websearch.py:28-34`.
+```python
+"""Embedding service for semantic search.
+IMPORTANT: All public methods are async to avoid blocking the event loop.
+The sentence-transformers model is CPU-bound, so we use run_in_executor().
+"""
+import asyncio
+from typing import List
+import chromadb
+from sentence_transformers import SentenceTransformer
+class EmbeddingService:
+    """Handles text embedding and vector storage.
+    All embedding operations run in a thread pool to avoid blocking
+    the async event loop. See src/tools/websearch.py for the pattern.
+    """
+    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
+        self._model = SentenceTransformer(model_name)
+        self._client = chromadb.Client()  # In-memory for hackathon
+        self._collection = self._client.create_collection(
+            name="evidence",
+            metadata={"hnsw:space": "cosine"}
+        )
+    # ─────────────────────────────────────────────────────────────────
+    # Sync internal methods (run in thread pool)
+    # ─────────────────────────────────────────────────────────────────
+    def _sync_embed(self, text: str) -> List[float]:
+        """Synchronous embedding - DO NOT call directly from async code."""
+        return self._model.encode(text).tolist()
+    def _sync_batch_embed(self, texts: List[str]) -> List[List[float]]:
+        """Batch embedding for efficiency - DO NOT call directly from async code."""
+        return [e.tolist() for e in self._model.encode(texts)]
+    # ─────────────────────────────────────────────────────────────────
+    # Async public methods (safe for event loop)
+    # ─────────────────────────────────────────────────────────────────
+    async def embed(self, text: str) -> List[float]:
+        """Embed a single text (async-safe).
+        Uses run_in_executor to avoid blocking the event loop.
+        """
+        loop = asyncio.get_running_loop()
+        return await loop.run_in_executor(None, self._sync_embed, text)
+    async def embed_batch(self, texts: List[str]) -> List[List[float]]:
+        """Batch embed multiple texts (async-safe, more efficient)."""
+        loop = asyncio.get_running_loop()
+        return await loop.run_in_executor(None, self._sync_batch_embed, texts)
+    async def add_evidence(self, evidence_id: str, content: str, metadata: dict) -> None:
+        """Add evidence to vector store (async-safe)."""
+        embedding = await self.embed(content)
+        # ChromaDB operations are fast, but wrap for consistency
+        loop = asyncio.get_running_loop()
+        await loop.run_in_executor(
+            None,
+            lambda: self._collection.add(
+                ids=[evidence_id],
+                embeddings=[embedding],
+                metadatas=[metadata],
+                documents=[content]
+            )
+        )
+    async def search_similar(self, query: str, n_results: int = 5) -> List[dict]:
+        """Find semantically similar evidence (async-safe)."""
+        query_embedding = await self.embed(query)
+        loop = asyncio.get_running_loop()
+        results = await loop.run_in_executor(
+            None,
+            lambda: self._collection.query(
+                query_embeddings=[query_embedding],
+                n_results=n_results
+            )
+        )
+        # Handle empty results gracefully
+        if not results["ids"] or not results["ids"][0]:
+            return []
+        return [
+            {"id": id, "content": doc, "metadata": meta, "distance": dist}
+            for id, doc, meta, dist in zip(
+                results["ids"][0],
+                results["documents"][0],
+                results["metadatas"][0],
+                results["distances"][0]
+            )
+        ]
+    async def deduplicate(self, new_evidence: List, threshold: float = 0.9) -> List:
+        """Remove semantically duplicate evidence (async-safe)."""
+        unique = []
+        for evidence in new_evidence:
+            similar = await self.search_similar(evidence.content, n_results=1)
+            if not similar or similar[0]["distance"] > (1 - threshold):
+                unique.append(evidence)
+                await self.add_evidence(
+                    evidence_id=evidence.citation.url,
+                    content=evidence.content,
+                    metadata={"source": evidence.citation.source}
+                )
+        return unique
+```
+### 4.3 Enhanced SearchAgent (`src/agents/search_agent.py`)
+Update SearchAgent to use embeddings. **Note**: All embedding calls are `await`ed:
+```python
+class SearchAgent(BaseAgent):
+    def __init__(
+        self,
+        search_handler: SearchHandlerProtocol,
+        evidence_store: dict,
+        embedding_service: EmbeddingService | None = None,  # NEW
+    ):
+        # ... existing init ...
+        self._embeddings = embedding_service
+    async def run(self, messages, *, thread=None, **kwargs) -> AgentRunResponse:
+        # ... extract query ...
+        # Execute keyword search
+        result = await self._handler.execute(query, max_results_per_tool=10)
+        # Semantic deduplication (NEW) - ALL CALLS ARE AWAITED
+        if self._embeddings:
+            # Deduplicate by semantic similarity (async-safe)
+            unique_evidence = await self._embeddings.deduplicate(result.evidence)
+            # Also search for semantically related evidence (async-safe)
+            related = await self._embeddings.search_similar(query, n_results=5)
+            # Merge related evidence not already in results
+            existing_urls = {e.citation.url for e in unique_evidence}
+            for item in related:
+                if item["id"] not in existing_urls:
+                    # Reconstruct Evidence from stored data
+                    # ... merge logic ...
+        # ... rest of method ...
+```
+### 4.4 Semantic Expansion in Orchestrator
+The MagenticOrchestrator can use embeddings to expand queries:
+```python
+# In task instruction
+task = f"""Research drug repurposing opportunities for: {query}
+The system has semantic search enabled. When evidence is found:
+1. Related concepts will be automatically surfaced
+2. Duplicates are removed by meaning, not just URL
+3. Use the surfaced related concepts to refine searches
+"""
+```
+### 4.5 HuggingFace Spaces Deployment
+> **⚠️ Important for HF Spaces**
+>
+> `sentence-transformers` downloads models (~500MB) to `~/.cache` on first use.
+> HuggingFace Spaces have **ephemeral storage** - the cache is wiped on restart.
+> This causes slow cold starts and bandwidth usage.
+**Solution**: Pre-download the model in your Dockerfile:
+```dockerfile
+# In Dockerfile
+FROM python:3.11-slim
+# Set cache directory
+ENV HF_HOME=/app/.cache
+ENV TRANSFORMERS_CACHE=/app/.cache
+# Pre-download the embedding model during build
+RUN pip install sentence-transformers && \
+    python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
+# ... rest of Dockerfile
+```
+**Alternative**: Use environment variable to specify persistent path:
+```yaml
+# In HF Spaces settings or app.yaml
+env:
+  - name: HF_HOME
+    value: /data/.cache  # Persistent volume
+```
+---
+## 5. Directory Structure After Phase 6
+```
+src/
+├── services/                   # NEW
+│   ├── __init__.py
+│   └── embeddings.py           # EmbeddingService
+├── agents/
+│   ├── search_agent.py         # Updated with embeddings
+│   └── judge_agent.py
+└── ...
+```
+---
+## 6. Tests
+### 6.1 Unit Tests (`tests/unit/services/test_embeddings.py`)
+> **Note**: All tests are async since the EmbeddingService methods are async.
+```python
+"""Unit tests for EmbeddingService."""
+import pytest
+from src.services.embeddings import EmbeddingService
+class TestEmbeddingService:
+    @pytest.mark.asyncio
+    async def test_embed_returns_vector(self):
+        """Embedding should return a float vector."""
+        service = EmbeddingService()
+        embedding = await service.embed("metformin diabetes")
+        assert isinstance(embedding, list)
+        assert len(embedding) > 0
+        assert all(isinstance(x, float) for x in embedding)
+    @pytest.mark.asyncio
+    async def test_similar_texts_have_close_embeddings(self):
+        """Semantically similar texts should have similar embeddings."""
+        service = EmbeddingService()
+        e1 = await service.embed("metformin treats diabetes")
+        e2 = await service.embed("metformin is used for diabetes treatment")
+        e3 = await service.embed("the weather is sunny today")
+        # Cosine similarity helper
+        from numpy import dot
+        from numpy.linalg import norm
+        cosine = lambda a, b: dot(a, b) / (norm(a) * norm(b))
+        # Similar texts should be closer
+        assert cosine(e1, e2) > cosine(e1, e3)
+    @pytest.mark.asyncio
+    async def test_batch_embed_efficient(self):
+        """Batch embedding should be more efficient than individual calls."""
+        service = EmbeddingService()
+        texts = ["text one", "text two", "text three"]
+        # Batch embed
+        batch_results = await service.embed_batch(texts)
+        assert len(batch_results) == 3
+        assert all(isinstance(e, list) for e in batch_results)
+    @pytest.mark.asyncio
+    async def test_add_and_search(self):
+        """Should be able to add evidence and search for similar."""
+        service = EmbeddingService()
+        await service.add_evidence(
+            evidence_id="test1",
+            content="Metformin activates AMPK pathway",
+            metadata={"source": "pubmed"}
+        )
+        results = await service.search_similar("AMPK activation drugs", n_results=1)
+        assert len(results) == 1
+        assert "AMPK" in results[0]["content"]
+    @pytest.mark.asyncio
+    async def test_search_similar_empty_collection(self):
+        """Search on empty collection should return empty list, not error."""
+        service = EmbeddingService()
+        results = await service.search_similar("anything", n_results=5)
+        assert results == []
+```
+---
+## 7. Definition of Done
+Phase 6 is **COMPLETE** when:
+1. `EmbeddingService` implemented with ChromaDB
+2. SearchAgent uses embeddings for deduplication
+3. Semantic search surfaces related evidence
+4. All unit tests pass
+5. Integration test shows improved recall (finds related papers)
+---
+## 8. Value Delivered
+| Before (Phase 5) | After (Phase 6) |
+|------------------|-----------------|
+| Keyword-only search | Semantic + keyword search |
+| URL-based deduplication | Meaning-based deduplication |
+| Miss related papers | Surface related concepts |
+| Exact match required | Fuzzy semantic matching |
+**Real example improvement:**
+- Query: "metformin alzheimer"
+- Before: Only papers mentioning both words
+- After: Also finds "AMPK neuroprotection", "biguanide cognitive", etc.

docs/implementation/07_phase_hypothesis.md ADDED Viewed

	@@ -0,0 +1,630 @@

+# Phase 7 Implementation Spec: Hypothesis Agent
+**Goal**: Add an agent that generates scientific hypotheses to guide targeted searches.
+**Philosophy**: "Don't just find evidence—understand the mechanisms."
+**Prerequisite**: Phase 6 complete (Embeddings working)
+---
+## 1. Why Hypothesis Agent?
+Current limitation: **Search is reactive, not hypothesis-driven.**
+Current flow:
+1. User asks about "metformin alzheimer"
+2. Search finds papers
+3. Judge says "need more evidence"
+4. Search again with slightly different keywords
+With Hypothesis Agent:
+1. User asks about "metformin alzheimer"
+2. Search finds initial papers
+3. **Hypothesis Agent analyzes**: "Evidence suggests metformin → AMPK activation → autophagy → amyloid clearance"
+4. Search can now target: "metformin AMPK", "autophagy neurodegeneration", "amyloid clearance drugs"
+**Key insight**: Scientific research is hypothesis-driven. The agent should think like a researcher.
+---
+## 2. Architecture
+### Current (Phase 6)
+```
+User Query → Magentic Manager
+                ├── SearchAgent → Evidence
+                └── JudgeAgent → Sufficient? → Synthesize/Continue
+```
+### Phase 7
+```
+User Query → Magentic Manager
+                ├── SearchAgent → Evidence
+                ├── HypothesisAgent → Mechanistic Hypotheses  ← NEW
+                └── JudgeAgent → Sufficient? → Synthesize/Continue
+                       ↑
+                  Uses hypotheses to guide next search
+```
+### Shared Context Enhancement
+```python
+evidence_store = {
+    "current": [],
+    "embeddings": {},
+    "vector_index": None,
+    "hypotheses": [],        # NEW: Generated hypotheses
+    "tested_hypotheses": [], # NEW: Hypotheses with supporting/contradicting evidence
+}
+```
+---
+## 3. Hypothesis Model
+### 3.1 Data Model (`src/utils/models.py`)
+```python
+class MechanismHypothesis(BaseModel):
+    """A scientific hypothesis about drug mechanism."""
+    drug: str = Field(description="The drug being studied")
+    target: str = Field(description="Molecular target (e.g., AMPK, mTOR)")
+    pathway: str = Field(description="Biological pathway affected")
+    effect: str = Field(description="Downstream effect on disease")
+    confidence: float = Field(ge=0, le=1, description="Confidence in hypothesis")
+    supporting_evidence: list[str] = Field(
+        default_factory=list,
+        description="PMIDs or URLs supporting this hypothesis"
+    )
+    contradicting_evidence: list[str] = Field(
+        default_factory=list,
+        description="PMIDs or URLs contradicting this hypothesis"
+    )
+    search_suggestions: list[str] = Field(
+        default_factory=list,
+        description="Suggested searches to test this hypothesis"
+    )
+    def to_search_queries(self) -> list[str]:
+        """Generate search queries to test this hypothesis."""
+        return [
+            f"{self.drug} {self.target}",
+            f"{self.target} {self.pathway}",
+            f"{self.pathway} {self.effect}",
+            *self.search_suggestions
+        ]
+```
+### 3.2 Hypothesis Assessment
+```python
+class HypothesisAssessment(BaseModel):
+    """Assessment of evidence against hypotheses."""
+    hypotheses: list[MechanismHypothesis]
+    primary_hypothesis: MechanismHypothesis | None = Field(
+        description="Most promising hypothesis based on current evidence"
+    )
+    knowledge_gaps: list[str] = Field(
+        description="What we don't know yet"
+    )
+    recommended_searches: list[str] = Field(
+        description="Searches to fill knowledge gaps"
+    )
+```
+---
+## 4. Implementation
+### 4.0 Text Utilities (`src/utils/text_utils.py`)
+> **Why These Utilities?**
+>
+> The original spec used arbitrary truncation (`evidence[:10]` and `content[:300]`).
+> This loses important information randomly. These utilities provide:
+> 1. **Sentence-aware truncation** - cuts at sentence boundaries, not mid-word
+> 2. **Diverse evidence selection** - uses embeddings to select varied evidence (MMR)
+```python
+"""Text processing utilities for evidence handling."""
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+    from src.utils.models import Evidence
+def truncate_at_sentence(text: str, max_chars: int = 300) -> str:
+    """Truncate text at sentence boundary, preserving meaning.
+    Args:
+        text: The text to truncate
+        max_chars: Maximum characters (default 300)
+    Returns:
+        Text truncated at last complete sentence within limit
+    """
+    if len(text) <= max_chars:
+        return text
+    # Find truncation point
+    truncated = text[:max_chars]
+    # Look for sentence endings: . ! ? followed by space or end
+    for sep in ['. ', '! ', '? ', '.\n', '!\n', '?\n']:
+        last_sep = truncated.rfind(sep)
+        if last_sep > max_chars // 2:  # Don't truncate too aggressively
+            return text[:last_sep + 1].strip()
+    # Fallback: find last period
+    last_period = truncated.rfind('.')
+    if last_period > max_chars // 2:
+        return text[:last_period + 1].strip()
+    # Last resort: truncate at word boundary
+    last_space = truncated.rfind(' ')
+    if last_space > 0:
+        return text[:last_space].strip() + "..."
+    return truncated + "..."
+async def select_diverse_evidence(
+    evidence: list["Evidence"],
+    n: int,
+    query: str,
+    embeddings: "EmbeddingService | None" = None
+) -> list["Evidence"]:
+    """Select n most diverse and relevant evidence items.
+    Uses Maximal Marginal Relevance (MMR) when embeddings available,
+    falls back to relevance_score sorting otherwise.
+    Args:
+        evidence: All available evidence
+        n: Number of items to select
+        query: Original query for relevance scoring
+        embeddings: Optional EmbeddingService for semantic diversity
+    Returns:
+        Selected evidence items, diverse and relevant
+    """
+    if not evidence:
+        return []
+    if n >= len(evidence):
+        return evidence
+    # Fallback: sort by relevance score if no embeddings
+    if embeddings is None:
+        return sorted(
+            evidence,
+            key=lambda e: e.relevance_score,
+            reverse=True
+        )[:n]
+    # MMR: Maximal Marginal Relevance for diverse selection
+    # Score = λ * relevance - (1-λ) * max_similarity_to_selected
+    lambda_param = 0.7  # Balance relevance vs diversity
+    # Get query embedding
+    query_emb = await embeddings.embed(query)
+    # Get all evidence embeddings
+    evidence_embs = await embeddings.embed_batch([e.content for e in evidence])
+    # Compute relevance scores (cosine similarity to query)
+    from numpy import dot
+    from numpy.linalg import norm
+    cosine = lambda a, b: float(dot(a, b) / (norm(a) * norm(b)))
+    relevance_scores = [cosine(query_emb, emb) for emb in evidence_embs]
+    # Greedy MMR selection
+    selected_indices: list[int] = []
+    remaining = set(range(len(evidence)))
+    for _ in range(n):
+        best_score = float('-inf')
+        best_idx = -1
+        for idx in remaining:
+            # Relevance component
+            relevance = relevance_scores[idx]
+            # Diversity component: max similarity to already selected
+            if selected_indices:
+                max_sim = max(
+                    cosine(evidence_embs[idx], evidence_embs[sel])
+                    for sel in selected_indices
+                )
+            else:
+                max_sim = 0
+            # MMR score
+            mmr_score = lambda_param * relevance - (1 - lambda_param) * max_sim
+            if mmr_score > best_score:
+                best_score = mmr_score
+                best_idx = idx
+        if best_idx >= 0:
+            selected_indices.append(best_idx)
+            remaining.remove(best_idx)
+    return [evidence[i] for i in selected_indices]
+```
+### 4.1 Hypothesis Prompts (`src/prompts/hypothesis.py`)
+```python
+"""Prompts for Hypothesis Agent."""
+from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
+SYSTEM_PROMPT = """You are a biomedical research scientist specializing in drug repurposing.
+Your role is to generate mechanistic hypotheses based on evidence.
+A good hypothesis:
+1. Proposes a MECHANISM: Drug → Target → Pathway → Effect
+2. Is TESTABLE: Can be supported or refuted by literature search
+3. Is SPECIFIC: Names actual molecular targets and pathways
+4. Generates SEARCH QUERIES: Helps find more evidence
+Example hypothesis format:
+- Drug: Metformin
+- Target: AMPK (AMP-activated protein kinase)
+- Pathway: mTOR inhibition → autophagy activation
+- Effect: Enhanced clearance of amyloid-beta in Alzheimer's
+- Confidence: 0.7
+- Search suggestions: ["metformin AMPK brain", "autophagy amyloid clearance"]
+Be specific. Use actual gene/protein names when possible."""
+async def format_hypothesis_prompt(
+    query: str,
+    evidence: list,
+    embeddings=None
+) -> str:
+    """Format prompt for hypothesis generation.
+    Uses smart evidence selection instead of arbitrary truncation.
+    Args:
+        query: The research query
+        evidence: All collected evidence
+        embeddings: Optional EmbeddingService for diverse selection
+    """
+    # Select diverse, relevant evidence (not arbitrary first 10)
+    selected = await select_diverse_evidence(
+        evidence, n=10, query=query, embeddings=embeddings
+    )
+    # Format with sentence-aware truncation
+    evidence_text = "\n".join([
+        f"- **{e.citation.title}** ({e.citation.source}): {truncate_at_sentence(e.content, 300)}"
+        for e in selected
+    ])
+    return f"""Based on the following evidence about "{query}", generate mechanistic hypotheses.
+## Evidence ({len(selected)} papers selected for diversity)
+{evidence_text}
+## Task
+1. Identify potential drug targets mentioned in the evidence
+2. Propose mechanism hypotheses (Drug → Target → Pathway → Effect)
+3. Rate confidence based on evidence strength
+4. Suggest searches to test each hypothesis
+Generate 2-4 hypotheses, prioritized by confidence."""
+```
+### 4.2 Hypothesis Agent (`src/agents/hypothesis_agent.py`)
+```python
+"""Hypothesis agent for mechanistic reasoning."""
+from collections.abc import AsyncIterable
+from typing import TYPE_CHECKING, Any
+from agent_framework import (
+    AgentRunResponse,
+    AgentRunResponseUpdate,
+    AgentThread,
+    BaseAgent,
+    ChatMessage,
+    Role,
+)
+from pydantic_ai import Agent
+from src.prompts.hypothesis import SYSTEM_PROMPT, format_hypothesis_prompt
+from src.utils.config import settings
+from src.utils.models import Evidence, HypothesisAssessment
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+class HypothesisAgent(BaseAgent):
+    """Generates mechanistic hypotheses based on evidence."""
+    def __init__(
+        self,
+        evidence_store: dict[str, list[Evidence]],
+        embedding_service: "EmbeddingService | None" = None,  # NEW: for diverse selection
+    ) -> None:
+        super().__init__(
+            name="HypothesisAgent",
+            description="Generates scientific hypotheses about drug mechanisms to guide research",
+        )
+        self._evidence_store = evidence_store
+        self._embeddings = embedding_service  # Used for MMR evidence selection
+        self._agent = Agent(
+            model=settings.llm_provider,  # Uses configured LLM
+            output_type=HypothesisAssessment,
+            system_prompt=SYSTEM_PROMPT,
+        )
+    async def run(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AgentRunResponse:
+        """Generate hypotheses based on current evidence."""
+        # Extract query
+        query = self._extract_query(messages)
+        # Get current evidence
+        evidence = self._evidence_store.get("current", [])
+        if not evidence:
+            return AgentRunResponse(
+                messages=[ChatMessage(
+                    role=Role.ASSISTANT,
+                    text="No evidence available yet. Search for evidence first."
+                )],
+                response_id="hypothesis-no-evidence",
+            )
+        # Generate hypotheses with diverse evidence selection
+        # NOTE: format_hypothesis_prompt is now async
+        prompt = await format_hypothesis_prompt(
+            query, evidence, embeddings=self._embeddings
+        )
+        result = await self._agent.run(prompt)
+        assessment = result.output
+        # Store hypotheses in shared context
+        existing = self._evidence_store.get("hypotheses", [])
+        self._evidence_store["hypotheses"] = existing + assessment.hypotheses
+        # Format response
+        response_text = self._format_response(assessment)
+        return AgentRunResponse(
+            messages=[ChatMessage(role=Role.ASSISTANT, text=response_text)],
+            response_id=f"hypothesis-{len(assessment.hypotheses)}",
+            additional_properties={"assessment": assessment.model_dump()},
+        )
+    def _format_response(self, assessment: HypothesisAssessment) -> str:
+        """Format hypothesis assessment as markdown."""
+        lines = ["## Generated Hypotheses\n"]
+        for i, h in enumerate(assessment.hypotheses, 1):
+            lines.append(f"### Hypothesis {i} (Confidence: {h.confidence:.0%})")
+            lines.append(f"**Mechanism**: {h.drug} → {h.target} → {h.pathway} → {h.effect}")
+            lines.append(f"**Suggested searches**: {', '.join(h.search_suggestions)}\n")
+        if assessment.primary_hypothesis:
+            lines.append(f"### Primary Hypothesis")
+            h = assessment.primary_hypothesis
+            lines.append(f"{h.drug} → {h.target} → {h.pathway} → {h.effect}\n")
+        if assessment.knowledge_gaps:
+            lines.append("### Knowledge Gaps")
+            for gap in assessment.knowledge_gaps:
+                lines.append(f"- {gap}")
+        if assessment.recommended_searches:
+            lines.append("\n### Recommended Next Searches")
+            for search in assessment.recommended_searches:
+                lines.append(f"- `{search}`")
+        return "\n".join(lines)
+    def _extract_query(self, messages) -> str:
+        """Extract query from messages."""
+        if isinstance(messages, str):
+            return messages
+        elif isinstance(messages, ChatMessage):
+            return messages.text or ""
+        elif isinstance(messages, list):
+            for msg in reversed(messages):
+                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
+                    return msg.text or ""
+                elif isinstance(msg, str):
+                    return msg
+        return ""
+    async def run_stream(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AsyncIterable[AgentRunResponseUpdate]:
+        """Streaming wrapper."""
+        result = await self.run(messages, thread=thread, **kwargs)
+        yield AgentRunResponseUpdate(
+            messages=result.messages,
+            response_id=result.response_id
+        )
+```
+### 4.3 Update MagenticOrchestrator
+Add HypothesisAgent to the workflow:
+```python
+# In MagenticOrchestrator.__init__
+self._hypothesis_agent = HypothesisAgent(self._evidence_store)
+# In workflow building
+workflow = (
+    MagenticBuilder()
+    .participants(
+        searcher=search_agent,
+        hypothesizer=self._hypothesis_agent,  # NEW
+        judge=judge_agent,
+    )
+    .with_standard_manager(...)
+    .build()
+)
+# Update task instruction
+task = f"""Research drug repurposing opportunities for: {query}
+Workflow:
+1. SearchAgent: Find initial evidence from PubMed and web
+2. HypothesisAgent: Generate mechanistic hypotheses (Drug → Target → Pathway → Effect)
+3. SearchAgent: Use hypothesis-suggested queries for targeted search
+4. JudgeAgent: Evaluate if evidence supports hypotheses
+5. Repeat until confident or max rounds
+Focus on:
+- Identifying specific molecular targets
+- Understanding mechanism of action
+- Finding supporting/contradicting evidence for hypotheses
+"""
+```
+---
+## 5. Directory Structure After Phase 7
+```
+src/
+├── agents/
+│   ├── search_agent.py
+│   ├── judge_agent.py
+│   └── hypothesis_agent.py     # NEW
+├── prompts/
+│   ├── judge.py
+│   └── hypothesis.py           # NEW
+├── services/
+│   └── embeddings.py
+└── utils/
+    └── models.py               # Updated with hypothesis models
+```
+---
+## 6. Tests
+### 6.1 Unit Tests (`tests/unit/agents/test_hypothesis_agent.py`)
+```python
+"""Unit tests for HypothesisAgent."""
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+from src.agents.hypothesis_agent import HypothesisAgent
+from src.utils.models import Citation, Evidence, HypothesisAssessment, MechanismHypothesis
+@pytest.fixture
+def sample_evidence():
+    return [
+        Evidence(
+            content="Metformin activates AMPK, which inhibits mTOR signaling...",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin and AMPK",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                date="2023"
+            )
+        )
+    ]
+@pytest.fixture
+def mock_assessment():
+    return HypothesisAssessment(
+        hypotheses=[
+            MechanismHypothesis(
+                drug="Metformin",
+                target="AMPK",
+                pathway="mTOR inhibition",
+                effect="Reduced cancer cell proliferation",
+                confidence=0.75,
+                search_suggestions=["metformin AMPK cancer", "mTOR cancer therapy"]
+            )
+        ],
+        primary_hypothesis=None,
+        knowledge_gaps=["Clinical trial data needed"],
+        recommended_searches=["metformin clinical trial cancer"]
+    )
+@pytest.mark.asyncio
+async def test_hypothesis_agent_generates_hypotheses(sample_evidence, mock_assessment):
+    """HypothesisAgent should generate mechanistic hypotheses."""
+    store = {"current": sample_evidence, "hypotheses": []}
+    with patch("src.agents.hypothesis_agent.Agent") as MockAgent:
+        mock_result = MagicMock()
+        mock_result.output = mock_assessment
+        MockAgent.return_value.run = AsyncMock(return_value=mock_result)
+        agent = HypothesisAgent(store)
+        response = await agent.run("metformin cancer")
+        assert "AMPK" in response.messages[0].text
+        assert len(store["hypotheses"]) == 1
+@pytest.mark.asyncio
+async def test_hypothesis_agent_no_evidence():
+    """HypothesisAgent should handle empty evidence gracefully."""
+    store = {"current": [], "hypotheses": []}
+    agent = HypothesisAgent(store)
+    response = await agent.run("test query")
+    assert "No evidence" in response.messages[0].text
+```
+---
+## 7. Definition of Done
+Phase 7 is **COMPLETE** when:
+1. `MechanismHypothesis` and `HypothesisAssessment` models implemented
+2. `HypothesisAgent` generates hypotheses from evidence
+3. Hypotheses stored in shared context
+4. Search queries generated from hypotheses
+5. Magentic workflow includes HypothesisAgent
+6. All unit tests pass
+---
+## 8. Value Delivered
+| Before (Phase 6) | After (Phase 7) |
+|------------------|-----------------|
+| Reactive search | Hypothesis-driven search |
+| Generic queries | Mechanism-targeted queries |
+| No scientific reasoning | Drug → Target → Pathway → Effect |
+| Judge says "need more" | Hypothesis says "search for X to test Y" |
+**Real example improvement:**
+- Query: "metformin alzheimer"
+- Before: "metformin alzheimer mechanism", "metformin brain"
+- After: "metformin AMPK activation", "AMPK autophagy neurodegeneration", "autophagy amyloid clearance"
+The search becomes **scientifically targeted** rather than keyword variations.

docs/implementation/08_phase_report.md ADDED Viewed

	@@ -0,0 +1,854 @@

+# Phase 8 Implementation Spec: Report Agent
+**Goal**: Generate structured scientific reports with proper citations and methodology.
+**Philosophy**: "Research isn't complete until it's communicated clearly."
+**Prerequisite**: Phase 7 complete (Hypothesis Agent working)
+---
+## 1. Why Report Agent?
+Current limitation: **Synthesis is basic markdown, not a scientific report.**
+Current output:
+```
+## Drug Repurposing Analysis
+### Drug Candidates
+- Metformin
+### Key Findings
+- Some findings
+### Citations
+1. [Paper 1](url)
+```
+With Report Agent:
+```
+## Executive Summary
+One-paragraph summary for busy readers...
+## Research Question
+Clear statement of what was investigated...
+## Methodology
+- Sources searched: PubMed, DuckDuckGo
+- Date range: ...
+- Inclusion criteria: ...
+## Hypotheses Tested
+1. Metformin → AMPK → neuroprotection (Supported: 7 papers, Contradicted: 2)
+## Findings
+### Mechanistic Evidence
+...
+### Clinical Evidence
+...
+## Limitations
+- Only English language papers
+- Abstract-level analysis only
+## Conclusion
+...
+## References
+Properly formatted citations...
+```
+---
+## 2. Architecture
+### Phase 8 Addition
+```
+Evidence + Hypotheses + Assessment
+            ↓
+      Report Agent
+            ↓
+   Structured Scientific Report
+```
+### Report Generation Flow
+```
+1. JudgeAgent says "synthesize"
+2. Magentic Manager selects ReportAgent
+3. ReportAgent gathers:
+   - All evidence from shared context
+   - All hypotheses (supported/contradicted)
+   - Assessment scores
+4. ReportAgent generates structured report
+5. Final output to user
+```
+---
+## 3. Report Model
+### 3.1 Data Model (`src/utils/models.py`)
+```python
+class ReportSection(BaseModel):
+    """A section of the research report."""
+    title: str
+    content: str
+    citations: list[str] = Field(default_factory=list)
+class ResearchReport(BaseModel):
+    """Structured scientific report."""
+    title: str = Field(description="Report title")
+    executive_summary: str = Field(
+        description="One-paragraph summary for quick reading",
+        min_length=100,
+        max_length=500
+    )
+    research_question: str = Field(description="Clear statement of what was investigated")
+    methodology: ReportSection = Field(description="How the research was conducted")
+    hypotheses_tested: list[dict] = Field(
+        description="Hypotheses with supporting/contradicting evidence counts"
+    )
+    mechanistic_findings: ReportSection = Field(
+        description="Findings about drug mechanisms"
+    )
+    clinical_findings: ReportSection = Field(
+        description="Findings from clinical/preclinical studies"
+    )
+    drug_candidates: list[str] = Field(description="Identified drug candidates")
+    limitations: list[str] = Field(description="Study limitations")
+    conclusion: str = Field(description="Overall conclusion")
+    references: list[dict] = Field(
+        description="Formatted references with title, authors, source, URL"
+    )
+    # Metadata
+    sources_searched: list[str] = Field(default_factory=list)
+    total_papers_reviewed: int = 0
+    search_iterations: int = 0
+    confidence_score: float = Field(ge=0, le=1)
+    def to_markdown(self) -> str:
+        """Render report as markdown."""
+        sections = [
+            f"# {self.title}\n",
+            f"## Executive Summary\n{self.executive_summary}\n",
+            f"## Research Question\n{self.research_question}\n",
+            f"## Methodology\n{self.methodology.content}\n",
+        ]
+        # Hypotheses
+        sections.append("## Hypotheses Tested\n")
+        for h in self.hypotheses_tested:
+            status = "✅ Supported" if h.get("supported", 0) > h.get("contradicted", 0) else "⚠️ Mixed"
+            sections.append(
+                f"- **{h['mechanism']}** ({status}): "
+                f"{h.get('supported', 0)} supporting, {h.get('contradicted', 0)} contradicting\n"
+            )
+        # Findings
+        sections.append(f"## Mechanistic Findings\n{self.mechanistic_findings.content}\n")
+        sections.append(f"## Clinical Findings\n{self.clinical_findings.content}\n")
+        # Drug candidates
+        sections.append("## Drug Candidates\n")
+        for drug in self.drug_candidates:
+            sections.append(f"- **{drug}**\n")
+        # Limitations
+        sections.append("## Limitations\n")
+        for lim in self.limitations:
+            sections.append(f"- {lim}\n")
+        # Conclusion
+        sections.append(f"## Conclusion\n{self.conclusion}\n")
+        # References
+        sections.append("## References\n")
+        for i, ref in enumerate(self.references, 1):
+            sections.append(
+                f"{i}. {ref.get('authors', 'Unknown')}. "
+                f"*{ref.get('title', 'Untitled')}*. "
+                f"{ref.get('source', '')} ({ref.get('date', '')}). "
+                f"[Link]({ref.get('url', '#')})\n"
+            )
+        # Metadata footer
+        sections.append("\n---\n")
+        sections.append(
+            f"*Report generated from {self.total_papers_reviewed} papers "
+            f"across {self.search_iterations} search iterations. "
+            f"Confidence: {self.confidence_score:.0%}*"
+        )
+        return "\n".join(sections)
+```
+---
+## 4. Implementation
+### 4.0 Citation Validation (`src/utils/citation_validator.py`)
+> **🚨 CRITICAL: Why Citation Validation?**
+>
+> LLMs frequently **hallucinate** citations - inventing paper titles, authors, and URLs
+> that don't exist. For a medical research tool, fake citations are **dangerous**.
+>
+> This validation layer ensures every reference in the report actually exists
+> in the collected evidence.
+```python
+"""Citation validation to prevent LLM hallucination.
+CRITICAL: Medical research requires accurate citations.
+This module validates that all references exist in collected evidence.
+"""
+import logging
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from src.utils.models import Evidence, ResearchReport
+logger = logging.getLogger(__name__)
+def validate_references(
+    report: "ResearchReport",
+    evidence: list["Evidence"]
+) -> "ResearchReport":
+    """Ensure all references actually exist in collected evidence.
+    CRITICAL: Prevents LLM hallucination of citations.
+    Args:
+        report: The generated research report
+        evidence: All evidence collected during research
+    Returns:
+        Report with only valid references (hallucinated ones removed)
+    """
+    # Build set of valid URLs from evidence
+    valid_urls = {e.citation.url for e in evidence}
+    valid_titles = {e.citation.title.lower() for e in evidence}
+    validated_refs = []
+    removed_count = 0
+    for ref in report.references:
+        ref_url = ref.get("url", "")
+        ref_title = ref.get("title", "").lower()
+        # Check if URL matches collected evidence
+        if ref_url in valid_urls:
+            validated_refs.append(ref)
+        # Fallback: check title match (URLs might differ slightly)
+        elif ref_title and any(ref_title in t or t in ref_title for t in valid_titles):
+            validated_refs.append(ref)
+        else:
+            removed_count += 1
+            logger.warning(
+                f"Removed hallucinated reference: '{ref.get('title', 'Unknown')}' "
+                f"(URL: {ref_url[:50]}...)"
+            )
+    if removed_count > 0:
+        logger.info(
+            f"Citation validation removed {removed_count} hallucinated references. "
+            f"{len(validated_refs)} valid references remain."
+        )
+    # Update report with validated references
+    report.references = validated_refs
+    return report
+def build_reference_from_evidence(evidence: "Evidence") -> dict:
+    """Build a properly formatted reference from evidence.
+    Use this to ensure references match the original evidence exactly.
+    """
+    return {
+        "title": evidence.citation.title,
+        "authors": evidence.citation.authors or ["Unknown"],
+        "source": evidence.citation.source,
+        "date": evidence.citation.date or "n.d.",
+        "url": evidence.citation.url,
+    }
+```
+### 4.1 Report Prompts (`src/prompts/report.py`)
+```python
+"""Prompts for Report Agent."""
+from src.utils.text_utils import truncate_at_sentence, select_diverse_evidence
+SYSTEM_PROMPT = """You are a scientific writer specializing in drug repurposing research reports.
+Your role is to synthesize evidence and hypotheses into a clear, structured report.
+A good report:
+1. Has a clear EXECUTIVE SUMMARY (one paragraph, key takeaways)
+2. States the RESEARCH QUESTION clearly
+3. Describes METHODOLOGY (what was searched, how)
+4. Evaluates HYPOTHESES with evidence counts
+5. Separates MECHANISTIC and CLINICAL findings
+6. Lists specific DRUG CANDIDATES
+7. Acknowledges LIMITATIONS honestly
+8. Provides a balanced CONCLUSION
+9. Includes properly formatted REFERENCES
+Write in scientific but accessible language. Be specific about evidence strength.
+─────────────────────────────────────────────────────────────────────────────
+🚨 CRITICAL CITATION REQUIREMENTS 🚨
+─────────────────────────────────────────────────────────────────────────────
+You MUST follow these rules for the References section:
+1. You may ONLY cite papers that appear in the Evidence section above
+2. Every reference URL must EXACTLY match a provided evidence URL
+3. Do NOT invent, fabricate, or hallucinate any references
+4. Do NOT modify paper titles, authors, dates, or URLs
+5. If unsure about a citation, OMIT it rather than guess
+6. Copy URLs exactly as provided - do not create similar-looking URLs
+VIOLATION OF THESE RULES PRODUCES DANGEROUS MISINFORMATION.
+─────────────────────────────────────────────────────────────────────────────"""
+async def format_report_prompt(
+    query: str,
+    evidence: list,
+    hypotheses: list,
+    assessment: dict,
+    metadata: dict,
+    embeddings=None
+) -> str:
+    """Format prompt for report generation.
+    Includes full evidence details for accurate citation.
+    """
+    # Select diverse evidence (not arbitrary truncation)
+    selected = await select_diverse_evidence(
+        evidence, n=20, query=query, embeddings=embeddings
+    )
+    # Include FULL citation details for each evidence item
+    # This helps the LLM create accurate references
+    evidence_summary = "\n".join([
+        f"- **Title**: {e.citation.title}\n"
+        f"  **URL**: {e.citation.url}\n"
+        f"  **Authors**: {', '.join(e.citation.authors or ['Unknown'])}\n"
+        f"  **Date**: {e.citation.date or 'n.d.'}\n"
+        f"  **Source**: {e.citation.source}\n"
+        f"  **Content**: {truncate_at_sentence(e.content, 200)}\n"
+        for e in selected
+    ])
+    hypotheses_summary = "\n".join([
+        f"- {h.drug} → {h.target} → {h.pathway} → {h.effect} (Confidence: {h.confidence:.0%})"
+        for h in hypotheses
+    ]) if hypotheses else "No hypotheses generated yet."
+    return f"""Generate a structured research report for the following query.
+## Original Query
+{query}
+## Evidence Collected ({len(selected)} papers, selected for diversity)
+{evidence_summary}
+## Hypotheses Generated
+{hypotheses_summary}
+## Assessment Scores
+- Mechanism Score: {assessment.get('mechanism_score', 'N/A')}/10
+- Clinical Evidence Score: {assessment.get('clinical_score', 'N/A')}/10
+- Overall Confidence: {assessment.get('confidence', 0):.0%}
+## Metadata
+- Sources Searched: {', '.join(metadata.get('sources', []))}
+- Search Iterations: {metadata.get('iterations', 0)}
+Generate a complete ResearchReport with all sections filled in.
+REMINDER: Only cite papers from the Evidence section above. Copy URLs exactly."""
+```
+### 4.2 Report Agent (`src/agents/report_agent.py`)
+```python
+"""Report agent for generating structured research reports."""
+from collections.abc import AsyncIterable
+from typing import TYPE_CHECKING, Any
+from agent_framework import (
+    AgentRunResponse,
+    AgentRunResponseUpdate,
+    AgentThread,
+    BaseAgent,
+    ChatMessage,
+    Role,
+)
+from pydantic_ai import Agent
+from src.prompts.report import SYSTEM_PROMPT, format_report_prompt
+from src.utils.citation_validator import validate_references  # CRITICAL
+from src.utils.config import settings
+from src.utils.models import Evidence, MechanismHypothesis, ResearchReport
+if TYPE_CHECKING:
+    from src.services.embeddings import EmbeddingService
+class ReportAgent(BaseAgent):
+    """Generates structured scientific reports from evidence and hypotheses."""
+    def __init__(
+        self,
+        evidence_store: dict[str, list[Evidence]],
+        embedding_service: "EmbeddingService | None" = None,  # For diverse selection
+    ) -> None:
+        super().__init__(
+            name="ReportAgent",
+            description="Generates structured scientific research reports with citations",
+        )
+        self._evidence_store = evidence_store
+        self._embeddings = embedding_service
+        self._agent = Agent(
+            model=settings.llm_provider,
+            output_type=ResearchReport,
+            system_prompt=SYSTEM_PROMPT,
+        )
+    async def run(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AgentRunResponse:
+        """Generate research report."""
+        query = self._extract_query(messages)
+        # Gather all context
+        evidence = self._evidence_store.get("current", [])
+        hypotheses = self._evidence_store.get("hypotheses", [])
+        assessment = self._evidence_store.get("last_assessment", {})
+        if not evidence:
+            return AgentRunResponse(
+                messages=[ChatMessage(
+                    role=Role.ASSISTANT,
+                    text="Cannot generate report: No evidence collected."
+                )],
+                response_id="report-no-evidence",
+            )
+        # Build metadata
+        metadata = {
+            "sources": list(set(e.citation.source for e in evidence)),
+            "iterations": self._evidence_store.get("iteration_count", 0),
+        }
+        # Generate report (format_report_prompt is now async)
+        prompt = await format_report_prompt(
+            query=query,
+            evidence=evidence,
+            hypotheses=hypotheses,
+            assessment=assessment,
+            metadata=metadata,
+            embeddings=self._embeddings,
+        )
+        result = await self._agent.run(prompt)
+        report = result.output
+        # ═══════════════════════════════════════════════════════════════════
+        # 🚨 CRITICAL: Validate citations to prevent hallucination
+        # ═══════════════════════════════════════════════════════════════════
+        report = validate_references(report, evidence)
+        # Store validated report
+        self._evidence_store["final_report"] = report
+        # Return markdown version
+        return AgentRunResponse(
+            messages=[ChatMessage(role=Role.ASSISTANT, text=report.to_markdown())],
+            response_id="report-complete",
+            additional_properties={"report": report.model_dump()},
+        )
+    def _extract_query(self, messages) -> str:
+        """Extract query from messages."""
+        if isinstance(messages, str):
+            return messages
+        elif isinstance(messages, ChatMessage):
+            return messages.text or ""
+        elif isinstance(messages, list):
+            for msg in reversed(messages):
+                if isinstance(msg, ChatMessage) and msg.role == Role.USER:
+                    return msg.text or ""
+                elif isinstance(msg, str):
+                    return msg
+        return ""
+    async def run_stream(
+        self,
+        messages: str | ChatMessage | list[str] | list[ChatMessage] | None = None,
+        *,
+        thread: AgentThread | None = None,
+        **kwargs: Any,
+    ) -> AsyncIterable[AgentRunResponseUpdate]:
+        """Streaming wrapper."""
+        result = await self.run(messages, thread=thread, **kwargs)
+        yield AgentRunResponseUpdate(
+            messages=result.messages,
+            response_id=result.response_id
+        )
+```
+### 4.3 Update MagenticOrchestrator
+Add ReportAgent as the final synthesis step:
+```python
+# In MagenticOrchestrator.__init__
+self._report_agent = ReportAgent(self._evidence_store)
+# In workflow building
+workflow = (
+    MagenticBuilder()
+    .participants(
+        searcher=search_agent,
+        hypothesizer=hypothesis_agent,
+        judge=judge_agent,
+        reporter=self._report_agent,  # NEW
+    )
+    .with_standard_manager(...)
+    .build()
+)
+# Update task instruction
+task = f"""Research drug repurposing opportunities for: {query}
+Workflow:
+1. SearchAgent: Find evidence from PubMed and web
+2. HypothesisAgent: Generate mechanistic hypotheses
+3. SearchAgent: Targeted search based on hypotheses
+4. JudgeAgent: Evaluate evidence sufficiency
+5. If sufficient → ReportAgent: Generate structured research report
+6. If not sufficient → Repeat from step 1 with refined queries
+The final output should be a complete research report with:
+- Executive summary
+- Methodology
+- Hypotheses tested
+- Mechanistic and clinical findings
+- Drug candidates
+- Limitations
+- Conclusion with references
+"""
+```
+---
+## 5. Directory Structure After Phase 8
+```
+src/
+├── agents/
+│   ├── search_agent.py
+│   ├── judge_agent.py
+│   ├── hypothesis_agent.py
+│   └── report_agent.py         # NEW
+├── prompts/
+│   ├── judge.py
+│   ├── hypothesis.py
+│   └── report.py               # NEW
+├── services/
+│   └── embeddings.py
+└── utils/
+    └── models.py               # Updated with report models
+```
+---
+## 6. Tests
+### 6.1 Unit Tests (`tests/unit/agents/test_report_agent.py`)
+```python
+"""Unit tests for ReportAgent."""
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+from src.agents.report_agent import ReportAgent
+from src.utils.models import (
+    Citation, Evidence, MechanismHypothesis,
+    ResearchReport, ReportSection
+)
+@pytest.fixture
+def sample_evidence():
+    return [
+        Evidence(
+            content="Metformin activates AMPK...",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin mechanisms",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                date="2023",
+                authors=["Smith J", "Jones A"]
+            )
+        )
+    ]
+@pytest.fixture
+def sample_hypotheses():
+    return [
+        MechanismHypothesis(
+            drug="Metformin",
+            target="AMPK",
+            pathway="mTOR inhibition",
+            effect="Neuroprotection",
+            confidence=0.8,
+            search_suggestions=[]
+        )
+    ]
+@pytest.fixture
+def mock_report():
+    return ResearchReport(
+        title="Drug Repurposing Analysis: Metformin for Alzheimer's",
+        executive_summary="This report analyzes metformin as a potential...",
+        research_question="Can metformin be repurposed for Alzheimer's disease?",
+        methodology=ReportSection(
+            title="Methodology",
+            content="Searched PubMed and web sources..."
+        ),
+        hypotheses_tested=[
+            {"mechanism": "Metformin → AMPK → neuroprotection", "supported": 5, "contradicted": 1}
+        ],
+        mechanistic_findings=ReportSection(
+            title="Mechanistic Findings",
+            content="Evidence suggests AMPK activation..."
+        ),
+        clinical_findings=ReportSection(
+            title="Clinical Findings",
+            content="Limited clinical data available..."
+        ),
+        drug_candidates=["Metformin"],
+        limitations=["Abstract-level analysis only"],
+        conclusion="Metformin shows promise...",
+        references=[],
+        sources_searched=["pubmed", "web"],
+        total_papers_reviewed=10,
+        search_iterations=3,
+        confidence_score=0.75
+    )
+@pytest.mark.asyncio
+async def test_report_agent_generates_report(
+    sample_evidence, sample_hypotheses, mock_report
+):
+    """ReportAgent should generate structured report."""
+    store = {
+        "current": sample_evidence,
+        "hypotheses": sample_hypotheses,
+        "last_assessment": {"mechanism_score": 8, "clinical_score": 6}
+    }
+    with patch("src.agents.report_agent.Agent") as MockAgent:
+        mock_result = MagicMock()
+        mock_result.output = mock_report
+        MockAgent.return_value.run = AsyncMock(return_value=mock_result)
+        agent = ReportAgent(store)
+        response = await agent.run("metformin alzheimer")
+        assert "Executive Summary" in response.messages[0].text
+        assert "Methodology" in response.messages[0].text
+        assert "References" in response.messages[0].text
+@pytest.mark.asyncio
+async def test_report_agent_no_evidence():
+    """ReportAgent should handle empty evidence gracefully."""
+    store = {"current": [], "hypotheses": []}
+    agent = ReportAgent(store)
+    response = await agent.run("test query")
+    assert "Cannot generate report" in response.messages[0].text
+# ═══════════════════════════════════════════════════════════════════════════
+# 🚨 CRITICAL: Citation Validation Tests
+# ═══════════════════════════════════════════════════════════════════════════
+@pytest.mark.asyncio
+async def test_report_agent_removes_hallucinated_citations(sample_evidence):
+    """ReportAgent should remove citations not in evidence."""
+    from src.utils.citation_validator import validate_references
+    # Create report with mix of valid and hallucinated references
+    report_with_hallucinations = ResearchReport(
+        title="Test Report",
+        executive_summary="This is a test report for citation validation...",
+        research_question="Testing citation validation",
+        methodology=ReportSection(title="Methodology", content="Test"),
+        hypotheses_tested=[],
+        mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
+        clinical_findings=ReportSection(title="Clinical", content="Test"),
+        drug_candidates=["TestDrug"],
+        limitations=["Test limitation"],
+        conclusion="Test conclusion",
+        references=[
+            # Valid reference (matches sample_evidence)
+            {
+                "title": "Metformin mechanisms",
+                "url": "https://pubmed.ncbi.nlm.nih.gov/12345/",
+                "authors": ["Smith J", "Jones A"],
+                "date": "2023",
+                "source": "pubmed"
+            },
+            # HALLUCINATED reference (URL doesn't exist in evidence)
+            {
+                "title": "Fake Paper That Doesn't Exist",
+                "url": "https://fake-journal.com/made-up-paper",
+                "authors": ["Hallucinated A"],
+                "date": "2024",
+                "source": "fake"
+            },
+            # Another HALLUCINATED reference
+            {
+                "title": "Invented Research",
+                "url": "https://pubmed.ncbi.nlm.nih.gov/99999999/",
+                "authors": ["NotReal B"],
+                "date": "2025",
+                "source": "pubmed"
+            }
+        ],
+        sources_searched=["pubmed"],
+        total_papers_reviewed=1,
+        search_iterations=1,
+        confidence_score=0.5
+    )
+    # Validate - should remove hallucinated references
+    validated_report = validate_references(report_with_hallucinations, sample_evidence)
+    # Only the valid reference should remain
+    assert len(validated_report.references) == 1
+    assert validated_report.references[0]["title"] == "Metformin mechanisms"
+    assert "Fake Paper" not in str(validated_report.references)
+def test_citation_validator_handles_empty_references():
+    """Citation validator should handle reports with no references."""
+    from src.utils.citation_validator import validate_references
+    report = ResearchReport(
+        title="Empty Refs Report",
+        executive_summary="This report has no references...",
+        research_question="Testing empty refs",
+        methodology=ReportSection(title="Methodology", content="Test"),
+        hypotheses_tested=[],
+        mechanistic_findings=ReportSection(title="Mechanistic", content="Test"),
+        clinical_findings=ReportSection(title="Clinical", content="Test"),
+        drug_candidates=[],
+        limitations=[],
+        conclusion="Test",
+        references=[],  # Empty!
+        sources_searched=[],
+        total_papers_reviewed=0,
+        search_iterations=0,
+        confidence_score=0.0
+    )
+    validated = validate_references(report, [])
+    assert validated.references == []
+```
+---
+## 7. Definition of Done
+Phase 8 is **COMPLETE** when:
+1. `ResearchReport` model implemented with all sections
+2. `ReportAgent` generates structured reports
+3. Reports include proper citations and methodology
+4. Magentic workflow uses ReportAgent for final synthesis
+5. Report renders as clean markdown
+6. All unit tests pass
+---
+## 8. Value Delivered
+| Before (Phase 7) | After (Phase 8) |
+|------------------|-----------------|
+| Basic synthesis | Structured scientific report |
+| Simple bullet points | Executive summary + methodology |
+| List of citations | Formatted references |
+| No methodology | Clear research process |
+| No limitations | Honest limitations section |
+**Sample output comparison:**
+Before:
+```
+## Analysis
+- Metformin might help
+- Found 5 papers
+[Link 1] [Link 2]
+```
+After:
+```
+# Drug Repurposing Analysis: Metformin for Alzheimer's Disease
+## Executive Summary
+Analysis of 15 papers suggests metformin may provide neuroprotection
+through AMPK activation. Mechanistic evidence is strong (8/10),
+while clinical evidence is moderate (6/10)...
+## Methodology
+Systematic search of PubMed and web sources using queries...
+## Hypotheses Tested
+- ✅ Metformin → AMPK → neuroprotection (7 supporting, 2 contradicting)
+## References
+1. Smith J, Jones A. *Metformin mechanisms*. Nature (2023). [Link](...)
+```
+---
+## 9. Complete Magentic Architecture (Phases 5-8)
+```
+User Query
+    ↓
+Gradio UI
+    ↓
+Magentic Manager (LLM Coordinator)
+    ├── SearchAgent ←→ PubMed + Web + VectorDB
+    ├── HypothesisAgent ←→ Mechanistic Reasoning
+    ├── JudgeAgent ←→ Evidence Assessment
+    └── ReportAgent ←→ Final Synthesis
+    ↓
+Structured Research Report
+```
+**This matches Mario's diagram** with the practical agents that add real value for drug repurposing research.

docs/implementation/roadmap.md CHANGED Viewed

@@ -115,26 +115,96 @@ tests/
 ---
-### **Phase 5: Magentic Integration (OPTIONAL - Post-MVP)**
 *Goal: Upgrade orchestrator to use Microsoft Agent Framework patterns.*
-- [ ] Wrap SearchHandler as `AgentProtocol` (SearchAgent) with strict protocol compliance.
-- [ ] Wrap JudgeHandler as `AgentProtocol` (JudgeAgent) with strict protocol compliance.
-- [ ] Implement `MagenticOrchestrator` using `MagenticBuilder`.
-- [ ] Create factory pattern for switching implementations.
 - **Deliverable**: Same API, better multi-agent orchestration engine.
-**NOTE**: Only implement Phase 5 if time permits after MVP is shipped.
 ---
 ## Spec Documents
-1. **[Phase 1 Spec: Foundation](01_phase_foundation.md)**
-2. **[Phase 2 Spec: Search Slice](02_phase_search.md)**
-3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)**
-4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)**
-5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** *(Optional)*
-*Start by reading Phase 1 Spec to initialize the repo.*

 ---
+### **Phase 5: Magentic Integration** ✅ COMPLETE
 *Goal: Upgrade orchestrator to use Microsoft Agent Framework patterns.*
+- [x] Wrap SearchHandler as `AgentProtocol` (SearchAgent) with strict protocol compliance.
+- [x] Wrap JudgeHandler as `AgentProtocol` (JudgeAgent) with strict protocol compliance.
+- [x] Implement `MagenticOrchestrator` using `MagenticBuilder`.
+- [x] Create factory pattern for switching implementations.
 - **Deliverable**: Same API, better multi-agent orchestration engine.
+---
+### **Phase 6: Embeddings & Semantic Search**
+*Goal: Add vector search for semantic evidence retrieval.*
+- [ ] Implement `EmbeddingService` with ChromaDB.
+- [ ] Add semantic deduplication to SearchAgent.
+- [ ] Enable semantic search for related evidence.
+- [ ] Store embeddings in shared context.
+- **Deliverable**: Find semantically related papers, not just keyword matches.
+---
+### **Phase 7: Hypothesis Agent**
+*Goal: Generate scientific hypotheses to guide targeted searches.*
+- [ ] Implement `MechanismHypothesis` and `HypothesisAssessment` models.
+- [ ] Implement `HypothesisAgent` for mechanistic reasoning.
+- [ ] Add hypothesis-driven search queries.
+- [ ] Integrate into Magentic workflow.
+- **Deliverable**: Drug → Target → Pathway → Effect hypotheses that guide research.
+---
+### **Phase 8: Report Agent**
+*Goal: Generate structured scientific reports with proper citations.*
+- [ ] Implement `ResearchReport` model with all sections.
+- [ ] Implement `ReportAgent` for synthesis.
+- [ ] Include methodology, limitations, formatted references.
+- [ ] Integrate as final synthesis step in Magentic workflow.
+- **Deliverable**: Publication-quality research reports.
+---
+## Complete Architecture (Phases 1-8)
+```
+User Query
+    ↓
+Gradio UI (Phase 4)
+    ↓
+Magentic Manager (Phase 5)
+    ├── SearchAgent (Phase 2+5) ←→ PubMed + Web + VectorDB (Phase 6)
+    ├── HypothesisAgent (Phase 7) ←→ Mechanistic Reasoning
+    ├── JudgeAgent (Phase 3+5) ←→ Evidence Assessment
+    └── ReportAgent (Phase 8) ←→ Final Synthesis
+    ↓
+Structured Research Report
+```
 ---
 ## Spec Documents
+1. **[Phase 1 Spec: Foundation](01_phase_foundation.md)** ✅
+2. **[Phase 2 Spec: Search Slice](02_phase_search.md)** ✅
+3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)** ✅
+4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)** ✅
+5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** ✅
+6. **[Phase 6 Spec: Embeddings & Semantic Search](06_phase_embeddings.md)**
+7. **[Phase 7 Spec: Hypothesis Agent](07_phase_hypothesis.md)**
+8. **[Phase 8 Spec: Report Agent](08_phase_report.md)**
+---
+## Progress Summary
+| Phase | Status | Deliverable |
+|-------|--------|-------------|
+| Phase 1: Foundation | ✅ COMPLETE | CI-ready repo with uv/pytest |
+| Phase 2: Search | ✅ COMPLETE | PubMed + Web search |
+| Phase 3: Judge | ✅ COMPLETE | LLM evidence assessment |
+| Phase 4: UI & Loop | ✅ COMPLETE | Working Gradio app |
+| Phase 5: Magentic | ✅ COMPLETE | Multi-agent orchestration |
+| Phase 6: Embeddings | 📝 SPEC READY | Semantic search |
+| Phase 7: Hypothesis | 📝 SPEC READY | Mechanistic reasoning |
+| Phase 8: Report | 📝 SPEC READY | Structured reports |
+*Phases 1-5 completed in ONE DAY. Phases 6-8 specs ready for implementation.*