Spaces:

Fred-Rcky
/

Mediscribe

Sleeping

File size: 16,502 Bytes

c32bf13

# Hospital Copilot — Development Log

**Hackathon:** Gemma 4 for Good  
**Team:** Ricky (fredrickandoh17@gmail.com)  
**Stack:** Python · Gradio · Gemma 4 · faster-whisper · ChromaDB · SQLite  
**Started:** 2026-05-16

---

## Project Goal

Build an AI clinical assistant that listens to doctor-patient consultations and automatically produces:
- Live transcription of the conversation
- Structured symptom extraction (symptoms, medications, duration, allergies, follow-up actions)
- SOAP notes grounded with real ICD-10 codes and drug dosages
- Plain-language patient summary
- Structured patient records saved to a local database

**Why:** Reduce doctor burnout from paperwork, improve care quality, and support healthcare workers in low-resource settings like Ghana.

---

## Architecture Overview

```
Microphone
  └─► faster-whisper (STT, local CPU)       → raw transcript
        └─► Gemma 4 26B cloud (speaker labelling) → Doctor:/Patient: transcript
              ├─► Gemma 4 E2B via Ollama (symptom JSON)  → local CPU
              └─► ChromaDB + MiniLM (RAG retrieval)      → ICD-10 codes + drug info
                    └─► Gemma 4 26B cloud (SOAP note, patient summary)
                          └─► SQLite (patients, sessions, notes, symptoms)
                                └─► Gradio UI
```

---

## Features Implemented

### Core Pipeline
| Feature | Status | Implementation |
|---|---|---|
| Live mic transcription | ✅ | faster-whisper `small` model, 3s chunks, VAD filter |
| Speaker diarization | ✅ | Gemma 4 post-hoc Doctor:/Patient: labelling |
| Symptom extraction | ✅ | Gemma 4 E2B via Ollama — JSON: chief complaint, symptoms, duration, severity, medications, allergies, vitals, history, follow-up actions |
| RAG ICD-10 retrieval | ✅ | ChromaDB + all-MiniLM-L6-v2, 90+ Ghana-relevant codes |
| RAG drug grounding | ✅ | ChromaDB, 40+ WHO Essential Medicines with dosages |
| SOAP note generation | ✅ | Gemma 4 26B cloud, RAG context injected into prompt |
| Patient summary | ✅ | Gemma 4 26B cloud, plain English |
| Patient records (SQLite) | ✅ | patients, sessions, notes, symptoms tables |
| Patient registration | ✅ | Name, DOB, gender, phone |
| Records viewer | ✅ | Load any patient's most recent session |

### Translation (Twi/Akan)
| Status | Note |
|---|---|
| ⏸️ Paused | Gemma 4 returned 500 INTERNAL errors on Twi translation. Identified root cause: Twi is a low-resource language and Gemma 4 is not purpose-built for it. Decision: implement NLLB-200 (Meta's No Language Left Behind model) which was specifically trained on Akan/Twi. Deferred until core pipeline is stable. |

### Gemma 4 Advanced Features (Added 2026-05-18)
| Feature | Status | Implementation |
|---|---|---|
| **Reasoning mode (thinking)** | ✅ | `ThinkingConfig(thinking_budget=2048, include_thoughts=False)` on SOAP generation — Gemma 4 reasons step-by-step internally before writing the note |
| **Function calling (symptom extraction)** | ✅ | `FunctionDeclaration` schema with `FunctionCallingMode.ANY` — guaranteed valid structured output, no JSON parsing |
| **Multimodal image/document analysis** | ✅ | `Part.from_bytes()` with lab result / prescription images — extracted findings injected into SOAP context |

---

## Technical Decisions

### 1. Multi-agent Gemma 4 architecture
**Decision:** Use multiple specialised Gemma 4 instances rather than one large model for everything.  
**Reasoning:** Different tasks have different speed/accuracy requirements:
- Symptom extraction: needs to be fast, structured JSON → small local model (E2B)
- SOAP notes: needs medical reasoning and long output → large cloud model (26B)
- Speaker labelling: needs language understanding → cloud model
- Embeddings: needs speed, runs every session → lightweight MiniLM locally

### 2. Local vs cloud split
**Decision:** Run small models locally (Ollama E2B, Whisper, MiniLM, ChromaDB), large inference on cloud API.  
**Reasoning:** User has no GPU. CPU-only local inference is viable for small quantised models (Q4_K_M gemma4:e2b runs at ~5-10 tok/s). Large models (26B+) are impractical on CPU — cloud API provides them at acceptable latency.

### 3. RAG with ChromaDB + MiniLM
**Decision:** Use local vector store over calling the cloud model with full knowledge base in prompt.  
**Reasoning:**
- Injecting 70k ICD-10 codes into every prompt would exceed context limits and cost tokens
- Local ChromaDB persists to disk, zero latency after first build
- MiniLM-L6-v2 (~80MB) gives good semantic similarity for medical terms on CPU
- Retrieves top-5 most relevant codes per consultation — keeps prompt tight and accurate

### 4. Gradio over Streamlit
**Decision:** Use Gradio for the UI.  
**Reasoning:** Gradio has better support for streaming, audio, and timer-based polling. Streamlit's re-run model makes real-time transcript updates difficult. Gradio's `gr.Timer` makes 2-second polling trivial.

### 5. Gemma 4 reasoning mode — temperature requirement
**Decision:** Set `temperature=1.0` when `thinking_config` is enabled, not `0.3`.
**Reasoning:** Google's API requires temperature=1.0 when using ThinkingConfig — lower values raise an error. The thinking process itself introduces determinism so output quality is not degraded. Added graceful fallback: if the model doesn't support thinking (e.g. older model version), retry without `thinking_config`.

### 6. Function calling mode = ANY
**Decision:** Use `FunctionCallingMode.ANY` (force the model to always call the function) rather than `AUTO`.
**Reasoning:** `AUTO` mode allows the model to optionally use the function or just return text — unreliable for extraction tasks. `ANY` mode guarantees the model returns a structured function call every time, eliminating the JSON parse errors we had with the prompt-based approach.

### 7. Symptom extraction: local first, cloud fallback
**Decision:** Keep Gemma 4 E2B (Ollama, local) as primary for symptom extraction, cloud function calling as fallback.
**Reasoning:** Preserves the "local AI, privacy-preserving" story for the hackathon. Cloud fallback ensures reliability when Ollama returns malformed JSON or fails. Both paths return the same dict structure.

### 8. Transcript repair before downstream processing
**Problem:** faster-whisper `small` on CPU makes errors — mishears medical terms, missing punctuation, run-on sentences. Downstream models (symptom extraction, SOAP generation) produce lower quality output when given a garbled transcript.
**Decision:** Add a `clean_and_label_transcript()` step using Gemma 4 cloud that simultaneously repairs ASR errors AND labels speakers in one API call. This runs after `stop_consultation()` before any downstream processing.
**What it fixes:** Incorrect drug names, missing punctuation, filler words (um/uh), run-on sentences, garbled medical terminology.
**What it preserves:** All clinical facts — symptoms, medications, durations, dosages. Never adds or invents information.
**Why one call:** Combining repair + labelling saves one API round-trip and is cheaper than two separate calls.

### 9. Speaker diarization: Gemma 4 post-hoc vs pyannote-audio
**Decision:** Use Gemma 4 cloud to infer Doctor/Patient labels from transcript text.  
**Reasoning:**
- `pyannote-audio` requires HuggingFace account, model license acceptance, and token setup
- For a hackathon demo, Gemma 4 inference from linguistic context is good enough
- Doctors and patients have very different speech patterns (questions vs symptom descriptions) that Gemma 4 reliably distinguishes
- Can always upgrade to pyannote later

### 6. SQLite for storage
**Decision:** Local SQLite over PostgreSQL or cloud database.  
**Reasoning:** Desktop app, no server, no network dependency. SQLite is reliable, zero-config, and sufficient for demo-scale data. Schema: patients → sessions → notes + symptoms.

### 7. Whisper model: small over base
**Decision:** Upgrade from `base` to `small` Whisper model.  
**Reasoning:** `base` had poor accuracy on real speech, especially medical terminology. `small` is ~4x more accurate on medical vocabulary and still runs acceptably on CPU (~2-3x slower than base but real-time viable with 3-second chunking). `medium` was considered but too slow for live demo.

---

## Issues Encountered & Resolutions

### Issue 1: `google-generativeai` deprecated
**Error:** `FutureWarning: All support for the google.generativeai package has ended`  
**Root cause:** Google deprecated the old `google-generativeai` SDK in favour of `google-genai`  
**Resolution:** Replaced `google-generativeai` with `google-genai>=1.0.0` in requirements. Updated `cloud_agents.py` to use `from google import genai` and `genai.Client()` pattern.

### Issue 2: Wrong Gemma 4 cloud model name
**Error:** `404 NOT_FOUND: models/gemma-4-27b-it is not found`  
**Root cause:** Model name `gemma-4-27b-it` does not exist on Google AI Studio API.  
**Resolution:** Listed available models via API (`client.models.list()`). Correct names are:
- `gemma-4-26b-a4b-it` (26B MoE, faster)
- `gemma-4-31b-it` (31B dense, most capable)
Updated default in `cloud_agents.py` and `.env`.

### Issue 3: Twi translation 500 INTERNAL error
**Error:** `500 INTERNAL: Internal error encountered` on `translate_to_twi()`  
**Root cause:** Gemma 4 struggles with Twi (Akan) — a low-resource language with limited training data. The model likely has insufficient Twi coverage to translate medical content reliably, causing server-side failures.  
**Resolution (temporary):** Removed Twi translation from the pipeline. Added try/except guards around all cloud agent calls so one failure doesn't break the entire `generate_notes()` flow.  
**Planned fix:** Integrate NLLB-200 (`facebook/nllb-200-distilled-600M`) — Meta's purpose-built model for 200 low-resource languages including Akan/Twi.

### Issue 4: Ollama version too old for Gemma 4
**Error:** `Error: pull model manifest: 412: The model you are attempting to pull requires a newer version of Ollama`  
**Root cause:** System Ollama was v0.19.0. Gemma 4 requires a newer version.  
**Resolution:** Reinstall Ollama via the official install script: `curl -fsSL https://ollama.com/install.sh | sh` then `sudo systemctl restart ollama`. Note: Linux package managers (snap, apt) ship outdated Ollama versions — always use the curl script.

### Issue 5: `chromadb.PersistentClient | None` TypeError
**Error:** `TypeError: unsupported operand type(s) for |: 'function' and 'NoneType'`  
**Root cause:** `chromadb.PersistentClient` is a factory function, not a class. Using it in a `X | None` type annotation evaluates at runtime and fails.  
**Resolution:** Added `from __future__ import annotations` to `rag/retriever.py` — this makes all annotations lazy (strings at runtime), bypassing the evaluation issue.

### Issue 6: White empty boxes in UI (RAG panels)
**Issue:** `gr.Markdown` components rendered as white boxes on dark Gradio theme, even when empty.  
**Root cause:** Gradio's default light background on Markdown components clashes with the dark theme. Empty panels had no content but still showed as white rectangles.  
**Resolution:** Moved RAG panels (ICD-10, Drug Reference, Symptoms) into `gr.Accordion` components. Accordions collapse when not needed and have theme-consistent styling. Also added CSS `background: transparent` for markdown panels.

### Issue 9: Gemma 4 image input — wrong contents structure
**Error:** `500 INTERNAL` then `Part.from_text() takes 1 positional argument but 2 were given`
**Root cause:** Two sequential mistakes in the multimodal contents format:
  1. First attempt wrapped parts in `types.Content(role="user", parts=[...])` — not needed
  2. Used `types.Part.from_text(IMAGE_PROMPT)` — this method does not exist in the SDK
**Resolution:** Per official Gemma 4 docs (philschmid.de/gemma-4-gemini-api), the correct format is a plain list mixing `Part.from_bytes()` and a raw string:
```python
contents=[
    types.Part.from_bytes(data=file_bytes, mime_type=mime_type),
    IMAGE_PROMPT,   # plain string, not Part.from_text()
]
```
All Gemma 4 models (including 26B and 31B) are fully multimodal. The initial 500 error was caused by the wrong content structure, not a model limitation.

### Issue 10: pyannote-audio abandoned in favour of Gemma 4
**Decision made:** Started implementing pyannote-audio for speaker diarization, then stopped.
**Reason:** User confirmed Gemma 4 post-hoc labelling is sufficient for the demo. pyannote requires HuggingFace account, model license acceptance, and heavy torch dependency. Gemma 4 language-based inference is actually more reliable for medical conversations because it uses *context* (doctors ask questions, patients describe symptoms) rather than raw audio signal (which can fail when two speakers have similar voices).

### Issue 10: Gradio CSS parameter deprecation warning
**Warning:** `UserWarning: The parameters have been moved from the Blocks constructor to the launch() method`  
**Root cause:** Gradio 6.0 moved `css` parameter from `gr.Blocks(css=...)` to `demo.launch(css=...)`.  
**Resolution:** Moved `css=CSS` to `demo.launch(...)`.

### Issue 8: uv installing to wrong Python version
**Issue:** `chromadb` and `sentence-transformers` installed but not importable from venv.  
**Root cause:** The venv was created with Python 3.11 (via uv) but system also has Python 3.12. Running `uv pip install` without specifying the environment installed to the wrong location.  
**Resolution:** Used `VIRTUAL_ENV=/path/to/.venv uv pip install ...` to target the correct venv, or used `/path/to/.venv/bin/python -m pip install ...`.

---

## What Was Considered and Rejected

| Option | Rejected because |
|---|---|
| Streamlit UI | Real-time transcript polling is awkward in Streamlit's re-run model |
| PostgreSQL storage | Overkill for desktop demo; SQLite is zero-config |
| pyannote-audio diarization | Requires HF account + model license; too much setup for hackathon timeline |
| Full 70k ICD-10 dataset | Too large to embed in demo time; curated Ghana-relevant subset is more impactful |
| Running everything on cloud API | Wanted to demonstrate hybrid local+cloud multi-agent architecture |
| Whisper `large-v3` | Too slow on CPU for real-time; `small` is the sweet spot |
| Gemma 4 for Twi translation | Low-resource language; model returned 500 errors. NLLB-200 is the right tool |

---

## Remaining Work / Roadmap

- [ ] **Twi translation via NLLB-200** — integrate `facebook/nllb-200-distilled-600M` locally
- [ ] **PDF export** — export SOAP note + patient summary as printable PDF (fpdf2 already in deps)
- [ ] **Multi-session history** — view all past sessions for a patient, not just the most recent
- [ ] **Upgrade to Whisper `medium`** if demo machine is fast enough
- [ ] **ICD-10 code expansion** — add full 70k code dataset for production use
- [ ] **MedGemma** — self-host `medgemma-4b-it` or `medgemma-27b-it` for higher-accuracy medical image analysis
- [ ] **Long-context patient history** — load all previous session notes into SOAP prompt for longitudinal care reasoning

---

## File Structure

```
hosptial_copilot/
├── app.py                          Main Gradio app + UI
├── agents/
│   ├── cloud_agents.py             Gemma 4 cloud: SOAP, summary, speaker labelling
│   └── symptom_agent.py            Gemma 4 E2B local: symptom JSON extraction
├── transcription/
│   └── transcriber.py              faster-whisper live mic streaming
├── rag/
│   ├── retriever.py                ChromaDB + MiniLM embedding + retrieval
│   └── data/
│       ├── icd10_common.json       90+ ICD-10 codes (Ghana-relevant)
│       └── essential_medicines.json 40+ WHO Essential Medicines
├── database/
│   └── db.py                       SQLite schema + helpers
├── requirements.txt
├── .env.example
├── .gitignore
├── README.md
└── DEVLOG.md                       This file
```

---

## Environment Variables

| Variable | Default | Description |
|---|---|---|
| `GEMINI_API_KEY` | — | Google AI Studio API key (required) |
| `WHISPER_MODEL` | `small` | Whisper model size: tiny/base/small/medium/large-v3 |
| `OLLAMA_MODEL` | `gemma4:e2b` | Local Ollama model for symptom extraction |
| `CLOUD_MODEL` | `gemma-4-26b-a4b-it` | Google AI Studio model name |