Spaces:
Sleeping
Sleeping
File size: 16,502 Bytes
c32bf13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | # Hospital Copilot β Development Log
**Hackathon:** Gemma 4 for Good
**Team:** Ricky (fredrickandoh17@gmail.com)
**Stack:** Python Β· Gradio Β· Gemma 4 Β· faster-whisper Β· ChromaDB Β· SQLite
**Started:** 2026-05-16
---
## Project Goal
Build an AI clinical assistant that listens to doctor-patient consultations and automatically produces:
- Live transcription of the conversation
- Structured symptom extraction (symptoms, medications, duration, allergies, follow-up actions)
- SOAP notes grounded with real ICD-10 codes and drug dosages
- Plain-language patient summary
- Structured patient records saved to a local database
**Why:** Reduce doctor burnout from paperwork, improve care quality, and support healthcare workers in low-resource settings like Ghana.
---
## Architecture Overview
```
Microphone
βββΊ faster-whisper (STT, local CPU) β raw transcript
βββΊ Gemma 4 26B cloud (speaker labelling) β Doctor:/Patient: transcript
βββΊ Gemma 4 E2B via Ollama (symptom JSON) β local CPU
βββΊ ChromaDB + MiniLM (RAG retrieval) β ICD-10 codes + drug info
βββΊ Gemma 4 26B cloud (SOAP note, patient summary)
βββΊ SQLite (patients, sessions, notes, symptoms)
βββΊ Gradio UI
```
---
## Features Implemented
### Core Pipeline
| Feature | Status | Implementation |
|---|---|---|
| Live mic transcription | β
| faster-whisper `small` model, 3s chunks, VAD filter |
| Speaker diarization | β
| Gemma 4 post-hoc Doctor:/Patient: labelling |
| Symptom extraction | β
| Gemma 4 E2B via Ollama β JSON: chief complaint, symptoms, duration, severity, medications, allergies, vitals, history, follow-up actions |
| RAG ICD-10 retrieval | β
| ChromaDB + all-MiniLM-L6-v2, 90+ Ghana-relevant codes |
| RAG drug grounding | β
| ChromaDB, 40+ WHO Essential Medicines with dosages |
| SOAP note generation | β
| Gemma 4 26B cloud, RAG context injected into prompt |
| Patient summary | β
| Gemma 4 26B cloud, plain English |
| Patient records (SQLite) | β
| patients, sessions, notes, symptoms tables |
| Patient registration | β
| Name, DOB, gender, phone |
| Records viewer | β
| Load any patient's most recent session |
### Translation (Twi/Akan)
| Status | Note |
|---|---|
| βΈοΈ Paused | Gemma 4 returned 500 INTERNAL errors on Twi translation. Identified root cause: Twi is a low-resource language and Gemma 4 is not purpose-built for it. Decision: implement NLLB-200 (Meta's No Language Left Behind model) which was specifically trained on Akan/Twi. Deferred until core pipeline is stable. |
### Gemma 4 Advanced Features (Added 2026-05-18)
| Feature | Status | Implementation |
|---|---|---|
| **Reasoning mode (thinking)** | β
| `ThinkingConfig(thinking_budget=2048, include_thoughts=False)` on SOAP generation β Gemma 4 reasons step-by-step internally before writing the note |
| **Function calling (symptom extraction)** | β
| `FunctionDeclaration` schema with `FunctionCallingMode.ANY` β guaranteed valid structured output, no JSON parsing |
| **Multimodal image/document analysis** | β
| `Part.from_bytes()` with lab result / prescription images β extracted findings injected into SOAP context |
---
## Technical Decisions
### 1. Multi-agent Gemma 4 architecture
**Decision:** Use multiple specialised Gemma 4 instances rather than one large model for everything.
**Reasoning:** Different tasks have different speed/accuracy requirements:
- Symptom extraction: needs to be fast, structured JSON β small local model (E2B)
- SOAP notes: needs medical reasoning and long output β large cloud model (26B)
- Speaker labelling: needs language understanding β cloud model
- Embeddings: needs speed, runs every session β lightweight MiniLM locally
### 2. Local vs cloud split
**Decision:** Run small models locally (Ollama E2B, Whisper, MiniLM, ChromaDB), large inference on cloud API.
**Reasoning:** User has no GPU. CPU-only local inference is viable for small quantised models (Q4_K_M gemma4:e2b runs at ~5-10 tok/s). Large models (26B+) are impractical on CPU β cloud API provides them at acceptable latency.
### 3. RAG with ChromaDB + MiniLM
**Decision:** Use local vector store over calling the cloud model with full knowledge base in prompt.
**Reasoning:**
- Injecting 70k ICD-10 codes into every prompt would exceed context limits and cost tokens
- Local ChromaDB persists to disk, zero latency after first build
- MiniLM-L6-v2 (~80MB) gives good semantic similarity for medical terms on CPU
- Retrieves top-5 most relevant codes per consultation β keeps prompt tight and accurate
### 4. Gradio over Streamlit
**Decision:** Use Gradio for the UI.
**Reasoning:** Gradio has better support for streaming, audio, and timer-based polling. Streamlit's re-run model makes real-time transcript updates difficult. Gradio's `gr.Timer` makes 2-second polling trivial.
### 5. Gemma 4 reasoning mode β temperature requirement
**Decision:** Set `temperature=1.0` when `thinking_config` is enabled, not `0.3`.
**Reasoning:** Google's API requires temperature=1.0 when using ThinkingConfig β lower values raise an error. The thinking process itself introduces determinism so output quality is not degraded. Added graceful fallback: if the model doesn't support thinking (e.g. older model version), retry without `thinking_config`.
### 6. Function calling mode = ANY
**Decision:** Use `FunctionCallingMode.ANY` (force the model to always call the function) rather than `AUTO`.
**Reasoning:** `AUTO` mode allows the model to optionally use the function or just return text β unreliable for extraction tasks. `ANY` mode guarantees the model returns a structured function call every time, eliminating the JSON parse errors we had with the prompt-based approach.
### 7. Symptom extraction: local first, cloud fallback
**Decision:** Keep Gemma 4 E2B (Ollama, local) as primary for symptom extraction, cloud function calling as fallback.
**Reasoning:** Preserves the "local AI, privacy-preserving" story for the hackathon. Cloud fallback ensures reliability when Ollama returns malformed JSON or fails. Both paths return the same dict structure.
### 8. Transcript repair before downstream processing
**Problem:** faster-whisper `small` on CPU makes errors β mishears medical terms, missing punctuation, run-on sentences. Downstream models (symptom extraction, SOAP generation) produce lower quality output when given a garbled transcript.
**Decision:** Add a `clean_and_label_transcript()` step using Gemma 4 cloud that simultaneously repairs ASR errors AND labels speakers in one API call. This runs after `stop_consultation()` before any downstream processing.
**What it fixes:** Incorrect drug names, missing punctuation, filler words (um/uh), run-on sentences, garbled medical terminology.
**What it preserves:** All clinical facts β symptoms, medications, durations, dosages. Never adds or invents information.
**Why one call:** Combining repair + labelling saves one API round-trip and is cheaper than two separate calls.
### 9. Speaker diarization: Gemma 4 post-hoc vs pyannote-audio
**Decision:** Use Gemma 4 cloud to infer Doctor/Patient labels from transcript text.
**Reasoning:**
- `pyannote-audio` requires HuggingFace account, model license acceptance, and token setup
- For a hackathon demo, Gemma 4 inference from linguistic context is good enough
- Doctors and patients have very different speech patterns (questions vs symptom descriptions) that Gemma 4 reliably distinguishes
- Can always upgrade to pyannote later
### 6. SQLite for storage
**Decision:** Local SQLite over PostgreSQL or cloud database.
**Reasoning:** Desktop app, no server, no network dependency. SQLite is reliable, zero-config, and sufficient for demo-scale data. Schema: patients β sessions β notes + symptoms.
### 7. Whisper model: small over base
**Decision:** Upgrade from `base` to `small` Whisper model.
**Reasoning:** `base` had poor accuracy on real speech, especially medical terminology. `small` is ~4x more accurate on medical vocabulary and still runs acceptably on CPU (~2-3x slower than base but real-time viable with 3-second chunking). `medium` was considered but too slow for live demo.
---
## Issues Encountered & Resolutions
### Issue 1: `google-generativeai` deprecated
**Error:** `FutureWarning: All support for the google.generativeai package has ended`
**Root cause:** Google deprecated the old `google-generativeai` SDK in favour of `google-genai`
**Resolution:** Replaced `google-generativeai` with `google-genai>=1.0.0` in requirements. Updated `cloud_agents.py` to use `from google import genai` and `genai.Client()` pattern.
### Issue 2: Wrong Gemma 4 cloud model name
**Error:** `404 NOT_FOUND: models/gemma-4-27b-it is not found`
**Root cause:** Model name `gemma-4-27b-it` does not exist on Google AI Studio API.
**Resolution:** Listed available models via API (`client.models.list()`). Correct names are:
- `gemma-4-26b-a4b-it` (26B MoE, faster)
- `gemma-4-31b-it` (31B dense, most capable)
Updated default in `cloud_agents.py` and `.env`.
### Issue 3: Twi translation 500 INTERNAL error
**Error:** `500 INTERNAL: Internal error encountered` on `translate_to_twi()`
**Root cause:** Gemma 4 struggles with Twi (Akan) β a low-resource language with limited training data. The model likely has insufficient Twi coverage to translate medical content reliably, causing server-side failures.
**Resolution (temporary):** Removed Twi translation from the pipeline. Added try/except guards around all cloud agent calls so one failure doesn't break the entire `generate_notes()` flow.
**Planned fix:** Integrate NLLB-200 (`facebook/nllb-200-distilled-600M`) β Meta's purpose-built model for 200 low-resource languages including Akan/Twi.
### Issue 4: Ollama version too old for Gemma 4
**Error:** `Error: pull model manifest: 412: The model you are attempting to pull requires a newer version of Ollama`
**Root cause:** System Ollama was v0.19.0. Gemma 4 requires a newer version.
**Resolution:** Reinstall Ollama via the official install script: `curl -fsSL https://ollama.com/install.sh | sh` then `sudo systemctl restart ollama`. Note: Linux package managers (snap, apt) ship outdated Ollama versions β always use the curl script.
### Issue 5: `chromadb.PersistentClient | None` TypeError
**Error:** `TypeError: unsupported operand type(s) for |: 'function' and 'NoneType'`
**Root cause:** `chromadb.PersistentClient` is a factory function, not a class. Using it in a `X | None` type annotation evaluates at runtime and fails.
**Resolution:** Added `from __future__ import annotations` to `rag/retriever.py` β this makes all annotations lazy (strings at runtime), bypassing the evaluation issue.
### Issue 6: White empty boxes in UI (RAG panels)
**Issue:** `gr.Markdown` components rendered as white boxes on dark Gradio theme, even when empty.
**Root cause:** Gradio's default light background on Markdown components clashes with the dark theme. Empty panels had no content but still showed as white rectangles.
**Resolution:** Moved RAG panels (ICD-10, Drug Reference, Symptoms) into `gr.Accordion` components. Accordions collapse when not needed and have theme-consistent styling. Also added CSS `background: transparent` for markdown panels.
### Issue 9: Gemma 4 image input β wrong contents structure
**Error:** `500 INTERNAL` then `Part.from_text() takes 1 positional argument but 2 were given`
**Root cause:** Two sequential mistakes in the multimodal contents format:
1. First attempt wrapped parts in `types.Content(role="user", parts=[...])` β not needed
2. Used `types.Part.from_text(IMAGE_PROMPT)` β this method does not exist in the SDK
**Resolution:** Per official Gemma 4 docs (philschmid.de/gemma-4-gemini-api), the correct format is a plain list mixing `Part.from_bytes()` and a raw string:
```python
contents=[
types.Part.from_bytes(data=file_bytes, mime_type=mime_type),
IMAGE_PROMPT, # plain string, not Part.from_text()
]
```
All Gemma 4 models (including 26B and 31B) are fully multimodal. The initial 500 error was caused by the wrong content structure, not a model limitation.
### Issue 10: pyannote-audio abandoned in favour of Gemma 4
**Decision made:** Started implementing pyannote-audio for speaker diarization, then stopped.
**Reason:** User confirmed Gemma 4 post-hoc labelling is sufficient for the demo. pyannote requires HuggingFace account, model license acceptance, and heavy torch dependency. Gemma 4 language-based inference is actually more reliable for medical conversations because it uses *context* (doctors ask questions, patients describe symptoms) rather than raw audio signal (which can fail when two speakers have similar voices).
### Issue 10: Gradio CSS parameter deprecation warning
**Warning:** `UserWarning: The parameters have been moved from the Blocks constructor to the launch() method`
**Root cause:** Gradio 6.0 moved `css` parameter from `gr.Blocks(css=...)` to `demo.launch(css=...)`.
**Resolution:** Moved `css=CSS` to `demo.launch(...)`.
### Issue 8: uv installing to wrong Python version
**Issue:** `chromadb` and `sentence-transformers` installed but not importable from venv.
**Root cause:** The venv was created with Python 3.11 (via uv) but system also has Python 3.12. Running `uv pip install` without specifying the environment installed to the wrong location.
**Resolution:** Used `VIRTUAL_ENV=/path/to/.venv uv pip install ...` to target the correct venv, or used `/path/to/.venv/bin/python -m pip install ...`.
---
## What Was Considered and Rejected
| Option | Rejected because |
|---|---|
| Streamlit UI | Real-time transcript polling is awkward in Streamlit's re-run model |
| PostgreSQL storage | Overkill for desktop demo; SQLite is zero-config |
| pyannote-audio diarization | Requires HF account + model license; too much setup for hackathon timeline |
| Full 70k ICD-10 dataset | Too large to embed in demo time; curated Ghana-relevant subset is more impactful |
| Running everything on cloud API | Wanted to demonstrate hybrid local+cloud multi-agent architecture |
| Whisper `large-v3` | Too slow on CPU for real-time; `small` is the sweet spot |
| Gemma 4 for Twi translation | Low-resource language; model returned 500 errors. NLLB-200 is the right tool |
---
## Remaining Work / Roadmap
- [ ] **Twi translation via NLLB-200** β integrate `facebook/nllb-200-distilled-600M` locally
- [ ] **PDF export** β export SOAP note + patient summary as printable PDF (fpdf2 already in deps)
- [ ] **Multi-session history** β view all past sessions for a patient, not just the most recent
- [ ] **Upgrade to Whisper `medium`** if demo machine is fast enough
- [ ] **ICD-10 code expansion** β add full 70k code dataset for production use
- [ ] **MedGemma** β self-host `medgemma-4b-it` or `medgemma-27b-it` for higher-accuracy medical image analysis
- [ ] **Long-context patient history** β load all previous session notes into SOAP prompt for longitudinal care reasoning
---
## File Structure
```
hosptial_copilot/
βββ app.py Main Gradio app + UI
βββ agents/
β βββ cloud_agents.py Gemma 4 cloud: SOAP, summary, speaker labelling
β βββ symptom_agent.py Gemma 4 E2B local: symptom JSON extraction
βββ transcription/
β βββ transcriber.py faster-whisper live mic streaming
βββ rag/
β βββ retriever.py ChromaDB + MiniLM embedding + retrieval
β βββ data/
β βββ icd10_common.json 90+ ICD-10 codes (Ghana-relevant)
β βββ essential_medicines.json 40+ WHO Essential Medicines
βββ database/
β βββ db.py SQLite schema + helpers
βββ requirements.txt
βββ .env.example
βββ .gitignore
βββ README.md
βββ DEVLOG.md This file
```
---
## Environment Variables
| Variable | Default | Description |
|---|---|---|
| `GEMINI_API_KEY` | β | Google AI Studio API key (required) |
| `WHISPER_MODEL` | `small` | Whisper model size: tiny/base/small/medium/large-v3 |
| `OLLAMA_MODEL` | `gemma4:e2b` | Local Ollama model for symptom extraction |
| `CLOUD_MODEL` | `gemma-4-26b-a4b-it` | Google AI Studio model name |
|