Spaces:
Sleeping
Sleeping
| # Hospital Copilot β Development Log | |
| **Hackathon:** Gemma 4 for Good | |
| **Team:** Ricky (fredrickandoh17@gmail.com) | |
| **Stack:** Python Β· Gradio Β· Gemma 4 Β· faster-whisper Β· ChromaDB Β· SQLite | |
| **Started:** 2026-05-16 | |
| --- | |
| ## Project Goal | |
| Build an AI clinical assistant that listens to doctor-patient consultations and automatically produces: | |
| - Live transcription of the conversation | |
| - Structured symptom extraction (symptoms, medications, duration, allergies, follow-up actions) | |
| - SOAP notes grounded with real ICD-10 codes and drug dosages | |
| - Plain-language patient summary | |
| - Structured patient records saved to a local database | |
| **Why:** Reduce doctor burnout from paperwork, improve care quality, and support healthcare workers in low-resource settings like Ghana. | |
| --- | |
| ## Architecture Overview | |
| ``` | |
| Microphone | |
| βββΊ faster-whisper (STT, local CPU) β raw transcript | |
| βββΊ Gemma 4 26B cloud (speaker labelling) β Doctor:/Patient: transcript | |
| βββΊ Gemma 4 E2B via Ollama (symptom JSON) β local CPU | |
| βββΊ ChromaDB + MiniLM (RAG retrieval) β ICD-10 codes + drug info | |
| βββΊ Gemma 4 26B cloud (SOAP note, patient summary) | |
| βββΊ SQLite (patients, sessions, notes, symptoms) | |
| βββΊ Gradio UI | |
| ``` | |
| --- | |
| ## Features Implemented | |
| ### Core Pipeline | |
| | Feature | Status | Implementation | | |
| |---|---|---| | |
| | Live mic transcription | β | faster-whisper `small` model, 3s chunks, VAD filter | | |
| | Speaker diarization | β | Gemma 4 post-hoc Doctor:/Patient: labelling | | |
| | Symptom extraction | β | Gemma 4 E2B via Ollama β JSON: chief complaint, symptoms, duration, severity, medications, allergies, vitals, history, follow-up actions | | |
| | RAG ICD-10 retrieval | β | ChromaDB + all-MiniLM-L6-v2, 90+ Ghana-relevant codes | | |
| | RAG drug grounding | β | ChromaDB, 40+ WHO Essential Medicines with dosages | | |
| | SOAP note generation | β | Gemma 4 26B cloud, RAG context injected into prompt | | |
| | Patient summary | β | Gemma 4 26B cloud, plain English | | |
| | Patient records (SQLite) | β | patients, sessions, notes, symptoms tables | | |
| | Patient registration | β | Name, DOB, gender, phone | | |
| | Records viewer | β | Load any patient's most recent session | | |
| ### Translation (Twi/Akan) | |
| | Status | Note | | |
| |---|---| | |
| | βΈοΈ Paused | Gemma 4 returned 500 INTERNAL errors on Twi translation. Identified root cause: Twi is a low-resource language and Gemma 4 is not purpose-built for it. Decision: implement NLLB-200 (Meta's No Language Left Behind model) which was specifically trained on Akan/Twi. Deferred until core pipeline is stable. | | |
| ### Gemma 4 Advanced Features (Added 2026-05-18) | |
| | Feature | Status | Implementation | | |
| |---|---|---| | |
| | **Reasoning mode (thinking)** | β | `ThinkingConfig(thinking_budget=2048, include_thoughts=False)` on SOAP generation β Gemma 4 reasons step-by-step internally before writing the note | | |
| | **Function calling (symptom extraction)** | β | `FunctionDeclaration` schema with `FunctionCallingMode.ANY` β guaranteed valid structured output, no JSON parsing | | |
| | **Multimodal image/document analysis** | β | `Part.from_bytes()` with lab result / prescription images β extracted findings injected into SOAP context | | |
| --- | |
| ## Technical Decisions | |
| ### 1. Multi-agent Gemma 4 architecture | |
| **Decision:** Use multiple specialised Gemma 4 instances rather than one large model for everything. | |
| **Reasoning:** Different tasks have different speed/accuracy requirements: | |
| - Symptom extraction: needs to be fast, structured JSON β small local model (E2B) | |
| - SOAP notes: needs medical reasoning and long output β large cloud model (26B) | |
| - Speaker labelling: needs language understanding β cloud model | |
| - Embeddings: needs speed, runs every session β lightweight MiniLM locally | |
| ### 2. Local vs cloud split | |
| **Decision:** Run small models locally (Ollama E2B, Whisper, MiniLM, ChromaDB), large inference on cloud API. | |
| **Reasoning:** User has no GPU. CPU-only local inference is viable for small quantised models (Q4_K_M gemma4:e2b runs at ~5-10 tok/s). Large models (26B+) are impractical on CPU β cloud API provides them at acceptable latency. | |
| ### 3. RAG with ChromaDB + MiniLM | |
| **Decision:** Use local vector store over calling the cloud model with full knowledge base in prompt. | |
| **Reasoning:** | |
| - Injecting 70k ICD-10 codes into every prompt would exceed context limits and cost tokens | |
| - Local ChromaDB persists to disk, zero latency after first build | |
| - MiniLM-L6-v2 (~80MB) gives good semantic similarity for medical terms on CPU | |
| - Retrieves top-5 most relevant codes per consultation β keeps prompt tight and accurate | |
| ### 4. Gradio over Streamlit | |
| **Decision:** Use Gradio for the UI. | |
| **Reasoning:** Gradio has better support for streaming, audio, and timer-based polling. Streamlit's re-run model makes real-time transcript updates difficult. Gradio's `gr.Timer` makes 2-second polling trivial. | |
| ### 5. Gemma 4 reasoning mode β temperature requirement | |
| **Decision:** Set `temperature=1.0` when `thinking_config` is enabled, not `0.3`. | |
| **Reasoning:** Google's API requires temperature=1.0 when using ThinkingConfig β lower values raise an error. The thinking process itself introduces determinism so output quality is not degraded. Added graceful fallback: if the model doesn't support thinking (e.g. older model version), retry without `thinking_config`. | |
| ### 6. Function calling mode = ANY | |
| **Decision:** Use `FunctionCallingMode.ANY` (force the model to always call the function) rather than `AUTO`. | |
| **Reasoning:** `AUTO` mode allows the model to optionally use the function or just return text β unreliable for extraction tasks. `ANY` mode guarantees the model returns a structured function call every time, eliminating the JSON parse errors we had with the prompt-based approach. | |
| ### 7. Symptom extraction: local first, cloud fallback | |
| **Decision:** Keep Gemma 4 E2B (Ollama, local) as primary for symptom extraction, cloud function calling as fallback. | |
| **Reasoning:** Preserves the "local AI, privacy-preserving" story for the hackathon. Cloud fallback ensures reliability when Ollama returns malformed JSON or fails. Both paths return the same dict structure. | |
| ### 8. Transcript repair before downstream processing | |
| **Problem:** faster-whisper `small` on CPU makes errors β mishears medical terms, missing punctuation, run-on sentences. Downstream models (symptom extraction, SOAP generation) produce lower quality output when given a garbled transcript. | |
| **Decision:** Add a `clean_and_label_transcript()` step using Gemma 4 cloud that simultaneously repairs ASR errors AND labels speakers in one API call. This runs after `stop_consultation()` before any downstream processing. | |
| **What it fixes:** Incorrect drug names, missing punctuation, filler words (um/uh), run-on sentences, garbled medical terminology. | |
| **What it preserves:** All clinical facts β symptoms, medications, durations, dosages. Never adds or invents information. | |
| **Why one call:** Combining repair + labelling saves one API round-trip and is cheaper than two separate calls. | |
| ### 9. Speaker diarization: Gemma 4 post-hoc vs pyannote-audio | |
| **Decision:** Use Gemma 4 cloud to infer Doctor/Patient labels from transcript text. | |
| **Reasoning:** | |
| - `pyannote-audio` requires HuggingFace account, model license acceptance, and token setup | |
| - For a hackathon demo, Gemma 4 inference from linguistic context is good enough | |
| - Doctors and patients have very different speech patterns (questions vs symptom descriptions) that Gemma 4 reliably distinguishes | |
| - Can always upgrade to pyannote later | |
| ### 6. SQLite for storage | |
| **Decision:** Local SQLite over PostgreSQL or cloud database. | |
| **Reasoning:** Desktop app, no server, no network dependency. SQLite is reliable, zero-config, and sufficient for demo-scale data. Schema: patients β sessions β notes + symptoms. | |
| ### 7. Whisper model: small over base | |
| **Decision:** Upgrade from `base` to `small` Whisper model. | |
| **Reasoning:** `base` had poor accuracy on real speech, especially medical terminology. `small` is ~4x more accurate on medical vocabulary and still runs acceptably on CPU (~2-3x slower than base but real-time viable with 3-second chunking). `medium` was considered but too slow for live demo. | |
| --- | |
| ## Issues Encountered & Resolutions | |
| ### Issue 1: `google-generativeai` deprecated | |
| **Error:** `FutureWarning: All support for the google.generativeai package has ended` | |
| **Root cause:** Google deprecated the old `google-generativeai` SDK in favour of `google-genai` | |
| **Resolution:** Replaced `google-generativeai` with `google-genai>=1.0.0` in requirements. Updated `cloud_agents.py` to use `from google import genai` and `genai.Client()` pattern. | |
| ### Issue 2: Wrong Gemma 4 cloud model name | |
| **Error:** `404 NOT_FOUND: models/gemma-4-27b-it is not found` | |
| **Root cause:** Model name `gemma-4-27b-it` does not exist on Google AI Studio API. | |
| **Resolution:** Listed available models via API (`client.models.list()`). Correct names are: | |
| - `gemma-4-26b-a4b-it` (26B MoE, faster) | |
| - `gemma-4-31b-it` (31B dense, most capable) | |
| Updated default in `cloud_agents.py` and `.env`. | |
| ### Issue 3: Twi translation 500 INTERNAL error | |
| **Error:** `500 INTERNAL: Internal error encountered` on `translate_to_twi()` | |
| **Root cause:** Gemma 4 struggles with Twi (Akan) β a low-resource language with limited training data. The model likely has insufficient Twi coverage to translate medical content reliably, causing server-side failures. | |
| **Resolution (temporary):** Removed Twi translation from the pipeline. Added try/except guards around all cloud agent calls so one failure doesn't break the entire `generate_notes()` flow. | |
| **Planned fix:** Integrate NLLB-200 (`facebook/nllb-200-distilled-600M`) β Meta's purpose-built model for 200 low-resource languages including Akan/Twi. | |
| ### Issue 4: Ollama version too old for Gemma 4 | |
| **Error:** `Error: pull model manifest: 412: The model you are attempting to pull requires a newer version of Ollama` | |
| **Root cause:** System Ollama was v0.19.0. Gemma 4 requires a newer version. | |
| **Resolution:** Reinstall Ollama via the official install script: `curl -fsSL https://ollama.com/install.sh | sh` then `sudo systemctl restart ollama`. Note: Linux package managers (snap, apt) ship outdated Ollama versions β always use the curl script. | |
| ### Issue 5: `chromadb.PersistentClient | None` TypeError | |
| **Error:** `TypeError: unsupported operand type(s) for |: 'function' and 'NoneType'` | |
| **Root cause:** `chromadb.PersistentClient` is a factory function, not a class. Using it in a `X | None` type annotation evaluates at runtime and fails. | |
| **Resolution:** Added `from __future__ import annotations` to `rag/retriever.py` β this makes all annotations lazy (strings at runtime), bypassing the evaluation issue. | |
| ### Issue 6: White empty boxes in UI (RAG panels) | |
| **Issue:** `gr.Markdown` components rendered as white boxes on dark Gradio theme, even when empty. | |
| **Root cause:** Gradio's default light background on Markdown components clashes with the dark theme. Empty panels had no content but still showed as white rectangles. | |
| **Resolution:** Moved RAG panels (ICD-10, Drug Reference, Symptoms) into `gr.Accordion` components. Accordions collapse when not needed and have theme-consistent styling. Also added CSS `background: transparent` for markdown panels. | |
| ### Issue 9: Gemma 4 image input β wrong contents structure | |
| **Error:** `500 INTERNAL` then `Part.from_text() takes 1 positional argument but 2 were given` | |
| **Root cause:** Two sequential mistakes in the multimodal contents format: | |
| 1. First attempt wrapped parts in `types.Content(role="user", parts=[...])` β not needed | |
| 2. Used `types.Part.from_text(IMAGE_PROMPT)` β this method does not exist in the SDK | |
| **Resolution:** Per official Gemma 4 docs (philschmid.de/gemma-4-gemini-api), the correct format is a plain list mixing `Part.from_bytes()` and a raw string: | |
| ```python | |
| contents=[ | |
| types.Part.from_bytes(data=file_bytes, mime_type=mime_type), | |
| IMAGE_PROMPT, # plain string, not Part.from_text() | |
| ] | |
| ``` | |
| All Gemma 4 models (including 26B and 31B) are fully multimodal. The initial 500 error was caused by the wrong content structure, not a model limitation. | |
| ### Issue 10: pyannote-audio abandoned in favour of Gemma 4 | |
| **Decision made:** Started implementing pyannote-audio for speaker diarization, then stopped. | |
| **Reason:** User confirmed Gemma 4 post-hoc labelling is sufficient for the demo. pyannote requires HuggingFace account, model license acceptance, and heavy torch dependency. Gemma 4 language-based inference is actually more reliable for medical conversations because it uses *context* (doctors ask questions, patients describe symptoms) rather than raw audio signal (which can fail when two speakers have similar voices). | |
| ### Issue 10: Gradio CSS parameter deprecation warning | |
| **Warning:** `UserWarning: The parameters have been moved from the Blocks constructor to the launch() method` | |
| **Root cause:** Gradio 6.0 moved `css` parameter from `gr.Blocks(css=...)` to `demo.launch(css=...)`. | |
| **Resolution:** Moved `css=CSS` to `demo.launch(...)`. | |
| ### Issue 8: uv installing to wrong Python version | |
| **Issue:** `chromadb` and `sentence-transformers` installed but not importable from venv. | |
| **Root cause:** The venv was created with Python 3.11 (via uv) but system also has Python 3.12. Running `uv pip install` without specifying the environment installed to the wrong location. | |
| **Resolution:** Used `VIRTUAL_ENV=/path/to/.venv uv pip install ...` to target the correct venv, or used `/path/to/.venv/bin/python -m pip install ...`. | |
| --- | |
| ## What Was Considered and Rejected | |
| | Option | Rejected because | | |
| |---|---| | |
| | Streamlit UI | Real-time transcript polling is awkward in Streamlit's re-run model | | |
| | PostgreSQL storage | Overkill for desktop demo; SQLite is zero-config | | |
| | pyannote-audio diarization | Requires HF account + model license; too much setup for hackathon timeline | | |
| | Full 70k ICD-10 dataset | Too large to embed in demo time; curated Ghana-relevant subset is more impactful | | |
| | Running everything on cloud API | Wanted to demonstrate hybrid local+cloud multi-agent architecture | | |
| | Whisper `large-v3` | Too slow on CPU for real-time; `small` is the sweet spot | | |
| | Gemma 4 for Twi translation | Low-resource language; model returned 500 errors. NLLB-200 is the right tool | | |
| --- | |
| ## Remaining Work / Roadmap | |
| - [ ] **Twi translation via NLLB-200** β integrate `facebook/nllb-200-distilled-600M` locally | |
| - [ ] **PDF export** β export SOAP note + patient summary as printable PDF (fpdf2 already in deps) | |
| - [ ] **Multi-session history** β view all past sessions for a patient, not just the most recent | |
| - [ ] **Upgrade to Whisper `medium`** if demo machine is fast enough | |
| - [ ] **ICD-10 code expansion** β add full 70k code dataset for production use | |
| - [ ] **MedGemma** β self-host `medgemma-4b-it` or `medgemma-27b-it` for higher-accuracy medical image analysis | |
| - [ ] **Long-context patient history** β load all previous session notes into SOAP prompt for longitudinal care reasoning | |
| --- | |
| ## File Structure | |
| ``` | |
| hosptial_copilot/ | |
| βββ app.py Main Gradio app + UI | |
| βββ agents/ | |
| β βββ cloud_agents.py Gemma 4 cloud: SOAP, summary, speaker labelling | |
| β βββ symptom_agent.py Gemma 4 E2B local: symptom JSON extraction | |
| βββ transcription/ | |
| β βββ transcriber.py faster-whisper live mic streaming | |
| βββ rag/ | |
| β βββ retriever.py ChromaDB + MiniLM embedding + retrieval | |
| β βββ data/ | |
| β βββ icd10_common.json 90+ ICD-10 codes (Ghana-relevant) | |
| β βββ essential_medicines.json 40+ WHO Essential Medicines | |
| βββ database/ | |
| β βββ db.py SQLite schema + helpers | |
| βββ requirements.txt | |
| βββ .env.example | |
| βββ .gitignore | |
| βββ README.md | |
| βββ DEVLOG.md This file | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |---|---|---| | |
| | `GEMINI_API_KEY` | β | Google AI Studio API key (required) | | |
| | `WHISPER_MODEL` | `small` | Whisper model size: tiny/base/small/medium/large-v3 | | |
| | `OLLAMA_MODEL` | `gemma4:e2b` | Local Ollama model for symptom extraction | | |
| | `CLOUD_MODEL` | `gemma-4-26b-a4b-it` | Google AI Studio model name | | |