Small Cuts
A deadpan narrator for your life, from small open models.
There's a scene early in The Invention of Lying where a calm, omniscient voice narrates the most ordinary human moments as if they were the only things that ever mattered. No drama. Just a flat, certain voice telling you what is happening, while it happens.
I wanted that voice. For real life. Running on hardware I own.
Small Cuts is a narrator for first-person moments. You point at something happening — pour the coffee, turn down the alley, miss the bus — and a small vision-language model watches one frame while a small text-to-speech voice speaks a single grounded, deadpan line back to you. One narrator. No menus, no "pick a director." You point; it tells you what it means, in the voice of a film that has decided your Tuesday is the plot.
It is not magic, and I'm not going to pretend it is. The line lands a beat after the moment — it narrates the recent past, not the future. That honesty turned out to be a feature, and it shaped the whole architecture.
The thing I'm proudest of isn't a model. It's a decision: Small Cuts has two different ways to make the exact same finished cut, on purpose.
Two paths, one artifact: a clip with a generated title, a spoken line, synced captions, and a library tile with a small badge telling you where it came from. The comparison is the product — embodied-and-private versus uploaded-and-verifiable — and it draws a hard line I cared about: the public never touches my hardware, and my hardware never exposes raw inference to the public. Finished cuts cross that boundary. Compute never does.
Ray-Ban Meta glasses ──frames──▶ home engine (small VLM + TTS) ──▶ narration in your ear
│
└──── finished cuts ────▶ the Space (watch · library)
judge's browser ──short video──▶ Modal GPU (Qwen3-VL-8B + Kokoro) ──▶ finished cut in the Space
The rule was: every model under 32B, and I wanted it to run off-grid. So:
Qwen/Qwen3-VL-8B-Instruct — 8B parameters, grounded captioning, comfortably under the cap.llama.cpp on a home node for the live loop. No cloud LLM anywhere in the path that matters.Small-and-local is harder than big-and-hosted, and that's the point. You can't paper over a confused model with a bigger one. You have to make an 8B model say something true about a single frame, fast enough that the line still feels like it belongs to the moment.
Grounding the narrator. An early prompt told the model to "find the story." It found stories, all right — beautiful, confident, completely invented ones. A small VLM under any sampling heat will happily narrate a dog that isn't there. The fix was discipline, not scale: a grounded prompt (v3, after a judged A/B test taught me v2 was lying) that asks for what's actually in the frame, low temperature, present tense, two sentences. The narrator got less poetic and a lot more honest. Good trade.
One clock. The Space replays each cut with a video, a voice track, captions, and a progress bar — and Gradio's default components each wanted to keep their own time. Everything drifted. I ended up building a custom player where a single hidden <audio> element is the master clock, and the video, captions, and progress all follow it. It's the least glamorous code in the project and the reason the replay doesn't feel broken.
The private/public seam. The glasses path is private and local; the Space is public and anonymous. Getting a finished cut from one to the other without leaving a path back to my machine meant the home node publishes finished artifacts to a public bucket, and the Space refreshes from a pushed event — not a polling loop reaching inward. I learned that lesson the expensive way early on. Now nothing public points at anything local. Ever.
The first narration of a cold session is slow — loading an 8B VLM is not free, and I won't pretend the very first line is snappy. And the live, during-the-moment version — where you'd hear short fragments every few seconds instead of one line per cut — is designed but not shipped. The honest blocker isn't the GPU: a grounded line, spoken, is several seconds of audio, and you can't pour forty seconds of narration into a three-second gap and call it live. The fix is a shorter "continuation" prompt, not a faster machine. That's the next cut.
Built by Carlos Crespo Macaya, with an AI toolchain riding shotgun — Claude (Opus) for design critique, Codex (GPT-5.x) for paired implementation, GLM for review, Gemini for eval — all pointed where I told them to point.
Try it in the Space: build-small-hackathon/small-cuts. Upload a short clip; it'll narrate it for you, in the only voice it has. Deadpan. Certain. A little too honest.
And that was the moment the reader decided to go build something small.
A deadpan narrator for your life, from small open models.
More from this author