The Invention of Narrating — building Small Cuts

Community Article

Published June 15, 2026

Upvote

Carlos Crespo

macayaven

build-small-hackathon

a deadpan voice for your life, on ≤32B models running off-grid
The two-path bet
The constraint that made it interesting
The parts that were harder than the model
What I'd tell you is still rough
What's next
a deadpan voice for your life, on ≤32B models running off-grid

There's a scene early in The Invention of Lying where a calm, omniscient voice narrates the most ordinary human moments as if they were the only things that ever mattered. No drama. Just a flat, certain voice telling you what is happening, while it happens.

I wanted that voice. For real life. Running on hardware I own.

Small Cuts is a narrator for first-person moments. You point at something happening — pour the coffee, turn down the alley, miss the bus — and a small vision-language model watches one frame while a small text-to-speech voice speaks a single grounded, deadpan line back to you. One narrator. No menus, no "pick a director." You point; it tells you what it means, in the voice of a film that has decided your Tuesday is the plot.

It is not magic, and I'm not going to pretend it is. The line lands a beat after the moment — it narrates the recent past, not the future. That honesty turned out to be a feature, and it shaped the whole architecture.

The two-path bet

The thing I'm proudest of isn't a model. It's a decision: Small Cuts has two different ways to make the exact same finished cut, on purpose.

Pieces + hints — the soul. You wear the glasses, tap Action!, walk through a moment, tap Cut!. Short clips stream to a private engine on the Mac on my desk; it narrates and speaks the line into your ear while the moment is still warm. This path is embodied, off-grid, and mine. It never leaves the house.
Whole video, one pass — the proof. A judge — or anyone — uploads a short clip in the public Space. It goes to a bounded cloud GPU, gets narrated in one pass, and the finished cut appears in the same theater. No glasses, no app, no access to my hardware required. This is how you verify the thing is real without me handing you my laptop.

Two paths, one artifact: a clip with a generated title, a spoken line, synced captions, and a library tile with a small badge telling you where it came from. The comparison is the product — embodied-and-private versus uploaded-and-verifiable — and it draws a hard line I cared about: the public never touches my hardware, and my hardware never exposes raw inference to the public. Finished cuts cross that boundary. Compute never does.

Ray-Ban Meta glasses ──frames──▶  home engine (small VLM + TTS)  ──▶  narration in your ear
                                          │
                                          └──── finished cuts ────▶  the Space (watch · library)

judge's browser ──short video──▶  Modal GPU (Qwen3-VL-8B + Kokoro)  ──▶  finished cut in the Space

The constraint that made it interesting

The rule was: every model under 32B, and I wanted it to run off-grid. So:

Narrator: Qwen/Qwen3-VL-8B-Instruct — 8B parameters, grounded captioning, comfortably under the cap.
Voice: Kokoro — a tiny, open, weirdly expressive TTS. One signature deadpan delivery.
Runtime: llama.cpp on a home node for the live loop. No cloud LLM anywhere in the path that matters.

Small-and-local is harder than big-and-hosted, and that's the point. You can't paper over a confused model with a bigger one. You have to make an 8B model say something true about a single frame, fast enough that the line still feels like it belongs to the moment.

The parts that were harder than the model

Grounding the narrator. An early prompt told the model to "find the story." It found stories, all right — beautiful, confident, completely invented ones. A small VLM under any sampling heat will happily narrate a dog that isn't there. The fix was discipline, not scale: a grounded prompt (v3, after a judged A/B test taught me v2 was lying) that asks for what's actually in the frame, low temperature, present tense, two sentences. The narrator got less poetic and a lot more honest. Good trade.

One clock. The Space replays each cut with a video, a voice track, captions, and a progress bar — and Gradio's default components each wanted to keep their own time. Everything drifted. I ended up building a custom player where a single hidden <audio> element is the master clock, and the video, captions, and progress all follow it. It's the least glamorous code in the project and the reason the replay doesn't feel broken.

The private/public seam. The glasses path is private and local; the Space is public and anonymous. Getting a finished cut from one to the other without leaving a path back to my machine meant the home node publishes finished artifacts to a public bucket, and the Space refreshes from a pushed event — not a polling loop reaching inward. I learned that lesson the expensive way early on. Now nothing public points at anything local. Ever.

What I'd tell you is still rough

The first narration of a cold session is slow — loading an 8B VLM is not free, and I won't pretend the very first line is snappy. And the live, during-the-moment version — where you'd hear short fragments every few seconds instead of one line per cut — is designed but not shipped. The honest blocker isn't the GPU: a grounded line, spoken, is several seconds of audio, and you can't pour forty seconds of narration into a three-second gap and call it live. The fix is a shorter "continuation" prompt, not a faster machine. That's the next cut.

What's next

Live micro-segments: hear the narrator during the take, not just after Cut! — paced so the audio never falls behind the moment.
A deployment-agnostic engine: the same code runs off-grid on my home node by default, and can burst to an ephemeral cloud GPU when I want a bigger machine for a demo. Off-grid is the identity; the cloud is the option.

Built by Carlos Crespo Macaya, with an AI toolchain riding shotgun — Claude (Opus) for design critique, Codex (GPT-5.x) for paired implementation, GLM for review, Gemini for eval — all pointed where I told them to point.

Try it in the Space: build-small-hackathon/small-cuts. Upload a short clip; it'll narrate it for you, in the only voice it has. Deadpan. Certain. A little too honest.

And that was the moment the reader decided to go build something small.

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

The Invention of Narrating — building Small Cuts

a deadpan voice for your life, on ≤32B models running off-grid The two-path bet The constraint that made it interesting The parts that were harder than the model What I'd tell you is still rough What's next a deadpan voice for your life, on ≤32B models running off-grid

The two-path bet

The constraint that made it interesting

The parts that were harder than the model

What I'd tell you is still rough

What's next

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

Noteworthy

Community

Spaces mentioned in this article 1

a deadpan voice for your life, on ≤32B models running off-grid
The two-path bet
The constraint that made it interesting
The parts that were harder than the model
What I'd tell you is still rough
What's next
a deadpan voice for your life, on ≤32B models running off-grid