I Built an AI That Finished the Song I Quit On Ten Years Ago

Published June 15, 2026

I recorded a song in 2016. It was called Track0000 because I hadn't even named it yet. I got through the verse, started building toward the bridge, and just... stopped. No dramatic reason. I ran out of whatever I had that night. Saved the file, closed the DAW, never opened it again.

That was almost ten years ago. The file sat on a drive through two laptop migrations and a format scare. track0000_tony_winslow.mp3. Twenty-five seconds of something that could've been a song.

For the Build Small Hackathon, I wanted to see if a small model could finish it. Not generate a new song from a text description. Not "make something that sounds like this." Actually take my recording and continue it forward, in the same key, the same tempo, matching the feel of what I played. Continuation, not generation.

The model that made it possible

Stable Audio 3 Small Music is 0.6 billion parameters. Tiny by current standards. But it has one trick that nothing else at this size does: generate_diffusion_cond_inpaint. You give it a buffer of audio and a binary mask. Ones mean "keep this." Zeros mean "fill this in." So I put my clip at the front, masked everything after it, and told the model to paint music into the silence.

One call. 44.1 kHz stereo. Up to 120 seconds. No chaining 12-second windows together and watching it drift. No 32 kHz mono downgrade. The model hears where the song was going and keeps going.

I'd tried continuation before with MusicGen. You had to feed 12 seconds of context, generate 18 new seconds, take the last 12 of those as context for the next hop, and repeat. The drift compounded every cycle. Quiet inputs faded to nothing. That approach had about 800 lines of fragile chaining code. SA3's inpainting deleted all of it in one move.

Three days of things going wrong

The first time I ran it on ZeroGPU, the output was a wall of broadband noise. Sounded like someone held a microphone next to a broken amplifier.

Turned out to be fp16 precision. SA3's autoencoder uses Snake activations internally: x + (1/(beta+1e-9)) * sin(alpha*x)^2. That reciprocal-times-sine-squared can shoot past fp16's ceiling (~65504), overflow to infinity, and a single inf on decode turns your entire output into static. The fix: force the autoencoder to decode in fp32 while keeping the transformer in fast fp16. Pin the attention backend to MATH (some cards route fp16 SDPA through kernels with known NaN bugs). Kill TF32 accumulation so error doesn't compound across eight diffusion steps.

Three separate fixes for one symptom. Took me days to untangle.

The seed that lied

Even after the precision fixes, I had a worse problem. I'd found a magic seed locally: seed 7 produced a beautiful take every single time on my machine. I pinned it. Deployed it. On the cloud, seed 7 produced a loud synth blast that scored 90 on my artifact meter where a normal draw scores about 3.

Same seed. Same code. Different torch build. The RNG plumbing changed underneath me and I'd been trusting a number that meant nothing outside my specific hardware.

So I stopped betting on seeds entirely. CODA now generates up to five candidates and keeps the best one, scored by a cheap artifact detector that catches the four ways an SA3 draw goes wrong: loud random bursts (spikiness way above median), silence collapse (overall RMS below floor), mid-tail dropout (a sustained quiet stretch in the middle while overall loudness reads fine), and dynamics collapse (crest factor below 3, meaning transients got smeared into mush). Most of the time the first or second draw is clean and it stops early. The synth-noise bug? Best-of-5 rejects that draw on sight. That's what actually killed it.

The seam

Here's the part nobody thinks about until they hear the click.

Your recording ends. The AI's music begins. If those two pieces don't match in level, if there's a gap or a pop at the join, the whole illusion breaks. You hear "two clips glued together" instead of "one song."

The splicer does four things: gain-matches the generated tail to your recording's loudness right at the boundary (bounded so a whisper-quiet phone recording can't drag the continuation down to nothing). Applies an equal-power cosine/sine crossfade at the seam (equal-power, not equal-gain, because the two sides are sequential content and equal-gain dips). Fades out with a cos-squared curve so the track ends like a song instead of getting cut mid-air. And a final peak-normalize with a dB of headroom.

Your original recording is never touched. It plays back exactly as you recorded it, up to the seam. Then CODA takes over.

The moment it worked

I uploaded Track0000. Watched it detect D minor, 89 BPM, 4/4. Hit finish. Waited about ten seconds.

And there it was. My song, the one I walked away from in 2016, with a second half. The continuation picked up the harmonic pattern I'd been building and carried it somewhere I wouldn't have gone myself but that felt right. The seam was invisible. I listened to it three times in a row.

Ten years of that file doing nothing on a hard drive. A 0.6B model finished it in ten seconds.

What I learned building this

The flashy part is the model. The hard part is everything around it. Precision arithmetic on cloud GPUs. Candidate selection because generative models are stochastic and you can't ship the first draw blind. Splicing that respects the original recording. Analysis that works on phone-quality audio.

I spent more time on the artifact scorer than on any other single piece. And the artifact scorer doesn't use ML at all. It's signal processing: RMS windows, peak detection, crest factor. Sometimes the best tool for catching a model's mistakes is not another model.

Out of 489 entries in this hackathon, as far as I can tell, CODA is the only one working at the waveform level with an audio AI model. Every other audio entry pipes text through an LLM and adds speech synthesis. That's fine for what it is, but it's a different thing. CODA takes your actual samples and continues them. 0.6 billion parameters, one job, done properly.

The demo on the Space is my actual song. Press the button, listen, and bring your own unfinished clip if you have one. Most of us do.

Try CODA on Hugging Face Spaces

Built by Tony Winslow / Black Box Analytics for the Build Small Hackathon, June 2026.

Spaces mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote