I just released Inflect-Nano-v1, an ultra-small 4.63 parameter text-to-speech model.
The main idea is simple: instead of only making the acoustic model tiny and relying on a larger external vocoder, Inflect-Nano-v1 keeps the complete text-to-waveform stack under 5M parameters.
Quick facts: - 4.63M total inference parameters - 3.46M acoustic model - 1.17M vocoder - 24 kHz audio - English-only - Single male voice - Runs locally with a simple PyTorch inference script
Why I made it: Most modern TTS models are much larger, and even many “small TTS” projects depend on a separate vocoder. I wanted to see how far a complete tiny TTS stack could be pushed while still producing usable speech.
It is not SOTA, and I am not trying to claim it competes with large TTS systems. The interesting part is the size-to-functionality ratio.
What works: It can generate arbitrary English speech locally, and the model is small enough to be interesting for:
- local voice assistants - embedded/edge experiments - browser or WASM-style TTS exploration - efficient inference research - tiny-model baselines
Limitations: The quality is still limited. It can sound robotic, stumble on difficult unseen text, and the vocoder is still a clear bottleneck. Long or unusual prompts are less reliable.
So I would frame this as a research/demo release, not a production TTS engine.
I’d love feedback from people interested in: - tiny speech models - vocoders - local TTS - efficient inference - embedded speech synthesis - improving small-model generalization
If people find it useful, I’m interested in putting more training budget into a stronger v2.
I just released Inflect-Nano-v1, an ultra-small 4.63 parameter text-to-speech model.
The main idea is simple: instead of only making the acoustic model tiny and relying on a larger external vocoder, Inflect-Nano-v1 keeps the complete text-to-waveform stack under 5M parameters.
Quick facts: - 4.63M total inference parameters - 3.46M acoustic model - 1.17M vocoder - 24 kHz audio - English-only - Single male voice - Runs locally with a simple PyTorch inference script
Why I made it: Most modern TTS models are much larger, and even many “small TTS” projects depend on a separate vocoder. I wanted to see how far a complete tiny TTS stack could be pushed while still producing usable speech.
It is not SOTA, and I am not trying to claim it competes with large TTS systems. The interesting part is the size-to-functionality ratio.
What works: It can generate arbitrary English speech locally, and the model is small enough to be interesting for:
- local voice assistants - embedded/edge experiments - browser or WASM-style TTS exploration - efficient inference research - tiny-model baselines
Limitations: The quality is still limited. It can sound robotic, stumble on difficult unseen text, and the vocoder is still a clear bottleneck. Long or unusual prompts are less reliable.
So I would frame this as a research/demo release, not a production TTS engine.
I’d love feedback from people interested in: - tiny speech models - vocoders - local TTS - efficient inference - embedded speech synthesis - improving small-model generalization
If people find it useful, I’m interested in putting more training budget into a stronger v2.