Kokoro-82M for Core ML -- Model Surgery for Apple Silicon
We took Kokoro-82M -- 82M parameters, Apache 2.0, trades blows with models 10x its size -- and cut it apart so each piece runs on the Apple Silicon processor that's best at that job.
~17x faster than real-time. On-device. Offline. No API keys. No network. No cents-per-character.
Just .mlpackage files and a Swift MLModel(contentsOf:) call.
Source code, exporters, Swift runtime: github.com/mattmireles/kokoro-coreml Pre-converted models (this repo): huggingface.co/mattmireles/kokoro-coreml
Why surgery?
Apple Silicon isn't one chip. It's three -- CPU, GPU, and the Neural Engine (ANE) -- each built for different workloads. Most Core ML ports shove the whole model through and hope the scheduler figures it out. It doesn't. You end up on CPU wondering why your "Neural Engine model" is pegged on a single core.
We dissected Kokoro's TTS pipeline and made three deliberate cuts:
ββββββββββββββββββββββββββββββββ
"Hello world" βββββΆ β DURATION MODEL β
β Transformers + LSTMs β βββ CPU/GPU
β Sequential, dynamic-length β Best at: branching,
βββββββββββββ¬βββββββββββββββββββ variable sequences
β
βΌ
ββββββββββββββββββββββββββββββββ
β ALIGNMENT (Swift / CPU) β
β Build matrix from durations β βββ CPU
β ~50 lines of code β Best at: small, complex
βββββββββββββ¬βββββββββββββββββββ data-dependent logic
β
βΌ
ββββββββββββββββββββββββββββββββ
β DECODER / VOCODER β
β Heavy convolutions + iSTFT β βββ Neural Engine (ANE)
β Fixed shapes, pure math β Best at: dense parallel
βββββββββββββ¬βββββββββββββββββββ tensor operations
β
βΌ
24 kHz Audio
The Neural Engine devours fixed-shape convolutions -- exactly the dense parallel math that dominates audio synthesis. But it chokes on dynamic shapes and data-dependent control flow. So we give the messy sequential stuff to the CPU, build alignment in plain Swift, and hand the ANE a clean, static tensor.
Redesign the inference pipeline, not the model.
Performance (M2 Ultra, warmed)
23.7 seconds of synthesized audio:
| Bucket | Wall time | RTF | Speed vs real-time |
|---|---|---|---|
| 5s | ~1.35s | 0.057 | 17x |
| 15s | ~1.41s | 0.060 | 17x |
| 30s | ~1.38s | 0.058 | 17x |
Where the time goes: ANE predict ~0.25--0.31s, CPU preprocessing ~0.15--0.17s, iSTFT ~0.02--0.03s, orchestration ~0.55--0.60s. Cold start ~2--3s, then under 1.5s per call. ~200MB per loaded model. 24 kHz mono PCM output.
What's in the download
| File | What it does | Runs on |
|---|---|---|
kokoro_duration.mlpackage |
Phoneme durations + style embeddings from token IDs | CPU/GPU |
kokoro_decoder_only_3s.mlpackage |
Decoder/vocoder, ~3s audio | ANE (static shapes) |
kokoro_decoder_only_5s.mlpackage |
~5s audio | ANE (static shapes) |
kokoro_decoder_only_10s.mlpackage |
~10s audio | ANE (static shapes) |
kokoro_decoder_har_post_*s.mlpackage |
Post-harmonic stack (hybrid path) | ANE tail; hn-nsf on CPU |
Pick the smallest bucket that fits your predicted utterance. That's the whole strategy.
Usage (Swift)
import CoreML
let duration = try MLModel(contentsOf: durationURL)
let decoder = try MLModel(contentsOf: decoder3sURL)
// 1. Tokenize text β input_ids [1, 128]
// 2. duration.prediction(...) β pred_dur, t_en, ref_s_out
// 3. Build alignment matrix in Swift from pred_dur
// 4. asr = t_en @ alignment β [1, 512, 72]
// 5. decoder.prediction(...) β waveform [1, 43200]
// 6. AVAudioPCMBuffer @ 24 kHz β speaker
Full Swift glue (tokenizer, alignment builder, PCM playback) is in the GitHub repo. It's small. You can read it in an afternoon.
Tensor shapes (3s decoder bucket)
Inputs:
asr [1, 512, 72] float16 text features x alignment
F0_pred [1, 144] float16 pitch contour
N_pred [1, 144] float16 noise/aperiodicity
ref_s [1, 256] float16 voice embedding
Output:
waveform [1, 43200] float16 3s @ 24kHz
Everything is static and float16. No dynamic ops. No RangeDim. No non_zero kernels.
Requirements
- iOS 16+ / macOS 13+ (MLProgram + modern Core ML runtime)
- Apple Silicon (M1+) or A15+ for Neural Engine acceleration
- Runs on older chips too, just slower
License
Apache 2.0, inherited from Kokoro-82M. Ship it. Sell it. Fork it.
Credits
- @hexgrad -- Kokoro-82M weights, training, and the Apache release
- @yl4579 -- StyleTTS 2 architecture
- Apple's coremltools team -- for maintaining the PyTorch-to-Core ML path
Kokoro (εΏ) -- Japanese for "heart."
- Downloads last month
- 222