Kokoro-82M for Core ML -- Model Surgery for Apple Silicon

We took Kokoro-82M -- 82M parameters, Apache 2.0, trades blows with models 10x its size -- and cut it apart so each piece runs on the Apple Silicon processor that's best at that job.

~17x faster than real-time. On-device. Offline. No API keys. No network. No cents-per-character.

Just .mlpackage files and a Swift MLModel(contentsOf:) call.

Source code, exporters, Swift runtime: github.com/mattmireles/kokoro-coreml Pre-converted models (this repo): huggingface.co/mattmireles/kokoro-coreml

Why surgery?

Apple Silicon isn't one chip. It's three -- CPU, GPU, and the Neural Engine (ANE) -- each built for different workloads. Most Core ML ports shove the whole model through and hope the scheduler figures it out. It doesn't. You end up on CPU wondering why your "Neural Engine model" is pegged on a single core.

We dissected Kokoro's TTS pipeline and made three deliberate cuts:

                        ┌──────────────────────────────┐
  "Hello world"  ────▶  │  DURATION MODEL               │
                        │  Transformers + LSTMs          │  ◀── CPU/GPU
                        │  Sequential, dynamic-length    │      Best at: branching,
                        └───────────┬──────────────────┘      variable sequences
                                    │
                                    ▼
                        ┌──────────────────────────────┐
                        │  ALIGNMENT  (Swift / CPU)      │
                        │  Build matrix from durations   │  ◀── CPU
                        │  ~50 lines of code             │      Best at: small, complex
                        └───────────┬──────────────────┘      data-dependent logic
                                    │
                                    ▼
                        ┌──────────────────────────────┐
                        │  DECODER / VOCODER             │
                        │  Heavy convolutions + iSTFT    │  ◀── Neural Engine (ANE)
                        │  Fixed shapes, pure math       │      Best at: dense parallel
                        └───────────┬──────────────────┘      tensor operations
                                    │
                                    ▼
                               24 kHz Audio

The Neural Engine devours fixed-shape convolutions -- exactly the dense parallel math that dominates audio synthesis. But it chokes on dynamic shapes and data-dependent control flow. So we give the messy sequential stuff to the CPU, build alignment in plain Swift, and hand the ANE a clean, static tensor.

Redesign the inference pipeline, not the model.

Performance (M2 Ultra, warmed)

23.7 seconds of synthesized audio:

Bucket	Wall time	RTF	Speed vs real-time
5s	~1.35s	0.057	17x
15s	~1.41s	0.060	17x
30s	~1.38s	0.058	17x

Where the time goes: ANE predict ~0.25--0.31s, CPU preprocessing ~0.15--0.17s, iSTFT ~0.02--0.03s, orchestration ~0.55--0.60s. Cold start ~2--3s, then under 1.5s per call. ~200MB per loaded model. 24 kHz mono PCM output.

What's in the download

File	What it does	Runs on
`kokoro_duration.mlpackage`	Phoneme durations + style embeddings from token IDs	CPU/GPU
`kokoro_decoder_only_3s.mlpackage`	Decoder/vocoder, ~3s audio	ANE (static shapes)
`kokoro_decoder_only_5s.mlpackage`	~5s audio	ANE (static shapes)
`kokoro_decoder_only_10s.mlpackage`	~10s audio	ANE (static shapes)
`kokoro_decoder_har_post_*s.mlpackage`	Post-harmonic stack (hybrid path)	ANE tail; hn-nsf on CPU

Pick the smallest bucket that fits your predicted utterance. That's the whole strategy.

Usage (Swift)

import CoreML

let duration = try MLModel(contentsOf: durationURL)
let decoder  = try MLModel(contentsOf: decoder3sURL)

// 1. Tokenize text → input_ids [1, 128]
// 2. duration.prediction(...) → pred_dur, t_en, ref_s_out
// 3. Build alignment matrix in Swift from pred_dur
// 4. asr = t_en @ alignment → [1, 512, 72]
// 5. decoder.prediction(...) → waveform [1, 43200]
// 6. AVAudioPCMBuffer @ 24 kHz → speaker

Full Swift glue (tokenizer, alignment builder, PCM playback) is in the GitHub repo. It's small. You can read it in an afternoon.

Tensor shapes (3s decoder bucket)

Inputs:
  asr      [1, 512, 72]   float16   text features x alignment
  F0_pred  [1, 144]       float16   pitch contour
  N_pred   [1, 144]       float16   noise/aperiodicity
  ref_s    [1, 256]       float16   voice embedding

Output:
  waveform [1, 43200]     float16   3s @ 24kHz

Everything is static and float16. No dynamic ops. No RangeDim. No non_zero kernels.

Requirements

iOS 16+ / macOS 13+ (MLProgram + modern Core ML runtime)
Apple Silicon (M1+) or A15+ for Neural Engine acceleration
Runs on older chips too, just slower

License

Apache 2.0, inherited from Kokoro-82M. Ship it. Sell it. Fork it.

Credits

@hexgrad -- Kokoro-82M weights, training, and the Apache release
@yl4579 -- StyleTTS 2 architecture
Apple's coremltools team -- for maintaining the PyTorch-to-Core ML path

Kokoro (心) -- Japanese for "heart."

Downloads last month: 222

Model tree for mattmireles/kokoro-coreml

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Quantized

(26)

this model