Kokoro-82M for Core ML -- Model Surgery for Apple Silicon

We took Kokoro-82M -- 82M parameters, Apache 2.0, trades blows with models 10x its size -- and cut it apart so each piece runs on the Apple Silicon processor that's best at that job.

~17x faster than real-time. On-device. Offline. No API keys. No network. No cents-per-character.

Just .mlpackage files and a Swift MLModel(contentsOf:) call.

Source code, exporters, Swift runtime: github.com/mattmireles/kokoro-coreml Pre-converted models (this repo): huggingface.co/mattmireles/kokoro-coreml

Why surgery?

Apple Silicon isn't one chip. It's three -- CPU, GPU, and the Neural Engine (ANE) -- each built for different workloads. Most Core ML ports shove the whole model through and hope the scheduler figures it out. It doesn't. You end up on CPU wondering why your "Neural Engine model" is pegged on a single core.

We dissected Kokoro's TTS pipeline and made three deliberate cuts:

                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  "Hello world"  ────▢  β”‚  DURATION MODEL               β”‚
                        β”‚  Transformers + LSTMs          β”‚  ◀── CPU/GPU
                        β”‚  Sequential, dynamic-length    β”‚      Best at: branching,
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      variable sequences
                                    β”‚
                                    β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  ALIGNMENT  (Swift / CPU)      β”‚
                        β”‚  Build matrix from durations   β”‚  ◀── CPU
                        β”‚  ~50 lines of code             β”‚      Best at: small, complex
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      data-dependent logic
                                    β”‚
                                    β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  DECODER / VOCODER             β”‚
                        β”‚  Heavy convolutions + iSTFT    β”‚  ◀── Neural Engine (ANE)
                        β”‚  Fixed shapes, pure math       β”‚      Best at: dense parallel
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      tensor operations
                                    β”‚
                                    β–Ό
                               24 kHz Audio

The Neural Engine devours fixed-shape convolutions -- exactly the dense parallel math that dominates audio synthesis. But it chokes on dynamic shapes and data-dependent control flow. So we give the messy sequential stuff to the CPU, build alignment in plain Swift, and hand the ANE a clean, static tensor.

Redesign the inference pipeline, not the model.

Performance (M2 Ultra, warmed)

23.7 seconds of synthesized audio:

Bucket Wall time RTF Speed vs real-time
5s ~1.35s 0.057 17x
15s ~1.41s 0.060 17x
30s ~1.38s 0.058 17x

Where the time goes: ANE predict ~0.25--0.31s, CPU preprocessing ~0.15--0.17s, iSTFT ~0.02--0.03s, orchestration ~0.55--0.60s. Cold start ~2--3s, then under 1.5s per call. ~200MB per loaded model. 24 kHz mono PCM output.

What's in the download

File What it does Runs on
kokoro_duration.mlpackage Phoneme durations + style embeddings from token IDs CPU/GPU
kokoro_decoder_only_3s.mlpackage Decoder/vocoder, ~3s audio ANE (static shapes)
kokoro_decoder_only_5s.mlpackage ~5s audio ANE (static shapes)
kokoro_decoder_only_10s.mlpackage ~10s audio ANE (static shapes)
kokoro_decoder_har_post_*s.mlpackage Post-harmonic stack (hybrid path) ANE tail; hn-nsf on CPU

Pick the smallest bucket that fits your predicted utterance. That's the whole strategy.

Usage (Swift)

import CoreML

let duration = try MLModel(contentsOf: durationURL)
let decoder  = try MLModel(contentsOf: decoder3sURL)

// 1. Tokenize text β†’ input_ids [1, 128]
// 2. duration.prediction(...) β†’ pred_dur, t_en, ref_s_out
// 3. Build alignment matrix in Swift from pred_dur
// 4. asr = t_en @ alignment β†’ [1, 512, 72]
// 5. decoder.prediction(...) β†’ waveform [1, 43200]
// 6. AVAudioPCMBuffer @ 24 kHz β†’ speaker

Full Swift glue (tokenizer, alignment builder, PCM playback) is in the GitHub repo. It's small. You can read it in an afternoon.

Tensor shapes (3s decoder bucket)

Inputs:
  asr      [1, 512, 72]   float16   text features x alignment
  F0_pred  [1, 144]       float16   pitch contour
  N_pred   [1, 144]       float16   noise/aperiodicity
  ref_s    [1, 256]       float16   voice embedding

Output:
  waveform [1, 43200]     float16   3s @ 24kHz

Everything is static and float16. No dynamic ops. No RangeDim. No non_zero kernels.

Requirements

  • iOS 16+ / macOS 13+ (MLProgram + modern Core ML runtime)
  • Apple Silicon (M1+) or A15+ for Neural Engine acceleration
  • Runs on older chips too, just slower

License

Apache 2.0, inherited from Kokoro-82M. Ship it. Sell it. Fork it.

Credits

  • @hexgrad -- Kokoro-82M weights, training, and the Apache release
  • @yl4579 -- StyleTTS 2 architecture
  • Apple's coremltools team -- for maintaining the PyTorch-to-Core ML path

Kokoro (εΏƒ) -- Japanese for "heart."

Downloads last month
222
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mattmireles/kokoro-coreml

Quantized
(26)
this model