VoxCPM-0.5B β Core AI (on-device, iPhone + Mac)
OpenBMB's VoxCPM-0.5B converted to Apple's Core AI engine, running fully on-device β iPhone (Apple Neural Engine / GPU, AOT-compiled) and Apple-silicon Mac. No network, no server.
VoxCPM is not a classic vocoder TTS: it pairs a MiniCPM4 language-model backbone with a LocDiT flow-matching diffusion head and an AudioVAE, generating speech through a continuous (token-rate) diffusion loop. This repo ships the whole stack as Core AI model bundles plus the small host-side glue the runtime needs.
- Output: 16 kHz mono
- License: Apache-2.0 (commercial-friendly), inherited from the base model
- Quantization: weight-only int8 on the two LM backbones (the size driver); the diffusion decoder, feature encoder, and AudioVAE stay fp16 β the continuous-feedback path is quantization-sensitive (the same split mlx-community/VoxCPM2 uses).
Contents
| Path | What |
|---|---|
macos/voxcpm_base_int8_decode_cl512/ |
LM backbone (MiniCPM4, 24L), int8, static-KV decode β JIT .aimodel for Mac |
macos/voxcpm_res_int8_decode_cl512/ |
Residual LM (6L), int8 |
macos/voxcpm_feat_decoder_fp16/ |
LocDiT CFM diffusion decoder (10-step euler + CFG, unrolled), fp16 |
macos/voxcpm_feat_encoder_fp16/ |
LocEnc + projection (per-frame feedback embed), fp16 |
macos/voxcpm_vocoder_fp16_t12/ |
AudioVAE decoder (DAC-style, 640Γ upsample), fp16 |
ios/*.h18p.aimodelc/ |
The same five bundles, AOT-compiled for iOS (h18p) |
voxcpm_host_glue/ |
Token-embedding table + dit/FSQ/stop-head weights (run host-side via Accelerate) |
tokenizer/ |
Llama tokenizer (tokenizer.json + config) |
The two prefill bundles are intentionally not shipped: prefill runs through the q=1 decode bundle (it is causal, so step-by-step == batched), which also makes text length unbounded.
Usage
Easiest path is the coreai-model-zoo coreai-audio app (the "Voice" tab) and CoreAIKit:
import CoreAIKit
let tts = try await VoxCPMTTS(paths: .standard(artifactsRoot: modelRoot)) // macOS (.aimodel)
// let tts = try await VoxCPMTTS(paths: .aot(root: modelRoot, arch: "h18p")) // iOS (.aimodelc)
let pcm = try await tts.synthesize("On device speech synthesis, running entirely on your iPhone.")
// pcm: [Float] @ 16 kHz mono
The conversion scripts and the Swift host are in the zoo (conversion/voxcpm/) and CoreAIKit.
Notes
- Plain TTS (fixed speaker). VoxCPM's voice-cloning branch is a follow-on.
- Per-step quality is fp16-equivalent (int8 LM cos > 0.999 vs the fp32 reference); whole-utterance output is natural speech.
- Community port β not an official Apple model.
Acknowledgements
OpenBMB / VoxCPM. Built on Apple's Core AI.