VoxCPM-0.5B β€” Core AI (on-device, iPhone + Mac)

OpenBMB's VoxCPM-0.5B converted to Apple's Core AI engine, running fully on-device β€” iPhone (Apple Neural Engine / GPU, AOT-compiled) and Apple-silicon Mac. No network, no server.

VoxCPM is not a classic vocoder TTS: it pairs a MiniCPM4 language-model backbone with a LocDiT flow-matching diffusion head and an AudioVAE, generating speech through a continuous (token-rate) diffusion loop. This repo ships the whole stack as Core AI model bundles plus the small host-side glue the runtime needs.

  • Output: 16 kHz mono
  • License: Apache-2.0 (commercial-friendly), inherited from the base model
  • Quantization: weight-only int8 on the two LM backbones (the size driver); the diffusion decoder, feature encoder, and AudioVAE stay fp16 β€” the continuous-feedback path is quantization-sensitive (the same split mlx-community/VoxCPM2 uses).

Contents

Path What
macos/voxcpm_base_int8_decode_cl512/ LM backbone (MiniCPM4, 24L), int8, static-KV decode β€” JIT .aimodel for Mac
macos/voxcpm_res_int8_decode_cl512/ Residual LM (6L), int8
macos/voxcpm_feat_decoder_fp16/ LocDiT CFM diffusion decoder (10-step euler + CFG, unrolled), fp16
macos/voxcpm_feat_encoder_fp16/ LocEnc + projection (per-frame feedback embed), fp16
macos/voxcpm_vocoder_fp16_t12/ AudioVAE decoder (DAC-style, 640Γ— upsample), fp16
ios/*.h18p.aimodelc/ The same five bundles, AOT-compiled for iOS (h18p)
voxcpm_host_glue/ Token-embedding table + dit/FSQ/stop-head weights (run host-side via Accelerate)
tokenizer/ Llama tokenizer (tokenizer.json + config)

The two prefill bundles are intentionally not shipped: prefill runs through the q=1 decode bundle (it is causal, so step-by-step == batched), which also makes text length unbounded.

Usage

Easiest path is the coreai-model-zoo coreai-audio app (the "Voice" tab) and CoreAIKit:

import CoreAIKit

let tts = try await VoxCPMTTS(paths: .standard(artifactsRoot: modelRoot))   // macOS (.aimodel)
// let tts = try await VoxCPMTTS(paths: .aot(root: modelRoot, arch: "h18p")) // iOS (.aimodelc)
let pcm = try await tts.synthesize("On device speech synthesis, running entirely on your iPhone.")
// pcm: [Float] @ 16 kHz mono

The conversion scripts and the Swift host are in the zoo (conversion/voxcpm/) and CoreAIKit.

Notes

  • Plain TTS (fixed speaker). VoxCPM's voice-cloning branch is a follow-on.
  • Per-step quality is fp16-equivalent (int8 LM cos > 0.999 vs the fp32 reference); whole-utterance output is natural speech.
  • Community port β€” not an official Apple model.

Acknowledgements

OpenBMB / VoxCPM. Built on Apple's Core AI.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/VoxCPM-0.5B-CoreAI

Finetuned
(7)
this model