VoxCPM2 - 4-bit quantized

MLX port of openbmb/VoxCPM2 — a 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, and voice design.

4-bit quantized (LM layers only, VAE/DiT at full precision). Fastest, smallest, with minimal quality loss.

Features

30 languages — including English, Chinese, Indonesian, Japanese, Korean, and more
48kHz output — studio-quality audio
Voice Design — create voices from text descriptions (no reference audio needed)
Voice Cloning — clone any voice from a short audio reference
4 generation modes — zero-shot, continuation, reference cloning, combined

Usage

pip install mlx-audio

# Zero-shot
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit --text "Hello world" --verbose

# Voice design
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --instruct "A young woman, gentle and sweet voice"

# Voice cloning
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --ref_audio speaker.wav --ref_text "reference text"

Python API

from mlx_audio.tts import load_model

model = load_model("mlx-community/VoxCPM2-4bit")

# Generate
for result in model.generate(
    text="Hello, this is VoxCPM2 on Apple Silicon.",
    inference_timesteps=7,
    cfg_value=2.0,
):
    print(f"Duration: {result.audio_duration}")

Performance (Apple Silicon)

Variant	Size	RTF (7 timesteps)
bf16	4.96 GB	0.48x
8-bit	3.23 GB	0.85x
4-bit	2.30 GB	0.90x

RTF = Real-Time Factor (>1.0 = faster than realtime)

Original Model

openbmb/VoxCPM2
Apache 2.0 License

Converted with mlx-audio.

Downloads last month: -

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/VoxCPM2-4bit

Base model

openbmb/VoxCPM2

Finetuned

(3)

this model