VoxCPM2 - 4-bit quantized

MLX port of openbmb/VoxCPM2 โ€” a 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, and voice design.

4-bit quantized (LM layers only, VAE/DiT at full precision). Fastest, smallest, with minimal quality loss.

Features

  • 30 languages โ€” including English, Chinese, Indonesian, Japanese, Korean, and more
  • 48kHz output โ€” studio-quality audio
  • Voice Design โ€” create voices from text descriptions (no reference audio needed)
  • Voice Cloning โ€” clone any voice from a short audio reference
  • 4 generation modes โ€” zero-shot, continuation, reference cloning, combined

Usage

pip install mlx-audio

# Zero-shot
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit --text "Hello world" --verbose

# Voice design
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --instruct "A young woman, gentle and sweet voice"

# Voice cloning
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --ref_audio speaker.wav --ref_text "reference text"

Python API

from mlx_audio.tts import load_model

model = load_model("mlx-community/VoxCPM2-4bit")

# Generate
for result in model.generate(
    text="Hello, this is VoxCPM2 on Apple Silicon.",
    inference_timesteps=7,
    cfg_value=2.0,
):
    print(f"Duration: {result.audio_duration}")

Performance (Apple Silicon)

Variant Size RTF (7 timesteps)
bf16 4.96 GB 0.48x
8-bit 3.23 GB 0.85x
4-bit 2.30 GB 0.90x

RTF = Real-Time Factor (>1.0 = faster than realtime)

Original Model

Converted with mlx-audio.

Downloads last month
-
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mlx-community/VoxCPM2-4bit

Base model

openbmb/VoxCPM2
Finetuned
(3)
this model