VoxCPM2 - BFloat16 (full precision)

MLX port of openbmb/VoxCPM2 โ€” a 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, and voice design.

Full precision BFloat16 weights. Best quality, largest size.

Features

  • 30 languages โ€” including English, Chinese, Indonesian, Japanese, Korean, and more
  • 48kHz output โ€” studio-quality audio
  • Voice Design โ€” create voices from text descriptions (no reference audio needed)
  • Voice Cloning โ€” clone any voice from a short audio reference
  • 4 generation modes โ€” zero-shot, continuation, reference cloning, combined

Usage

pip install mlx-audio

# Zero-shot
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-bf16 --text "Hello world" --verbose

# Voice design
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-bf16 \
  --text "Hello world" \
  --instruct "A young woman, gentle and sweet voice"

# Voice cloning
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-bf16 \
  --text "Hello world" \
  --ref_audio speaker.wav --ref_text "reference text"

Python API

from mlx_audio.tts import load_model

model = load_model("mlx-community/VoxCPM2-bf16")

# Generate
for result in model.generate(
    text="Hello, this is VoxCPM2 on Apple Silicon.",
    inference_timesteps=7,
    cfg_value=2.0,
):
    print(f"Duration: {result.audio_duration}")

Performance (Apple Silicon)

Variant Size RTF (7 timesteps)
bf16 4.96 GB 0.48x
8-bit 3.23 GB 0.85x
4-bit 2.30 GB 0.90x

RTF = Real-Time Factor (>1.0 = faster than realtime)

Original Model

Converted with mlx-audio.

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
I32
ยท
F32
ยท
BF16
ยท
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mlx-community/VoxCPM2-bf16

Base model

openbmb/VoxCPM2
Finetuned
(3)
this model