OmniVoice 🌍

OmniVoice

Hugging Face Model   Hugging Face Space     GitHub Code  

OmniVoice is a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. Built on a novel diffusion language model-style architecture, it delivers high-quality speech with superior inference speed, supporting voice cloning and voice design.

Key Features

  • 600+ Languages Supported: The broadest language coverage among zero-shot TTS models.
  • Voice Cloning: State-of-the-art voice cloning quality from a short reference audio.
  • Voice Design: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
  • Fast Inference: RTF as low as 0.025 (40x faster than real-time).
  • Diffusion Language Model Architecture: A clean, streamlined, and scalable design that delivers both quality and speed.

Sample Usage

To get started, install the omnivoice library:

pip install omnivoice

Python API

You can use OmniVoice for zero-shot voice cloning as follows:

from omnivoice import OmniVoice
import torch
import torchaudio

# Load the model
model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)

# Generate audio
audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.

torchaudio.save("out.wav", audio[0], 24000)

Citation

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}
Downloads last month
6,560
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for k2-fsa/OmniVoice

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(744)
this model
Finetunes
2 models

Spaces using k2-fsa/OmniVoice 4

Paper for k2-fsa/OmniVoice