OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Paper • 2604.00688 • Published • 4
OmniVoice is a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. Built on a novel diffusion language model-style architecture, it delivers high-quality speech with superior inference speed, supporting voice cloning and voice design.
To get started, install the omnivoice library:
pip install omnivoice
You can use OmniVoice for zero-shot voice cloning as follows:
from omnivoice import OmniVoice
import torch
import torchaudio
# Load the model
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
# Generate audio
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.
torchaudio.save("out.wav", audio[0], 24000)
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}