XTTSv2 Streaming ONNX
Streaming text-to-speech inference for XTTSv2 using ONNX Runtime — no PyTorch required.
This repository provides a complete, CPU-friendly, streaming TTS pipeline built on ONNX-exported XTTSv2 models. It replaces the original PyTorch inference path with pure Python/NumPy logic while preserving full compatibility with the XTTSv2 architecture.
Features
- Zero-shot voice cloning from a short (≤ 6 s) reference audio clip.
- Streaming audio output — audio chunks are yielded as they are generated, enabling low-latency playback.
- Pure ONNX Runtime + NumPy — no PyTorch dependency at inference time.
- INT8-quantised GPT model option for reduced memory footprint and faster CPU inference.
- Cross-fade chunk stitching for seamless audio across vocoder boundaries.
- Speed control via linear interpolation of GPT latents.
- Multilingual support — 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi).
Architecture Overview
XTTSv2 is composed of four main neural network components, each exported as a separate ONNX model:
| Component | ONNX File | Description |
|---|---|---|
| Conditioning Encoder | conditioning_encoder.onnx |
Six 16-head attention layers + Perceiver Resampler. Compresses a reference mel-spectrogram into 32 × 1024 conditioning latents. |
| Speaker Encoder | speaker_encoder.onnx |
H/ASP speaker verification network. Extracts a 512-dim speaker embedding from 16 kHz audio. |
| GPT-2 Decoder | gpt_model.onnx / gpt_model_int8.onnx |
30-layer, 1024-dim decoder-only transformer with KV-cache. Autoregressively predicts VQ-VAE audio codes conditioned on text tokens and conditioning latents. |
| HiFi-GAN Vocoder | hifigan_vocoder.onnx |
26M-parameter neural vocoder. Converts GPT-2 hidden states + speaker embedding into a 24 kHz waveform. |
Pre-exported embedding tables (text, mel, positional) are stored as .npy files in the embeddings/ directory.
┌─────────────┐ mel @ 22 kHz ┌─────────────────────┐
│ Reference │ ───────────────► │ Conditioning Encoder │──► cond_latents [1,32,1024]
│ Audio Clip │ └─────────────────────┘
│ │ audio @ 16 kHz ┌─────────────────────┐
│ │ ───────────────► │ Speaker Encoder │──► speaker_emb [1,512,1]
└─────────────┘ └─────────────────────┘
┌──────────┐ BPE tokens ┌──────────────────────────────────────────┐
│ Text │ ─────────────► │ GPT-2 Decoder (autoregressive + KV$) │──► latents [1,T,1024]
└──────────┘ │ prefix = [cond | text+pos | start_mel] │
└──────────────────────────────────────────┘
│
▼
┌─────────────────────┐
│ HiFi-GAN Vocoder │──► waveform @ 24 kHz
│ (+ speaker_emb) │
└─────────────────────┘
Repository Structure
.
├── README.md # This file
├── requirements.txt # Python dependencies
├── xtts_streaming_pipeline.py # Top-level streaming TTS pipeline
├── xtts_onnx_orchestrator.py # Low-level ONNX AR loop orchestrator
├── xtts_tokenizer.py # BPE tokenizer wrapper
├── zh_num2words.py # Chinese number-to-words utility
├── xtts_onnx/ # ONNX models & assets
│ ├── metadata.json # Model architecture metadata
│ ├── vocab.json # BPE vocabulary
│ ├── mel_stats.npy # Per-channel mel normalisation stats
│ ├── conditioning_encoder.onnx # Conditioning encoder
│ ├── speaker_encoder.onnx # H/ASP speaker encoder
│ ├── gpt_model.onnx # GPT-2 decoder (FP32)
│ ├── gpt_model_int8.onnx # GPT-2 decoder (INT8 quantised)
│ ├── hifigan_vocoder.onnx # HiFi-GAN vocoder
│ └── embeddings/ # Pre-exported embedding tables
│ ├── mel_embedding.npy # [1026, 1024] audio code embeddings
│ ├── mel_pos_embedding.npy # [608, 1024] mel positional embeddings
│ ├── text_embedding.npy # [6681, 1024] BPE text embeddings
│ └── text_pos_embedding.npy # [404, 1024] text positional embeddings
├── audio_ref/ # Reference audio clips for voice cloning
└── audio_synth/ # Directory for synthesised output
Installation
Prerequisites
- Python ≥ 3.10
- A C compiler may be needed for some dependencies (e.g.
tokenizers).
Install dependencies
pip install -r requirements.txt
Clone from Hugging Face Hub
# Install Git LFS (required for large model files)
git lfs install
# Clone the repository
git clone https://huggingface.co/pltobing/XTTSv2-Streaming-ONNX
cd XTTSv2-Streaming-ONNX
Quick Start
Streaming TTS (command-line)
python -u xtts_streaming_pipeline.py \
--model_dir xtts_onnx/ \
--vocab_path xtts_onnx/vocab.json \
--mel_norms_path xtts_onnx/mel_stats.npy \
--ref_audio audio_ref/male_stewie.mp3 \
--language en \
--output output_streaming.wav
Python API
import numpy as np
from xtts_streaming_pipeline import StreamingTTSPipeline
# Initialise the pipeline
pipeline = StreamingTTSPipeline(
model_dir="xtts_onnx/",
vocab_path="xtts_onnx/vocab.json",
mel_norms_path="xtts_onnx/mel_stats.npy",
use_int8_gpt=True, # Use INT8-quantised GPT for faster CPU inference
num_threads_gpt=4, # Adjust to your CPU core count
)
# Compute speaker conditioning (one-time per speaker)
gpt_cond_latent, speaker_embedding = pipeline.get_conditioning_latents(
"audio_ref/male_stewie.mp3"
)
# Stream synthesis
all_chunks = []
for audio_chunk in pipeline.inference_stream(
text="Hello, this is a streaming text-to-speech demo.",
language="en",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
stream_chunk_size=20, # AR tokens per vocoder call
speed=1.0, # 1.0 = normal speed
):
all_chunks.append(audio_chunk)
# In a real application, you would play or stream each chunk here.
# Concatenate all chunks into a single waveform
full_audio = np.concatenate(all_chunks, axis=0)
# Save to file
import soundfile as sf
sf.write("output.wav", full_audio, 24000)
Configuration
SamplingConfig
Control the autoregressive token sampling behaviour:
| Parameter | Default | Description |
|---|---|---|
temperature |
0.75 |
Softmax temperature. Lower = more deterministic. |
top_k |
50 |
Keep only the top-k most probable tokens. |
top_p |
0.85 |
Nucleus sampling cumulative probability threshold. |
repetition_penalty |
10.0 |
Penalise previously generated tokens. |
do_sample |
True |
True = multinomial sampling; False = greedy argmax. |
from xtts_onnx_orchestrator import SamplingConfig
sampling = SamplingConfig(
temperature=0.65,
top_k=30,
top_p=0.90,
repetition_penalty=10.0,
do_sample=True,
)
for chunk in pipeline.inference_stream(text, "en", cond, spk, sampling=sampling):
...
GPTConfig
Model architecture parameters are automatically loaded from metadata.json. Key fields:
| Parameter | Value | Description |
|---|---|---|
n_layer |
30 | Number of GPT-2 transformer layers |
embed_dim |
1024 | Hidden dimension |
num_heads |
16 | Number of attention heads |
head_dim |
64 | Per-head dimension |
num_audio_tokens |
1026 | Audio vocabulary (1024 VQ codes + start + stop) |
perceiver_output_len |
32 | Conditioning latent sequence length |
max_gen_mel_tokens |
605 | Maximum generated audio tokens |
Module Reference
xtts_streaming_pipeline.py
Top-level streaming pipeline.
| Class / Function | Description |
|---|---|
StreamingTTSPipeline |
Main pipeline class. Owns sessions, tokenizer, orchestrator. |
StreamingTTSPipeline.get_conditioning_latents() |
Extract GPT conditioning + speaker embedding from reference audio. |
StreamingTTSPipeline.inference_stream() |
Generator that yields audio chunks for a text segment. |
StreamingTTSPipeline.time_scale_gpt_latents_numpy() |
Linearly time-scale GPT latents for speed control. |
wav_to_mel_cloning_numpy() |
Compute normalised log-mel spectrogram (NumPy, 22 kHz). |
crossfade_chunks() |
Cross-fade consecutive vocoder waveform chunks. |
xtts_onnx_orchestrator.py
Low-level ONNX autoregressive loop.
| Class / Function | Description |
|---|---|
ONNXSessionManager |
Loads and manages all ONNX sessions + embedding tables. |
XTTSOrchestratorONNX |
Drives the GPT-2 AR loop with KV-cache and logits processing. |
GPTConfig |
Model architecture hyper-parameters (from metadata.json). |
SamplingConfig |
Token sampling hyper-parameters. |
apply_repetition_penalty() |
NumPy repetition penalty on logits. |
apply_temperature() |
Temperature scaling on logits. |
apply_top_k() |
Top-k filtering on logits. |
apply_top_p() |
Nucleus (top-p) filtering on logits. |
numpy_softmax() |
Numerically-stable softmax in NumPy. |
numpy_multinomial() |
Inverse-CDF multinomial sampling. |
Performance Notes
stream_chunk_sizecontrols the latency–quality trade-off: smaller values yield audio sooner but run the vocoder more often (on all accumulated latents).- Thread count (
num_threads_gpt) should be tuned to your CPU. Start with the number of physical cores. - First call to
get_conditioning_latents()is an expensive step (resampling + mel computation + encoder inference). Cache the results for repeated synthesis with the same speaker.
License
This project is licensed under the CC-BY-NC-ND-4.0.
Created by: Patrick Lumbantobing, Vertox-AI
Copyright (c) 2026 Vertox-AI. All rights reserved.
This work is licensed under the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License.
To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/
Acknowledgements
- Coqui AI for the original XTTSv2 model and training recipe.
- XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model (Casanova et al., 2024).
- ONNX Runtime for high-performance cross-platform inference.
Model tree for pltobing/XTTSv2-Streaming-ONNX
Base model
coqui/XTTS-v2