XTTSv2 Streaming ONNX

Streaming text-to-speech inference for XTTSv2 using ONNX Runtime — no PyTorch required.

This repository provides a complete, CPU-friendly, streaming TTS pipeline built on ONNX-exported XTTSv2 models. It replaces the original PyTorch inference path with pure Python/NumPy logic while preserving full compatibility with the XTTSv2 architecture.

Features

Zero-shot voice cloning from a short (≤ 6 s) reference audio clip.
Streaming audio output — audio chunks are yielded as they are generated, enabling low-latency playback.
Pure ONNX Runtime + NumPy — no PyTorch dependency at inference time.
INT8-quantised GPT model option for reduced memory footprint and faster CPU inference.
Cross-fade chunk stitching for seamless audio across vocoder boundaries.
Speed control via linear interpolation of GPT latents.
Multilingual support — 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi).

Architecture Overview

XTTSv2 is composed of four main neural network components, each exported as a separate ONNX model:

Component	ONNX File	Description
Conditioning Encoder	`conditioning_encoder.onnx`	Six 16-head attention layers + Perceiver Resampler. Compresses a reference mel-spectrogram into 32 × 1024 conditioning latents.
Speaker Encoder	`speaker_encoder.onnx`	H/ASP speaker verification network. Extracts a 512-dim speaker embedding from 16 kHz audio.
GPT-2 Decoder	`gpt_model.onnx` / `gpt_model_int8.onnx`	30-layer, 1024-dim decoder-only transformer with KV-cache. Autoregressively predicts VQ-VAE audio codes conditioned on text tokens and conditioning latents.
HiFi-GAN Vocoder	`hifigan_vocoder.onnx`	26M-parameter neural vocoder. Converts GPT-2 hidden states + speaker embedding into a 24 kHz waveform.

Pre-exported embedding tables (text, mel, positional) are stored as .npy files in the embeddings/ directory.

┌─────────────┐   mel @ 22 kHz    ┌─────────────────────┐
│  Reference   │ ───────────────► │ Conditioning Encoder │──► cond_latents [1,32,1024]
│  Audio Clip  │                  └─────────────────────┘
│              │   audio @ 16 kHz  ┌─────────────────────┐
│              │ ───────────────► │   Speaker Encoder    │──► speaker_emb  [1,512,1]
└─────────────┘                   └─────────────────────┘

┌──────────┐   BPE tokens   ┌──────────────────────────────────────────┐
│   Text   │ ─────────────► │  GPT-2 Decoder (autoregressive + KV$)   │──► latents [1,T,1024]
└──────────┘                 │  prefix = [cond | text+pos | start_mel] │
                             └──────────────────────────────────────────┘
                                           │
                                           ▼
                             ┌─────────────────────┐
                             │   HiFi-GAN Vocoder   │──► waveform @ 24 kHz
                             │  (+ speaker_emb)     │
                             └─────────────────────┘

Repository Structure

.
├── README.md                         # This file
├── requirements.txt                  # Python dependencies
├── xtts_streaming_pipeline.py        # Top-level streaming TTS pipeline
├── xtts_onnx_orchestrator.py         # Low-level ONNX AR loop orchestrator
├── xtts_tokenizer.py                 # BPE tokenizer wrapper
├── zh_num2words.py                   # Chinese number-to-words utility
├── xtts_onnx/                        # ONNX models & assets
│   ├── metadata.json                 # Model architecture metadata
│   ├── vocab.json                    # BPE vocabulary
│   ├── mel_stats.npy                 # Per-channel mel normalisation stats
│   ├── conditioning_encoder.onnx     # Conditioning encoder
│   ├── speaker_encoder.onnx          # H/ASP speaker encoder
│   ├── gpt_model.onnx               # GPT-2 decoder (FP32)
│   ├── gpt_model_int8.onnx          # GPT-2 decoder (INT8 quantised)
│   ├── hifigan_vocoder.onnx          # HiFi-GAN vocoder
│   └── embeddings/                   # Pre-exported embedding tables
│       ├── mel_embedding.npy         # [1026, 1024] audio code embeddings
│       ├── mel_pos_embedding.npy     # [608, 1024]  mel positional embeddings
│       ├── text_embedding.npy        # [6681, 1024] BPE text embeddings
│       └── text_pos_embedding.npy    # [404, 1024]  text positional embeddings
├── audio_ref/                        # Reference audio clips for voice cloning
└── audio_synth/                      # Directory for synthesised output

Installation

Prerequisites

Python ≥ 3.10
A C compiler may be needed for some dependencies (e.g. tokenizers).

Install dependencies

pip install -r requirements.txt

Clone from Hugging Face Hub

# Install Git LFS (required for large model files)
git lfs install

# Clone the repository
git clone https://huggingface.co/pltobing/XTTSv2-Streaming-ONNX
cd XTTSv2-Streaming-ONNX

Quick Start

Streaming TTS (command-line)

python -u xtts_streaming_pipeline.py \
    --model_dir xtts_onnx/ \
    --vocab_path xtts_onnx/vocab.json \
    --mel_norms_path xtts_onnx/mel_stats.npy \
    --ref_audio audio_ref/male_stewie.mp3 \
    --language en \
    --output output_streaming.wav

Python API

import numpy as np
from xtts_streaming_pipeline import StreamingTTSPipeline

# Initialise the pipeline
pipeline = StreamingTTSPipeline(
    model_dir="xtts_onnx/",
    vocab_path="xtts_onnx/vocab.json",
    mel_norms_path="xtts_onnx/mel_stats.npy",
    use_int8_gpt=True,       # Use INT8-quantised GPT for faster CPU inference
    num_threads_gpt=4,        # Adjust to your CPU core count
)

# Compute speaker conditioning (one-time per speaker)
gpt_cond_latent, speaker_embedding = pipeline.get_conditioning_latents(
    "audio_ref/male_stewie.mp3"
)

# Stream synthesis
all_chunks = []
for audio_chunk in pipeline.inference_stream(
    text="Hello, this is a streaming text-to-speech demo.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    stream_chunk_size=20,     # AR tokens per vocoder call
    speed=1.0,                # 1.0 = normal speed
):
    all_chunks.append(audio_chunk)
    # In a real application, you would play or stream each chunk here.

# Concatenate all chunks into a single waveform
full_audio = np.concatenate(all_chunks, axis=0)

# Save to file
import soundfile as sf
sf.write("output.wav", full_audio, 24000)

Configuration

SamplingConfig

Control the autoregressive token sampling behaviour:

Parameter	Default	Description
`temperature`	`0.75`	Softmax temperature. Lower = more deterministic.
`top_k`	`50`	Keep only the top-k most probable tokens.
`top_p`	`0.85`	Nucleus sampling cumulative probability threshold.
`repetition_penalty`	`10.0`	Penalise previously generated tokens.
`do_sample`	`True`	`True` = multinomial sampling; `False` = greedy argmax.

from xtts_onnx_orchestrator import SamplingConfig

sampling = SamplingConfig(
    temperature=0.65,
    top_k=30,
    top_p=0.90,
    repetition_penalty=10.0,
    do_sample=True,
)

for chunk in pipeline.inference_stream(text, "en", cond, spk, sampling=sampling):
    ...

GPTConfig

Model architecture parameters are automatically loaded from metadata.json. Key fields:

Parameter	Value	Description
`n_layer`	30	Number of GPT-2 transformer layers
`embed_dim`	1024	Hidden dimension
`num_heads`	16	Number of attention heads
`head_dim`	64	Per-head dimension
`num_audio_tokens`	1026	Audio vocabulary (1024 VQ codes + start + stop)
`perceiver_output_len`	32	Conditioning latent sequence length
`max_gen_mel_tokens`	605	Maximum generated audio tokens

Module Reference

`xtts_streaming_pipeline.py`

Top-level streaming pipeline.

Class / Function	Description
`StreamingTTSPipeline`	Main pipeline class. Owns sessions, tokenizer, orchestrator.
`StreamingTTSPipeline.get_conditioning_latents()`	Extract GPT conditioning + speaker embedding from reference audio.
`StreamingTTSPipeline.inference_stream()`	Generator that yields audio chunks for a text segment.
`StreamingTTSPipeline.time_scale_gpt_latents_numpy()`	Linearly time-scale GPT latents for speed control.
`wav_to_mel_cloning_numpy()`	Compute normalised log-mel spectrogram (NumPy, 22 kHz).
`crossfade_chunks()`	Cross-fade consecutive vocoder waveform chunks.

`xtts_onnx_orchestrator.py`

Low-level ONNX autoregressive loop.

Class / Function	Description
`ONNXSessionManager`	Loads and manages all ONNX sessions + embedding tables.
`XTTSOrchestratorONNX`	Drives the GPT-2 AR loop with KV-cache and logits processing.
`GPTConfig`	Model architecture hyper-parameters (from `metadata.json`).
`SamplingConfig`	Token sampling hyper-parameters.
`apply_repetition_penalty()`	NumPy repetition penalty on logits.
`apply_temperature()`	Temperature scaling on logits.
`apply_top_k()`	Top-k filtering on logits.
`apply_top_p()`	Nucleus (top-p) filtering on logits.
`numpy_softmax()`	Numerically-stable softmax in NumPy.
`numpy_multinomial()`	Inverse-CDF multinomial sampling.

Performance Notes

stream_chunk_size controls the latency–quality trade-off: smaller values yield audio sooner but run the vocoder more often (on all accumulated latents).
Thread count (num_threads_gpt) should be tuned to your CPU. Start with the number of physical cores.
First call to get_conditioning_latents() is an expensive step (resampling + mel computation + encoder inference). Cache the results for repeated synthesis with the same speaker.

License

This project is licensed under the CC-BY-NC-ND-4.0.

Created by: Patrick Lumbantobing, Vertox-AI
Copyright (c) 2026 Vertox-AI. All rights reserved.

This work is licensed under the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License.
To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/

Acknowledgements

Coqui AI for the original XTTSv2 model and training recipe.
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model (Casanova et al., 2024).
ONNX Runtime for high-performance cross-platform inference.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pltobing/XTTSv2-Streaming-ONNX

Base model

coqui/XTTS-v2

Quantized

(2)

this model

Paper for pltobing/XTTSv2-Streaming-ONNX

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Paper • 2406.04904 • Published Jun 7, 2024 • 9