LatinX TTS

LatinX is a multilingual, voice-preserving text-to-speech model component designed for cascaded speech-to-speech translation research. Given input text and a target language code, it generates 16 kHz speech. It can also condition generation on a short reference audio sample when the user wants to preserve speaker characteristics.

This Hugging Face repository hosts the public checkpoint files used by the latinx-inference repository. The release is intended primarily as a reproducibility artifact for the associated research paper, not as a production voice-cloning service or a polished general-purpose TTS package.

Model details

Model name: LatinX TTS
Paper: LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization
Authors: Luis Felipe Chary and Miguel Arjona Ramirez
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Supported language codes: pt, en, es, fr, it, ro
Output format: mono 16 kHz waveform
Inference code: https://github.com/luischary/latinx-inference

Intended use

LatinX is intended for non-commercial research and academic use, including:

reproducibility and evaluation of the associated paper;
multilingual text-to-speech experiments;
cascaded speech-to-speech translation research;
voice-preserving synthesis studies with consenting reference speakers;
analysis of multilingual speech-generation and alignment methods.

The model is best understood as a research artifact. Users should expect to run the accompanying inference scripts rather than load this repository directly as a generic transformers pipeline.

Out-of-scope use

Do not use LatinX for:

impersonation or deceptive voice cloning;
generating speech in another person's voice without consent;
fraud, harassment, misinformation, social engineering, or other harmful speech-generation scenarios;
commercial products or services;
any use that violates applicable laws, privacy rights, publicity rights, institutional rules, or ethical guidelines.

Architecture overview

The public inference stack includes:

text normalization, including numeric value expansion via bundled num_to_text rules in the inference repository;
grapheme-to-phoneme conversion;
phoneme tokenization;
autoregressive acoustic-token generation;
VQ acoustic autoencoder decoding;
HiFi-GAN waveform synthesis.

This model repository contains the large checkpoint artifacts. Small tokenizer/configuration assets and the executable inference code are kept in the GitHub repository.

Expected checkpoint files:

g2p.pt
ar_decoder.pt
hifi.pt
vq_autoencoder.pt

How to use

Clone the inference repository and install its requirements in a clean virtual environment:

git clone https://github.com/luischary/latinx-inference.git
cd latinx-inference
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

On Windows PowerShell:

git clone https://github.com/luischary/latinx-inference.git
cd latinx-inference
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

The inference code automatically resolves and downloads the checkpoint files from this Hugging Face repository when they are not already cached.

Minimal example

python scripts/run_inference.py \
  --text "Olá, este é um teste curto de síntese de fala." \
  --lang pt \
  --output-dir inference_outputs

Multiple languages

python scripts/run_inference.py \
  --text "Olá, este é um teste curto de síntese de fala." --lang pt \
  --text "Hello, this is a short speech synthesis test." --lang en \
  --text "Hola, esta es una prueba corta de síntesis de voz." --lang es \
  --text "Bonjour, ceci est un court test de synthèse vocale." --lang fr \
  --text "Ciao, questo è un breve test di sintesi vocale." --lang it \
  --text "Bună, acesta este un scurt test de sinteză vocală." --lang ro \
  --output-dir inference_outputs \
  --seed 1234

Reference-audio conditioning

For voice-preserving use, provide a reference audio sample from a speaker who has given appropriate consent:

python scripts/run_inference.py \
  --text "Olá, este é um teste curto de síntese de fala." \
  --lang pt \
  --cond-audio-path /path/to/reference_audio.wav \
  --output-dir inference_outputs

Generated files are written as output_0.wav, output_1.wav, etc. inside the output directory.

Windows notes

On Windows, TorchCodec/Torchaudio may require FFmpeg dynamic libraries (.dll files) when saving or loading audio. Static or essentials FFmpeg builds may not be sufficient.

Recommended setup:

Download a Release Full Shared FFmpeg build, such as ffmpeg-release-full-shared.7z, from Gyan.dev.
Extract it to a stable directory, for example C:\ffmpeg_shared.
If needed, point FFMPEG_SHARED_PATH to either the extracted root or the bin directory that contains files such as avcodec*.dll, avformat*.dll, and avutil*.dll.

PowerShell / Command Prompt example:

set FFMPEG_SHARED_PATH=C:\ffmpeg_shared\bin
python scripts\run_inference.py --text "Olá, este é um teste curto." --lang pt --output-dir inference_outputs

Git Bash example:

export FFMPEG_SHARED_PATH='C:/ffmpeg_shared/bin'
# or
export FFMPEG_SHARED_PATH='/c/ffmpeg_shared/bin'
python scripts/run_inference.py --text "Olá, este é um teste curto." --lang pt --output-dir inference_outputs

The inference script also checks common Windows locations automatically and skips directories that do not look like shared FFmpeg builds.

GPU and Flash Attention notes

The default requirements are intended as a simple baseline. For GPU inference, install torch, torchaudio, and torchvision using the official PyTorch installation selector for your operating system, NVIDIA driver, and CUDA runtime; then install the remaining requirements from the repository.

Flash Attention is optional. It can improve performance in compatible CUDA/PyTorch environments, but it should not be treated as a universal Windows requirement.

Training and alignment

LatinX is described in the associated paper:

LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization

For training and alignment details, refer to the paper. This public release focuses on inference and does not include the original training pipeline.

Limitations

Voice preservation depends on the quality, duration, recording conditions, and suitability of the reference audio.
The model was trained and aligned primarily for conditional generation. Reference-audio conditioning is recommended for voice-preserving use and may be more stable than unconditioned generation.
Unconditioned generation is useful for quick tests, but it can be more prone to repetition or unstable outputs.
Pronunciation and prosody quality may vary across languages, speakers, domains, and text styles.
Numeric normalization is applied before G2P, but unusual formats, abbreviations, names, acronyms, code-switching, or mixed-language text may still be mispronounced.
The current inference script targets short utterances. It does not implement long-form sentence splitting, chunking, crossfading, or concatenation.
Compatibility statements are based on observed tests and are not a guarantee for every PyTorch/CUDA/driver/FFmpeg combination.
The model is released for research and non-commercial use only.

Ethical considerations

Because LatinX can synthesize speech with reference-audio conditioning, users must obtain appropriate consent before using a person's voice. Generated speech should be clearly disclosed as synthetic when there is any possibility of confusion.

This model must not be used for impersonation, deception, fraud, harassment, misinformation, non-consensual voice cloning, or other harmful speech-generation scenarios. Users are responsible for complying with applicable laws, privacy rights, publicity rights, institutional policies, and ethical review requirements.

License

This model repository, the associated checkpoint files, and the corresponding inference repository are released under the Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0).

Human-readable summary: https://creativecommons.org/licenses/by-nc/4.0/
Legal code: https://creativecommons.org/licenses/by-nc/4.0/legalcode
SPDX identifier: CC-BY-NC-4.0

Citation

If you use LatinX in academic work, please cite:

@misc{chary2025latinx,
  title={LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization},
  author={Luis Felipe Chary and Miguel Arjona Ramirez},
  year={2025},
  eprint={2509.05863},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2509.05863}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for LuisChary/LatinX-TTS

LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization

Paper • 2509.05863 • Published Sep 6, 2025