LatinX TTS
LatinX is a multilingual, voice-preserving text-to-speech model component designed for cascaded speech-to-speech translation research. Given input text and a target language code, it generates 16 kHz speech. It can also condition generation on a short reference audio sample when the user wants to preserve speaker characteristics.
This Hugging Face repository hosts the public checkpoint files used by the latinx-inference repository. The release is intended primarily as a reproducibility artifact for the associated research paper, not as a production voice-cloning service or a polished general-purpose TTS package.
Model details
- Model name: LatinX TTS
- Paper: LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization
- Authors: Luis Felipe Chary and Miguel Arjona Ramirez
- License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
- Supported language codes:
pt,en,es,fr,it,ro - Output format: mono 16 kHz waveform
- Inference code: https://github.com/luischary/latinx-inference
Intended use
LatinX is intended for non-commercial research and academic use, including:
- reproducibility and evaluation of the associated paper;
- multilingual text-to-speech experiments;
- cascaded speech-to-speech translation research;
- voice-preserving synthesis studies with consenting reference speakers;
- analysis of multilingual speech-generation and alignment methods.
The model is best understood as a research artifact. Users should expect to run the accompanying inference scripts rather than load this repository directly as a generic transformers pipeline.
Out-of-scope use
Do not use LatinX for:
- impersonation or deceptive voice cloning;
- generating speech in another person's voice without consent;
- fraud, harassment, misinformation, social engineering, or other harmful speech-generation scenarios;
- commercial products or services;
- any use that violates applicable laws, privacy rights, publicity rights, institutional rules, or ethical guidelines.
Architecture overview
The public inference stack includes:
- text normalization, including numeric value expansion via bundled
num_to_textrules in the inference repository; - grapheme-to-phoneme conversion;
- phoneme tokenization;
- autoregressive acoustic-token generation;
- VQ acoustic autoencoder decoding;
- HiFi-GAN waveform synthesis.
This model repository contains the large checkpoint artifacts. Small tokenizer/configuration assets and the executable inference code are kept in the GitHub repository.
Expected checkpoint files:
g2p.ptar_decoder.pthifi.ptvq_autoencoder.pt
How to use
Clone the inference repository and install its requirements in a clean virtual environment:
git clone https://github.com/luischary/latinx-inference.git
cd latinx-inference
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
On Windows PowerShell:
git clone https://github.com/luischary/latinx-inference.git
cd latinx-inference
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
The inference code automatically resolves and downloads the checkpoint files from this Hugging Face repository when they are not already cached.
Minimal example
python scripts/run_inference.py \
--text "Olá, este é um teste curto de síntese de fala." \
--lang pt \
--output-dir inference_outputs
Multiple languages
python scripts/run_inference.py \
--text "Olá, este é um teste curto de síntese de fala." --lang pt \
--text "Hello, this is a short speech synthesis test." --lang en \
--text "Hola, esta es una prueba corta de síntesis de voz." --lang es \
--text "Bonjour, ceci est un court test de synthèse vocale." --lang fr \
--text "Ciao, questo è un breve test di sintesi vocale." --lang it \
--text "Bună, acesta este un scurt test de sinteză vocală." --lang ro \
--output-dir inference_outputs \
--seed 1234
Reference-audio conditioning
For voice-preserving use, provide a reference audio sample from a speaker who has given appropriate consent:
python scripts/run_inference.py \
--text "Olá, este é um teste curto de síntese de fala." \
--lang pt \
--cond-audio-path /path/to/reference_audio.wav \
--output-dir inference_outputs
Generated files are written as output_0.wav, output_1.wav, etc. inside the output directory.
Windows notes
On Windows, TorchCodec/Torchaudio may require FFmpeg dynamic libraries (.dll files) when saving or loading audio. Static or essentials FFmpeg builds may not be sufficient.
Recommended setup:
- Download a Release Full Shared FFmpeg build, such as
ffmpeg-release-full-shared.7z, from Gyan.dev. - Extract it to a stable directory, for example
C:\ffmpeg_shared. - If needed, point
FFMPEG_SHARED_PATHto either the extracted root or thebindirectory that contains files such asavcodec*.dll,avformat*.dll, andavutil*.dll.
PowerShell / Command Prompt example:
set FFMPEG_SHARED_PATH=C:\ffmpeg_shared\bin
python scripts\run_inference.py --text "Olá, este é um teste curto." --lang pt --output-dir inference_outputs
Git Bash example:
export FFMPEG_SHARED_PATH='C:/ffmpeg_shared/bin'
# or
export FFMPEG_SHARED_PATH='/c/ffmpeg_shared/bin'
python scripts/run_inference.py --text "Olá, este é um teste curto." --lang pt --output-dir inference_outputs
The inference script also checks common Windows locations automatically and skips directories that do not look like shared FFmpeg builds.
GPU and Flash Attention notes
The default requirements are intended as a simple baseline. For GPU inference, install torch, torchaudio, and torchvision using the official PyTorch installation selector for your operating system, NVIDIA driver, and CUDA runtime; then install the remaining requirements from the repository.
Flash Attention is optional. It can improve performance in compatible CUDA/PyTorch environments, but it should not be treated as a universal Windows requirement.
Training and alignment
LatinX is described in the associated paper:
LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization
For training and alignment details, refer to the paper. This public release focuses on inference and does not include the original training pipeline.
Limitations
- Voice preservation depends on the quality, duration, recording conditions, and suitability of the reference audio.
- The model was trained and aligned primarily for conditional generation. Reference-audio conditioning is recommended for voice-preserving use and may be more stable than unconditioned generation.
- Unconditioned generation is useful for quick tests, but it can be more prone to repetition or unstable outputs.
- Pronunciation and prosody quality may vary across languages, speakers, domains, and text styles.
- Numeric normalization is applied before G2P, but unusual formats, abbreviations, names, acronyms, code-switching, or mixed-language text may still be mispronounced.
- The current inference script targets short utterances. It does not implement long-form sentence splitting, chunking, crossfading, or concatenation.
- Compatibility statements are based on observed tests and are not a guarantee for every PyTorch/CUDA/driver/FFmpeg combination.
- The model is released for research and non-commercial use only.
Ethical considerations
Because LatinX can synthesize speech with reference-audio conditioning, users must obtain appropriate consent before using a person's voice. Generated speech should be clearly disclosed as synthetic when there is any possibility of confusion.
This model must not be used for impersonation, deception, fraud, harassment, misinformation, non-consensual voice cloning, or other harmful speech-generation scenarios. Users are responsible for complying with applicable laws, privacy rights, publicity rights, institutional policies, and ethical review requirements.
License
This model repository, the associated checkpoint files, and the corresponding inference repository are released under the Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0).
- Human-readable summary: https://creativecommons.org/licenses/by-nc/4.0/
- Legal code: https://creativecommons.org/licenses/by-nc/4.0/legalcode
- SPDX identifier:
CC-BY-NC-4.0
Citation
If you use LatinX in academic work, please cite:
@misc{chary2025latinx,
title={LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization},
author={Luis Felipe Chary and Miguel Arjona Ramirez},
year={2025},
eprint={2509.05863},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.05863}
}