You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

A unified speech-language model that synchronizes speech and text into a single, cohesive stream via 1:1 alignment.

Text-Acoustic Dual-Alignment Large Language Model

TADA is a unified speech-language model that synchronizes speech and text into a single, cohesive stream via 1:1 alignment. By leveraging a novel tokenizer and architectural design, TADA achieves high-fidelity synthesis and generation with a fraction of the computational overhead required by traditional models.

⭐️ arxiv: https://arxiv.org/abs/2602.23068
⭐️ demo: https://huggingface.co/spaces/HumeAI/tada
⭐️ github: https://github.com/HumeAI/tada
⭐️ blog post: https://www.hume.ai/blog/opensource-tada \

Key Features

1:1 Token Alignment: Unlike standard models, TADA’s tokenizer encodes audio into a sequence of vectors that perfectly matches the number of text tokens.
Dynamic Duration Synthesis: As a TTS model, it generates the full speech segment for a text token in a single autoregressive step, regardless of length. This eliminates the need for fixed-frame-rate processing.
Dual-Stream Generation: In speech-language modeling mode, it generates a text token and the speech for the preceding token simultaneously, maintaining the same context length and minimal overhead compared to text-only generation.
Efficiency & Reliability: TADA delivers superior expressiveness and natural flow while significantly reducing the computational cost associated with fixed audio frame rates.

How It Works

The Tokenization Schema

TADA unifies modalities by ensuring that for every word or subword token, there is exactly one corresponding speech vector. This synchronized stream allows the model to "understand" the precise timing of speech relative to text.

Dynamic Autoregression

Most TTS models require a fixed number of steps to produce one second of audio (e.g., 50 frames per second). TADA breaks this constraint:

Each autoregressive step covers one text token.
The model dynamically determines the duration and prosody for that specific token.
This results in a more natural flow and eliminates transcript hallucination.

Installation

From the github repo

pip install git+https://github.com/HumeAI/tada.git

From source

pip install -e .

Models

We provide several model checkpoints:

Model	Base Model	HuggingFace Hub
TADA-1B	Llama 3.2 1B	`HumeAI/tada-1b`
TADA-3B-ml	Llama 3.2 3B	`HumeAI/tada-3b-ml`

All models use the same encoder (HumeAI/tada-codec) and can be loaded using the same API.

Evaluation

Run Inferece

Text-to-speech

import torch
import torchaudio

from tada.modules.encoder import Encoder
from tada.modules.tada import TadaForCausalLM

device = "cuda"
encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder").to(device)
model = TadaForCausalLM.from_pretrained("HumeAI/tada-1b").to(device)

audio, sample_rate = torchaudio.load("samples/ljspeech.wav")
audio = audio.to(device)
prompt_text = "The examination and testimony of the experts, enabled the commission to conclude that five shots may have been fired."
prompt = encoder(
    audio, text=[prompt_text], sample_rate=sample_rate
)

output = model.generate(
    prompt=prompt,
    text="Please call Stella. Ask her to bring these things with her from the store.",
)

Speech continuation

Provide num_extra_steps if you want to generate text+speech continuation of the prompt

output = model.generate(
    prompt=prompt,
    num_extra_steps=50
)

📚 Citation

If you use this project in your research, please cite our paper:

@article{dang2026tada,
  title={TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment},
  author={Dang, Trung and Rao, Sharath and Gupta, Ananya and Gagne, Christopher and Tzirakis, Panagiotis and Baird, Alice and Cłapa, Jakub Piotr and Chin, Peter and Cowen, Alan},
  journal={arXiv preprint arXiv:2602.23068},
  year={2026}
}

Contact

Hume AI is an empathic AI research company. We research the datasets, tools, and models needed to give empathy to AI models to serve human wellbeing. If you're interested in any of our product or research collaborations, please reach out to us at hello@hume.ai

Downloads last month: 2,241

Safetensors

Model size

2B params

Tensor type

BF16

BOOL

Space using HumeAI/tada-1b 1

Collection including HumeAI/tada-1b

TADA

Collection

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment | https://huggingface.co/papers/2602.23068 • 4 items • Updated about 1 hour ago • 4

Paper for HumeAI/tada-1b

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

Paper • 2602.23068 • Published 12 days ago