X-Codec2 (Transformers-native)

The X-Codec2 model was proposed in Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis.

X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.

About its architecture:

Unified Semantic-Acoustic Tokenization: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
Single-Stage Feature Scalar Quantization (FSQ): Unlike the multi-layer residual VQ in most approaches (e.g., DAC, EnCodec, X-Codec, Mimi), X-Codec2 uses a single-layer of Feature Scalar Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs.
Transformer-Friendly Design: The 1D token structure of X-Codec2 naturally aligns with the autoregressive modeling in LLMs like LLaMA, improving training efficiency and downstream compatibility.

This model was contributed by Eric Bezzam and Steven Zheng. The original modeling code can be found here, while their training code is here.

Setup

X-Codec2 is supported natively in 🤗 Transformers. Until it is part of an official Transformers release, install from source:

pip install git+https://github.com/huggingface/transformers

Usage example

Here is a quick example of how to encode and decode an audio using this model:

from datasets import Audio, load_dataset
from transformers import AutoFeatureExtractor, AutoModel

model_id = "HKUSTAudio/xcodec2-hf"
model = AutoModel.from_pretrained(model_id, device_map="auto")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio = dataset[0]["audio"]["array"]
inputs = feature_extractor(audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(
    model.device, model.dtype
)
print("Input waveform shape:", inputs["input_values"].shape)
# Input waveform shape: torch.Size([1, 1, 93760])

# encoder and decoder
audio_codes = model.encode(**inputs).audio_codes
print("Audio codes shape:", audio_codes.shape)
# Audio codes shape: torch.Size([1, 1, 293])
audio_values = model.decode(audio_codes).audio_values
print("Audio values shape:", audio_values.shape)
# Audio values shape: torch.Size([1, 1, 93760])

# Equivalently, you can do encoding and decoding in one step
model_output = model(**inputs)
audio_codes = model_output.audio_codes
audio_values = model_output.audio_values

Batch processing

Unlike the original release, this implementation also supports batched inputs.

from datasets import Audio, load_dataset
from transformers import AutoFeatureExtractor, AutoModel

batch_size = 2
model_id = "HKUSTAudio/xcodec2-hf"
model = AutoModel.from_pretrained(model_id, device_map="auto")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audios = [dataset[i]["audio"]["array"] for i in range(batch_size)]
inputs = feature_extractor(audio=audios, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(
    model.device, model.dtype
)
print("Input waveform shape:", inputs["input_values"].shape)
# Input waveform shape: torch.Size([2, 1, 93760])

# encoder and decoder
encoder_output = model.encode(**inputs)
audio_codes = encoder_output.audio_codes
print("Audio codes shape:", audio_codes.shape)
# Audio codes shape: torch.Size([2, 1, 293])
audio_values = model.decode(audio_codes).audio_values
print("Audio values shape:", audio_values.shape)
# Audio values shape: torch.Size([2, 1, 93760])

# Equivalently, you can do encoding and decoding in one step
model_output = model(**inputs)
audio_codes = model_output.audio_codes
audio_values = model_output.audio_values

Speed-up with `torch.compile`

You can speed up inference with torch.compile. The first few calls will be slower due to compilation overhead, but subsequent calls will be faster.

On an A100, we observed a speed-up of ~1.35 for a batch size of 4 (script).

import torch
from datasets import Audio, load_dataset
from transformers import AutoFeatureExtractor, AutoModel

batch_size = 4
model_id = "HKUSTAudio/xcodec2-hf"
model = AutoModel.from_pretrained(model_id, device_map="auto")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audios = [dataset[i]["audio"]["array"] for i in range(batch_size)]
inputs = feature_extractor(
    audio=audios, sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt"
).to(model.device, model.dtype)

compiled_model = torch.compile(model, fullgraph=True)

# Warmup (includes compilation on first call)
for _ in range(10):
    with torch.inference_mode():
        _ = compiled_model(**inputs)

with torch.inference_mode():
    output = compiled_model(**inputs)
print("Audio values shape:", output.audio_values.shape)

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F32

Paper for bezzam/xcodec2-hf

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Paper • 2502.04128 • Published Feb 6, 2025 • 28