Instructions to use voidful/llm-codec with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use voidful/llm-codec with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="voidful/llm-codec")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("voidful/llm-codec")
model = AutoModelForCausalLM.from_pretrained("voidful/llm-codec")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use voidful/llm-codec with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "voidful/llm-codec"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "voidful/llm-codec",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/voidful/llm-codec

SGLang

How to use voidful/llm-codec with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "voidful/llm-codec" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "voidful/llm-codec",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "voidful/llm-codec" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "voidful/llm-codec",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use voidful/llm-codec with Docker Model Runner:
```
docker model run hf.co/voidful/llm-codec
```

llm-codec

File size: 7,982 Bytes

d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d205d5d
 
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
 
 
 
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
d205d5d
fdec09b
 
 
a0ef01c
 
 
 
 
 
fdec09b
 
d205d5d
fdec09b

---
base_model:
- Qwen/Qwen3-4B-Instruct-2507
library_name: transformers
pipeline_tag: text-generation
tags:
- audio
- speech
- audio-codec
- neural-audio-codec
- spoken-language-modeling
- codec-superb
- qwen3
datasets:
- librispeech_asr
metrics:
- perplexity
- pesq
- stoi
---

# LLM-Codec

LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio
tokens that are both reconstructable and easier for autoregressive language
models to predict.

Model: https://huggingface.co/voidful/llm-codec

Code: https://github.com/voidful/llm-codec

Usage reference: https://github.com/voidful/Codec-SUPERB

## Model Description

Most neural audio codecs are trained for waveform reconstruction. Spoken
language models, however, consume codec tokens with a next-token prediction
objective. This mismatch can make acoustically valid variation appear as token
uncertainty to the language model.

LLM-Codec adapts a codec with language-model-facing objectives while keeping the
deployed codec interface unchanged. The model is trained with:

- Future Token Prediction (FTP): Medusa-style heads predict future audio tokens
  from frozen-LLM hidden states.
- Semantic Alignment (SA): audio-induced hidden states are aligned with paired
  text hidden states inside a frozen LLM.
- Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward
  tokens while enabling gradients to flow to the codec encoder.
- Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex
  STFT, VQ, GAN, and feature matching losses.

The deployed codec does not require the auxiliary FTP heads.

## Intended Use

This model is intended for research and development in:

- audio tokenization for spoken language modeling
- codec reconstruction experiments
- token-level speech LM training
- Codec-SUPERB style codec evaluation
- speech token analysis and ablation studies

It is not a full text-to-speech system by itself. For speech generation, use the
codec as the tokenizer/decoder inside a separate speech language modeling
pipeline.

## Out-of-Scope Use

Do not use this model for:

- impersonation or unauthorized voice cloning
- surveillance or speaker tracking without consent
- high-stakes speaker, language, or identity decisions
- generating deceptive audio content

## Installation

The easiest inference path is through the Codec-SUPERB `SoundCodec` interface.

```bash
git clone https://github.com/voidful/Codec-SUPERB.git
cd Codec-SUPERB
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```

If your environment supports editable installs, this is also convenient:

```bash
pip install -e .
```

## Quick Start

Load LLM-Codec through the Codec-SUPERB codec registry:

```python
from SoundCodec import codec

print(codec.list_codec())
model = codec.load_codec("llmcodec")
```

Encode and reconstruct one audio file:

```python
from SoundCodec import codec
import torchaudio
import soundfile as sf

model = codec.load_codec("llmcodec")

waveform, sample_rate = torchaudio.load("sample_audio.wav")
data_item = {
    "audio": {
        "array": waveform.numpy()[0],
        "sampling_rate": sample_rate,
    }
}

units = model.extract_unit(data_item).unit
print("Unit shape:", units.shape)

result = model.synth(data_item, local_save=False)
reconstructed = result["audio"]["array"]
reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)

sf.write("reconstructed.wav", reconstructed, reconstructed_sr)
```

## Batch Usage

Codec-SUPERB also provides batch APIs:

```python
from SoundCodec import codec
import torchaudio

model = codec.load_codec("llmcodec")

audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
data_list = []

for path in audio_files:
    waveform, sample_rate = torchaudio.load(path)
    data_list.append({
        "id": path,
        "audio": {
            "array": waveform.numpy()[0],
            "sampling_rate": sample_rate,
        },
    })

batch_units = model.batch_extract_unit(data_list)
batch_audio = model.batch_decode_unit(batch_units)

results = model.batch_synth(data_list, local_save=False)
for item in results:
    print(item["unit"].shape, item["audio"]["array"].shape)
```

For better throughput, group audio samples with similar lengths before batching.

## Codec-SUPERB Evaluation

To evaluate LLM-Codec with Codec-SUPERB-tiny:

```bash
PYTHONPATH=. python3 scripts/dataset_creator.py \
  --dataset voidful/codec-superb-tiny

PYTHONPATH=. python3 scripts/benchmarking.py \
  --dataset datasets/voidful/codec-superb-tiny_synth \
  --models llmcodec
```

## Model Files

The model repository provides:

- codec weights as `llm-codec.pt`
- a tokenizer extended with `<CODEC_*>` audio tokens
- Qwen-compatible model artifacts containing trained audio-token embeddings

The codec uses 20,480 audio tokens with the canonical token format:

```text
<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>
```

## Training Data

The codec was trained on LibriSpeech `train-clean-100` with paired transcripts.
The validation split used during training is LibriSpeech `validation`.

Because training is speech-centric and transcript-supervised, performance may be
weaker on non-English speech, conversational speech, music, environmental audio,
or audio with strong noise and overlap.

## Training Procedure

Base components:

- Base codec: AUV
- Frozen LLM backbone: Qwen3-4B-Instruct
- Token rate: 50 Hz
- Audio vocabulary size: 20,480
- Segment length: 4 seconds

Losses:

- reconstruction mel loss
- multi-scale mel loss
- multi-resolution STFT loss
- complex STFT loss with phase term
- VQ commitment loss
- Gumbel bridge cross entropy
- Future Token Prediction loss
- Semantic Alignment cosine loss
- Semantic Alignment contrastive loss with memory bank
- MPD/MSD GAN and feature matching losses

## Evaluation Results

### Token Learnability

SALMon speech coherence accuracy after token-level LM training:

| Tokenizer | Overall accuracy |
| --- | ---: |
| WavTok-L | 48.3 |
| BigCodec | 49.4 |
| UniCodec | 50.1 |
| AUV | 49.4 |
| LLM-Codec | 61.6 |

Token-level perplexity on LibriSpeech after 3 epochs of LM training:

| Tokenizer | Eval loss | Perplexity |
| --- | ---: | ---: |
| WavTok-L | 11.91 | 148,122 |
| UniCodec | 11.92 | 150,197 |
| BigCodec | 11.96 | 156,448 |
| AUV | 11.98 | 159,768 |
| LLM-Codec | 8.44 | 4,617 |

### Reconstruction Quality

Codec-SUPERB-tiny speech reconstruction:

| Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better |
| --- | ---: | ---: | ---: | ---: |
| AUV base | 0.762 | 1.648 | 2.094 | 0.850 |
| LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 |

## Limitations

- The semantic alignment objective depends on paired speech and text.
- The model is primarily validated on read speech.
- Downstream generation quality depends on the separate speech language model.
- The model may preserve speaker identity information present in the input.
- The Hugging Face `transformers` artifacts are not a standalone text chatbot;
  they accompany the codec/tokenizer workflow.

## Citation

```bibtex
@article{chung2026llm,
  title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
  author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2604.17852},
  note = {Model and code available at https://github.com/voidful/llm-codec},
  year={2026}
}
```

If you use the Codec-SUPERB interface or benchmark, please also cite
Codec-SUPERB:

```bibtex
@inproceedings{wu-etal-2024-codec,
  title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
  author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
  year = {2024},
  url = {https://aclanthology.org/2024.findings-acl.616},
  doi = {10.18653/v1/2024.findings-acl.616},
  pages = {10330--10348}
}
```