File size: 7,982 Bytes
d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b d205d5d fdec09b a0ef01c fdec09b d205d5d fdec09b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | ---
base_model:
- Qwen/Qwen3-4B-Instruct-2507
library_name: transformers
pipeline_tag: text-generation
tags:
- audio
- speech
- audio-codec
- neural-audio-codec
- spoken-language-modeling
- codec-superb
- qwen3
datasets:
- librispeech_asr
metrics:
- perplexity
- pesq
- stoi
---
# LLM-Codec
LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio
tokens that are both reconstructable and easier for autoregressive language
models to predict.
Model: https://huggingface.co/voidful/llm-codec
Code: https://github.com/voidful/llm-codec
Usage reference: https://github.com/voidful/Codec-SUPERB
## Model Description
Most neural audio codecs are trained for waveform reconstruction. Spoken
language models, however, consume codec tokens with a next-token prediction
objective. This mismatch can make acoustically valid variation appear as token
uncertainty to the language model.
LLM-Codec adapts a codec with language-model-facing objectives while keeping the
deployed codec interface unchanged. The model is trained with:
- Future Token Prediction (FTP): Medusa-style heads predict future audio tokens
from frozen-LLM hidden states.
- Semantic Alignment (SA): audio-induced hidden states are aligned with paired
text hidden states inside a frozen LLM.
- Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward
tokens while enabling gradients to flow to the codec encoder.
- Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex
STFT, VQ, GAN, and feature matching losses.
The deployed codec does not require the auxiliary FTP heads.
## Intended Use
This model is intended for research and development in:
- audio tokenization for spoken language modeling
- codec reconstruction experiments
- token-level speech LM training
- Codec-SUPERB style codec evaluation
- speech token analysis and ablation studies
It is not a full text-to-speech system by itself. For speech generation, use the
codec as the tokenizer/decoder inside a separate speech language modeling
pipeline.
## Out-of-Scope Use
Do not use this model for:
- impersonation or unauthorized voice cloning
- surveillance or speaker tracking without consent
- high-stakes speaker, language, or identity decisions
- generating deceptive audio content
## Installation
The easiest inference path is through the Codec-SUPERB `SoundCodec` interface.
```bash
git clone https://github.com/voidful/Codec-SUPERB.git
cd Codec-SUPERB
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```
If your environment supports editable installs, this is also convenient:
```bash
pip install -e .
```
## Quick Start
Load LLM-Codec through the Codec-SUPERB codec registry:
```python
from SoundCodec import codec
print(codec.list_codec())
model = codec.load_codec("llmcodec")
```
Encode and reconstruct one audio file:
```python
from SoundCodec import codec
import torchaudio
import soundfile as sf
model = codec.load_codec("llmcodec")
waveform, sample_rate = torchaudio.load("sample_audio.wav")
data_item = {
"audio": {
"array": waveform.numpy()[0],
"sampling_rate": sample_rate,
}
}
units = model.extract_unit(data_item).unit
print("Unit shape:", units.shape)
result = model.synth(data_item, local_save=False)
reconstructed = result["audio"]["array"]
reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)
sf.write("reconstructed.wav", reconstructed, reconstructed_sr)
```
## Batch Usage
Codec-SUPERB also provides batch APIs:
```python
from SoundCodec import codec
import torchaudio
model = codec.load_codec("llmcodec")
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
data_list = []
for path in audio_files:
waveform, sample_rate = torchaudio.load(path)
data_list.append({
"id": path,
"audio": {
"array": waveform.numpy()[0],
"sampling_rate": sample_rate,
},
})
batch_units = model.batch_extract_unit(data_list)
batch_audio = model.batch_decode_unit(batch_units)
results = model.batch_synth(data_list, local_save=False)
for item in results:
print(item["unit"].shape, item["audio"]["array"].shape)
```
For better throughput, group audio samples with similar lengths before batching.
## Codec-SUPERB Evaluation
To evaluate LLM-Codec with Codec-SUPERB-tiny:
```bash
PYTHONPATH=. python3 scripts/dataset_creator.py \
--dataset voidful/codec-superb-tiny
PYTHONPATH=. python3 scripts/benchmarking.py \
--dataset datasets/voidful/codec-superb-tiny_synth \
--models llmcodec
```
## Model Files
The model repository provides:
- codec weights as `llm-codec.pt`
- a tokenizer extended with `<CODEC_*>` audio tokens
- Qwen-compatible model artifacts containing trained audio-token embeddings
The codec uses 20,480 audio tokens with the canonical token format:
```text
<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>
```
## Training Data
The codec was trained on LibriSpeech `train-clean-100` with paired transcripts.
The validation split used during training is LibriSpeech `validation`.
Because training is speech-centric and transcript-supervised, performance may be
weaker on non-English speech, conversational speech, music, environmental audio,
or audio with strong noise and overlap.
## Training Procedure
Base components:
- Base codec: AUV
- Frozen LLM backbone: Qwen3-4B-Instruct
- Token rate: 50 Hz
- Audio vocabulary size: 20,480
- Segment length: 4 seconds
Losses:
- reconstruction mel loss
- multi-scale mel loss
- multi-resolution STFT loss
- complex STFT loss with phase term
- VQ commitment loss
- Gumbel bridge cross entropy
- Future Token Prediction loss
- Semantic Alignment cosine loss
- Semantic Alignment contrastive loss with memory bank
- MPD/MSD GAN and feature matching losses
## Evaluation Results
### Token Learnability
SALMon speech coherence accuracy after token-level LM training:
| Tokenizer | Overall accuracy |
| --- | ---: |
| WavTok-L | 48.3 |
| BigCodec | 49.4 |
| UniCodec | 50.1 |
| AUV | 49.4 |
| LLM-Codec | 61.6 |
Token-level perplexity on LibriSpeech after 3 epochs of LM training:
| Tokenizer | Eval loss | Perplexity |
| --- | ---: | ---: |
| WavTok-L | 11.91 | 148,122 |
| UniCodec | 11.92 | 150,197 |
| BigCodec | 11.96 | 156,448 |
| AUV | 11.98 | 159,768 |
| LLM-Codec | 8.44 | 4,617 |
### Reconstruction Quality
Codec-SUPERB-tiny speech reconstruction:
| Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better |
| --- | ---: | ---: | ---: | ---: |
| AUV base | 0.762 | 1.648 | 2.094 | 0.850 |
| LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 |
## Limitations
- The semantic alignment objective depends on paired speech and text.
- The model is primarily validated on read speech.
- Downstream generation quality depends on the separate speech language model.
- The model may preserve speaker identity information present in the input.
- The Hugging Face `transformers` artifacts are not a standalone text chatbot;
they accompany the codec/tokenizer workflow.
## Citation
```bibtex
@article{chung2026llm,
title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
journal={arXiv preprint arXiv:2604.17852},
note = {Model and code available at https://github.com/voidful/llm-codec},
year={2026}
}
```
If you use the Codec-SUPERB interface or benchmark, please also cite
Codec-SUPERB:
```bibtex
@inproceedings{wu-etal-2024-codec,
title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
year = {2024},
url = {https://aclanthology.org/2024.findings-acl.616},
doi = {10.18653/v1/2024.findings-acl.616},
pages = {10330--10348}
}
```
|