File size: 7,982 Bytes
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d205d5d
 
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
 
 
 
 
 
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
d205d5d
fdec09b
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
d205d5d
fdec09b
d205d5d
fdec09b
 
 
 
 
 
d205d5d
fdec09b
 
 
a0ef01c
 
 
 
 
 
fdec09b
 
d205d5d
fdec09b
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
base_model:
- Qwen/Qwen3-4B-Instruct-2507
library_name: transformers
pipeline_tag: text-generation
tags:
- audio
- speech
- audio-codec
- neural-audio-codec
- spoken-language-modeling
- codec-superb
- qwen3
datasets:
- librispeech_asr
metrics:
- perplexity
- pesq
- stoi
---

# LLM-Codec

LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio
tokens that are both reconstructable and easier for autoregressive language
models to predict.

Model: https://huggingface.co/voidful/llm-codec

Code: https://github.com/voidful/llm-codec

Usage reference: https://github.com/voidful/Codec-SUPERB

## Model Description

Most neural audio codecs are trained for waveform reconstruction. Spoken
language models, however, consume codec tokens with a next-token prediction
objective. This mismatch can make acoustically valid variation appear as token
uncertainty to the language model.

LLM-Codec adapts a codec with language-model-facing objectives while keeping the
deployed codec interface unchanged. The model is trained with:

- Future Token Prediction (FTP): Medusa-style heads predict future audio tokens
  from frozen-LLM hidden states.
- Semantic Alignment (SA): audio-induced hidden states are aligned with paired
  text hidden states inside a frozen LLM.
- Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward
  tokens while enabling gradients to flow to the codec encoder.
- Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex
  STFT, VQ, GAN, and feature matching losses.

The deployed codec does not require the auxiliary FTP heads.

## Intended Use

This model is intended for research and development in:

- audio tokenization for spoken language modeling
- codec reconstruction experiments
- token-level speech LM training
- Codec-SUPERB style codec evaluation
- speech token analysis and ablation studies

It is not a full text-to-speech system by itself. For speech generation, use the
codec as the tokenizer/decoder inside a separate speech language modeling
pipeline.

## Out-of-Scope Use

Do not use this model for:

- impersonation or unauthorized voice cloning
- surveillance or speaker tracking without consent
- high-stakes speaker, language, or identity decisions
- generating deceptive audio content

## Installation

The easiest inference path is through the Codec-SUPERB `SoundCodec` interface.

```bash
git clone https://github.com/voidful/Codec-SUPERB.git
cd Codec-SUPERB
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```

If your environment supports editable installs, this is also convenient:

```bash
pip install -e .
```

## Quick Start

Load LLM-Codec through the Codec-SUPERB codec registry:

```python
from SoundCodec import codec

print(codec.list_codec())
model = codec.load_codec("llmcodec")
```

Encode and reconstruct one audio file:

```python
from SoundCodec import codec
import torchaudio
import soundfile as sf

model = codec.load_codec("llmcodec")

waveform, sample_rate = torchaudio.load("sample_audio.wav")
data_item = {
    "audio": {
        "array": waveform.numpy()[0],
        "sampling_rate": sample_rate,
    }
}

units = model.extract_unit(data_item).unit
print("Unit shape:", units.shape)

result = model.synth(data_item, local_save=False)
reconstructed = result["audio"]["array"]
reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)

sf.write("reconstructed.wav", reconstructed, reconstructed_sr)
```

## Batch Usage

Codec-SUPERB also provides batch APIs:

```python
from SoundCodec import codec
import torchaudio

model = codec.load_codec("llmcodec")

audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
data_list = []

for path in audio_files:
    waveform, sample_rate = torchaudio.load(path)
    data_list.append({
        "id": path,
        "audio": {
            "array": waveform.numpy()[0],
            "sampling_rate": sample_rate,
        },
    })

batch_units = model.batch_extract_unit(data_list)
batch_audio = model.batch_decode_unit(batch_units)

results = model.batch_synth(data_list, local_save=False)
for item in results:
    print(item["unit"].shape, item["audio"]["array"].shape)
```

For better throughput, group audio samples with similar lengths before batching.

## Codec-SUPERB Evaluation

To evaluate LLM-Codec with Codec-SUPERB-tiny:

```bash
PYTHONPATH=. python3 scripts/dataset_creator.py \
  --dataset voidful/codec-superb-tiny

PYTHONPATH=. python3 scripts/benchmarking.py \
  --dataset datasets/voidful/codec-superb-tiny_synth \
  --models llmcodec
```

## Model Files

The model repository provides:

- codec weights as `llm-codec.pt`
- a tokenizer extended with `<CODEC_*>` audio tokens
- Qwen-compatible model artifacts containing trained audio-token embeddings

The codec uses 20,480 audio tokens with the canonical token format:

```text
<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>
```

## Training Data

The codec was trained on LibriSpeech `train-clean-100` with paired transcripts.
The validation split used during training is LibriSpeech `validation`.

Because training is speech-centric and transcript-supervised, performance may be
weaker on non-English speech, conversational speech, music, environmental audio,
or audio with strong noise and overlap.

## Training Procedure

Base components:

- Base codec: AUV
- Frozen LLM backbone: Qwen3-4B-Instruct
- Token rate: 50 Hz
- Audio vocabulary size: 20,480
- Segment length: 4 seconds

Losses:

- reconstruction mel loss
- multi-scale mel loss
- multi-resolution STFT loss
- complex STFT loss with phase term
- VQ commitment loss
- Gumbel bridge cross entropy
- Future Token Prediction loss
- Semantic Alignment cosine loss
- Semantic Alignment contrastive loss with memory bank
- MPD/MSD GAN and feature matching losses

## Evaluation Results

### Token Learnability

SALMon speech coherence accuracy after token-level LM training:

| Tokenizer | Overall accuracy |
| --- | ---: |
| WavTok-L | 48.3 |
| BigCodec | 49.4 |
| UniCodec | 50.1 |
| AUV | 49.4 |
| LLM-Codec | 61.6 |

Token-level perplexity on LibriSpeech after 3 epochs of LM training:

| Tokenizer | Eval loss | Perplexity |
| --- | ---: | ---: |
| WavTok-L | 11.91 | 148,122 |
| UniCodec | 11.92 | 150,197 |
| BigCodec | 11.96 | 156,448 |
| AUV | 11.98 | 159,768 |
| LLM-Codec | 8.44 | 4,617 |

### Reconstruction Quality

Codec-SUPERB-tiny speech reconstruction:

| Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better |
| --- | ---: | ---: | ---: | ---: |
| AUV base | 0.762 | 1.648 | 2.094 | 0.850 |
| LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 |

## Limitations

- The semantic alignment objective depends on paired speech and text.
- The model is primarily validated on read speech.
- Downstream generation quality depends on the separate speech language model.
- The model may preserve speaker identity information present in the input.
- The Hugging Face `transformers` artifacts are not a standalone text chatbot;
  they accompany the codec/tokenizer workflow.

## Citation

```bibtex
@article{chung2026llm,
  title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
  author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2604.17852},
  note = {Model and code available at https://github.com/voidful/llm-codec},
  year={2026}
}
```

If you use the Codec-SUPERB interface or benchmark, please also cite
Codec-SUPERB:

```bibtex
@inproceedings{wu-etal-2024-codec,
  title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
  author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
  year = {2024},
  url = {https://aclanthology.org/2024.findings-acl.616},
  doi = {10.18653/v1/2024.findings-acl.616},
  pages = {10330--10348}
}
```