| | --- |
| | license: other |
| | license_name: license-term-of-stabletoken |
| | language: |
| | - en |
| | - zh |
| | tags: |
| | - speech tokenizer |
| | --- |
| | # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026) |
| |
|
| | **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments. |
| |
|
| | π [Paper](https://arxiv.org/abs/2509.22220) | π» [GitHub](https://github.com/Tencent/StableToken) |
| |
|
| | For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken). |
| |
|
| | ## Model Details |
| |
|
| | | Attribute | Value | |
| | |:----------|:------| |
| | | Frame Rate | 25 Hz | |
| | | Codebook Size | 8,192 | |
| | | BPS (Bits Per Second) | 325 | |
| |
|
| | ## Quick Start |
| |
|
| | To use StableToken, please clone the official repository and install dependencies. |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | git clone --recursive https://github.com/Tencent/StableToken.git |
| | cd StableToken && pip install -r requirements.txt |
| | ``` |
| |
|
| | ### Inference |
| |
|
| | ```python |
| | import os |
| | from huggingface_hub import snapshot_download |
| | from transformers import WhisperFeatureExtractor |
| | from src.model.modeling_whisper import WhisperLFQEncoder |
| | from src.utils.flow_inference import AudioDecoder |
| | from src.utils.utils import extract_speech_token, speech_token_to_wav |
| | |
| | # 1. Download & Load Models |
| | model_dir = snapshot_download("tencent/StableToken") |
| | |
| | # Load Tokenizer |
| | tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda() |
| | feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer")) |
| | |
| | # Load Decoder |
| | decoder = AudioDecoder( |
| | config_path=os.path.join(model_dir, "decoder", "config.yaml"), |
| | flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"), |
| | hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"), |
| | device="cuda" |
| | ) |
| | |
| | # 2. Tokenize |
| | tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0] |
| | |
| | # 3. Reconstruct |
| | tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens) |
| | ``` |
| |
|
| | ## Performance |
| |
|
| | StableToken achieves **60% lower UED** (Unit Edit Distance) than best existing supervised semantic tokenizers. |
| |
|
| | ### Noise Robustness (UED β) |
| |
|
| | | Model | Frame Rate | Codebook Size | UED (%, β) | |
| | |:---|:---:|:---:|:---:| |
| | | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 16,384 | 31.10 | |
| | | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 4,096 | 26.17 | |
| | | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 6,561 | 38.66 | |
| | | **StableToken** | 25Hz | 8,192 | **10.17** π | |
| |
|
| | ### Reconstruction Quality |
| |
|
| | Measurements on LibriSpeech (LS) and SEED benchmarks. |
| |
|
| | | Model | Frame<br>Rate | BPS | WER (β)<br>LS-clean | WER (β)<br>LS-other | WER (β)<br>SEED-en | WER (β)<br>SEED-zh | MOS (β)<br>LS-clean | MOS (β)<br>LS-other | MOS (β)<br>SEED-en | MOS (β)<br>SEED-zh | |
| | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | **3.99** | **4.16** | 4.10 | |
| | | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 | |
| | | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 | |
| | | **StableToken** | 25Hz | 325 | **3.84** | **7.99** | **3.44** | **2.62** | **4.09** | 3.83 | 4.01 | **4.18** | |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{song2025stabletoken, |
| | title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs}, |
| | author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao}, |
| | journal={arXiv preprint arXiv:2509.22220}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | This project is licensed under the [License Term of StableToken](LICENSE). |
| |
|