--- license: other license_name: license-term-of-stabletoken language: - en - zh tags: - speech tokenizer --- # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026) **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments. 📄 [Paper](https://arxiv.org/abs/2509.22220) | 💻 [GitHub](https://github.com/Tencent/StableToken) For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken). ## Model Details | Attribute | Value | |:----------|:------| | Frame Rate | 25 Hz | | Codebook Size | 8,192 | | BPS (Bits Per Second) | 325 | ## Quick Start To use StableToken, please clone the official repository and install dependencies. ### Installation ```bash git clone --recursive https://github.com/Tencent/StableToken.git cd StableToken && pip install -r requirements.txt ``` ### Inference ```python import os from huggingface_hub import snapshot_download from transformers import WhisperFeatureExtractor from src.model.modeling_whisper import WhisperLFQEncoder from src.utils.flow_inference import AudioDecoder from src.utils.utils import extract_speech_token, speech_token_to_wav # 1. Download & Load Models model_dir = snapshot_download("tencent/StableToken") # Load Tokenizer tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda() feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer")) # Load Decoder decoder = AudioDecoder( config_path=os.path.join(model_dir, "decoder", "config.yaml"), flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"), hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"), device="cuda" ) # 2. Tokenize tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0] # 3. Reconstruct tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens) ``` ## Performance StableToken achieves **60% lower UED** (Unit Edit Distance) than best existing supervised semantic tokenizers. ### Noise Robustness (UED ↓) | Model | Frame Rate | Codebook Size | UED (%, ↓) | |:---|:---:|:---:|:---:| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 16,384 | 31.10 | | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 4,096 | 26.17 | | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 6,561 | 38.66 | | **StableToken** | 25Hz | 8,192 | **10.17** 🏆 | ### Reconstruction Quality Measurements on LibriSpeech (LS) and SEED benchmarks. | Model | Frame
Rate | BPS | WER (↓)
LS-clean | WER (↓)
LS-other | WER (↓)
SEED-en | WER (↓)
SEED-zh | MOS (↑)
LS-clean | MOS (↑)
LS-other | MOS (↑)
SEED-en | MOS (↑)
SEED-zh | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | **3.99** | **4.16** | 4.10 | | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 | | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 | | **StableToken** | 25Hz | 325 | **3.84** | **7.99** | **3.44** | **2.62** | **4.09** | 3.83 | 4.01 | **4.18** | ## Citation ```bibtex @article{song2025stabletoken, title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs}, author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao}, journal={arXiv preprint arXiv:2509.22220}, year={2025} } ``` ## License This project is licensed under the [License Term of StableToken](LICENSE).