Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders
Abstract
Sparse autoencoders trained on language model representations reveal interpretable features for speech synthesis that can be manipulated to control linguistic and prosodic attributes.
Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.
Community
Bringing SAEs to text-to-speech models!
Currently, control over TTS models such as CosyVoice3 is limited to prompts or predefined tags. We found that model generations can be precisely edited by steering SAE features.
We also analyze these features: some are audio-only, others activate only on text, and some activate on both text and audio. Additionally, we introduce an autointerp pipeline for all of them.
We plan to publish the SAE weights and code soon!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech (2026)
- Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM (2026)
- GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech (2026)
- F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation (2026)
- TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech (2026)
- Exploring Token-Space Manipulation in Latent Audio Tokenizers (2026)
- MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.10029 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper