Qwen3-TTS VoiceDesign β€” T5

A fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign focused on expressive prompt following β€” emotion, pace, and affect controllability under free-form English voice descriptions. The model trades a small amount of intelligibility headroom for a substantially more expressive output: prompts that ask for sad, whispered, projected, sarcastic, bedtime-storyteller, etc. are noticeably closer to what the description asks for than the base model produces.

  • Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
  • Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
  • Training data: EARS (rich-style multi-speaker reads) + Expresso (high-quality expressive performances) with free-form natural-language captions
  • Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is self-contained β€” it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. No other HF repo needs to be downloaded at inference time.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t5")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version with three example prompts is provided at example_inference.py.

The instruct prompt format

The instruct field is free-form English describing the voice. The training distribution covers:

  • gender β€” "a male/female speaker", "a deep-voiced narrator"
  • pitch β€” "high/medium/low pitched", "deep", "thin and high"
  • speed β€” "slowly", "at a brisk pace", "at a moderate tempo"
  • affect / emotion β€” "happy", "angry", "sad", "whispered", "sarcastic", "projected"
  • scene / persona β€” "a bedtime storyteller", "a news anchor", "a sports announcer at the climax of a play"

Example prompts:

A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks softly with a sad tone, low energy, almost whispering.
An older male narrator reads a bedtime story slowly, with warmth.
A high-pitched announcer projects an exciting headline at a fast pace.

How the adapter was trained

This adapter follows a corrected training protocol designed to fix four silent issues common in earlier naive recipes for VoiceDesign:

  1. Dual-track input layout. Training-time inputs_embeds is built by the exact element-wise sum of text-track and codec-track embeddings used by Qwen3TTSForConditionalGeneration.generate's VoiceDesign path β€” including the 5-position English think-prefix on the codec track. This matches inference exactly, instead of approximating it with a chat-templated prompt + boundary switch.
  2. Single-shift loss. Labels are computed manually as F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100). The labels= argument is never passed into the wrapped forward, avoiding the double-shift that occurs when PEFT's wrapped CausalLMLoss adds its own internal shift on top of the collator's.
  3. Conservative LR for LoRA on a 1.7 B base. Peak LR 2.0e-5, cosine schedule, with min_lr_ratio=0.2 so the late-training LR stays high enough to keep learning rather than plateauing.
  4. No sub-talker loss with a frozen Code Predictor. The sub-talker auxiliary loss is disabled (weight=0.0) when the Code Predictor isn't part of the LoRA scope β€” this combination is known to corrupt training.

The adapter is LoRA r=16, Ξ±=32, dropout=0.05 on the Talker's q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj projections only. The Code Predictor and audio codec are frozen end-to-end. Training data combines EARS (clean multi-speaker reads with style descriptors) and Expresso (high-quality expressive performances at 48 kHz, downsampled to 24 kHz to match the base's native rate). Captions are free-form natural-language prose, one canonical caption per clip β€” no templated descriptions.

The final adapter (~19 M parameters, ~77 MB at fp32) was permanently merged into the Talker weights for this repo so inference does not require PEFT.

Strengths

  • Better emotion + affect rendering on prompts that ask for it (e.g. whispered, sad, projected, bedtime-storyteller) versus the base model.
  • Better persona / scene composition β€” prompts that combine an emotion with a scene (a stern parent, an excited sports announcer) come through more clearly.
  • No identity drift on neutral prompts. Plain "a clear neutral voice" prompts produce output that is acoustically close to the base model β€” the adapter doesn't "color" everything.

Known limitations

  • Gender drift on strong-emotion prompts. Some sad_male, sad_female, and fear_female prompts can render with the wrong-gender timbre. Root cause: the training corpora's emotion-axis coverage is concentrated on a handful of speakers, so strongly emotional descriptions act partially as speaker-identity cues. Mitigation in the prompt: lead with the gender ("A male speaker, sad and quiet, …") rather than the emotion.
  • Slight robotic tone on extreme prompts. A small number of fear_male_normal_slow and similar prompts produce flatter prosody than the base. Trade-off accepted in exchange for the broader expressive lift.
  • English only. All training and evaluation used English prompts and English text. The base model supports 10 languages; they are untouched but not validated against this adapter's modified CB-0 distribution.
  • Research / non-commercial use only β€” see license.

License

  • Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
  • Training data:
    • EARS: CC BY-NC-SA 4.0 (research / non-commercial).
    • Expresso: CC BY-NC 4.0 (research / non-commercial).

Because both training corpora carry non-commercial restrictions, the derived model effectively inherits a CC BY-NC-SA 4.0 constraint: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

References

Downloads last month
9
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for macminix/qwen3_voice_design_t5

Adapter
(8)
this model