Qwen3-TTS VoiceDesign — T5

A fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign focused on expressive prompt following — emotion, pace, and affect controllability under free-form English voice descriptions. The model trades a small amount of intelligibility headroom for a substantially more expressive output: prompts that ask for sad, whispered, projected, sarcastic, bedtime-storyteller, etc. are noticeably closer to what the description asks for than the base model produces.

Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
Method: LoRA on the Talker's attention + MLP projections, merged back into the base weights
Training data: EARS (rich-style multi-speaker reads) + Expresso (high-quality expressive performances) with free-form natural-language captions
Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is self-contained — it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. No other HF repo needs to be downloaded at inference time.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t5")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version with three example prompts is provided at example_inference.py.

The `instruct` prompt format

The instruct field is free-form English describing the voice. The training distribution covers:

gender — "a male/female speaker", "a deep-voiced narrator"
pitch — "high/medium/low pitched", "deep", "thin and high"
speed — "slowly", "at a brisk pace", "at a moderate tempo"
affect / emotion — "happy", "angry", "sad", "whispered", "sarcastic", "projected"
scene / persona — "a bedtime storyteller", "a news anchor", "a sports announcer at the climax of a play"

Example prompts:

A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks softly with a sad tone, low energy, almost whispering.
An older male narrator reads a bedtime story slowly, with warmth.
A high-pitched announcer projects an exciting headline at a fast pace.

How the adapter was trained

This adapter follows a corrected training protocol designed to fix four silent issues common in earlier naive recipes for VoiceDesign:

Dual-track input layout. Training-time inputs_embeds is built by the exact element-wise sum of text-track and codec-track embeddings used by Qwen3TTSForConditionalGeneration.generate's VoiceDesign path — including the 5-position English think-prefix on the codec track. This matches inference exactly, instead of approximating it with a chat-templated prompt + boundary switch.
Single-shift loss. Labels are computed manually as F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100). The labels= argument is never passed into the wrapped forward, avoiding the double-shift that occurs when PEFT's wrapped CausalLMLoss adds its own internal shift on top of the collator's.
Conservative LR for LoRA on a 1.7 B base. Peak LR 2.0e-5, cosine schedule, with min_lr_ratio=0.2 so the late-training LR stays high enough to keep learning rather than plateauing.
No sub-talker loss with a frozen Code Predictor. The sub-talker auxiliary loss is disabled (weight=0.0) when the Code Predictor isn't part of the LoRA scope — this combination is known to corrupt training.

The adapter is LoRA r=16, α=32, dropout=0.05 on the Talker's q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj projections only. The Code Predictor and audio codec are frozen end-to-end. Training data combines EARS (clean multi-speaker reads with style descriptors) and Expresso (high-quality expressive performances at 48 kHz, downsampled to 24 kHz to match the base's native rate). Captions are free-form natural-language prose, one canonical caption per clip — no templated descriptions.

The final adapter (~19 M parameters, ~77 MB at fp32) was permanently merged into the Talker weights for this repo so inference does not require PEFT.

Strengths

Better emotion + affect rendering on prompts that ask for it (e.g. whispered, sad, projected, bedtime-storyteller) versus the base model.
Better persona / scene composition — prompts that combine an emotion with a scene (a stern parent, an excited sports announcer) come through more clearly.
No identity drift on neutral prompts. Plain "a clear neutral voice" prompts produce output that is acoustically close to the base model — the adapter doesn't "color" everything.

Known limitations

Gender drift on strong-emotion prompts. Some sad_male, sad_female, and fear_female prompts can render with the wrong-gender timbre. Root cause: the training corpora's emotion-axis coverage is concentrated on a handful of speakers, so strongly emotional descriptions act partially as speaker-identity cues. Mitigation in the prompt: lead with the gender ("A male speaker, sad and quiet, …") rather than the emotion.
Slight robotic tone on extreme prompts. A small number of fear_male_normal_slow and similar prompts produce flatter prosody than the base. Trade-off accepted in exchange for the broader expressive lift.
English only. All training and evaluation used English prompts and English text. The base model supports 10 languages; they are untouched but not validated against this adapter's modified CB-0 distribution.
Research / non-commercial use only — see license.

License

Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
Training data:
- EARS: CC BY-NC-SA 4.0 (research / non-commercial).
- Expresso: CC BY-NC 4.0 (research / non-commercial).

Because both training corpora carry non-commercial restrictions, the derived model effectively inherits a CC BY-NC-SA 4.0 constraint: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

References

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Inference library: qwen-tts on PyPI
EARS dataset: Effortless and Realistic Speech Dataset
Expresso dataset: ylacombe/expresso

Downloads last month: 9

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for macminix/qwen3_voice_design_t5

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Adapter

(8)

this model