FroST-4B β Frozen-State Conversational Speech Model
A conversational speech model: a frozen Qwen3-4B writes a text reply and its hidden states drive an autoregressive speech head that emits XY_Tokenizer codes, decoded to 24 kHz audio. One input β text reply + spoken audio.
- Backbone: Qwen3-4B (frozen; a LoRA r=32 adapter is baked in)
- Speech head: ~341M params, delay-pattern, 8 codebooks (XY_Tokenizer)
- Conditioning: hidden states tapped at layer 24 β speech head (cross-attention)
- Trainable at train time: LoRA + projector + speech head only (base + lm_head frozen)
- Codec dependency:
OpenMOSS-Team/XY_Tokenizer_TTSD_V0_hf(downloaded at load)
Usage
from frost import FroST
frost = FroST.from_pretrained("apxrv/frost-4b")
out = frost.chat("hi, how are you?", system="You are warm and upbeat.")
print(out.text); out.save("reply.wav")
Trained 1500 steps. See the project repo for training/eval code.
- Downloads last month
- 21