You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

FroST-4B — Frozen-State Conversational Speech Model

A conversational speech model: a frozen Qwen3-4B writes a text reply and its hidden states drive an autoregressive speech head that emits XY_Tokenizer codes, decoded to 24 kHz audio. One input → text reply + spoken audio.

Backbone: Qwen3-4B (frozen; a LoRA r=32 adapter is baked in)
Speech head: ~341M params, delay-pattern, 8 codebooks (XY_Tokenizer)
Conditioning: hidden states tapped at layer 24 → speech head (cross-attention)
Trainable at train time: LoRA + projector + speech head only (base + lm_head frozen)
Codec dependency: OpenMOSS-Team/XY_Tokenizer_TTSD_V0_hf (downloaded at load)

Usage

from frost import FroST
frost = FroST.from_pretrained("apxrv/frost-4b")
out = frost.chat("hi, how are you?", system="You are warm and upbeat.")
print(out.text); out.save("reply.wav")

Trained 1500 steps. See the project repo for training/eval code.

Downloads last month: 21