Musci-ASR-2.4B

An English speech-to-text model that pairs a Qwen3 language-model backbone with a Qwen3-Omni-MoE audio encoder. Trained on public English ASR corpora and tuned with reinforcement learning on the Open ASR Leaderboard training splits. Total ~2.4B parameters, distributed as a single bfloat16 safetensors shard (~4.84 GB).

Inference

import librosa
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_class_from_dynamic_module

REPO = "Musci-research/Musci-ASR-2.4B"
DEVICE = "cuda:0"

model = AutoModelForCausalLM.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, trust_remote_code=True
).to(DEVICE).eval()
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)

MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO)
MelConfig      = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO)
mel_cfg = MelConfig(mel_sr=16000, mel_dim=128, mel_n_fft=400, mel_hop_length=160)
processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False)
processor.load_template(hf_hub_download(REPO, "chat_template_default.py"))

waveform, _ = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE)
inputs["audio_data"] = inputs["audio_data"].to(model.dtype)

with torch.no_grad():
    out_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        num_beams=1,
        use_cache=True,
        eos_token_id=[processor.end_token_id],
    )

new_ids = out_ids[:, inputs["input_ids"].shape[1]:]
transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip()
print(transcript)

Audio frontend

Sample rate: 16 kHz
Features: Whisper log-mel filterbank — n_mels=128, n_fft=400, hop_length=160

License

apache-2.0.

Downloads last month: 55

Safetensors

Model size

2B params

Tensor type

BF16