Automatic Speech Recognition
Transformers
Safetensors
English
musci
text-generation
asr
speech
english
custom_code
Eval Results
Instructions to use Musci-research/Musci-ASR-2.4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Musci-research/Musci-ASR-2.4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Musci-research/Musci-ASR-2.4B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Musci-research/Musci-ASR-2.4B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Musci-ASR-2.4B
An English speech-to-text model that pairs a Qwen3 language-model backbone with a
Qwen3-Omni-MoE audio encoder. Trained on public English ASR corpora and tuned with
reinforcement learning on the Open ASR Leaderboard training splits. Total ~2.4B parameters,
distributed as a single bfloat16 safetensors shard (~4.84 GB).
Inference
import librosa
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_class_from_dynamic_module
REPO = "Musci-research/Musci-ASR-2.4B"
DEVICE = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(
REPO, torch_dtype=torch.bfloat16, trust_remote_code=True
).to(DEVICE).eval()
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO)
MelConfig = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO)
mel_cfg = MelConfig(mel_sr=16000, mel_dim=128, mel_n_fft=400, mel_hop_length=160)
processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False)
processor.load_template(hf_hub_download(REPO, "chat_template_default.py"))
waveform, _ = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE)
inputs["audio_data"] = inputs["audio_data"].to(model.dtype)
with torch.no_grad():
out_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
num_beams=1,
use_cache=True,
eos_token_id=[processor.end_token_id],
)
new_ids = out_ids[:, inputs["input_ids"].shape[1]:]
transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip()
print(transcript)
Audio frontend
- Sample rate: 16 kHz
- Features: Whisper log-mel filterbank —
n_mels=128,n_fft=400,hop_length=160
License
apache-2.0.
- Downloads last month
- 55