--- language: en library_name: transformers pipeline_tag: automatic-speech-recognition tags: - automatic-speech-recognition - speech-to-text - asr - speech - english - qwen3 - audio - reinforcement-learning datasets: - openslr/librispeech_asr - speechcolab/gigaspeech - mozilla-foundation/common_voice_17_0 - facebook/voxpopuli - LIUM/tedlium - edinburghcstr/ami - anton-l/earnings22 - kensho/spgispeech metrics: - wer model-index: - name: Musci-ASR-2.4B results: - task: type: automatic-speech-recognition dataset: name: Open ASR Leaderboard type: hf-audio/esb-datasets-test-only-sorted metrics: - type: wer value: 5.44 name: Average WER license: apache-2.0 --- # Musci-ASR-2.4B Musci-ASR-2.4B is an English speech-to-text model that pairs a Qwen3-1.7B-base language-model backbone with a Qwen3-Omni-MoE audio encoder. A gated-MLP adapter projects audio features into the language-model embedding space. The model is trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits. The model has approximately 2.4B parameters and is distributed as a single `bfloat16` safetensors shard of approximately 4.84 GB. ## Model Details - **Developed by:** Musci Research - **Model type:** Automatic Speech Recognition / speech-to-text model - **Language:** English - **License:** Apache-2.0 - **Library:** Transformers - **Backbone:** Qwen3-1.7B-base, 28 layers, hidden size 2048 - **Audio encoder:** Qwen3-Omni-MoE audio encoder - **Adapter:** Gated-MLP adapter, hidden size 8192 - **Parameter size:** approximately 2.4B - **Checkpoint format:** `bfloat16` safetensors ## Intended Use This model is intended for English automatic speech recognition, including transcription of English speech audio for research and evaluation purposes. ## Inference ```python import librosa import torch from huggingface_hub import hf_hub_download from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.dynamic_module_utils import get_class_from_dynamic_module REPO = "Musci-research/Musci-ASR-2.4B" DEVICE = "cuda:0" model = AutoModelForCausalLM.from_pretrained( REPO, torch_dtype=torch.bfloat16, trust_remote_code=True ).to(DEVICE).eval() tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True) MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO) MelConfig = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO) mel_cfg = MelConfig( mel_sr=16000, mel_dim=128, mel_n_fft=400, mel_hop_length=160, ) processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False) processor.load_template(hf_hub_download(REPO, "chat_template_default.py")) waveform, _ = librosa.load("your_audio.wav", sr=16000) inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE) inputs["audio_data"] = inputs["audio_data"].to(model.dtype) with torch.no_grad(): out_ids = model.generate( **inputs, max_new_tokens=512, do_sample=False, num_beams=1, use_cache=True, eos_token_id=[processor.end_token_id], ) new_ids = out_ids[:, inputs["input_ids"].shape[1]:] transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip() print(transcript) ``` ## Audio Frontend - **Sample rate:** 16 kHz - **Features:** Whisper log-mel filterbank - **Mel bins:** 128 - **FFT size:** 400 - **Hop length:** 160 ## Training The model was trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits. ## Limitations The model is designed for English ASR. It may perform worse on non-English speech, heavy accents, noisy recordings, overlapping speakers, far-field audio, domain-specific terminology, or audio conditions that differ significantly from the training and evaluation data. The output should be manually reviewed before use in high-stakes settings. ## Citation ```bibtex @misc{musci_asr_2025, title = {{Musci-ASR-2.4B}}, author = {{Musci Research}}, year = {2025}, howpublished = {\url{https://huggingface.co/Musci-research/Musci-ASR-2.4B}} } ``` ## License This model is released under the Apache-2.0 license.