Automatic Speech Recognition
MLX
Safetensors
English
Chinese
mega_asr
speech-to-text
asr
robust-asr
qwen3-asr
Instructions to use mlx-community/Mega-ASR-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Mega-ASR-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Mega-ASR-bf16 mlx-community/Mega-ASR-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
| license: apache-2.0 | |
| library_name: mlx | |
| tags: | |
| - mlx | |
| - speech-to-text | |
| - asr | |
| - robust-asr | |
| - qwen3-asr | |
| base_model: | |
| - zhifeixie/Mega-ASR | |
| - Qwen/Qwen3-ASR-1.7B | |
| language: | |
| - en | |
| - zh | |
| pipeline_tag: automatic-speech-recognition | |
| # Mega-ASR-bf16 | |
| This model was converted to MLX format from [`zhifeixie/Mega-ASR`](https://huggingface.co/zhifeixie/Mega-ASR) (built on [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)) using [mlx-audio](https://github.com/Blaizzy/mlx-audio). | |
| Mega-ASR is a **robustness layer over Qwen3-ASR-1.7B**: a tiny audio-quality **router** classifies each utterance as clean or degraded and switches a dense **LoRA adapter** in/out of the base weights at inference β degraded audio runs the LoRA (robust) path, clean audio runs the unmodified base path. This recovers large WER gains on noisy/far-field speech while leaving clean-speech accuracy unchanged. | |
| > The base weights are stored as **dense bf16** on purpose: Mega-ASR adds fp32 LoRA deltas to the base at inference, so the base cannot be quantized without losing the runtime router/LoRA switching. | |
| ## Use with mlx-audio | |
| ```bash | |
| pip install mlx-audio | |
| ``` | |
| ```python | |
| from mlx_audio.stt import load | |
| model = load("mlx-community/Mega-ASR-bf16") | |
| result = model.generate("audio.wav", language="en") | |
| print(result.text) | |
| ``` | |
| CLI: | |
| ```bash | |
| python -m mlx_audio.stt.generate --model mlx-community/Mega-ASR-bf16 --audio audio.wav | |
| ``` | |
| The router decides per-utterance automatically; no flags needed. | |
| ## Validation | |
| Reproduces the paper's published robustness gains. Word Error Rate on the real **NOIZEUS** corpus (8 noise types Γ 4 SNR Γ 30 utterances, Apple Silicon): | |
| | SNR | base (Qwen3-ASR) | Mega-ASR (robust) | paper base | paper robust | | |
| |---|---:|---:|---:|---:| | |
| | 0 dB | 23.35 | 20.61 | 23.97 | 19.80 | | |
| | 5 dB | 8.47 | 6.51 | β | β | | |
| | 10 dB | 3.31 | 2.17 | 3.41 | 2.79 | | |
| | 15 dB | 2.12 | 0.83 | β | β | | |
| | **overall** | **9.31** | **7.53** | **9.45** | **7.52** | | |
| Overall robust WER **7.53 vs the paper's 7.52** β a ~20% relative reduction over the Qwen3-ASR baseline, reproduced. On clean read speech (FLEURS) the model matches plain Qwen3-ASR, as intended. | |
| ## License & attribution | |
| Apache-2.0. Built on [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (adapter + router) and [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (base). | |