Mega-ASR-bf16 / README.md
beshkenadze's picture
Upload README.md with huggingface_hub
58b2aa4 verified
---
license: apache-2.0
library_name: mlx
tags:
- mlx
- speech-to-text
- asr
- robust-asr
- qwen3-asr
base_model:
- zhifeixie/Mega-ASR
- Qwen/Qwen3-ASR-1.7B
language:
- en
- zh
pipeline_tag: automatic-speech-recognition
---
# Mega-ASR-bf16
This model was converted to MLX format from [`zhifeixie/Mega-ASR`](https://huggingface.co/zhifeixie/Mega-ASR) (built on [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)) using [mlx-audio](https://github.com/Blaizzy/mlx-audio).
Mega-ASR is a **robustness layer over Qwen3-ASR-1.7B**: a tiny audio-quality **router** classifies each utterance as clean or degraded and switches a dense **LoRA adapter** in/out of the base weights at inference β€” degraded audio runs the LoRA (robust) path, clean audio runs the unmodified base path. This recovers large WER gains on noisy/far-field speech while leaving clean-speech accuracy unchanged.
> The base weights are stored as **dense bf16** on purpose: Mega-ASR adds fp32 LoRA deltas to the base at inference, so the base cannot be quantized without losing the runtime router/LoRA switching.
## Use with mlx-audio
```bash
pip install mlx-audio
```
```python
from mlx_audio.stt import load
model = load("mlx-community/Mega-ASR-bf16")
result = model.generate("audio.wav", language="en")
print(result.text)
```
CLI:
```bash
python -m mlx_audio.stt.generate --model mlx-community/Mega-ASR-bf16 --audio audio.wav
```
The router decides per-utterance automatically; no flags needed.
## Validation
Reproduces the paper's published robustness gains. Word Error Rate on the real **NOIZEUS** corpus (8 noise types Γ— 4 SNR Γ— 30 utterances, Apple Silicon):
| SNR | base (Qwen3-ASR) | Mega-ASR (robust) | paper base | paper robust |
|---|---:|---:|---:|---:|
| 0 dB | 23.35 | 20.61 | 23.97 | 19.80 |
| 5 dB | 8.47 | 6.51 | β€” | β€” |
| 10 dB | 3.31 | 2.17 | 3.41 | 2.79 |
| 15 dB | 2.12 | 0.83 | β€” | β€” |
| **overall** | **9.31** | **7.53** | **9.45** | **7.52** |
Overall robust WER **7.53 vs the paper's 7.52** β€” a ~20% relative reduction over the Qwen3-ASR baseline, reproduced. On clean read speech (FLEURS) the model matches plain Qwen3-ASR, as intended.
## License & attribution
Apache-2.0. Built on [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (adapter + router) and [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (base).