mlx-community
/

Mega-ASR-bf16

Automatic Speech Recognition

Model card Files Files and versions

Mega-ASR-bf16 / README.md

beshkenadze's picture

Upload README.md with huggingface_hub

58b2aa4 verified 7 days ago

|

history blame contribute delete

2.4 kB

	---
	license: apache-2.0
	library_name: mlx
	tags:
	- mlx
	- speech-to-text
	- asr
	- robust-asr
	- qwen3-asr
	base_model:
	- zhifeixie/Mega-ASR
	- Qwen/Qwen3-ASR-1.7B
	language:
	- en
	- zh
	pipeline_tag: automatic-speech-recognition
	---

	# Mega-ASR-bf16

	This model was converted to MLX format from [`zhifeixie/Mega-ASR`](https://huggingface.co/zhifeixie/Mega-ASR) (built on [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)) using [mlx-audio](https://github.com/Blaizzy/mlx-audio).

	Mega-ASR is a robustness layer over Qwen3-ASR-1.7B: a tiny audio-quality router classifies each utterance as clean or degraded and switches a dense LoRA adapter in/out of the base weights at inference — degraded audio runs the LoRA (robust) path, clean audio runs the unmodified base path. This recovers large WER gains on noisy/far-field speech while leaving clean-speech accuracy unchanged.

	> The base weights are stored as dense bf16 on purpose: Mega-ASR adds fp32 LoRA deltas to the base at inference, so the base cannot be quantized without losing the runtime router/LoRA switching.

	## Use with mlx-audio

	```bash
	pip install mlx-audio
	```

	```python
	from mlx_audio.stt import load

	model = load("mlx-community/Mega-ASR-bf16")
	result = model.generate("audio.wav", language="en")
	print(result.text)
	```

	CLI:

	```bash
	python -m mlx_audio.stt.generate --model mlx-community/Mega-ASR-bf16 --audio audio.wav
	```

	The router decides per-utterance automatically; no flags needed.

	## Validation

	Reproduces the paper's published robustness gains. Word Error Rate on the real NOIZEUS corpus (8 noise types × 4 SNR × 30 utterances, Apple Silicon):

	\| SNR \| base (Qwen3-ASR) \| Mega-ASR (robust) \| paper base \| paper robust \|
	\|---\|---:\|---:\|---:\|---:\|
	\| 0 dB \| 23.35 \| 20.61 \| 23.97 \| 19.80 \|
	\| 5 dB \| 8.47 \| 6.51 \| — \| — \|
	\| 10 dB \| 3.31 \| 2.17 \| 3.41 \| 2.79 \|
	\| 15 dB \| 2.12 \| 0.83 \| — \| — \|
	\| overall \| 9.31 \| 7.53 \| 9.45 \| 7.52 \|

	Overall robust WER 7.53 vs the paper's 7.52 — a ~20% relative reduction over the Qwen3-ASR baseline, reproduced. On clean read speech (FLEURS) the model matches plain Qwen3-ASR, as intended.

	## License & attribution

	Apache-2.0. Built on [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (adapter + router) and [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) (base).