File size: 4,373 Bytes
aa12b42
952fade
 
 
91a30f1
 
 
 
952fade
 
 
 
 
 
aa12b42
952fade
91a30f1
 
 
952fade
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91a30f1
 
952fade
 
 
 
 
 
 
 
 
 
 
91a30f1
952fade
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91a30f1
952fade
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91a30f1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
language:
- en
- zh
license: apache-2.0
pipeline_tag: automatic-speech-recognition
datasets:
- zhifeixie/Voices-in-the-Wild-2M
tags:
- automatic-speech-recognition
- speech-recognition
- audio
- robust-asr
- qwen3-asr
---

# Mega-ASR: Towards In-the-wild^2 Speech Recognition

[**Paper**](https://huggingface.co/papers/2605.19833) | [**Project Page**](https://xzf-thu.github.io/Mega-ASR/) | [**Code**](https://github.com/xzf-thu/Mega-ASR)

Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.

The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech.

## Model Details

- **Model name:** Mega-ASR
- **Task:** Automatic speech recognition
- **Backbone:** Qwen3-ASR-1.7B
- **Primary use case:** In-the-wild ASR under challenging acoustic conditions
- **Default decoding:** Greedy decoding
- **Default max new tokens:** 256 in the Mega-ASR inference wrapper
- **Router:** Audio quality classifier with a default threshold of 0.5
- **License:** Apache-2.0

## Repository Contents

```text
Mega-ASR/
β”œβ”€β”€ Qwen3-ASR-1.7B/              # Backbone model, tokenizer, processor, and generation config
β”œβ”€β”€ mega-asr-merged/             # Mega-ASR adaptation weights used by the inference wrapper
β”œβ”€β”€ audio_quality_router/        # Audio quality router checkpoint
└── README.md                    # Model card
```

## Intended Use

Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions.

## Quick Start

### Installation

Install the Mega-ASR codebase and dependencies:

```bash
git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR

conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt
```

### Python Usage

```python
from MegaASR.model.megaASR import MegaASR

model = MegaASR(
    model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
    router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
    routing_enabled=True,
)

result = model.infer("/path/to/audio.wav", return_route=True)
print(result)
```

## Training Summary

Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning (A2S-SFT) on the **Voices-in-the-Wild-2M** dataset, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.

## Evaluation

Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are:

- **WER** for English and whitespace-tokenized languages
- **CER** for Chinese and character-based evaluation

The Mega-ASR repository includes an evaluation script:

```bash
python src/MegaASR/eval/evaluate_wer.py \
  --ckpt_dir ckpt/Mega-ASR \
  --input_jsonl examples/test.jsonl \
  --output_jsonl outputs/pred_with_wer.jsonl
```

## Citation

If you use Mega-ASR, please cite the project:

```bibtex
@misc{xie2026megaasrinthewild2speechrecognition,
      title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
      author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
      year={2026},
      eprint={2605.19833},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.19833},
}
```

## Acknowledgements

Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.