Mega-ASR: Towards In-the-wild2 Speech Recognition
via scaling up real-world acoustic simulation
You’ll come back to Mega-ASR — after finding the rest fail in the real world.

Zhifei Xie1*, Kaiyu Pang3*, Haobin Zhang2*, Deheng Ye1,†, Xiaobin Hu2,†
Shuicheng Yan2,†, Chunyan Miao1,†

1NTU    2NUS    3Shanghai AI Lab

We introduce MEGA-ASR, the first foundation ASR model to target full-scenario robust speech recognition in the wild through systematic training on 7 atomic acoustic conditions and 54 compound acoustic scenarios. Built on 2.6M training samples covering noise, far-field speech, obstruction, echo and reverberation, recording artifacts, electronic distortion, and transmission dropout, MEGA-ASR uses A2S-SFT and DG-WGPO based RL to achieve up to nearly 30% gains over leading open and closed source SOTA models in challenging acoustic environments. Mega-ASR is fully open.

Project teaser figure

ASR Robustness Across Challenging Conditions

Mega-ASR is trained on data spanning 7 atomic acoustic effectsreverberation, echo, additive noise, far-field, frequency dropout, bandwidth limitation, and clipping distortion — and 54 compound environmental scenarios built on top of them, then uses reinforcement learning to simultaneously handle the semantic-reconstruction and detail-recovery challenges that arise at different WER regimes.

Audio & Challenge Ground Truth reference Qwen3-ASR by Qwen Team Mega-ASR [Ours] Gemini-3-Pro by Google DeepMind Seed-ASR by Seed Whisper by OpenAI
Empty Output Recovery
"…and said to him let us go and eat some honey. Whose honey? inquired Kobay cautiously. My father's, Soongoora replied. Oh, all right, I'm with you, said the tortoise eagerly, and away they went."
WER100.0
<empty>
WER47.1
"He said to him let's go and eat some honey. It's honey? he inquired very cautiously. My father is Superabundant — oh, all right, I will, he said to her eagerly, and away they went."
WER86.1
"But tell me, that's how she met my father's sister. Oh, alright. I wish… I really…"
WER85.3
"My father is. Oh, all right, I wish you can."
WER92.5
"…to him… some honey… oh yeah…"
Long-Utterance Semantic Recovery
"To waste, I skip forty years, said the baker in tears, and proceed without further remark to the day when you took me aboard of your ship to help you in hunting the snark."
WER64.7
"I skipped 40 years. Second day in here. Ever since you left, I've been a monk. To the day when you took me aboard a ship to help you, I was a scout."
WER5.9
"To witness, I skip forty years, said the baker in tears, and proceeded without further remark to the day when you took me aboard of your ship to help you in hunting the snark."
WER64.7
"I spent forty years at sea and never seen a rougher than the day that you took me aboard your ship. I'll be with you always in the stars."
WER38.2
"To wait. I skip forty years. Saturday and years. And proceed without further remark. To the day when you took me on board your ship. I can help you with losing the weight."
WER71.5
"I skip forty years… to the day you took me on a ship… to hunt the shark."
Babble Noise & Hallucination
"The friendly gang left the drug store."
WER57.1
"It's a friendly gang. That's the drug gang."
WER0.0
"The friendly gang left the drug store."
WER42.9
"Friendly gang left the drugs."
WER28.6
"The friendly gang left the drugstore."
WER62.3
"A friendly young man left the drug store."
Restaurant Noise Recovery
"The set of china hit the floor with a crash."
WER40.0
"The bed is fine. It hit the floor with a crash."
WER0.0
"The set of china hit the floor with a crash."
WER100.0
"He said it's fine I hit the forward slash."
WER20.0
"The sound of china hits the floor with a crash."
WER55.0
"The chef of China hit the floor with a clash."
Financial Entity Recovery
"Among export-led electrical and computer makers, Japan Victor Company fell fifty to two thousand three hundred twenty."
WER38.9
"Among export-led computer makers, Japan VictorNet sold fifty-two thousand three hundred fifty."
WER11.1
"Among export-led computer makers, Japan Victor Company fell fifty to two thousand three hundred twenty."
WER35.7
"Among export-led computer makers, Japan Victor Co. fell 50 to 2,350 yen."
WER50.0
"Among export-led in computer makers, Japan Victor Company sell 50 to 2300 unit."
WER66.7
"Among exporters, computer makers in Japan victor companies sold fifty…"
Phrase Recovery
"Has exposure really been reduced?"
WER40.0
"Has exposure really done you?"
WER0.0
"Has exposure really been reduced."
WER80.0
"Has the closure really affected you?"
WER60.0
"Has exposure to beauty products."
WER78.5
"Have those who really been refused?"

Scaling up and Post-training in Mega-ASR

Toward next-generation robust ASR: ASR-in-the-wild2

Voices-in-the-wild-2M 2M items

Voices-in-the-wild-2M dataset composition
2.4M Items
7 meta-effects
54 compound scenarios
11k Hours

Voices-in-the-wild-2M is a large-scale ASR dataset built from 7 canonical meta-scenarios and 54 newly constructed compound scenarios, synthesized through a spectral-manipulation-based simulation pipeline with an agentic check for physical plausibility. We further calibrate the difficulty distribution and filter samples above 70% WER, yielding training data that is both genuinely in-the-wild and stable to learn from.

Training Recipe A2S-SFT → DG-WGPO

Mega-ASR training pipeline diagram

Mega-ASR is trained with Acoustic-to-Semantic Progressive SFT (A2S-SFT), which builds up the ability to extract and recover semantics under heavy acoustic perturbation. On top of the resulting Mega-ASR-Base, we apply Dual-Granularity WER-Gated Policy Optimization (DG-WGPO), an RL stage that fuses a token-level refinement reward with a sentence-level reconstruction reward — keeping the learning signal effective even when WER exceeds 30%.

Results

Mega-ASR is evaluated across three benchmark families — classical academic test sets, robustness benchmarks, and our own in-the-wild compound benchmark.

01 · Robust ASR Benchmarks Robust ASR Benchmarks results
02 · Voices-in-the-wild-Bench Voices-in-the-wild-Bench results
03 · Traditional Benchmarks Traditional Benchmarks results
Snowy mountain

Experience Mega-ASR

Mega-ASR excels in every environment.

Try Mega-ASR

ASR is a mountain — we are still at the foot of it, but we will keep climbing.