Voices-in-the-Wild-Bench Leaderboard

Dataset Summary

Voices-in-the-Wild-Bench contains 5,000 speech examples: 3,500 synthetic samples, 1,500 real-recorded samples from 16 speakers, and an even Chinese-English language split.

5,000Total samples

3,500Synthetic speech

1,500Real-recorded speech

8Acoustic categories

Leaderboard

Breakdown results by acoustic scenario. Scores are error rates; lower is better. Real. and Sim. denote real-recorded and synthetic speech, respectively.

Model	Noise		Far.		Obst.		Echo.		Record.		Elc.Dis.		Trans.Drop.		Mixed
Model	Real.	Sim.	Real.	Sim.	Real.	Sim.	Real.	Sim.	Real.	Sim.	Real.	Sim.	Real.	Sim.	Real.	Sim.
Closed-source models
Gemini3-Flash	7.63	10.61	5.14	1.90	3.73	2.65	8.75	14.86	8.38	19.85	3.15	7.56	5.47	7.65	7.99	9.62
Seed-ASR	8.21	8.11	3.06	3.19	3.10	2.76	16.55	18.21	18.48	23.33	3.89	5.71	7.97	7.46	6.88	9.29
GPT-4o-trans.	13.19	45.78	1.87	2.39	1.57	2.77	15.62	28.76	13.37	22.60	3.70	8.43	8.76	7.71	5.62	11.00
Open-source models
Whisper-L-v3	16.57	18.19	3.38	6.85	3.06	6.01	25.34	39.87	18.33	31.81	3.74	8.77	7.04	8.05	8.91	14.79
Qwen2.5-Omni	11.92	17.88	2.35	2.44	2.40	2.08	20.01	32.64	13.71	30.09	2.46	5.96	6.34	5.88	6.40	10.29
Kimi-Audio	35.10	14.59	2.71	1.92	2.49	1.64	24.00	26.58	8.73	18.09	1.83	2.78	4.54	6.33	4.44	6.19
Qwen3-ASR	7.51	9.52	2.23	1.54	1.73	1.27	10.40	14.61	9.57	19.42	1.54	3.41	4.16	4.19	3.30	5.39
Our model
Mega-ASR	6.33	8.26	2.35	1.61	1.62	1.23	8.62	12.59	7.65	14.21	1.71	3.72	2.59	2.62	2.73	4.57
Mega-ASR w/ router	6.12	8.09	2.33	1.69	1.80	1.41	8.66	12.22	6.91	13.23	1.60	3.35	2.72	2.88	2.63	4.53

Reproduction

Use scripts/run_inference.py to generate predictions and scripts/evaluate_predictions.py to compute Chinese CER and English WER. Mega-ASR is the public name for the merged_v2 model used in the paper.

The full benchmark is available on Hugging Face. The repository includes data/examples.jsonl and eight local audio clips for lightweight smoke tests.

Example Audio

Noise

Background noise and additive interference.

Far-field

Distant microphone capture.

Obstructed

Physical or spectral obstruction.

Distortion

Clipping and signal degradation.

Recording

Channel and device artifacts.

Echo

Echo-heavy reverberant speech.

Dropout

Missing or discontinuous speech.

Mixed

Combined acoustic perturbations.

Model Wrappers

CLI name	Backend	Checkpoint
`whisper-large-v3`	Transformers	`openai/whisper-large-v3`
`canary-1b-v2`	NVIDIA NeMo	`nvidia/canary-1b-v2`
`parakeet-tdt-0.6b-v3`	NVIDIA NeMo	`nvidia/parakeet-tdt-0.6b-v3`
`qwen3-asr-1.7b`	Qwen ASR runtime	`Qwen/Qwen3-ASR-1.7B`
`kimi-audio`	Kimi-Audio runtime	`--model-path` or `KIMI_AUDIO_MODEL_PATH`
`step-audio-2-mini`	Step-Audio2 runtime	`--model-path` or `STEP_AUDIO2_MODEL_PATH`
`mega-asr`	Qwen ASR runtime	`--model-path /path/to/Mega-ASR`

Category	Samples	Description
`noise`	500	Background noise and additive acoustic interference.
`far_field`	500	Distant microphone and reverberant capture conditions.
`obstructed`	500	Speech affected by physical or spectral obstruction.
`distortion`	500	Clipping, nonlinear distortion, and signal degradation.
`recording`	500	Recording coloration, channel effects, and related artifacts.
`echo`	500	Echo-heavy and reverberant speech.
`dropout`	500	Missing, repeated, or discontinuous speech segments.
`mixed`	1,500	Combinations of multiple acoustic conditions.

Dataset Summary

Leaderboard

Reproduction

Example Audio

Model Wrappers

Categories