Voices-in-the-Wild-Bench

A bilingual benchmark for robust speech understanding under realistic acoustic conditions, including noise, far-field capture, obstruction, recording artifacts, echo, dropout, and mixed perturbations.

Robust ASR ZH / EN 5,000 samples CER + WER

Dataset Summary

Voices-in-the-Wild-Bench contains 5,000 speech examples: 3,500 synthetic samples, 1,500 real-recorded samples from 16 speakers, and an even Chinese-English language split.

5,000Total samples
3,500Synthetic speech
1,500Real-recorded speech
8Acoustic categories

Leaderboard

Breakdown results by acoustic scenario. Scores are error rates; lower is better. Real. and Sim. denote real-recorded and synthetic speech, respectively.

Model Noise Far. Obst. Echo. Record. Elc.Dis. Trans.Drop. Mixed
Real.Sim. Real.Sim. Real.Sim. Real.Sim. Real.Sim. Real.Sim. Real.Sim. Real.Sim.
Closed-source models
Gemini3-Flash7.6310.615.141.903.732.658.7514.868.3819.853.157.565.477.657.999.62
Seed-ASR8.218.113.063.193.102.7616.5518.2118.4823.333.895.717.977.466.889.29
GPT-4o-trans.13.1945.781.872.391.572.7715.6228.7613.3722.603.708.438.767.715.6211.00
Open-source models
Whisper-L-v316.5718.193.386.853.066.0125.3439.8718.3331.813.748.777.048.058.9114.79
Qwen2.5-Omni11.9217.882.352.442.402.0820.0132.6413.7130.092.465.966.345.886.4010.29
Kimi-Audio35.1014.592.711.922.491.6424.0026.588.7318.091.832.784.546.334.446.19
Qwen3-ASR7.519.522.231.541.731.2710.4014.619.5719.421.543.414.164.193.305.39
Our model
Mega-ASR6.338.262.351.611.621.238.6212.597.6514.211.713.722.592.622.734.57
Mega-ASR w/ router6.128.092.331.691.801.418.6612.226.9113.231.603.352.722.882.634.53

Reproduction

Use scripts/run_inference.py to generate predictions and scripts/evaluate_predictions.py to compute Chinese CER and English WER. Mega-ASR is the public name for the merged_v2 model used in the paper.

The full benchmark is available on Hugging Face. The repository includes data/examples.jsonl and eight local audio clips for lightweight smoke tests.

Example Audio

Noise

Background noise and additive interference.

Far-field

Distant microphone capture.

Obstructed

Physical or spectral obstruction.

Distortion

Clipping and signal degradation.

Recording

Channel and device artifacts.

Echo

Echo-heavy reverberant speech.

Dropout

Missing or discontinuous speech.

Mixed

Combined acoustic perturbations.

Model Wrappers

CLI name Backend Checkpoint
whisper-large-v3Transformersopenai/whisper-large-v3
canary-1b-v2NVIDIA NeMonvidia/canary-1b-v2
parakeet-tdt-0.6b-v3NVIDIA NeMonvidia/parakeet-tdt-0.6b-v3
qwen3-asr-1.7bQwen ASR runtimeQwen/Qwen3-ASR-1.7B
kimi-audioKimi-Audio runtime--model-path or KIMI_AUDIO_MODEL_PATH
step-audio-2-miniStep-Audio2 runtime--model-path or STEP_AUDIO2_MODEL_PATH
mega-asrQwen ASR runtime--model-path /path/to/Mega-ASR

Categories

Category Samples Description
noise500Background noise and additive acoustic interference.
far_field500Distant microphone and reverberant capture conditions.
obstructed500Speech affected by physical or spectral obstruction.
distortion500Clipping, nonlinear distortion, and signal degradation.
recording500Recording coloration, channel effects, and related artifacts.
echo500Echo-heavy and reverberant speech.
dropout500Missing, repeated, or discontinuous speech segments.
mixed1,500Combinations of multiple acoustic conditions.