Dataset Summary
Voices-in-the-Wild-Bench contains 5,000 speech examples: 3,500 synthetic samples, 1,500 real-recorded samples from 16 speakers, and an even Chinese-English language split.
Leaderboard
Breakdown results by acoustic scenario. Scores are error rates; lower is better. Real. and Sim. denote real-recorded and synthetic speech, respectively.
| Model | Noise | Far. | Obst. | Echo. | Record. | Elc.Dis. | Trans.Drop. | Mixed | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Real. | Sim. | Real. | Sim. | Real. | Sim. | Real. | Sim. | Real. | Sim. | Real. | Sim. | Real. | Sim. | Real. | Sim. | |
| Closed-source models | ||||||||||||||||
| Gemini3-Flash | 7.63 | 10.61 | 5.14 | 1.90 | 3.73 | 2.65 | 8.75 | 14.86 | 8.38 | 19.85 | 3.15 | 7.56 | 5.47 | 7.65 | 7.99 | 9.62 |
| Seed-ASR | 8.21 | 8.11 | 3.06 | 3.19 | 3.10 | 2.76 | 16.55 | 18.21 | 18.48 | 23.33 | 3.89 | 5.71 | 7.97 | 7.46 | 6.88 | 9.29 |
| GPT-4o-trans. | 13.19 | 45.78 | 1.87 | 2.39 | 1.57 | 2.77 | 15.62 | 28.76 | 13.37 | 22.60 | 3.70 | 8.43 | 8.76 | 7.71 | 5.62 | 11.00 |
| Open-source models | ||||||||||||||||
| Whisper-L-v3 | 16.57 | 18.19 | 3.38 | 6.85 | 3.06 | 6.01 | 25.34 | 39.87 | 18.33 | 31.81 | 3.74 | 8.77 | 7.04 | 8.05 | 8.91 | 14.79 |
| Qwen2.5-Omni | 11.92 | 17.88 | 2.35 | 2.44 | 2.40 | 2.08 | 20.01 | 32.64 | 13.71 | 30.09 | 2.46 | 5.96 | 6.34 | 5.88 | 6.40 | 10.29 |
| Kimi-Audio | 35.10 | 14.59 | 2.71 | 1.92 | 2.49 | 1.64 | 24.00 | 26.58 | 8.73 | 18.09 | 1.83 | 2.78 | 4.54 | 6.33 | 4.44 | 6.19 |
| Qwen3-ASR | 7.51 | 9.52 | 2.23 | 1.54 | 1.73 | 1.27 | 10.40 | 14.61 | 9.57 | 19.42 | 1.54 | 3.41 | 4.16 | 4.19 | 3.30 | 5.39 |
| Our model | ||||||||||||||||
| Mega-ASR | 6.33 | 8.26 | 2.35 | 1.61 | 1.62 | 1.23 | 8.62 | 12.59 | 7.65 | 14.21 | 1.71 | 3.72 | 2.59 | 2.62 | 2.73 | 4.57 |
| Mega-ASR w/ router | 6.12 | 8.09 | 2.33 | 1.69 | 1.80 | 1.41 | 8.66 | 12.22 | 6.91 | 13.23 | 1.60 | 3.35 | 2.72 | 2.88 | 2.63 | 4.53 |
Reproduction
Use scripts/run_inference.py to generate predictions and scripts/evaluate_predictions.py to compute Chinese CER and English WER. Mega-ASR is the public name for the merged_v2 model used in the paper.
data/examples.jsonl and eight local audio clips for lightweight smoke tests.Example Audio
Background noise and additive interference.
Distant microphone capture.
Physical or spectral obstruction.
Clipping and signal degradation.
Channel and device artifacts.
Echo-heavy reverberant speech.
Missing or discontinuous speech.
Combined acoustic perturbations.
Model Wrappers
| CLI name | Backend | Checkpoint |
|---|---|---|
whisper-large-v3 | Transformers | openai/whisper-large-v3 |
canary-1b-v2 | NVIDIA NeMo | nvidia/canary-1b-v2 |
parakeet-tdt-0.6b-v3 | NVIDIA NeMo | nvidia/parakeet-tdt-0.6b-v3 |
qwen3-asr-1.7b | Qwen ASR runtime | Qwen/Qwen3-ASR-1.7B |
kimi-audio | Kimi-Audio runtime | --model-path or KIMI_AUDIO_MODEL_PATH |
step-audio-2-mini | Step-Audio2 runtime | --model-path or STEP_AUDIO2_MODEL_PATH |
mega-asr | Qwen ASR runtime | --model-path /path/to/Mega-ASR |
Categories
| Category | Samples | Description |
|---|---|---|
noise | 500 | Background noise and additive acoustic interference. |
far_field | 500 | Distant microphone and reverberant capture conditions. |
obstructed | 500 | Speech affected by physical or spectral obstruction. |
distortion | 500 | Clipping, nonlinear distortion, and signal degradation. |
recording | 500 | Recording coloration, channel effects, and related artifacts. |
echo | 500 | Echo-heavy and reverberant speech. |
dropout | 500 | Missing, repeated, or discontinuous speech segments. |
mixed | 1,500 | Combinations of multiple acoustic conditions. |