ASR Leaderboard Compares Speech Models on Speed and Accuracy

April 20, 20262 min read

TL;DR

New open benchmark reveals tradeoffs between accuracy, speed, and multilingual support to help teams choose the right speech recognition model.

Automatic speech recognition (ASR) benchmarks have long focused on short English clips, but real-world applications demand more. The Open ASR Leaderboard addresses this by evaluating models on multilingual performance and throughput, crucial for tasks like transcribing meetings and podcasts. This approach ensures that developers and researchers can compare systems fairly across diverse use cases. As of November 21, 2025, the leaderboard has become a standard, comparing 60 models from 18 organizations over 11 datasets.

In English transcription, models with Conformer encoders and large language model (LLM) decoders achieve the highest accuracy. For instance, NVIDIA's Canary-Qwen-2.5B, IBM's Granite-Speech-3.3-8B, and Microsoft's Phi-4-Multimodal-Instruct show the lowest word error rates (WER). This demonstrates that LLM integration significantly boosts ASR precision. NVIDIA's Fast Conformer, a faster variant used in models like Canary and Parakeet, offers a 2x speed improvement without sacrificing much accuracy.

However, these LLM-based decoders are slower than simpler alternatives, highlighting a tradeoff between speed and precision. Efficiency on the leaderboard is measured by inverse real-time factor (RTFx), where higher values indicate better performance. For faster inference, CTC and TDT decoders provide 10 to 100 times greater throughput, though with slightly higher error rates. This makes them ideal for real-time or batch transcription in scenarios like lectures and podcasts.

Multilingual ASR presents another layer of complexity, with OpenAI's Whisper Large v3 serving as a strong baseline supporting 99 languages. Fine-tuned versions such as Distil-Whisper and CrisperWhisper often excel in English-only tasks, showing how specialization through tuning can enhance performance. Yet, this focus on English reduces multilingual coverage, illustrating the classic tradeoff between generalization and specialization. Self-supervised systems like Meta's Massively Multilingual Speech (MMS) and Omnilingual ASR support over 1,000 languages but lag in accuracy compared to language-specific encoders.

Community-driven efforts complement the main leaderboard, with specialized benchmarks for languages like Arabic and Russian. The Open Universal Arabic ASR Leaderboard evaluates models on Modern Standard Arabic and dialects, addressing s like speech variation and diglossia. Similarly, the Russian ASR Leaderboard focuses on phonology and morphology, encouraging dataset sharing and transparent comparisons. These initiatives align with the broader goal of improving ASR in under-resourced languages through open collaboration.

For long-form audio, closed-source systems currently outperform open ones, possibly due to domain tuning or optimization. Among open models, Whisper Large v3 leads in accuracy, but CTC-based Conformers like NVIDIA's Parakeet CTC 1.1B offer much higher throughput, with an RTFx of 2793.75 versus 68.56 for Whisper. This comes with a moderate WER increase, from 6.43 to 6.68, and the limitation that Parakeet is English-only. Despite closed-source advantages, long-form ASR remains a promising area for open-source innovation.

The Open ASR Leaderboard continues to evolve, with plans to add more languages, models, and datasets. It serves as a transparent, community-driven benchmark, referenced by other leaderboards in areas like Russian and Arabic ASR, as well as speech deepfake detection. Contributions are encouraged via GitHub pull requests, fostering ongoing improvements in the field. As ASR technology advances rapidly, this benchmark will help track progress in both performance and efficiency.