Voice AI Struggles to Balance Accuracy and Natural Feel

April 20, 20262 min read

TL;DR

A new evaluation framework shows conversational agents often fail at either completing tasks or sounding natural, revealing gaps in current benchmarks.

The quality of conversational voice agents has remained frustratingly difficult to measure, with existing evaluation s failing to capture what truly matters in real-world interactions. When you call an airline's automated system to rebook a flight, you need both the correct outcome and a smooth conversation—mishearing a confirmation code or overwhelming you with spoken options can render even perfect task completion useless. This fundamental has persisted because current frameworks evaluate accuracy and experience separately, missing how these dimensions interact in practice.

The authors report that their new EVA framework is the first to jointly score task success and conversational experience through complete, multi-turn spoken conversations. They found a consistent tradeoff: agents that perform well on task completion tend to deliver worse user experiences, and vice versa. This reveals that optimizing for one dimension often comes at the expense of the other, a critical insight invisible to benchmarks that measure only whether tasks get completed.

Involves a bot-to-bot audio architecture that simulates realistic conversations. A user simulator plays the role of a caller using high-quality text-to-speech, while the voice agent being evaluated must invoke appropriate tools and reach verifiable end states. The framework includes validators to ensure conversations are correctly executed without human annotation, and supports both cascade architectures and audio-native models like speech-to-speech systems.

EVA produces two primary scores: EVA-A for accuracy and EVA-X for experience. Accuracy measures not just task completion but also whether information was communicated correctly and policies were followed faithfully. Experience evaluates whether interactions were natural, concise, and appropriately timed for spoken dialogue. The framework also includes diagnostic metrics to identify specific failure modes in components like speech recognition or synthesis.

From evaluating 20 systems show that no single configuration dominates both accuracy and experience axes. The authors identified named entity transcription as a dominant failure mode, where a single misheard character can cascade into authentication failures and complete conversation breakdowns. Multi-step workflows, particularly rebooking flights while preserving ancillary services, proved especially challenging across all configurations.

The field currently lacks frameworks that evaluate voice agent quality as an integrated whole. Existing efforts like AudioBench and VoiceBench assess speech understanding in isolation, while others like FD-Bench analyze conversational dynamics separately from task-oriented tool use. More recent benchmarks like VoiceAgentBench evaluate agentic capabilities but not within complete conversational workflows from initial request through final resolution.

Several limitations are important to acknowledge. The current release covers only 50 English-language scenarios in the airline domain, and may not generalize to other contexts. The user simulator may not perfectly replicate real caller behaviors like disfluencies or emotions. Additionally, LLM-as-judge metrics carry inherent biases and may favor certain response styles independent of quality.

The authors plan to expand EVA with additional domain datasets, more complex scenarios, and robustness testing under diverse conditions. They also intend to address current gaps like prosodic quality assessment and affect-aware evaluation of how agents respond to user distress. This ongoing development aims to provide a more comprehensive assessment of voice agent capabilities as the field evolves.