Targeted Evals Drive Better AI Agent Behavior

April 20, 20262 min read

TL;DR

Deep Agents proves that fewer, higher-quality evaluations improve real-world performance and cut costly inefficiencies.

In the rapidly evolving field of AI agents, a common pitfall is the blind accumulation of evaluations, leading to systems that excel in tests but falter in practice. The developers behind Deep Agents, an open-source, model-agnostic harness, recognized this early on. They argue that every evaluation acts as a vector, subtly shifting an agent's behavior over time, making thoughtful design paramount. This insight forms the core of their reasoning: targeted evals that mirror production behaviors are more effective than vast, generic test suites.

Their key finding is that a curated approach to evaluations, focusing on specific, high-impact behaviors, yields agents that are both correct and efficient in real-world scenarios. By cataloging essential production tasks, such as multi-file retrieval or sequential tool calls, they avoid the illusion of improvement from irrelevant benchmarks. This ensures that evals directly influence the agent's development, applying consistent pressure toward desired outcomes without unnecessary complexity or expense.

Ologically, the team operationalizes this through a systematic curation process. They trace every eval run to a shared LangSmith project, enabling collaborative analysis and fixes. For instance, interactions with Open SWE, their coding agent, are traced and converted into evals to prevent recurring mistakes. They integrate with external benchmarks like BFCL and Harbor for coding tasks, while also crafting custom evals from scratch to test isolated behaviors. This diversity in eval structure, from single-step prompts to multi-turn simulations, allows for comprehensive assessment across different scenarios.

Analysis reveals a dual focus on correctness and efficiency. Initially, models are evaluated for correctness using custom assertions, exact matching, or LLM-as-a-judge, depending on the task. Once correctness is established, efficiency metrics come into play, such as solve rate and latency ratio, which account for steps, tool calls, and time. An illustrative example compares an ideal trajectory of 4 steps and 8 seconds with an inefficient but correct one of 6 steps and 14 seconds, highlighting how inefficiencies increase cost and failure risk. These metrics distill complex runs into actionable numbers for model comparison.

In context, this approach fits into the broader problem of building reliable agents for diverse applications like Fleet and Open SWE. By using pytest with GitHub Actions in CI, evals run in reproducible environments, and tagging subsets allows cost-effective, targeted experiments. The open-source nature of Deep Agents encourages community involvement, with plans to expand the eval suite and explore open-source LLMs. This framework helps teams avoid the temptation of adding thousands of tests that don't reflect production needs.

Limitations include of defining ideal trajectories for open-ended tasks, where approximations based on the best-performing model are used and refined over time. Additionally, while targeted evals save money, they require ongoing maintenance and shared responsibility to remain effective. The reliance on traces for analysis, though powerful, can be resource-intensive, necessitating tools like Polly or Insights for scale. These constraints highlight areas for future improvement as agent systems evolve.