Open AI Models Now Match Closed Systems on Real Tasks

April 20, 20262 min read

TL;DR

New benchmarks show open-source language models rival expensive proprietary systems on practical workflows, saving developers thousands per year.

A significant shift is occurring in the artificial intelligence landscape as open-weight large language models demonstrate they can perform core agent tasks with reliability matching expensive proprietary systems. Recent evaluations from Deep Agents show models like GLM-5 from z.ai and MiniMax M2.7 achieve similar performance to closed frontier models on practical operations including file handling, tool calling, and instruction following. This development means developers building AI agents now have viable alternatives to costly proprietary models that have dominated the field.

These aren't surprising when considering the progress tracked through established benchmarks like SWE-Rebench and Terminal Bench 2.0, which have shown steady improvements in open model capabilities. What's significant is that for developers deploying agents in production environments, open models now offer the consistency and predictability needed for real-world workflows. The reliability of tool calling and instruction following has reached a threshold where these models can be trusted with practical applications rather than just experimental demonstrations.

The evaluation ology involved seven distinct categories covering fundamental agent capabilities: file operations, tool use, retrieval, conversation, memory, summarization, and unit tests. Each test included both hard-fail checks determining correctness and efficiency assertions measuring how economically models reached solutions. The researchers reported four key metrics, with step ratio and tool call ratio providing insights into how efficiently models accomplish tasks beyond simple correctness.

Performance data reveals compelling advantages beyond just capability matching. For latency-sensitive applications, open models demonstrate significant speed advantages, with GLM-5 on Baseten averaging 0.65 seconds latency and 70 tokens per second compared to 2.56 seconds and 34 tokens per second for Claude Opus 4.6. This performance gap represents a substantial engineering for closed models in interactive products where response time directly impacts user experience.

Cost considerations make the case even stronger for many development teams. An application processing 10 million tokens daily would cost approximately $250 per day using Opus 4.6 versus just $12 per day for MiniMax M2.7, translating to about $87,000 in annual savings. These financial become particularly relevant for high-throughput workloads where closed frontier models can run 8-10 times more expensive than their open counterparts.

The Deep Agents harness plays a crucial role in making these models practically usable by absorbing differences in context windows, tool-calling formats, and failure modes across various models. This standardization allows developers to focus on building applications rather than managing model-specific implementations. The same open model is often available through multiple providers, enabling teams to select infrastructure matching their specific constraints around latency, cost, or deployment preferences.

A particularly innovative feature enables runtime model swapping without restarting agents, allowing patterns like using frontier models for planning phases followed by cheaper open models for execution. This hybrid approach combines the strengths of different model types while optimizing for both performance and cost. The Deep Agents CLI supports this capability through slash commands that let developers switch models mid-session based on task requirements.

These developments represent more than just technical achievements—they signal a maturing ecosystem where open models have crossed from experimental tools to production-ready solutions. With providers like Baseten, Fireworks, Groq, OpenRouter, and Ollama offering optimized inference infrastructure, teams can achieve latency and throughput levels beyond what most could manage independently. The availability of these models through multiple channels gives developers flexibility in deployment strategies while maintaining consistent performance.

As the AI landscape continues evolving, these suggest a future where model selection becomes more nuanced, balancing factors like cost, latency, and task requirements rather than defaulting to the most expensive options. The open-source approach enables broader access to capable AI systems while fostering innovation through community development and transparent evaluation ologies that measure what truly matters for practical applications.