Alibaba's Qwen 3.6-Plus Beats Claude at Coding

April 20, 20262 min read

TL;DR

Qwen 3.6-Plus outscores Anthropic's Claude on Terminal-Bench 2.0, showing how fast Chinese AI is closing the gap with US frontier labs.

Alibaba has released Qwen 3.6-Plus, a next-generation multimodal large language model that represents a landmark moment in artificial intelligence development: it is the first Chinese-built model to outperform Anthropic's Claude on a major agentic coding benchmark. On Terminal-Bench 2.0, the industry's leading evaluation for autonomous terminal-based coding tasks, Qwen 3.6-Plus scored 61.6 compared to Claude 4.5 Opus's 59.3 — a narrow but symbolically significant lead that signals the accelerating pace of AI development outside the United States.

The achievement, while modest in absolute terms, carries outsized strategic weight. For years, Anthropic's Claude has been considered the gold standard for agentic coding — the ability to autonomously decompose complex programming problems, write and execute code, and iterate through debugging cycles with minimal human intervention. On SWE-bench Verified, another widely watched benchmark, Qwen 3.6-Plus scored 78.8, trailing Claude's 80.9 by just 2.1 points — the narrowest margin ever recorded between a Qwen model and Claude's top tier. Taken together, the results suggest the performance gap between leading American and Chinese frontier models is closing rapidly.

Qwen 3.6-Plus ships with a native one-million-token context window, enabling it to ingest entire code repositories in a single pass — a capability that is increasingly table stakes among frontier models but remains critical for enterprise adoption. Alibaba reports inference throughput of 158 tokens per second, roughly three times faster than Claude on comparable tasks, a claim that, if validated independently, could make the model particularly attractive for latency-sensitive production workloads. The model also features native multimodal capabilities, including the ability to generate front-end web pages directly from screenshots, design mockups, and text prompts.

Beyond coding, the model demonstrates strong performance in document understanding and visual reasoning. On OmniDocBench v1.5, Qwen 3.6-Plus posted a score of 91.2, surpassing both Claude and Google's Gemini 3 Pro, which each scored 87.7. On the RealWorldQA visual reasoning benchmark, it scored 85.4 versus Claude's 77.0. These results position the model as a serious contender across multiple modalities, not just code generation.

However, the release is not without caveats. Independent evaluations have flagged a 26.5 percent hallucination rate in code reasoning claims, and the model achieved only a 43.3 percent success rate on hidden security benchmarks — weaknesses that could limit its suitability for safety-critical applications. Alibaba has not disclosed the model's parameter count, making it difficult for outside researchers to fully assess its architectural innovations or efficiency gains.

The model is immediately available for free during a preview period via OpenRouter and through Alibaba Cloud's Bailian platform, where production pricing is set at 2 yuan (approximately $0.29) per million input tokens and 12 yuan per million output tokens — pricing that significantly undercuts Western competitors. Notably, Alibaba has optimized Qwen 3.6-Plus for compatibility with Claude Code, OpenClaw, Qwen Code, and Cline agent frameworks, a deliberate move to position the model as a drop-in replacement for Claude in existing developer workflows.

The release arrives amid a broader reorganization of Alibaba's AI operations into a unified division dubbed the "Alibaba Token Hub," reflecting intensifying domestic competition from ByteDance and DeepSeek. Qwen 3.6-Plus will be integrated across Alibaba's ecosystem, including the Wukong enterprise AI platform, Qwen App, and Model Studio. With support for over 100 languages and aggressive pricing, Alibaba is clearly betting that developer adoption — not just benchmark supremacy — will determine the next phase of the global AI race.

For Anthropic, the results are a reminder that its lead in agentic coding, once considered commanding, is no longer assured. While Claude retains advantages on SWE-bench and arguably in safety and reliability, the trajectory is unmistakable: the frontier is getting crowded, and the margin for differentiation is shrinking with every model generation.