Google's newest Flash model underperforms older rivals on Android coding tests while charging developers the highest per-run price on the leaderboard.
Google's newest AI model for developers costs more than any rival on the Android coding leaderboard — and scored lower than its own predecessor.
Updated Android Bench results published Sunday show Gemini 3.5 Flash finishing sixth overall with a score of 63.7. Android Authority reports the model averaged 355.9 tokens per benchmark run, producing a cost of $147.1 per run — the highest figure on the chart. OpenAI's GPT 5.5 topped the rankings with 74. Gemini 3.1 Pro Preview, Google's older model, matched GPT 5.4 at 72.4, outscoring its own successor by more than eight points.
The Flash brand carries a specific meaning in Google's product hierarchy: fast, efficient, affordable. Gemini 3.5 Flash undercuts all three at once. Android Bench evaluates concrete Android development tasks — the kind of work where developers make daily API choices and where cost-per-run numbers feed directly into production budgets.
The score gap
At Google I/O 2026, the company presented Gemini 3.5 Flash as its most capable Flash model yet, citing improved coding support, stronger agent workflows, and better handling of complex multi-step tasks. According to Android Authority, this is the model's first appearance in the Android Bench rankings, meaning the initial public data point from a domain-specific coding benchmark directly contradicts the I/O pitch.
Two Claude Opus models from Anthropic also placed ahead of Gemini 3.5 Flash. Six models in the field outperformed Google's newest release. The token efficiency problem is distinct from raw accuracy: a model consuming 355.9 tokens on average while scoring 63.7 suggests expensive computation that isn't translating into better answers. For artificial intelligence deployments running inference at scale, that arithmetic matters more than stage announcements.
The competitive picture
This result lands in a market where the artificial intelligence review process for developer tools is growing more rigorous. Benchmark transparency has become a selling point: engineering teams at larger companies now publish internal evaluations before committing to a model provider. Android Bench is one such structured review — domain-specific, cost-aware, and difficult to game with general-capability improvements.
Globally, the AI coding market is intensifying. Swarajya Mag reported this week that Bengaluru-based Sarvam AI closed $234 million in a Series B round at a $1.5 billion valuation, with HCLTech leading the investment. Sarvam is specifically targeting coding and agentic use cases — the same categories where Gemini 3.5 Flash underperformed — and its inference platform now handles 10 million API calls a day. More focused, cost-aware competitors entering the field adds pressure on incumbents who lead on brand recognition but trail on benchmarks.
The Claude Opus models that outperformed Flash come from Anthropic, which faces its own turbulence. ZDNET reported this weekend that the company pulled its Fable 5 and Mythos 5 models globally after a US government export directive cited national security concerns over a jailbreak technique. Despite that disruption, Anthropic's Opus line remained available and placed above Google's newest Flash release.
Google faces a specific credibility problem here. Flash was never the premium tier — it was the efficient tier. Pricing it above everything on the leaderboard while it finishes sixth creates an awkward conversation with developers who allocate API budgets based on value per token. One benchmark debut is not a final verdict, and Google has room to close the gap in future updates. But the first read from Android Bench is hard to reframe.
Whether Gemini 3.5 Flash's claimed strengths in agent workflows and multimodal tasks can justify the cost premium is a question developers will answer by running their own workloads. What Android Bench establishes is that for the specific work Android developers do every day, Google is currently charging the most and delivering the sixth-best result.
FAQ
What is Android Bench?
A benchmark leaderboard that evaluates how well AI models perform Android development tasks, reporting both a performance score and an average cost per run based on token consumption.
Why did Gemini 3.5 Flash score lower than Gemini 3.1 Pro Preview?
The benchmark data does not explain the technical reason. It shows the older Pro Preview model scoring 72.4 against Gemini 3.5 Flash's 63.7, with the newer model consuming significantly more tokens per run without producing better results.
How much does Gemini 3.5 Flash cost per benchmark run?
$147.1 on average, based on 355.9 average tokens per run — the highest cost of any model in the current Android Bench rankings.
Which model topped the Android Bench leaderboard?
OpenAI's GPT 5.5, with a score of 74, followed by GPT 5.4 and Gemini 3.1 Pro Preview, both at 72.4.








