Gemini 3.5 Flash Places Sixth, Costs Most on Android Bench

June 15, 20263 min read

TL;DR

Google's newest Flash model underperforms older rivals on Android coding tests while charging developers the highest per-run price on the leaderboard.

Google's newest AI model for developers costs more than any rival on the Android coding leaderboard — and scored lower than its own predecessor.

Updated Android Bench results published Sunday show Gemini 3.5 Flash finishing sixth overall with a score of 63.7. Android Authority reports the model averaged 355.9 tokens per benchmark run, producing a cost of $147.1 per run — the highest figure on the chart. OpenAI's GPT 5.5 topped the rankings with 74. Gemini 3.1 Pro Preview, Google's older model, matched GPT 5.4 at 72.4, outscoring its own successor by more than eight points.

The Flash brand carries a specific meaning in Google's product hierarchy: fast, efficient, affordable. Gemini 3.5 Flash undercuts all three at once. Android Bench evaluates concrete Android development tasks — the kind of work where developers make daily API choices and where cost-per-run numbers feed directly into production budgets.

The score gap

At Google I/O 2026, the company presented Gemini 3.5 Flash as its most capable Flash model yet, citing improved coding support, stronger agent workflows, and better handling of complex multi-step tasks. According to Android Authority, this is the model's first appearance in the Android Bench rankings, meaning the initial public data point from a domain-specific coding benchmark directly contradicts the I/O pitch.

Two Claude Opus models from Anthropic also placed ahead of Gemini 3.5 Flash. Six models in the field outperformed Google's newest release. The token efficiency problem is distinct from raw accuracy: a model consuming 355.9 tokens on average while scoring 63.7 suggests expensive computation that isn't translating into better answers. For artificial intelligence deployments running inference at scale, that arithmetic matters more than stage announcements.

The competitive picture

This result lands in a market where the artificial intelligence review process for developer tools is growing more rigorous. Benchmark transparency has become a selling point: engineering teams at larger companies now publish internal evaluations before committing to a model provider. Android Bench is one such structured review — domain-specific, cost-aware, and difficult to game with general-capability improvements.

Globally, the AI coding market is intensifying. Swarajya Mag reported this week that Bengaluru-based Sarvam AI closed $234 million in a Series B round at a $1.5 billion valuation, with HCLTech leading the investment. Sarvam is specifically targeting coding and agentic use cases — the same categories where Gemini 3.5 Flash underperformed — and its inference platform now handles 10 million API calls a day. More focused, cost-aware competitors entering the field adds pressure on incumbents who lead on brand recognition but trail on benchmarks.

The Claude Opus models that outperformed Flash come from Anthropic, which faces its own turbulence. ZDNET reported this weekend that the company pulled its Fable 5 and Mythos 5 models globally after a US government export directive cited national security concerns over a jailbreak technique. Despite that disruption, Anthropic's Opus line remained available and placed above Google's newest Flash release.

Google faces a specific credibility problem here. Flash was never the premium tier — it was the efficient tier. Pricing it above everything on the leaderboard while it finishes sixth creates an awkward conversation with developers who allocate API budgets based on value per token. One benchmark debut is not a final verdict, and Google has room to close the gap in future updates. But the first read from Android Bench is hard to reframe.

Whether Gemini 3.5 Flash's claimed strengths in agent workflows and multimodal tasks can justify the cost premium is a question developers will answer by running their own workloads. What Android Bench establishes is that for the specific work Android developers do every day, Google is currently charging the most and delivering the sixth-best result.

FAQ

What is Android Bench?
A benchmark leaderboard that evaluates how well AI models perform Android development tasks, reporting both a performance score and an average cost per run based on token consumption.

Why did Gemini 3.5 Flash score lower than Gemini 3.1 Pro Preview?
The benchmark data does not explain the technical reason. It shows the older Pro Preview model scoring 72.4 against Gemini 3.5 Flash's 63.7, with the newer model consuming significantly more tokens per run without producing better results.

How much does Gemini 3.5 Flash cost per benchmark run?
$147.1 on average, based on 355.9 average tokens per run — the highest cost of any model in the current Android Bench rankings.

Which model topped the Android Bench leaderboard?
OpenAI's GPT 5.5, with a score of 74, followed by GPT 5.4 and Gemini 3.1 Pro Preview, both at 72.4.

Gemini 3.5 Flash Places Sixth, Costs Most on Android Bench

The score gap

The competitive picture

FAQ

Read Next

Anthropic Shuts Down Fable 5 and Mythos 5 After US Jailbreak Directive

Meta Paid $14B to Build an AI Model. Selling It Is Zuckerberg's Job.

OpenAI Eyes Ad Revenue as ChatGPT Hits 800M Users

OpenAI names 26-student Futures Class with $10K grants

States Forge Ahead on AI Rules Despite Trump's Warnings

Equal AI raises $30M Series B to build India's AI lifestyle concierge

OpenAI funds 26 student AI builders with $260K in grants

Google Sues Scam Ring That Used Gemini to Build 9,000 Fake Sites