Google shipped Gemini 3.5 Flash on May 19, 2026 at Google I/O, and the headline claim is a real one: a Flash-tier model now beats last generation's flagship. Google DeepMind's chief technologist Koray Kavukcuoglu put it plainly — 3.5 Flash "outperforms our latest frontier model, 3.1 Pro, on nearly all the benchmarks."
That is genuinely notable. The industry's working assumption has been that the smartest models are the slowest and most expensive, and the "Flash" or "mini" tier is where you go when you need cheap and fast and can accept being dumber. Gemini 3.5 Flash scrambles that.
But the honest version of this story has two asterisks: it does not beat 3.1 Pro at everything, and the "Flash" brand no longer means "cheap." Here is the full picture.
What Launched
Gemini 3.5 Flash is generally available the same day it was announced — no preview period — across the Gemini app, AI Mode in Google Search, Google Antigravity, the Gemini API (AI Studio and Android Studio), Vertex AI, and Gemini Enterprise.
| Spec | Gemini 3.5 Flash |
|---|---|
| API model ID | gemini-3.5-flash |
| Context window | 1M input tokens / 64K output tokens |
| Knowledge cutoff | January 2026 |
| Modalities | Text, image, speech, video in; text out. Reasoning model with a thinking-budget mode. |
| Output speed | ~280 tokens/sec (independently measured by Artificial Analysis, #2 of all models tracked) |
| API price | $1.50 input / $9.00 output / $0.15 cached input, per 1M tokens |
A second model, Gemini 3.5 Pro, is not out yet. Google says it is "already being used internally" with a rollout expected next month. No specs or benchmarks for Pro have been published.
The Benchmarks, and What They Actually Mean
Google led the launch with four benchmark numbers. Here is what each one is, so the scores mean something:
| Benchmark | What it tests | 3.5 Flash score |
|---|---|---|
| Terminal-Bench 2.1 | Real coding-agent quality — inspect a sandboxed environment, edit files, run commands, recover from errors, finish a task end-to-end | 76.2% |
| MCP Atlas | Tool-use competency against real Model Context Protocol servers, with multi-hop reasoning (one tool's output feeds the next) | 83.6% |
| CharXiv Reasoning | Reading and reasoning about real scientific charts and diagrams (not toy visual quizzes) | 84.2% |
| GDPval-AA | Performance on economically valuable real-world work across many occupations, scored as an Elo rating | 1656 Elo |
In plain English: on the percentage benchmarks, 75 to 85% is the current frontier band for these hard 2026 agentic suites, so 76 to 84% is genuinely strong. The Elo number is the eye-catcher — 1656 on GDPval-AA puts 3.5 Flash in the top cluster of all models, just behind GPT-5.4 at extra-high effort.
Flash vs last-gen Pro, head to head
This is the news. Google's published numbers, reproduced by independent trackers:
| Benchmark | 3.5 Flash | 3.1 Pro | Winner |
|---|---|---|---|
| Terminal-Bench 2.1 (coding agent) | 76.2% | 70.3% | Flash +5.9 |
| MCP Atlas (tool use) | 83.6% | 78.2% | Flash +5.4 |
| CharXiv Reasoning (multimodal) | 84.2% | 83.3% | Flash +0.9 |
| Finance Agent v2 | 57.9% | 43.0% | Flash +14.9 |
| GDPval-AA (real-world work, Elo) | 1656 | 1314 | Flash +342 |
A Flash-tier model beating last year's flagship by 342 Elo points on real-world work, while running about 4x faster, is the kind of result that resets expectations.
The Asterisk: Read the Benchmark Menu
Here is the part most launch-day coverage skipped. Google chose those four headline benchmarks deliberately — they are all agentic, tool-use, and multimodal tests. Google did not lead with the hardest pure-reasoning benchmarks. When you look at those, the story changes:
| Benchmark | 3.5 Flash | 3.1 Pro | Winner |
|---|---|---|---|
| Humanity's Last Exam (hard reasoning) | 40.2% | 44.4% | 3.1 Pro +4.2 |
| ARC-AGI-2 (abstract reasoning) | 72.1% | 77.1% | 3.1 Pro +5.0 |
So the accurate headline is not "Flash beats Pro at everything." It is: a Flash-tier model now beats last generation's flagship at *the work most people actually deploy agents for* — coding, tool use, multimodal understanding, economically valuable tasks — while still trailing it on the hardest abstract-reasoning tests.
That nuance is not a knock on Gemini 3.5 Flash. It is a more useful truth than the marketing line. If your workload is agentic coding or document processing, 3.5 Flash is a genuine upgrade over 3.1 Pro. If your workload is frontier-grade abstract reasoning, last year's Pro is still ahead — and so are this year's actual flagships.
Get the Weekly IT + AI Roundup
What changed this week in NinjaOne, ServiceNow, CrowdStrike, and AI. One email, every Monday.
No spam, unsubscribe anytime. Privacy Policy
Where It Actually Ranks
On the Artificial Analysis Intelligence Index — the most-cited independent composite "how smart is it" score — Gemini 3.5 Flash lands at 55. That is up 9 points from Gemini 3 Flash, and good for roughly 7th of about 147 models tracked. But it is not the top:
| Model | AA Intelligence Index v4 |
|---|---|
| GPT-5.5 (xhigh) | 60 |
| Claude Opus 4.7 (max effort) | 57 |
| Gemini 3.1 Pro | ~57 |
| Gemini 3.5 Flash | 55 |
| Grok 4.3 (high) | 53 |
So Gemini 3.5 Flash is flagship-*adjacent*, not flagship-beating. GPT-5.5 and Claude Opus 4.7 are still smarter on the composite. The real story is the *combination*: 3.5 Flash lands within 2 to 5 points of the flagships while running several times faster and costing a fraction as much. Artificial Analysis describes it as sitting on the "speed-intelligence Pareto frontier" — you cannot get more intelligence at that speed, and you cannot get more speed at that intelligence.
When Google said the model "tops the right quadrant" of the Artificial Analysis index, that is the intelligence-vs-speed chart they mean. High intelligence, very high speed.
The Price Story: "Flash" Does Not Mean "Cheap" Anymore
Google's announcement said Gemini 3.5 Flash costs "less than half the cost of other frontier models." That holds up — against US flagships:
| Model | Input / 1M | Output / 1M |
|---|---|---|
| Gemini 3.5 Flash | $1.50 | $9.00 |
| GPT-5.5 | $5.00 | $30.00 |
| Claude Opus 4.7 | $5.00 | $25.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Grok 4.3 | $1.25 | $2.50 |
| Gemini 3 Flash (prior gen) | $0.50 | $3.00 |
Against GPT-5.5 and Claude Opus 4.7, Gemini 3.5 Flash is roughly 65 to 70% cheaper. On a blended basis it runs about one-third the cost of those flagships. That is real, and for high-volume workloads it matters.
But notice the bottom two rows. Gemini 3.5 Flash is 3x the price of the previous Gemini 3 Flash ($1.50/$9.00 vs $0.50/$3.00). And Grok 4.3's output tokens cost $2.50 against Flash's $9.00.
Independent developer Simon Willison flagged this on launch day, noting that "all three of the major AI labs are starting to probe the price tolerance of their API customers." The top-voted Hacker News thread was titled, bluntly, *"It's discouraging to see Google price Gemini 3.5 Flash at 3x the cost of Gemini 3 Flash."* Artificial Analysis found that running its full benchmark suite on 3.5 Flash cost about 5.5x more than on Gemini 3 Flash — because the model both costs more per token *and* burns more reasoning tokens.
The takeaway for buyers: "Flash" used to be the budget tier. It is now a mid-priced tier that happens to be fast. Cheap only when your comparison is a flagship.
Who Should Use Gemini 3.5 Flash
This is a buyer's-guide question, and there is no single right answer. Pick by workload:
- ●Choose Gemini 3.5 Flash for high-volume agentic and multimodal workloads — coding agents, document and chart processing, tool-use pipelines — where speed and cost matter and you do not need the absolute top of the intelligence index. Its ~280 tokens/sec and 1M context make it strong for long autonomous runs.
- ●Choose GPT-5.5 when you need the highest raw intelligence on the hardest reasoning tasks.
- ●Choose Claude Opus 4.7 for the deepest production coding — it still leads SWE-bench.
- ●Choose Grok 4.3 when output cost is the dominant constraint — its $2.50 output price undercuts everyone.
- ●Choose Gemini 3 Flash (the old one) if it is still available and your task is simple — it is one-third the price and may be all you need.
The honest summary: no model wins everything. Gemini 3.5 Flash wins the "fast, capable, and not flagship-priced" quadrant, and it wins it convincingly.
What's Next
Gemini 3.5 Pro lands next month. Google has not published specs, but it shares the coding-and-agentic focus of 3.5 Flash. If 3.5 Flash already matches last year's Pro, the question for 3.5 Pro is whether it can take the Artificial Analysis intelligence crown back from GPT-5.5 — which would make it the first Google model to do so this cycle.
For now, Gemini 3.5 Flash is exactly what it claims to be, with the asterisks intact: a fast, mid-priced model that does the agentic work of last year's flagship, ships in everything Google makes, and quietly costs three times what the last Flash did. Read the benchmark menu, check the price against your actual alternative, and it is an easy model to recommend for the workloads it is built for.
Sources
- ●Google: A new era of intelligence with Gemini 3.5
- ●Artificial Analysis: Gemini 3.5 Flash model page
- ●Simon Willison: Gemini 3.5 Flash
- ●TechCrunch: Google bets its next AI wave on agents, not chatbots
- ●VentureBeat: Gemini 3.5 Flash and enterprise AI costs
- ●The New Stack: Gemini 3.5 Flash beats the frontier models
- ●Hacker News: Gemini 3.5 Flash pricing discussion