Back to Phantom Notes
AI Models

Claude Opus 4.8 Is the New #1 AI Model: Benchmarks, Pricing, and What Changed

May 29, 20267 min readBy T.W. Ghost
ClaudeOpus 4.8AnthropicAI ModelsBenchmarksCodingAgentsClaude Code

Release Summary

Anthropic made Claude Opus 4.8 generally available on May 28, 2026. The headline is not a single benchmark, it is a position: Opus 4.8 retook the #1 spot on the Artificial Analysis Intelligence Index v4.0 with a score of 61.4, moving back ahead of GPT-5.5 (60.2) and its own Opus 4.7 (57.3). The intelligence crown had been GPT-5.5's since April. It is Anthropic's again.

Pricing is unchanged from 4.7. The model ID is claude-opus-4-8, available across Claude products, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

The short version: same price, a clear lead on coding and real-economic-work benchmarks, a model that is measurably more honest about its own mistakes, and a new parallel-agent capability inside Claude Code.


The Headline: Back to #1

The Artificial Analysis Intelligence Index is an aggregate of ten evaluations, spanning GDPval-AA, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt, and Tau-squared Bench Telecom. No single test decides it, which is what makes the top spot meaningful.

ModelAA Intelligence Index v4.0
Claude Opus 4.861.4
GPT-5.5 (xhigh)60.2
Claude Opus 4.757.3
Gemini 3.1 Pro57.2

The 1.2-point gap over GPT-5.5 is narrow. It reflects consistent outperformance across a broad mix rather than dominance in any one category. The clearest single signal sits inside that index: on GDPval-AA, the benchmark built around real economic-value knowledge work, Opus 4.8 scores 1,890 Elo, a full 121 points ahead of GPT-5.5 in second place. A gap that size implies roughly a two-in-three head-to-head win rate on practical work tasks.


Benchmarks

Here is how Opus 4.8 lands against its predecessor and the two competing flagships. All numbers are from Anthropic's release reporting and Artificial Analysis.

BenchmarkOpus 4.8Opus 4.7GPT-5.5Gemini 3.1 Pro
SWE-bench Verified (coding)88.6%87.6%-80.6%
SWE-bench Pro (coding)69.2%64.3%58.6%54.2%
Terminal-Bench 2.1 (CLI)74.6%66.1%78.2%70.3%
OSWorld-Verified (computer use)83.4%82.8%78.7%76.2%
Humanity's Last Exam (with tools)57.9%54.7%52.2%51.4%
GPQA Diamond (science)93.6%94.2%-94.3%
GDPval-AA (Elo, economic work)1,8901,7531,7691,314
Online-Mind2Web (browser agent)84%---

Three things stand out.

SWE-bench Pro is the number that matters most for working developers. It went from 64.3% to 69.2%, a near-five-point jump, and it is almost eleven points clear of GPT-5.5. SWE-bench Verified, the more saturated benchmark, ticked up to 88.6%. The harder the coding test, the larger Claude's lead.

GDPval-AA is the quiet blockbuster. Coding benchmarks measure coding. GDPval-AA measures whether a model can do the kind of mixed knowledge work that actually pays salaries. A 121-Elo lead there is the strongest evidence that the #1 ranking is not a quirk of one eval.

The honest caveats are real. GPQA Diamond actually dipped slightly, from 94.2% to 93.6%, so on raw graduate-science knowledge Gemini 3.1 Pro still edges it. And Terminal-Bench 2.1 (74.6%) still trails GPT-5.5 (78.2%), so for pure shell and DevOps automation, GPT-5.5 remains the one to beat. Anthropic shipped a model that is broadly ahead, not universally ahead.


Pricing

Unchanged from Opus 4.7, which is the part that makes the upgrade easy to justify.

  • Input: $5 per million tokens
  • Output: $25 per million tokens
  • Context: 1M input tokens, 128K output tokens
  • Fast mode (optional): $10 input / $50 output per million tokens, for roughly 2.5x the throughput

Get the Weekly IT + AI Roundup

What changed this week in NinjaOne, ServiceNow, CrowdStrike, and AI. One email, every Monday.

No spam, unsubscribe anytime. Privacy Policy

The 1M-token context window stays at standard pricing with no long-context premium. Prompt caching and batch API discounts continue to apply.


What's New

Parallel-subagent dynamic workflows in Claude Code. This is the flagship feature. For large-scale problems, Claude Code can now spin up parallel subagents that fan work out and report back, rather than grinding through a long task in a single thread. Think codebase-wide migrations, broad audits, or multi-file refactors that one context window would struggle to hold. It is the agent-orchestration pattern productized.

Effort control on claude.ai. End users, not just API developers, can now trade response quality against speed directly in the web app. The high-effort setting routes more reasoning at the problem, the lower setting answers faster.

Mid-task system messages on the Messages API. Developers can now inject new system-level instructions partway through a task, so an agent's guardrails or goals can be updated without restarting the conversation. This is a meaningful primitive for long-running agents.


The Honesty Upgrade

The most interesting non-benchmark claim is about self-review. Anthropic reports that Opus 4.8 is roughly four times less likely than Opus 4.7 to let flaws in code it has written pass unflagged.

That is a specific, useful behavior. The failure mode it targets is the one every developer using AI has hit: the model writes code, the code has a subtle bug, and the model presents it as finished without noticing. A model that catches its own flaws four times more often is a model you have to babysit less. Combined with the SWE-bench Pro gains, this is the quiet reason the day-to-day coding experience feels different, not just the headline scores.


Should You Upgrade?

For most users, the answer is straightforward, because the price did not move.

  • Coding and agentic work: Yes. Bigger SWE-bench Pro lead, better self-review, parallel subagents in Claude Code. This is the clearest win.
  • Knowledge work and research: Yes. The GDPval-AA and Humanity's Last Exam gains are exactly the practical-reasoning improvements that show up in real tasks.
  • Shell and DevOps automation: Maybe. GPT-5.5 still leads Terminal-Bench 2.1, so if your workload is mostly CLI orchestration, test both.
  • Pure science Q&A: Marginal. GPQA dipped slightly, so this is the one area where 4.7 or Gemini 3.1 Pro is not clearly behind.

Existing Opus 4.7 code paths remain valid, and 4.7 is not immediately deprecated. For direct API use, point requests at claude-opus-4-8. In the desktop app and Claude Code, the model selector lists it once the app is updated. Remember that the VS Code extension bundles its own Claude Code binary separate from any npm-installed CLI, so updating one does not update the other.


What to Watch Next

  • Does the GDPval-AA lead translate to real workflows? A 121-Elo gap on a knowledge-work benchmark is impressive, but the proof is whether teams feel it on actual deliverables over the next few weeks.
  • How far do parallel subagents scale? Dynamic workflows are powerful, but they multiply token spend. The interesting question is where the cost-versus-speed tradeoff lands for everyday tasks versus large migrations.
  • How quickly does GPT-5.5 respond? A 1.2-point Index lead is the kind of margin that flips with the next point release. OpenAI will not sit at #2 quietly.

Opus 4.8 is available now. The factual case for upgrading is the cleanest it has been in a while: same price as 4.7, the #1 spot on the aggregate Intelligence Index, the widest coding lead Anthropic has held, and a model that is genuinely better at catching its own mistakes.


*Not sure which model fits your workflow? Our model comparison breaks down Claude, ChatGPT, Gemini, and Grok side by side, or take the quiz for a personalized match.*