Back to Phantom Notes
AI Tools

GPT-5.5 Ships at Double the Price: The Benchmark Picture Is More Mixed Than OpenAI's Marketing

April 23, 20268 min readBy T.W. Ghost
OpenAIGPT-5.5GPT-5AI ModelsClaude Opus 4.7Gemini 3.1 ProCodexAgentic AI

Release Summary

OpenAI shipped GPT-5.5 on April 23, 2026. Two variants ship at once:

  • GPT-5.5 (standard) — Plus, Pro, Business, Enterprise in ChatGPT and Codex. $5 per million input tokens, $30 per million output.
  • GPT-5.5 Pro (higher performance, longer thinking) — Pro, Business, Enterprise only. $30 per million input, $180 per million output.

Free-tier users: no date announced. API access: "very soon." Codex gets a 1M token context window and an optional fast-mode at 2.5x cost.

The headline pitch is "a new class of intelligence" and "the next step toward a new way of getting work done on a computer." Translation: OpenAI is doubling down on agentic workflows — GPT-5.5 is designed to plan, use tools, check its work, and iterate without asking for help between steps.

The price is the story that deserves more attention than OpenAI wants it to get. API pricing doubled vs GPT-5.4 ($2.50/$15 → $5/$30). The Pro tier at $30/$180 per million is 6x the price of Claude Opus 4.7 on the Anthropic API. OpenAI is tiering up and letting the cheaper options stay around for buyers who can't absorb the jump.


The Benchmark Picture (Honestly)

The marketing focuses on what GPT-5.5 wins. Here's the full scorecard.

Where GPT-5.5 wins

BenchmarkGPT-5.5Claude Opus 4.7Gemini 3.1 Pro
Artificial Analysis Intelligence Index605757
Terminal-Bench 2.0 (shell)82.7%69.4%68.5%
FrontierMath Tier 435.4%22.9%16.7%
FrontierMath Tier 1-3 (Pro)52.4%
BrowseComp (Pro variant)90.1%85.9%

Terminal-Bench is the clean win — shell automation and scripting is GPT-5.5's strongest suit by a wide margin. FrontierMath Tier 4 shows real reasoning depth. The Intelligence Index lead is 3 points — meaningful but narrow.

Where GPT-5.5 loses

BenchmarkGPT-5.5Claude Opus 4.7Gemini 3.1 Pro
SWE-Bench Pro (complex code)58.6%64.3%
MCP Atlas (tool use)75.3%79.1%78.2%

SWE-Bench Pro is the single most relevant benchmark for real-world coding assistants. Claude Opus 4.7 still wins it by ~6 points. If your job is "write and refactor production code," GPT-5.5 is not the obvious upgrade.

MCP Atlas measures how well a model uses external tools — the entire premise of "agentic workflows." GPT-5.5 posts 75.3% against Opus 4.7's 79.1% and Gemini 3.1 Pro's 78.2%. OpenAI is pitching agentic capability while scoring *last* on the most direct agentic benchmark. That is a credibility issue the marketing does not address.

The hallucination footnote nobody is quoting

On AA-Omniscience, GPT-5.5 posts the highest-ever accuracy at 57% — and carries an 86% hallucination rate.

Read that again. Highest accuracy on the benchmark, worst hallucination rate on the benchmark. The more confidently GPT-5.5 answers, the more often it is confidently wrong about something. This is a known failure mode of RLHF-optimized models and it got *worse* in 5.5, not better.

For coding assistants, this matters less than for research assistants. For customer-facing agents, it is a real problem.


What "New Class of Intelligence" Actually Means

Stripped of marketing, OpenAI's claim is that GPT-5.5 can collapse multi-step workflows. Where you used to chain three or four GPT-5.4 calls to get through a complex task, GPT-5.5 can handle it in one or two passes because it plans, tool-switches, self-checks, and continues without handoffs.

Practically, the capabilities list is:

  • Extended coding across longer contexts without losing focus
  • Computer use for software operation and interface navigation
  • Real spreadsheet and document manipulation
  • Multi-step web research with reading, cross-referencing, and citation
  • Tool orchestration across sessions without explicit routing

Three of those five have been shipping in Claude Opus 4.7 and Gemini 3.1 Pro for months. The differentiator is less "capability" and more "capability packaged into a product buyer's story." OpenAI's framing — one model that handles everything, longer runs without handoffs — is a product-marketing play, not a technical leap.

Specifically: OpenAI is not showing demos of tasks GPT-5.4 couldn't complete. They are showing the *same* tasks completed with fewer intermediate prompts. That's valuable for productivity, but "new class of intelligence" is a marketing frame, not an honest benchmark claim.

Get the Weekly IT + AI Roundup

What changed this week in NinjaOne, ServiceNow, CrowdStrike, and AI. One email, every Monday.

No spam, unsubscribe anytime. Privacy Policy


Pricing Shock vs the Competitive Set

ModelInput ($/1M)Output ($/1M)Source
GPT-5.4 (still available)$2.50$15OpenAI
GPT-5.5 standard$5$30OpenAI (2x 5.4)
GPT-5.5 Pro$30$180OpenAI
Claude Opus 4.7$5$25Anthropic
Claude Sonnet 4.6$2.50$12Anthropic
Gemini 3.1 Pro$2.50$15Google

GPT-5.5 standard prices in line with Claude Opus 4.7 on input ($5), runs slightly higher on output ($30 vs $25). GPT-5.5 Pro at $30/$180 is a different product — a reasoning-heavy tier that competes with Opus 4.7 in capability but at 6x the cost on input. You pay for longer thinking budgets.

The business logic is clear: OpenAI is segmenting users who would have paid $200/month for Pro into people who will now pay per-token for very expensive reasoning. Enterprise budgets handle this fine. Individual developers will feel it.


When to Use GPT-5.5

Use GPT-5.5 if:

  • You live in the terminal. Terminal-Bench 2.0 at 82.7% is a real advantage for shell automation, DevOps scripting, CI workflows.
  • You do research-heavy work and BrowseComp Pro's 90.1% web research score is worth the Pro pricing. (It is, if you were previously chaining three research prompts to get one clean answer.)
  • You do advanced math or quantitative analysis where FrontierMath performance translates to your work.
  • You're already on ChatGPT Plus or Pro and GPT-5.5 comes included. Testing it costs nothing.

Stay on Claude Opus 4.7 if:

  • Your primary use is production code review, refactoring, and complex codebases. SWE-Bench Pro 64.3% vs 58.6% is a real, measurable gap in favor of Claude.
  • You care about tool-use quality. MCP Atlas 79.1% vs 75.3% means fewer brittle tool calls.
  • You care about hallucination rate. Claude's defaults are meaningfully more conservative, which matters for customer-facing agents.
  • You're on Claude Max or API with prompt caching and already have the workflow dialed in.

Stay on Gemini 3.1 Pro if:

  • You're in the Google Workspace ecosystem (Docs/Sheets/Calendar native integration).
  • You need long-context grounding over large document corpora — Gemini's needle-in-haystack story remains best-in-class.
  • You want the Artificial Analysis Intelligence Index leader price ($2.50/$15 matches GPT-5.4, half of GPT-5.5).

Use both if your work is mixed. Most serious teams are already running at least two models and routing by task.


The Strategic Read

GPT-5.5 is OpenAI's response to a specific market fact: enterprise buyers love Claude for code and Gemini for research, and OpenAI has been bleeding both markets for two quarters. The answer is not to win on any single benchmark but to bundle "agentic workflow" into a narrative that makes GPT-5.5 the default for "one model that does everything."

It may well work. Buyers who want to standardize on one model for simplicity will pick the one with the strongest "whole-stack" story, not the one that wins SWE-Bench by 6 points or MCP Atlas by 4.

But the benchmark truth is unchanged:

  • Best for code: Claude Opus 4.7
  • Best for tools/agents in practice: Claude Opus 4.7 on MCP Atlas, Gemini 3.1 Pro close second
  • Best for shell/terminal: GPT-5.5
  • Best for math and research: GPT-5.5 Pro (if you can absorb the price)
  • Best for cost: Gemini 3.1 Pro or Claude Sonnet 4.6
  • Worst hallucination rate on record: GPT-5.5

Pick accordingly, and don't let the "new class of intelligence" framing pick for you.


Which AI Should You Use?

If this breakdown left you uncertain which model fits your actual workflow, the 2-minute quiz at llmmatchmaker.com/quiz matches you to the right AI based on what you're trying to make — not which vendor shouted loudest today.

The best model is the one that gets your work done with the fewest revisions. That varies by the job, not by the launch day.


References