Back to Phantom Notes
AI Models

OpenAI's Voice Triple-Launch: GPT-Realtime-2, Translate, and Streaming Whisper

May 10, 20267 min readBy T.W. Ghost
OpenAIGPT-RealtimeWhisperVoice AITranslationRealtime APIAI Agents

OpenAI just shipped three voice models in one announcement on May 7, 2026, and the bigger story is the unbundling. Where xAI launched a single flagship voice agent two weeks earlier, OpenAI split the voice problem into three distinct products and shipped each one as its own model.

You no longer pick "a voice model." You pick the right one for the job.

ModelJobPrice
GPT-Realtime-2Voice agent that reasons and uses tools$32 / 1M audio input, $64 / 1M output
GPT-Realtime-TranslateLive two-way speech translation$0.034 per minute
GPT-Realtime-WhisperStreaming speech-to-text$0.017 per minute

All three drop into the existing Realtime API. The interesting one to look at first is the agent.


GPT-Realtime-2: GPT-5 Reasoning in a Voice Loop

This is OpenAI's first voice model with GPT-5-class reasoning baked in. The headline feature is that you can now control how hard the model thinks, in five steps:

Reasoning effort levels: minimal, low, medium, high, xhigh (default: low)

That knob is new to voice. With GPT Realtime 1.5, reasoning was binary: either the model handled the request or it did not. With Realtime-2, you can dial up reasoning for hard turns and stay snappy for "what time is it."

OpenAI publishes two evals to show the gain:

  • Big Bench Audio: GPT-Realtime-2 (high) scores 15.2% higher than GPT-Realtime-1.5
  • Audio MultiChallenge: GPT-Realtime-2 (xhigh) scores 13.8% higher than GPT-Realtime-1.5

The other production-relevant changes:

  • Context window: 32K → 128K. Long support calls and multi-step agentic flows now actually fit.
  • Preambles. The agent can say "let me check that" or "one moment" before the real answer, so users hear something during reasoning.
  • Parallel tool calls with audible narration. It can fire multiple tools at once and say "checking your calendar" or "looking that up" while it works.
  • Recovery behavior. Instead of failing silently, it can say "I'm having trouble with that right now" and continue the conversation.
  • Stronger domain vocabulary. Healthcare terms, proper nouns, and account numbers come through more reliably.
  • Controllable tone. Calm for issue resolution, empathetic for frustrated users, upbeat for confirmations.

The Zillow case study is the one to pay attention to. On their hardest adversarial benchmark, after prompt optimization Realtime-2 hit 95% call success vs 69% on Realtime-1.5, a 26-point lift. Zillow specifically called out Fair Housing compliance robustness, which is the kind of regulatory guardrail that quietly kills voice deployments in regulated industries.


GPT-Realtime-Translate: 70 In, 13 Out

The translation model is a different shape than the agent. It does one thing: keep two speakers in sync across languages, in real time, while they talk naturally.

  • 70+ input languages
  • 13 output languages
  • $0.034 per minute

Deutsche Telekom is using it for multilingual customer support so callers can speak the language they are most comfortable in. Vimeo is using it to translate product education videos live as they play, removing the need to produce localized versions ahead of time.

The most interesting datapoint comes from BolnaAI in India:

*"In our evals across Hindi, Tamil, and Telugu, GPT-Realtime-Translate delivered 12.5% lower Word Error Rates than any other model we tested."*

>

Prateek Sachan, CTO, BolnaAI

Regional language phonetics is where most translation models fall apart. A 12.5% WER reduction on Indian languages specifically is a notable claim from a customer, not a vendor.

The other angle worth noticing: at $0.034/min, two-way live translation costs less than half what a basic voice agent costs. That changes which apps can afford to be multilingual by default.


GPT-Realtime-Whisper: Streaming STT

The third model is the one developers have been asking for since the original Whisper shipped: streaming transcription at production-grade latency.

Get the Weekly IT + AI Roundup

What changed this week in NinjaOne, ServiceNow, CrowdStrike, and AI. One email, every Monday.

No spam, unsubscribe anytime. Privacy Policy

  • $0.017 per minute
  • Transcribes audio as the speaker talks, not after they stop
  • Native fit for live captioning, meeting notes that update in real time, support workflows that build a transcript as the conversation happens

This is the cheapest of the three. At under two cents per minute, it is effectively a commodity layer for any product that needs to know what was said. Combine it with GPT-Realtime-2 for products that need to listen continuously even when the agent is not actively responding.


The Three Patterns OpenAI Is Selling

Read together, the three models map to three voice product patterns OpenAI explicitly names:

  • Voice-to-action. User describes what they want, the system reasons and uses tools to do it. Zillow's "find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday" is the canonical example.
  • Systems-to-voice. Software turns context into proactive spoken guidance. The travel-app example: "Your inbound flight is delayed, but you can still make your connection. I found the new gate."
  • Voice-to-voice. AI keeps live conversations flowing across languages, contexts, or speakers. Deutsche Telekom's multilingual support is the example here.

Most real products will combine all three. Priceline is reportedly building toward a future where a traveler manages an entire trip by voice: searching, booking, replanning after delays, and translating ground-side conversations.


Pricing Math: Per-Minute Reality

OpenAI publishes Realtime-2 in token pricing, which is hard to compare against per-minute models. At typical speaking rates (about 1500 input audio tokens per minute, 750 output tokens per minute), the all-in conversational rate works out to roughly:

ModelApprox. all-in per-minute
GPT-Realtime-2~$0.10/min
GPT-Realtime-Translate$0.034/min
GPT-Realtime-Whisper$0.017/min
Grok Voice Think Fast 1.0$0.05/min

That puts GPT-Realtime-2 at roughly 2x the per-minute cost of Grok Voice Think Fast 1.0, which xAI launched two weeks earlier. OpenAI did not publish a τ-voice Bench score for Realtime-2, so you cannot do a direct quality compare yet, but the pricing gap is real.


When Each One Wins

Practical guidance based on what has been shipped:

Pick GPT-Realtime-2 if you need:

  • 128K context for long, multi-turn flows
  • Tunable reasoning effort per turn
  • Strong domain vocabulary (healthcare, finance, legal proper nouns)
  • Tight integration with the rest of the OpenAI stack (Codex, Agents SDK, EU data residency)

Pick GPT-Realtime-Translate if you need:

  • Two-way live translation across 70+ input languages
  • Specifically: Indian regional phonetics, multilingual EU support, live media translation

Pick GPT-Realtime-Whisper if you need:

  • Streaming transcription only, no agent behavior
  • Live captions, real-time meeting notes, support call transcripts

Look elsewhere if you need:

  • Lowest per-minute cost on a smart agent (Grok Voice Think Fast 1.0 is roughly half the price)
  • Published τ-voice Bench scores for direct quality compare

Bottom Line

OpenAI's voice strategy is clearer now: instead of one model that does everything, three models that each do one thing well, all on the same API. That is a developer-friendly bet, and the reasoning-effort knob in particular is a real differentiator that the per-minute competitors do not have.

The pricing puts Realtime-2 at a premium versus Grok Voice Think Fast 1.0 on per-minute math, but Translate and Whisper are aggressively cheap and do not have direct equivalents from xAI. For most production stacks the right answer is going to be a mix.

The voice race in 2026 is no longer about who has the smartest model. It is about who has the right tool for each turn of a conversation. OpenAI just shipped three.