VoiceAgentGuide
Latest

OpenAI Releases Three Realtime Voice Models for Developer Apps

By 5 min read

The launch lands as a meaningful step for tool-using voice agents and a routine refresh for everything else. gpt-realtime-2 is the only one of the three that genuinely changes what one model can do inside a real-time loop. Live translation and streaming speech-to-text have been competitive markets for a while; OpenAI's offerings here are credible, but not category-defining.

What OpenAI shipped on May 7

The three models all live in OpenAI's Realtime API and are billed per audio token or per minute, depending on the model.

gpt-realtime-2 is the one most builders will actually integrate. It carries GPT-5-class reasoning into a voice loop, expands the context window to 128K tokens (up from 32K), and supports parallel tool calls with model narration covering the dead air. On Big Bench Audio at the high reasoning tier it scores 96.6%, up from 81.4% on the prior generation, a 15.2 percentage-point jump. Two new voices, Cedar and Marin, are exclusive to this release. Pricing runs $32 per 1M audio input tokens, $0.40 per 1M cached input tokens, and $64 per 1M audio output tokens.

gpt-realtime-translate does live speech-to-speech translation across 70+ input languages into 13 output languages, and keeps pace with the speaker. It bills at $0.034 per minute.

gpt-realtime-whisper is a streaming transcription model that emits tokens as the speaker talks. It bills at $0.017 per minute.

The Audio MultiChallenge benchmark also moved, but only at the highest reasoning tier, scoring 48.5% at xhigh on gpt-realtime-2 vs 34.7% on the prior generation. That gain is real, but reaching for it costs latency and output tokens. More on the tier choice below.

OpenAI also cites Zillow seeing a 26-point lift in call-success rate on its hardest adversarial benchmark, from 69% to 95%, after moving to gpt-realtime-2. That's a partner-disclosed number on a partner-defined benchmark, so weight it accordingly, but it's one of the few public datapoints that grounds the abstract reasoning gains in a real call flow.

The reasoning-effort knob is the lever you'll actually tune

The single most useful new control surface is the reasoning-effort tier. gpt-realtime-2 exposes five values (minimal, low, medium, high, xhigh), with low set as the default explicitly to keep latency tight.

Most teams should default to low for general turn-taking and barge-in handling. Bump to medium only when you've registered a tool call that returns structured data the model has to interpret on the fly, like a CRM lookup or a calendar query. Reach for high or xhigh only on accuracy-critical evaluations where you can afford the extra round-trip time and output tokens.

The +13.8 percentage-point MultiChallenge gain at xhigh is the strongest argument for going higher, but it's the kind of gain that only matters in contexts where one bad turn breaks the call. Most production voice agents are not that. Set the tier at the session level, not the request level, and watch for the easy mistake of leaving xhigh on after a debugging session and shipping with it as the default. Output-token bills compound fast.

Pricing math for a 5-minute voice agent call

Per-token rates abstract away from real cost. A representative number helps.

Take a 5-minute customer-service call with a reasonable shape of roughly 30k audio input tokens (caller speech plus system prompts and any tool-call inputs across turns) and 10k audio output tokens (model speech plus tool-call narration).

At $32 per 1M input and $64 per 1M output, that works out to about $0.96 input plus $0.64 output, or $1.60 per call before any caching. That's the worst case.

Once the cache warms up, the numbers improve fast. Cached input tokens bill at $0.40 per 1M, roughly 80x cheaper than fresh input. With about 60% of input tokens hitting the cache after a few turns (system prompt, persistent session state, conversation history) the input cost falls to around $0.39, and the total lands near $1.03 per 5-minute call.

That number is higher than running a dedicated stack of streaming STT, a smaller LLM, and a specialist TTS for the same call. What you buy with the premium is collapsing three vendor relationships into one, parallel tool calls in a single audio loop, and built-in interruption handling. Worth it when the integration overhead matters more than the absolute cost. Worth less when you've already standardised a reliable component stack.

The practical implication is to cache aggressively, or the per-call math gets ugly fast. The 80x gap between fresh and cached input is not a rounding error.

Where this changes your stack and where it doesn't

The strongest fit for gpt-realtime-2 is the tool-using voice agent. Think customer-service bots that look up an account in a CRM and explain what they find, or scheduling agents that confirm a slot and send a calendar invite. Round-trip latency and reasoning quality dominate, and the new parallel tool calls plus preambles smooth over multi-step gaps without the line going silent.

The weaker fits are predictable. Anything dominated by sub-100ms cold-start TTS still belongs to a specialist on raw first-byte time. Anything dominated by long read-aloud output with no tool calls is cheaper on a dedicated TTS at a fraction of the per-token rate. And on translation, incumbents like Cartesia and Deepgram already cover the most common language pairs at competitive accuracy and lower per-minute prices for the simpler ones; gpt-realtime-translate's coverage is good but its $0.034 per minute is not the cheapest option.

If keeping voice, STT, and TTS as swappable components matters to you, so you can change pieces as the market shifts or new pricing curves drop, an open-source telephony orchestrator like Dograh keeps the rest of the stack vendor-neutral.

An honest framing of the launch is that it makes one specific shape of voice agent meaningfully better. It does not retire the specialists. Pick on shape, not on hype.

Pick one tool-call-heavy flow you ship today, clone it onto gpt-realtime-2 with reasoning-effort low, and measure end-to-end latency and cost-per-call across 50 calls against your current stack before deciding. The numbers usually settle the argument.

Glossary

Audio token
The billing unit for the new Realtime API. An audio token represents a small chunk of speech and is billed separately from text tokens; 1M audio input tokens at gpt-realtime-2's low tier maps to several tens of minutes of caller speech depending on density.
Preamble
A short filler phrase the model speaks while a tool call is running, so the line does not go silent. Examples include 'let me check that' or 'one second'. Practitioners use these to mask latency during multi-step tool execution.
Reasoning-effort tier
The five-step control on gpt-realtime-2 (minimal / low / medium / high / xhigh) that trades latency and output-token cost for reasoning quality. Default is low. Set it at the session level, not per request.
Cached input tokens
Audio input the API has already processed in a recent session window, billed at $0.40 per 1M tokens instead of $32 per 1M for fresh input. The roughly 80x gap makes cache hit rate a primary cost lever.

Frequently asked questions

What's the difference between gpt-realtime-2 and the original gpt-realtime?
gpt-realtime-2 carries GPT-5-class reasoning into a voice model, expands the context window from 32K to 128K tokens, supports parallel tool calls, and adds two new voices (Cedar and Marin). On Big Bench Audio at the high reasoning tier it scores 96.6% vs the prior generation's 81.4%.
How much does a 5-minute call on gpt-realtime-2 actually cost?
Roughly $1.60 per call before caching at $32/1M audio input and $64/1M audio output tokens. With cache warmup of about 60% on input (cached at $0.40/1M, roughly 80x cheaper than fresh), the same call lands near $1.03. Cache hit rate is the primary cost lever.
Should I switch from a dedicated streaming STT to gpt-realtime-whisper?
Only if your stack benefits from collapsing transcription into the same Realtime API session as the rest of your agent. At $0.017 per minute, gpt-realtime-whisper is competitive on cost, but a dedicated streaming STT will usually still win on raw first-word latency for one-shot transcription tasks outside a conversation loop.
Does gpt-realtime-2 replace a separate TTS voice, or do I still need one?
gpt-realtime-2 produces speech directly, including the new Cedar and Marin voices, so a separate TTS is not required for tool-using conversation flows. For long read-aloud output with no tool calls, a dedicated TTS at a fraction of the per-token rate remains cheaper.
What does the reasoning-effort setting actually do, and which tier should I default to?
The reasoning-effort tier (minimal / low / medium / high / xhigh) trades latency and output-token cost for reasoning quality. Default to low for general turn-taking and barge-in handling. Bump to medium only when a tool call returns structured data the model has to interpret on the fly.
Is gpt-realtime-translate good enough to replace a human interpreter for live calls?
For routine business and customer-support conversations across the supported 70+ input and 13 output languages, gpt-realtime-translate at $0.034 per minute is a credible replacement for an interpreter on most flows. It is not yet a fit for legal proceedings, medical consent, or any setting where an interpretation error carries serious downstream cost.
What is gpt-realtime-2?
gpt-realtime-2 is OpenAI's voice model with GPT-5-class reasoning, released on May 7, 2026. It runs inside OpenAI's Realtime API with a 128K token context window, supports parallel tool calls during a live conversation, and ships with two new voices, Cedar and Marin.
How much does the OpenAI Realtime API cost per minute?
For gpt-realtime-2, costs are per audio token rather than per minute, working out to roughly $1 to $1.60 for a typical 5-minute customer-service call depending on cache hit rate. The two ancillary models bill per minute directly, with gpt-realtime-translate at $0.034 per minute and gpt-realtime-whisper at $0.017 per minute.
When did OpenAI release gpt-realtime-2?
OpenAI released gpt-realtime-2 alongside gpt-realtime-translate and gpt-realtime-whisper on May 7, 2026, in the Realtime API. All three are available to developers through the API and OpenAI Playground at launch.

Join the WhatsApp community

Discuss this article and more with other voice AI builders.

Join WhatsApp Group