Simon Willison’s Weblog: Introducing gpt-realtime

Sep 1, 2025

—

Source URL: https://simonwillison.net/2025/Sep/1/introducing-gpt-realtime/#atom-everything
Source: Simon Willison’s Weblog
Title: Introducing gpt-realtime

Feedly Summary: Introducing gpt-realtime
Released a few days ago (August 28th), gpt-realtime is OpenAI’s new “most advanced speech-to-speech model". It looks like this is a replacement for the older gpt-4o-realtime-preview model that was released last October.
This is a slightly confusing release. The previous realtime model was clearly described as a variant of GPT-4o, sharing the same October 2023 training cut-off date as that model.
I had expected that gpt-realtime might be a GPT-5 relative, but its training date is still October 2023 whereas GPT-5 is September 2024.
gpt-realtime also shares the relatively low 32,000 context token and 4,096 maximum output token limits of gpt-4o-realtime-preview.
The only reference I found to GPT-5 in the documentation for the new model was a note saying "Ambiguity and conflicting instructions degrade performance, similar to GPT-5."
The usage tips for gpt-realtime have a few surprises:

Iterate relentlessly. Small wording changes can make or break behavior.
Example: Swapping “inaudible” → “unintelligible” improved noisy input handling. […]
Convert non-text rules to text: The model responds better to clearly written text.
Example: Instead of writing, "IF x > 3 THEN ESCALATE", write, "IF MORE THAN THREE FAILURES THEN ESCALATE."

There are a whole lot more prompting tips in the new Realtime Prompting Guide.
OpenAI list several key improvements to gpt-realtime including the ability to configure it with a list of MCP servers, "better instruction following" and the ability to send it images.
My biggest confusion came from the pricing page, which lists separate pricing for using the Realtime API with gpt-realtime and GPT-4o mini. This suggests to me that the old gpt-4o-mini-realtime-preview model is still available, despite it no longer being listed on the OpenAI models page.
gpt-4o-mini-realtime-preview is a lot cheaper:

Model
Token Type
Input
Cached Input
Output

gpt-realtime
Text
$4.00
$0.40
$16.00

Audio
$32.00
$0.40
$64.00

Image
$5.00
$0.50
–

gpt-4o-mini-realtime-preview
Text
$0.60
$0.30
$2.40

Audio
$10.00
$0.30
$20.00

The mini model also has a much longer 128,000 token context window.
Tags: audio, realtime, ai, openai, generative-ai, llms, llm-pricing, multi-modal-output, llm-release

AI Summary and Description: Yes

Summary: The text introduces OpenAI’s new speech-to-speech model, gpt-realtime, and outlines its key features and pricing details while clarifying its position relative to previous models like GPT-4o. This advancement in AI technology has implications for professionals in AI and cloud computing, particularly in the realms of model efficiency and multi-modal capabilities.

Detailed Description:
The release of gpt-realtime by OpenAI, touted as the “most advanced speech-to-speech model,” represents a significant advancement in generative AI capabilities. The introduction of this model is noteworthy for security and compliance professionals who are following developments in AI technology, particularly its implications concerning security, efficiency, and usage models.

Key points include:

– **Model Transition**: gpt-realtime replaces the earlier gpt-4o-realtime-preview model. There is some ambiguity about this transition, as both models share the same training cut-off date of October 2023.
– **Context Tokens and Outputs**: gpt-realtime maintains the same limitations (32,000 context tokens and 4,096 maximum output tokens) as its predecessor, indicating a focus on refining performance rather than significantly expanding capabilities.
– **Performance Insights**: The documentation for gpt-realtime notes that its performance can degrade under ambiguity, a challenge noted to be similar to that of GPT-5, which hints at ongoing performance optimization.
– **Usage Tips**:
– Detailed adjustments can produce notable impacts on model efficiency. For example, minor changes in wording can drastically affect the model’s behavior, highlighting the importance of precise prompting.
– There are suggestions to convert non-text rules to clearly articulated text to enhance model responsiveness.
– **MCP Server Configuration**: The ability to configure the model with a list of Multi-Cloud Provider (MCP) servers indicates adaptability and potential integration into existing cloud architectures.
– **Pricing Structure**: A compelling aspect surrounding gpt-realtime is the pricing difference from the gpt-4o-mini-realtime-preview model. The pricing structures highlight a clear distinction in cost associated with text, audio, and image processing. The new model is significantly more expensive than its predecessor, suggesting a premium on its advanced capabilities.

Implications for professionals in security and compliance include:
– Monitoring the evolving capabilities of AI models like gpt-realtime is essential for safeguarding sensitive information as AI integration expands.
– Understanding pricing models and resource allocation when deploying these tools can effectively influence IT budgeting and financial planning for companies investing in AI technology.
– The ability to handle multi-modal inputs (text, audio, image) raises considerations for data compliance and security measures that must adapt to varied input modalities.

The gpt-realtime model illustrates ongoing advancements in generative AI and highlights critical areas for professionals in the fields of AI, cloud computing, and security to explore.

-4o .NET 1 10 2 2024 2025 24 3 4 5 a Act adaptability advanced advanced capabilities advancement advancements age AI AI capabilities AI integration ai model AI models AI technology All alt ambiguity and API app Arch architecture architectures Aria art as at ated audio Behavior being Bi bot budget budgeting by C Cache capabilities CERN challenge CI CIA CleaR Cloud cloud architecture cloud computing Cloud Provider co companies compliance compliance professionals Computing Configuration Context context window cost critical D data data compliance day days de development developments document documentation e effective efficiency end exp fail failures feature features financial following for g Gen generative Generative AI Go GPT GPT-4o grade gs H handling heap high Highlight HR http HTTPS image image processing impact implications in Influence information insights instruction instruction following integration Intel io ite J Just k Key l led Li limitations line llm llm-pricing llms lm long low M man max mcp MCP servers measures Mila mini mini model modal modalities Mode model model efficiency model transition models Monitor monitoring multi multi-cloud multi-modal input my N new NGO no non notes NPU o of off on only ons open openai OPM opt optimization oS out output Outputs per performance Performance Insights performance optimization planning point potential pre Preview pricing pricing model pricing models pricing structure pro process processing professionals prompt Prompting ps Q R Raise rate RCE re real red release resource resource allocation review Ro s s Position safe sam sec security security and compliance security measure security measures sensitive information server server configuration servers SHA sharing side Sig Sim Simon Willison small SoC source Speech speech-to-speech SSE SSL SSO structures swapping T Tags: Tails tech technology ted text the Time to token token context token limit tokens tool tools Tor TP training transition type UI under US usage V web Wi Wind writing x yt z