Source URL: https://simonwillison.net/2025/Sep/1/introducing-gpt-realtime/#atom-everything
Source: Simon Willison’s Weblog
Title: Introducing gpt-realtime
Feedly Summary: Introducing gpt-realtime
Released a few days ago (August 28th), gpt-realtime is OpenAI’s new “most advanced speech-to-speech model". It looks like this is a replacement for the older gpt-4o-realtime-preview model that was released last October.
This is a slightly confusing release. The previous realtime model was clearly described as a variant of GPT-4o, sharing the same October 2023 training cut-off date as that model.
I had expected that gpt-realtime might be a GPT-5 relative, but its training date is still October 2023 whereas GPT-5 is September 2024.
gpt-realtime also shares the relatively low 32,000 context token and 4,096 maximum output token limits of gpt-4o-realtime-preview.
The only reference I found to GPT-5 in the documentation for the new model was a note saying "Ambiguity and conflicting instructions degrade performance, similar to GPT-5."
The usage tips for gpt-realtime have a few surprises:
Iterate relentlessly. Small wording changes can make or break behavior.
Example: Swapping “inaudible” → “unintelligible” improved noisy input handling. […]
Convert non-text rules to text: The model responds better to clearly written text.
Example: Instead of writing, "IF x > 3 THEN ESCALATE", write, "IF MORE THAN THREE FAILURES THEN ESCALATE."
There are a whole lot more prompting tips in the new Realtime Prompting Guide.
OpenAI list several key improvements to gpt-realtime including the ability to configure it with a list of MCP servers, "better instruction following" and the ability to send it images.
My biggest confusion came from the pricing page, which lists separate pricing for using the Realtime API with gpt-realtime and GPT-4o mini. This suggests to me that the old gpt-4o-mini-realtime-preview model is still available, despite it no longer being listed on the OpenAI models page.
gpt-4o-mini-realtime-preview is a lot cheaper:
Model
Token Type
Input
Cached Input
Output
gpt-realtime
Text
$4.00
$0.40
$16.00
Audio
$32.00
$0.40
$64.00
Image
$5.00
$0.50
–
gpt-4o-mini-realtime-preview
Text
$0.60
$0.30
$2.40
Audio
$10.00
$0.30
$20.00
The mini model also has a much longer 128,000 token context window.
Tags: audio, realtime, ai, openai, generative-ai, llms, llm-pricing, multi-modal-output, llm-release
AI Summary and Description: Yes
Summary: The text introduces OpenAI’s new speech-to-speech model, gpt-realtime, and outlines its key features and pricing details while clarifying its position relative to previous models like GPT-4o. This advancement in AI technology has implications for professionals in AI and cloud computing, particularly in the realms of model efficiency and multi-modal capabilities.
Detailed Description:
The release of gpt-realtime by OpenAI, touted as the “most advanced speech-to-speech model,” represents a significant advancement in generative AI capabilities. The introduction of this model is noteworthy for security and compliance professionals who are following developments in AI technology, particularly its implications concerning security, efficiency, and usage models.
Key points include:
– **Model Transition**: gpt-realtime replaces the earlier gpt-4o-realtime-preview model. There is some ambiguity about this transition, as both models share the same training cut-off date of October 2023.
– **Context Tokens and Outputs**: gpt-realtime maintains the same limitations (32,000 context tokens and 4,096 maximum output tokens) as its predecessor, indicating a focus on refining performance rather than significantly expanding capabilities.
– **Performance Insights**: The documentation for gpt-realtime notes that its performance can degrade under ambiguity, a challenge noted to be similar to that of GPT-5, which hints at ongoing performance optimization.
– **Usage Tips**:
– Detailed adjustments can produce notable impacts on model efficiency. For example, minor changes in wording can drastically affect the model’s behavior, highlighting the importance of precise prompting.
– There are suggestions to convert non-text rules to clearly articulated text to enhance model responsiveness.
– **MCP Server Configuration**: The ability to configure the model with a list of Multi-Cloud Provider (MCP) servers indicates adaptability and potential integration into existing cloud architectures.
– **Pricing Structure**: A compelling aspect surrounding gpt-realtime is the pricing difference from the gpt-4o-mini-realtime-preview model. The pricing structures highlight a clear distinction in cost associated with text, audio, and image processing. The new model is significantly more expensive than its predecessor, suggesting a premium on its advanced capabilities.
Implications for professionals in security and compliance include:
– Monitoring the evolving capabilities of AI models like gpt-realtime is essential for safeguarding sensitive information as AI integration expands.
– Understanding pricing models and resource allocation when deploying these tools can effectively influence IT budgeting and financial planning for companies investing in AI technology.
– The ability to handle multi-modal inputs (text, audio, image) raises considerations for data compliance and security measures that must adapt to varied input modalities.
The gpt-realtime model illustrates ongoing advancements in generative AI and highlights critical areas for professionals in the fields of AI, cloud computing, and security to explore.