Simon Willison’s Weblog: New audio models from OpenAI, but how much can we rely on them?

Mar 20, 2025

—

Source URL: https://simonwillison.net/2025/Mar/20/new-openai-audio-models/#atom-everything
Source: Simon Willison’s Weblog
Title: New audio models from OpenAI, but how much can we rely on them?

Feedly Summary: OpenAI announced several new audio-related API features today, for both text-to-speech and speech-to-text. They’re very promising new models, but they appear to suffer from the ever-present risk of accidental (or malicious) instruction following.
gpt-4o-mini-tts
gpt-4o-mini-tts is a brand new text-to-speech model with “better steerability". OpenAI released a delightful new playground interface for this at OpenAI.fm – you can pick from 11 base voices, apply instructions like "High-energy, eccentric, and slightly unhinged" and get it to read out a script (with optional extra stage directions in parenthesis). It can then provide the equivalent API code in Python, JavaScript or curl. You can share links to your experiments, here’s an example.

Note how part of my script there looks like this:
(Whisper this bit:)
Footsteps echoed behind her, slow and deliberate. She turned, heart racing, but saw only shadows.
While fun and convenient, the fact that you can insert stage directions in the script itself feels like an anti-pattern to me – it means you can’t safely use this for arbitrary text because there’s a risk that some of that text may accidentally be treated as further instructions to the model.
In my own experiments I’ve already seen this happen: sometimes the model follows my "Whisper this bit" instruction correctly, other times it says the word "Whisper" out loud but doesn’t speak the words "this bit". The results appear non-deterministic, and might also vary with different base voices.
gpt-4o-mini-tts costs $0.60/million tokens, which OpenAI estimate as around 1.5 cents per minute.
gpt-4o-transcribe and gpt-4o-mini-transcribe
gpt-4o-transcribe and gpt-4o-mini-transcribe are two new speech-to-text models, serving a similar purpose to whisper but built on top of GPT-4o and setting a "new state-of-the-art benchmark". These can be used via OpenAI’s v1/audio/transcriptions API, as alternative options to `whisper-1. The API is still restricted to a 25MB audio file (MP3, WAV or several other formats).
Any time an LLM-based model is used for audio transcription (or OCR) I worry about accidental instruction following – is there a risk that content that looks like an instruction in the spoken or scanned text might not be included in the resulting transcript?
In a comment on Hacker News OpenAI’s Jeff Harris said this, regarding how these new models differ from gpt-4o-audio-preview:

It’s a slightly better model for TTS. With extra training focusing on reading the script exactly as written.
e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard

"much better in that regard" sounds to me like there’s still a risk of this occurring, so for some sensitive applications it may make sense to stick with whisper or other traditional text-to-speech approaches.
On Twitter Jeff added:
yep fidelity to transcript is the big chunk of work to turn an audio model into TTS model. still possible, but should be quite rare
gpt-4o-transcribe is an estimated 0.6 cents per minute, and gpt-4o-mini-transcribe is 0.3 cents per minute.
Mixing data and instructions remains the cardinal sin of LLMs
If these problems look familiar to you that’s because they are variants of the root cause behind prompt injection. LLM architectures encourage mixing instructions and data in the same stream of tokens, but that means there are always risks that tokens from data (which often comes from untrusted sources) may be misinterpreted as instructions to the model.
How much of an impact this has on the utility of these new models remains to be seen. Maybe the new training is so robust that these issues won’t actually cause problems for real-world applications?
I remain skeptical. I expect we’ll see demos of these flaws in action in relatively short order.
Tags: audio, text-to-speech, ai, openai, prompt-injection, generative-ai, whisper, llms, multi-modal-output

AI Summary and Description: Yes

Summary: OpenAI’s new audio API features, particularly the gpt-4o-mini-tts and gpt-4o-transcribe models, promise advanced functionalities but introduce risks of accidental instruction following. This paper raises concerns around model reliability, particularly in relation to handling ambiguous or mixed inputs.

Detailed Description: The recent announcement from OpenAI about audio-related API features for text-to-speech (TTS) and speech-to-text (STT) models showcases the potential advancements in generative AI. However, these developments bring inherent risks, particularly concerning the accuracy and reliability of instruction processing.

– **New Models Introduced:**
– **gpt-4o-mini-tts**: A text-to-speech model designed for higher steerability, allowing users to integrate stage directions into scripts. The model can read out scripts styled by user-defined instructions.
– **gpt-4o-transcribe** and **gpt-4o-mini-transcribe**: New speech-to-text models that aim to set a new benchmark in audio transcription, targeting improvements over previous offerings like Whisper.

– **Risks Associated with Instruction Following:**
– **Accidental Instruction Following**: The model’s ability to interpret phrases as instructions can lead to unexpected outputs. For instance, placing stage directions directly within the text can lead the model to treat these as executable commands rather than part of the narrative.
– **Non-deterministic Outputs**: Observations made during experimentation revealed varying performance: the model sometimes followed instructions correctly while at other times produced unintended outputs.

– **Cost Structure**:
– The pricing for the new models is structured around token usage, with respective costs provided for TTS and STT functionalities—$0.60/million tokens for gpt-4o-mini-tts, and lower costs for transcription models which cater to larger audio files.

– **Concerns in Sensitive Applications**:
– The potential for risk in sensitive scenarios is underscored—particularly where transcription models may inadvertently process content resembling instructions from audio or visual data.
– Even with improvements, uncertainties remain regarding the fidelity of these models in real-world applications. The continued threat of prompt injection highlights vulnerabilities within LLM frameworks where mixing data and instruction leads to misinterpretation.

– **Expert Skepticism**: The commentary suggests a cautious approach to adopting these new AI capabilities, particularly in critical fields requiring high reliability and accuracy. The expectation of flaws emerging in demos fuels doubts about the robustness of the training and general readiness of these models for practical use.

This analysis poses essential questions for professionals in AI and cloud security, particularly in evaluating the implications of adopting new models in environments where data integrity is paramount. Security and compliance measures should consider the potential risks of LLM architectures to prevent detrimental exposure to vulnerabilities, especially in industries handling sensitive information.

-4o .NET 1 2 2025 3 4 5 a accuracy Act ads advancement advancements AI alt analysis and anti API Application applications Arch architecture architectures Aria art as audio Audio transcription AWS based benchmark by C capabilities centric CERN CIA Cloud cloud security co code command compliance compliance measures concerns content core cost Cost Structure Costs critical Curl D data data integrity day de DeFi demo design development e end energy environment ERP event exp experimentation expert face fact feature features fine flaws for framework frameworks g Gen general generative Generative AI GPT GPT-4o gs H hack hacker Hacker News high Highlight HR http HTTPS implications in information injection integrity inter interface interpret iOS ite J Java JavaScript k l large law led Li liability Link llm llms lm low man Mila mini mixed modal Mode model model design model reliability models multi my N Narrativ native news NIST no non NPU o OCR oE of off on only open openai OPM opt out Outputs over performance play potential potential risks pre Preview pricing problem process processing professionals prompt prompt-injection Py Python question R rag rate RCE readiness reading ready real real-world applications red release reliability Risk risks Ro robustness Root Rust s safe sec security security and compliance self sensitive applications sensitive information SHA short side Sig Sim skepticism SoC source Speech speech-to-text SSE SSO state structured T Tags: text text-to-speech the threat Time to token token usage tokens TP training transcribe trie trust tts twitter two UI under US usage use user Users V val visual data voice vulnerabilities web whisper Wi world applications x