Simon Willison’s Weblog: New OpenAI feature: Predicted Outputs

Source URL: https://simonwillison.net/2024/Nov/4/predicted-outputs/
Source: Simon Willison’s Weblog
Title: New OpenAI feature: Predicted Outputs

Feedly Summary: New OpenAI feature: Predicted Outputs
Interesting new ability of the OpenAI API – the first time I’ve seen this from any vendor.
If you know your prompt is mostly going to return the same content – you’re requesting an edit to some existing code, for example – you can now send that content as a “prediction" and have GPT-4o or GPT-4o mini use that to accelerate the returned result.
OpenAI’s documentation says:

When providing a prediction, any tokens provided that are not part of the final completion are charged at completion token rates.

I initially misunderstood this as meaning you got a price reduction in addition to the latency improvement, but that’s not the case: in the best possible case it will return faster and you won’t be charged anything extra over the expected cost for the prompt, but the more it differs from your permission the more extra tokens you’ll be billed for.
I ran the example from the documentation both with and without the prediction and got these results. Without the prediction:
"usage": {
"prompt_tokens": 150,
"completion_tokens": 118,
"total_tokens": 268,
"completion_tokens_details": {
"accepted_prediction_tokens": 0,
"audio_tokens": null,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 0
}

That took 5.2 seconds and cost 0.1555 cents.
With the prediction:
"usage": {
"prompt_tokens": 166,
"completion_tokens": 226,
"total_tokens": 392,
"completion_tokens_details": {
"accepted_prediction_tokens": 49,
"audio_tokens": null,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 107
}

That took 3.3 seconds and cost 0.2675 cents.
OpenAI’s Steve Coffey confirms:

That’s right, If the prediction is 100% accurate, then you would see no cost difference. When the model diverges from your speculation, we do additional sampling to “discover” the net-new tokens, which is why we charge rejected tokens at completion time rates.

Via @OpenAIDevs
Tags: openai, llms, ai, generative-ai, llm-pricing

AI Summary and Description: Yes

Summary: The text discusses a new feature from OpenAI that allows users to provide predicted outputs when utilizing their API, which can speed up response times while maintaining cost efficiency. This is relevant for AI professionals, as it showcases advancements in AI model interactions and pricing strategies.

Detailed Description:
– The text introduces a new functionality in the OpenAI API, called “Predicted Outputs,” which enhances how users can interact with the AI model.
– The feature allows users to suggest expected content based on prior interactions. This is particularly beneficial for repetitive tasks, such as editing existing code.
– By submitting a prediction, users can expect faster responses from GPT-4o or GPT-4o mini.

Key Points:
– **Cost Structure**: Users are only charged based on actual completion tokens—resulting from the model’s response—while prediction tokens incur no additional fees if no divergence occurs. However, if the model generates different content than predicted, additional completion tokens are charged for those “rejected” tokens.
– **Performance Metrics**:
– **Without Prediction**:
– **Prompt Tokens**: 150
– **Completion Tokens**: 118
– **Total Tokens**: 268
– **Time**: 5.2 seconds
– **Cost**: $0.1555
– **With Prediction**:
– **Prompt Tokens**: 166
– **Completion Tokens**: 226
– **Total Tokens**: 392
– **Time**: 3.3 seconds
– **Cost**: $0.2675
– **Confirmation from OpenAI**: Steve Coffey clarifies that if the prediction aligns perfectly with the model’s output, there is no cost difference, but any divergence will lead to additional charges.

This advancement in AI technology not only enhances user experience by reducing latency but also provides insights into efficient pricing mechanisms that can benefit developers and organizations making use of AI tools. Understanding these changes is essential for security and compliance professionals who work with AI-based systems, as it impacts planning and budgeting for AI-powered projects.