Simon Willison’s Weblog: OpenAI’s new open weight (Apache 2) models are really good

Aug 5, 2025

—

Source URL: https://simonwillison.net/2025/Aug/5/gpt-oss/
Source: Simon Willison’s Weblog
Title: OpenAI’s new open weight (Apache 2) models are really good

Feedly Summary: The long promised OpenAI open weight models are here, and they are very impressive. They’re available under proper open source licenses – Apache 2.0 – and come in two sizes, 120B and 20B.
OpenAI’s own benchmarks are eyebrow-raising – emphasis mine:

The gpt-oss-120b model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The gpt-oss-20b; model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure.

o4-mini and o3-mini are really good proprietary models – I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM.
Both models are mixture-of-experts:

gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively.

Something that surprised me even more about the benchmarks was the scores for general knowledge based challenges. I can just about believe they managed to train a strong reasoning model that fits in 20B parameters, but these models score highly on benchmarks like “GPQA Diamond (without tools) PhD-level science questions" too:

o3 — 83.3%
o4-mini — 81.4%
gpt-oss-120b — 80.1%
o3-mini — 77%
gpt-oss-20b — 71.5%

A lot of these benchmarks are edging towards saturated.
Running gpt-oss-20b on my Mac with LM Studio
There are already a bunch of different ways to run these models – OpenAI partnered with numerous organizations in advance of the release.
I decided to start with LM Studio.
I had to update to the most recent version of the app, then install the new model from their openai/gpt-oss-20b page.
First impressions: this is a really good model, and it somehow runs using just 11.72GB of my system RAM.
The model supports three reasoning efforts: low, medium and high. LM Studio makes those available via a dropdown.
Let’s try "Generate an SVG of a pelican riding a bicycle":
Pelican on reasoning=low
I started with low. It thought for 0.07 seconds and then output this (at 39 tokens a second):

Except… it output invalid SVG. One of the path elements looked like this:

<path d="
M150,250 <!– rear wheel center –>
L300,120 <!– top tube to front –>
L450,250 <!– chain stays back to front –>
L300,350 <!– seat stays down –>
Z"
fill="#e0e0e0" stroke="#555" stroke-width="4"/>

But you can’t put comments inside attributes like that. I fixed this to get the above image.
Pelican on reasoning=medium
I tried again on medium. This time it thought for 4.44 seconds and gave me this, at 55 tokens/second:

Pelican on reasoning=high
And then on high. My first attempt here failed because it hit the default LM Studio 4096 context limit – so I reloaded the model and set the new limit to the maximum possible.
It jumped up to using 17.35GB RAM (I imagine it might use a lot more with a longer prompt) and after thinking for 5 minutes and 50 seconds (!) gave me this:

It thought really hard to get there. Here’s a snippet of that transcript:

But maybe we want seat below body bottom? Actually seat should be between body and pedals, not above body bottom. Wait, typical bicycle: seat post goes from bottom bracket upward to the seat. The seat sits on top of the seat tube (vertical). So the seat is above bottom bracket. In our representation, the seat is at y=250-255. Body center at 245, body bottom at 265. So body bottom is below seat. That means the pelican’s body is below the seat? Actually no: body bottom extends to 265; seat top is at 250; so seat sits above body bottom by about 15 units. That seems plausible because seat should be slightly above body, but body bottom is at 265 > seat top 255. […]

I’ve seen this from other local reasoning models too: they can really overthink this kind of problem! I don’t know if there’s any pelican SVG that’s worth waiting nearly six minutes for.
Space invaders with gpt-oss-20b
Given how long high took I switched back to medium for my next experiment:

Write an HTML and JavaScript page implementing space invaders

It thought for 10.78 seconds and produced this:

You can play that here.
It’s not the best I’ve seen – I was more impressed by GLM 4.5 Air – but it’s very competent for a model that only uses 12GB of my RAM (GLM 4.5 Air used 47GB).
Trying gpt-oss-120b via API providers
I don’t quite have the resources on my laptop to run the larger model. Thankfully it’s already being hosted by a number of different API providers.
OpenRouter already lists three – Fireworks, Groq and Cerebras.
Cerebras is fast, so I decided to try them first.
I installed the llm-cerebras plugin and ran the refresh command to ensure it had their latest models:
llm install -U llm-cerebras jsonschema
llm cerebras refresh
(Installing jsonschema worked around a warning message.)
Output:
Refreshed 10 Cerebras models:
– cerebras-deepseek-r1-distill-llama-70b
– cerebras-gpt-oss-120b
– cerebras-llama-3.3-70b
– cerebras-llama-4-maverick-17b-128e-instruct
– cerebras-llama-4-scout-17b-16e-instruct
– cerebras-llama3.1-8b
– cerebras-qwen-3-235b-a22b-instruct-2507
– cerebras-qwen-3-235b-a22b-thinking-2507
– cerebras-qwen-3-32b
– cerebras-qwen-3-coder-480b

Now:
llm -m cerebras-gpt-oss-120b \
‘Generate an SVG of a pelican riding a bicycle’
Cerebras runs the new model at between 2 and 4 thousands tokens per second!
To my surprise this one had the same comments-in-attributes bug that we saw with oss-20b earlier. I fixed those and got this pelican:

That bug appears intermittently – I’ve not seen it on some of my other runs of the same prompt.
The llm-openrouter plugin also provides access to the models, balanced across the underlying providers. You can use that like so:
llm install llm-openrouter
llm keys set openrouter
# Paste API key here
llm -m openrouter/openai/gpt-oss-120b "Say hi"
llama.cpp is coming very shortly
The llama.cpp pull request for gpt-oss was landed less than an hour ago. It’s worth browsing through the coded – a lot of work went into supporting this new model, spanning 48 commits to 83 different files. Hopefully this will land in the llama.cpp Homebrew package within the next day or so, which should provide a convenient way to run the model via llama-server and friends.
gpt-oss:20b in Ollama
Ollama also have gpt-oss, requiring an update to their app.
I fetched that 14GB model like this:
ollama pull gpt-oss:20b
Now I can use it with the new Ollama native app, or access it from LLM like this:
llm install llm-ollama
llm -m gpt-oss:20b ‘Hi’
This also appears to use around 13.26GB of system memory while running a prompt.
OpenAI Harmony, a new format for prompt templates
One of the gnarliest parts of implementing harnesses for LLMs is handling the prompt template format.
Modern prompts are complicated beasts. They need to model user v.s. assistant conversation turns, and tool calls, and reasoning traces and an increasing number of other complex patterns.
openai/harmony is a brand new open source project from OpenAI (again, Apache 2) which implements a new response format that was created for the gpt-oss models. It’s clearly inspired by their new-ish Responses API.
The format is described in the new OpenAI Harmony Response Format cookbook document. It introduces some concepts that I’ve not seen in open weight models before:

system, developer, user, assistant and tool roles – many other models only use user and assistant, and sometimes system and tool.
Three different channels for output: final, analysis and commentary. Only the final channel is default intended to be visible to users. analysis is for chain of thought and commentary is sometimes used for tools.

That channels concept has been present in ChatGPT for a few months, starting with the release of o3.
The details of the new tokens used by Harmony caught my eye:

Token
Purpose
ID

<|start|>
Start of message header
200006

<|end|>
End of message
200007

<|message|>
Start of message content
200008

<|channel|>
Start of channel info
200005

<|constrain|>
Data type for tool call
200003

<|return|>
Stop after response
200002

<|call|>
Call a tool
200012

Those token IDs are particularly important. They are part of a new token vocabulary called o200k_harmony, which landed in OpenAI’s tiktoken tokenizer library this morning.
In the past I’ve seen models get confused by special tokens – try pasting <|end|> into a model and see what happens.
Having these special instruction tokens formally map to dedicated token IDs should hopefully be a whole lot more robust!
The Harmony repo itself includes a Rust library and a Python library (wrapping that Rust library) for working with the new format in a much more ergonomic way.
I tried one of their demos using uv run to turn it into a shell one-liner:
uv run –python 3.12 –with openai-harmony python -c ‘
from openai_harmony import *
from openai_harmony import DeveloperContent
enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
convo = Conversation.from_messages([
Message.from_role_and_content(
Role.SYSTEM,
SystemContent.new(),
),
Message.from_role_and_content(
Role.DEVELOPER,
DeveloperContent.new().with_instructions("Talk like a pirate!")
),
Message.from_role_and_content(Role.USER, "Arrr, how be you?"),
])
tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)
print(tokens)’
Which outputs:

[200006, 17360, 200008, 3575, 553, 17554, 162016, 11, 261, 4410, 6439, 2359, 22203, 656, 7788, 17527, 558, 87447, 100594, 25, 220, 1323, 19, 12, 3218, 279, 30377, 289, 25, 14093, 279, 2, 13888, 18403, 25, 8450, 11, 49159, 11, 1721, 13, 21030, 2804, 413, 7360, 395, 1753, 3176, 13, 200007, 200006, 77944, 200008, 2, 68406, 279, 37992, 1299, 261, 96063, 0, 200007, 200006, 1428, 200008, 8977, 81, 11, 1495, 413, 481, 30, 200007, 200006, 173781]

Note those token IDs like 200006 corresponding to the special tokens listed above.
The open question for me: how good is tool calling?
There’s one aspect of these models that I haven’t explored in detail yet: tool calling. How these work is clearly a big part of the new Harmony format, but the packages I’m using myself (around my own LLM tool calling support) need various tweaks and fixes to start working with that new mechanism.
Tool calling currently represents my biggest disappointment with local models that I’ve run on my own machine. I’ve been able to get them to perform simple single calls, but the state of the art these days is wildly more ambitious than that.
Systems like Claude Code can make dozens if not hundreds of tool calls over the course of a single session, each one adding more context and information to a single conversation with an underlying model.
My experience to date has been that local models are unable to handle these lengthy conversations. I’m not sure if that’s inherent to the limitations of my own machine, or if it’s something that the right model architecture and training could overcome.
OpenAI make big claims about the tool calling capabilities of these new models. I’m looking forward to seeing how well they perform in practice.
Competing with the Chinese open models
I’ve been writing a lot about the flurry of excellent open weight models released by Chinese AI labs over the past few months – all of them very capable and most of them under Apache 2 or MIT licenses.
Just last week I said:

Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.
I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively smoked them over the course of July. […]
I can’t help but wonder if part of the reason for the delay in release of OpenAI’s open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.

With the release of the gpt-oss models that statement no longer holds true. I’m waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it’s likely that OpenAI now offer the best available open weights models.
Tags: open-source, ai, openai, generative-ai, local-llms, llms, llm, llm-tool-use, cerebras, ollama, pelican-riding-a-bicycle, llm-reasoning, llm-release, lm-studio, space-invaders

AI Summary and Description: Yes

**Summary:** The text provides an in-depth look at the recently released open weight models from OpenAI (gpt-oss-120b and gpt-oss-20b), illustrating their capabilities, performance benchmarks, and insights into user experiences with running the models on various platforms. Of particular interest to security and compliance professionals is the mention of model robustness, potential bugs, open-source licensing, and implications for local and edge deployment scenarios.

**Detailed Description:**
The text discusses the release of OpenAI’s open weight models, which are significant contributions in the field of AI, especially given their competitive performance compared to existing proprietary models. These models have been made available under the Apache 2.0 license, fostering a collaborative environment for developers. Key points from the text include:

– **Model Specifications:**
– gpt-oss-120b contains 117 billion parameters and operates efficiently on high-resource systems, requiring an 80 GB GPU.
– gpt-oss-20b, designed for edge and local inference, can perform tasks on just 16 GB of memory.

– **Performance Benchmarks:**
– gpt-oss-120b demonstrates close performance to proprietary models (o4-mini) on core reasoning benchmarks.
– General knowledge challenges showed impressive scores, with gpt-oss-120b achieving 80.1%.

– **User Experiences:**
– Initial trials with LM Studio revealed both successes and limitations, particularly with the model’s reasoning capabilities and output quality.
– Bugs in output, such as improper SVG generation, highlighted the nuances involved in managing model prompts and outputs.

– **Integration and Deployment:**
– The gpt-oss models can be accessed via various platforms and plugins, reflecting a growing ecosystem for running large language models without necessitating substantial infrastructure.

– **OpenAI Harmony:**
– Introduction of the new OpenAI Harmony format for prompt templates aims to enhance response generation, implementing advanced channeling concepts that could improve interactions with models.

– **Comparative Analysis:**
– The text positions OpenAI’s open weight models favorably against recent Chinese models, noting a shift in the competitive landscape of open-source AI technologies.

– **Tool Calling and Model Limitations:**
– An exploration of tool calling capabilities points out challenges and potential limitations in local models compared to more advanced frameworks in proprietary systems.

In conclusion, the release of these open weight models invites substantial attention and presents numerous implications for security and compliance professionals concerning model robustness, deployment strategies, and the broader open-source AI landscape. As organizations begin to adopt these models, considerations around how they integrate with existing security frameworks and data governance practices will be critical.

.NET 0 license 01 1 10 2 2025 24 3 32B 4 5 53 5G 7 a access Act actions advanced advanced frameworks after age AGI AI AI landscape AI technologies air analysis and anti apach Apache Apache 2 Apache 2.0 Apache 2.0 license API app Arch architecture ARM art as assistant at ated attribute based being benchmark benchmarks Best Bi bicycle book Bug bugs by C calling capabilities cell Cerebras CERN chain chain of thought challenge challenges channels chat ChatGPT Chinese Chinese models CI CIA class Claude Claude Code CleaR co code coding Col collaborative collaborative environment command commit competitive competitive landscape competitive performance compliance compliance professionals concept content Context conversation core cost cpp critical cross Current D data data governance data governance practices Data Type day days de deep DeepSeek default demo deployment deployment scenarios deployment strategies depth design developer developers device document e ecosystem edge Edge Devices efficient ELF encoding end environment EU Excel exp experience expert Experts exploration fail fast fault file first fixes for framework frameworks front full g Gemma Gen general generation generative Go governance governance practices GPT GPU Groq growing ecosystem gs H handling harm harmony high Highlight hosted HR http HTTPS IAM image implications implications for security in Inference information infrastructure insights instruction integration inter interaction interactions invaders io iOS Iron IRS ite iteration J Java JavaScript json Just k Key keys knowledge knowledge base l L4 Labor Lance land language language model language models large large language model large language models led level Li library license licensing limitations llama llama.cpp llm llms lm load local local inference local models long low M mac machine made making man max mean memory Mila mini Mistral Mixture mixture-of-experts ML Mode model model architecture model specifications model support models Modern moonshot my N native new next no nomic o o3 oE of off ollama on one only ons open open models open weight models open weights open weights models open-source openai openrouter opt organization organizations ory oS oss other out out tool output Outputs over parameter patterns pelican per performance performance benchmark performance benchmarks platform platforms play plugin plugins point porting post potential practices pre pro problem professionals project prompt prompt template prompts proprietary Proprietary model proprietary models ps Py Python Python library Q quality question Qwen R R1 rack raising rapid iteration rate RCE re ready real reasoning reasoning capabilities reasoning mode reasoning model reasoning models red release releases representation resource resources response responses return riding right Ro robustness Role row RSA Rust s sam SAP schema science sec security security and compliance security framework security frameworks self server shift short shot side Sig Sim Simple single size sizes small source source licensing source project space specific SSE STAR start state strategies studio support SVG system systems T Tags: Tails Task tasks tech technologies ted templates test text the thinking Thought TikTok tiktoken Time times to token tokens tool tool calling tools TP training trial trie turn two type UI under up update US use use cases user user experience user experiences Users uv V val Valid version vocabulary WAN web weight weight models weights models Well Wi writing x yt z Zen