Simon Willison’s Weblog: Open weight LLMs exhibit inconsistent performance across providers

Aug 15, 2025

—

Source URL: https://simonwillison.net/2025/Aug/15/inconsistent-performance/
Source: Simon Willison’s Weblog
Title: Open weight LLMs exhibit inconsistent performance across providers

Feedly Summary: Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model – OpenAI’s gpt-oss-120b – performs across different hosted providers.
The results showed some surprising differences. Here’s the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of “high":

These are some varied results!

93.3%: Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, vLLM 0.1.0
90.0%: Parasail
86.7%: Groq
83.3%: Amazon
80.0%: Azure
36.7%: CompactifAI

It looks like most of the providers that scored 93.3% were running models using the latest vLLM (with the exception of Cerebras who I believe have their own custom serving stack).
Microsoft Azure’s Lucas Pickup confirmed that Azure’s 80% score was caused by running an older vLLM, now fixed:

This is exactly it, it’s been fixed as of yesterday afternoon across all serving instances (of the hosted 120b service). Old vLLM commits that didn’t respect reasoning_effort, so all requests defaulted to medium.

No news yet on what went wrong with the AWS Bedrock version.
The challenge for customers of open weight models
As a customer of open weight model providers, this really isn’t something I wanted to have to think about!
It’s not really a surprise though. When running models myself I inevitably have to make choices – about which serving framework to use (I’m usually picking between GGPF/llama.cpp and MLX on my own Mac laptop) and the quantization size to use.
I know that quantization has an impact, but it’s difficult for me to quantify that effect.
It looks like with hosted models even knowing the quantization they are using isn’t necessarily enough information to be able to predict that model’s performance.
I see this situation as a general challenge for open weight models. They tend to be released as an opaque set of model weights plus loose instructions for running them on a single platform – if we are lucky! Most AI labs leave quantization and format conversions to the community and third-party providers.
There’s a lot that can go wrong. Tool calling is particularly vulnerable to these differences – models have been trained on specific tool-calling conventions, and if a provider doesn’t get these exactly right the results can be unpredictable but difficult to diagnose.
What would help enormously here would be some kind of conformance suite. If models were reliably deterministic this would be easy: publish a set of test cases and let providers (or their customers) run those to check the model’s implementation.
Models aren’t deterministic though, even at a temperature of 0. Maybe this new effort from Artificial Analysis is exactly what we need here, especially since running a full benchmark suite against a provider can be quite expensive in terms of token spend.
Tags: ai, generative-ai, local-llms, llms, gpt-oss, artificial-analysis

AI Summary and Description: Yes

Summary: The text discusses a benchmark analysis of OpenAI’s gpt-oss-120b model across various hosted platforms, revealing significant performance variations. It highlights the complexities and challenges associated with using open weight models, particularly concerning the choice of serving frameworks and quantization impacts, while suggesting that a conformance suite might be beneficial for consistency.

Detailed Description: The passage sheds light on a recent benchmark analysis conducted by Artificial Analysis focused on OpenAI’s gpt-oss-120b model when deployed on different hosting providers. This analysis is crucial for professionals in AI and cloud computing, as it underscores the inherent performance discrepancies that can arise based on deployment conditions and model management.

Key Points:

– **Benchmark Findings**:
– The performance results show significant variance among hosted providers.
– Top performing platforms included Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, all scoring 93.3%.
– Microsoft Azure scored 80% due to using an outdated vLLM version, which has since been corrected.
– AWS Bedrock’s results remain unexplained at the moment.

– **Challenges with Open Weight Models**:
– Customers often face difficulties in selecting appropriate serving frameworks and making decisions regarding quantization without clear benchmarks.
– The opaque nature of many open weight models can complicate the understanding of their performance.

– **Tool Calling Vulnerabilities**:
– Models are sensitive to tool calling conventions; inaccuracies here can lead to unpredictable outcomes that are hard to diagnose.

– **Need for a Conformance Suite**:
– The author argues for the development of a conformance suite to standardize performance assessments across different implementations.
– A systematic approach, involving deterministic tests, could provide clarity and reliability to model deployment across various environments.

– **Implications for AI and Cloud Security**:
– Understanding the variability in model performance is critical for ensuring effective AI deployments, particularly from a security and compliance perspective.
– The challenges presented can have broader implications for securing AI systems and ensuring they meet compliance standards.

This analysis carries significant implications for AI professionals and cloud computing specialists, emphasizing the need for better tools and frameworks to evaluate model performance, ultimately leading to improved reliability and security in AI applications.

.NET 1 2 2025 3 5 7 a Act after age AGI AI AI applications AI systems All Amazon American analysis and anti app Application applications Aria art artificial as assessment assessments at ated AWS Azure based Bedrock benchmark benchmark analysis benchmarks Bi by C calling Cerebras CERN challenge challenges CI CIA CleaR Cloud cloud computing cloud security co commit community compliance compliance standards Computing Condi conformance suite consistency core cpp critical cross Customer D day de decision decisions deep default deployment deployments deterministic development dual e EDR effective ELF end environment Ester exp face fault focused for framework frameworks full g Gen general generative Go GPT Groq gs H high Highlight hosted hosted models hosting hosting provider http HTTPS impact implementation implications in inaccuracies information Instance instruction io Iron ite k Key l leading led Li liability llama llama.cpp llm llms lm local M mac making man management math mathematics Micro Microsoft Microsoft Azure mini ML mlx Mode model model deployment model management model performance model providers model weights models my N nation new news NIST no o oE of on one ons open open weight models openai OPM oS oss other out outcome party party providers per performance performance assessment platform platforms point pre pro professionals ps Q quantization R rag RCE re real reasoning red release reliability right Ro Rock s sec security security and compliance self service Sig Sim single size SoC source specific SSE SSO stack standards system systems T Tags: ted test test cases text the third third-party third-party providers Time to token tool tool calling tool calling vulnerabilities tools TP trained UI under up US use uth V val variability version vllm vulnerabilities WAN web weight weight models Wi x z