Source URL: https://simonwillison.net/2025/Aug/15/inconsistent-performance/
Source: Simon Willison’s Weblog
Title: Open weight LLMs exhibit inconsistent performance across providers
Feedly Summary: Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model – OpenAI’s gpt-oss-120b – performs across different hosted providers.
The results showed some surprising differences. Here’s the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of “high":
These are some varied results!
93.3%: Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, vLLM 0.1.0
90.0%: Parasail
86.7%: Groq
83.3%: Amazon
80.0%: Azure
36.7%: CompactifAI
It looks like most of the providers that scored 93.3% were running models using the latest vLLM (with the exception of Cerebras who I believe have their own custom serving stack).
Microsoft Azure’s Lucas Pickup confirmed that Azure’s 80% score was caused by running an older vLLM, now fixed:
This is exactly it, it’s been fixed as of yesterday afternoon across all serving instances (of the hosted 120b service). Old vLLM commits that didn’t respect reasoning_effort, so all requests defaulted to medium.
No news yet on what went wrong with the AWS Bedrock version.
The challenge for customers of open weight models
As a customer of open weight model providers, this really isn’t something I wanted to have to think about!
It’s not really a surprise though. When running models myself I inevitably have to make choices – about which serving framework to use (I’m usually picking between GGPF/llama.cpp and MLX on my own Mac laptop) and the quantization size to use.
I know that quantization has an impact, but it’s difficult for me to quantify that effect.
It looks like with hosted models even knowing the quantization they are using isn’t necessarily enough information to be able to predict that model’s performance.
I see this situation as a general challenge for open weight models. They tend to be released as an opaque set of model weights plus loose instructions for running them on a single platform – if we are lucky! Most AI labs leave quantization and format conversions to the community and third-party providers.
There’s a lot that can go wrong. Tool calling is particularly vulnerable to these differences – models have been trained on specific tool-calling conventions, and if a provider doesn’t get these exactly right the results can be unpredictable but difficult to diagnose.
What would help enormously here would be some kind of conformance suite. If models were reliably deterministic this would be easy: publish a set of test cases and let providers (or their customers) run those to check the model’s implementation.
Models aren’t deterministic though, even at a temperature of 0. Maybe this new effort from Artificial Analysis is exactly what we need here, especially since running a full benchmark suite against a provider can be quite expensive in terms of token spend.
Tags: ai, generative-ai, local-llms, llms, gpt-oss, artificial-analysis
AI Summary and Description: Yes
Summary: The text discusses a benchmark analysis of OpenAI’s gpt-oss-120b model across various hosted platforms, revealing significant performance variations. It highlights the complexities and challenges associated with using open weight models, particularly concerning the choice of serving frameworks and quantization impacts, while suggesting that a conformance suite might be beneficial for consistency.
Detailed Description: The passage sheds light on a recent benchmark analysis conducted by Artificial Analysis focused on OpenAI’s gpt-oss-120b model when deployed on different hosting providers. This analysis is crucial for professionals in AI and cloud computing, as it underscores the inherent performance discrepancies that can arise based on deployment conditions and model management.
Key Points:
– **Benchmark Findings**:
– The performance results show significant variance among hosted providers.
– Top performing platforms included Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, all scoring 93.3%.
– Microsoft Azure scored 80% due to using an outdated vLLM version, which has since been corrected.
– AWS Bedrock’s results remain unexplained at the moment.
– **Challenges with Open Weight Models**:
– Customers often face difficulties in selecting appropriate serving frameworks and making decisions regarding quantization without clear benchmarks.
– The opaque nature of many open weight models can complicate the understanding of their performance.
– **Tool Calling Vulnerabilities**:
– Models are sensitive to tool calling conventions; inaccuracies here can lead to unpredictable outcomes that are hard to diagnose.
– **Need for a Conformance Suite**:
– The author argues for the development of a conformance suite to standardize performance assessments across different implementations.
– A systematic approach, involving deterministic tests, could provide clarity and reliability to model deployment across various environments.
– **Implications for AI and Cloud Security**:
– Understanding the variability in model performance is critical for ensuring effective AI deployments, particularly from a security and compliance perspective.
– The challenges presented can have broader implications for securing AI systems and ensuring they meet compliance standards.
This analysis carries significant implications for AI professionals and cloud computing specialists, emphasizing the need for better tools and frameworks to evaluate model performance, ultimately leading to improved reliability and security in AI applications.