Simon Willison’s Weblog: TIL: Running a gpt-oss eval suite against LM Studio on a Mac

Aug 17, 2025

—

Source URL: https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-everything
Source: Simon Willison’s Weblog
Title: TIL: Running a gpt-oss eval suite against LM Studio on a Mac

Feedly Summary: TIL: Running a gpt-oss eval suite against LM Studio on a Mac
The other day I learned that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on Verifying gpt-oss implementations.
I decided to try and run that eval suite on my own MacBook Pro, against gpt-oss-20b running inside of LM Studio.
TLDR: once I had the model running inside LM Studio with a longer than default context limit, the following incantation ran an eval suite in around 3.5 hours:
mkdir /tmp/aime25_openai
OPENAI_API_KEY=x \
uv run –python 3.13 –with ‘gpt-oss[eval]’ \
python -m gpt_oss.evals \
–base-url http://localhost:1234/v1 \
–eval aime25 \
–sampler chat_completions \
–model openai/gpt-oss-20b \
–reasoning-effort low \
–n-threads 2

My new TIL breaks that command down in detail and walks through the underlying eval – AIME 2025, which asks 30 questions (8 times each) that are defined using the following format:
{“question": "Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.", "answer": "70"}
Tags: python, ai, til, openai, generative-ai, local-llms, llms, evals, uv, lm-studio, gpt-oss

AI Summary and Description: Yes

Summary: The text discusses running an evaluation suite from OpenAI’s gpt-oss on a Mac environment using LM Studio. It outlines a specific command and describes the AIME 2025 evaluation process, which poses mathematical questions to test the model’s capabilities. This insight is beneficial for AI practitioners looking to validate the performance and outputs of generative models.

Detailed Description: The content highlights an example of utilizing OpenAI’s gpt-oss model for evaluation purposes, showcasing practical steps and insights for developers and researchers in the AI field. Here are the major points worth noting:

– **Evaluation Suite Implementation**: The text introduces the gpt-oss eval suite, a tool designed for validating the implementations of OpenAI’s models.
– **Execution Environment**: The evaluation was conducted on a MacBook Pro, demonstrating how local infrastructure can be used for testing large language models (LLMs).
– **Command Breakdown**: A specific command is provided to facilitate running the evaluation suite, emphasizing the process of setting up the environment and executing the tests.
– Key command components include:
– Directory creation for evaluation outputs
– Setting the environment variable for OpenAI’s API key
– Use of the `uv` command for running Python applications
– Parameters that define the evaluation, such as model type, base URL, and reasoning effort.
– **AIME 2025 Evaluation**: The evaluation itself consists of a series of mathematical questions, providing insights into the model’s reasoning and problem-solving capabilities. It involves:
– 30 questions asked multiple times to assess consistency and accuracy.
– Example question format illustrates the complexity and logical reasoning required from the model.
– **Relevance for AI Professionals**: This practical example serves as a guide for AI security and LLM practitioners who want to assess the robustness of their models or need insights into evaluation methodologies.

Overall, the text encapsulates an important aspect of model validation, which is crucial for ensuring security and reliability in AI applications. This evaluation framework can assist professionals in identifying strengths and weaknesses in generative AI models, paving the way for improved model governance and compliance in future AI deployments.

.NET 1 2 2025 3 4 5 7 a accuracy Act ads age AI AI applications ai model AI models AI security All and API app Application applications Arch Aria art as at Bi book C capabilities chat CI CIA co command complexity compliance consistency content Context creation D day de default DeFi demo deployment deployments design developer developers e E2 ELF environment evals evaluation evaluation framework evaluation methodologies execution fault fine following for framework future g Gen generative Generative AI generative AI models generative model Generative Models Go governance Governance and Compliance GPT gs H high Highlight HR http HTTPS implementation in infrastructure insights io Iron ite J k Key l language language model language models large large language model large language models Large Language Models (LLMs) led Li liability llm llms lm local localhost logic logical reasoning long low M mac MacBook man math mathematical methodologies Mode model model governance model release model validation models multi my N NCA new no o of on one ons open openai ory oS oss other out output Outputs over parameter per performance point practical example pro problem problem-solving problem-solving capabilities process professionals ps Py Python Q question R rate RCE re reasoning red release reliability research researchers Ro robustness s sam search sec security self series side Sig Sim solving source specific SSE studio T Tags: ted test Testing text the Threads Time times to tool Tor TP type UI under up US use uv V val Valid Validation Valuation WAN web Wi x yt z