Source URL: https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-everything
Source: Simon Willison’s Weblog
Title: TIL: Running a gpt-oss eval suite against LM Studio on a Mac
Feedly Summary: TIL: Running a gpt-oss eval suite against LM Studio on a Mac
The other day I learned that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on Verifying gpt-oss implementations.
I decided to try and run that eval suite on my own MacBook Pro, against gpt-oss-20b running inside of LM Studio.
TLDR: once I had the model running inside LM Studio with a longer than default context limit, the following incantation ran an eval suite in around 3.5 hours:
mkdir /tmp/aime25_openai
OPENAI_API_KEY=x \
uv run –python 3.13 –with ‘gpt-oss[eval]’ \
python -m gpt_oss.evals \
–base-url http://localhost:1234/v1 \
–eval aime25 \
–sampler chat_completions \
–model openai/gpt-oss-20b \
–reasoning-effort low \
–n-threads 2
My new TIL breaks that command down in detail and walks through the underlying eval – AIME 2025, which asks 30 questions (8 times each) that are defined using the following format:
{“question": "Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.", "answer": "70"}
Tags: python, ai, til, openai, generative-ai, local-llms, llms, evals, uv, lm-studio, gpt-oss
AI Summary and Description: Yes
Summary: The text discusses running an evaluation suite from OpenAI’s gpt-oss on a Mac environment using LM Studio. It outlines a specific command and describes the AIME 2025 evaluation process, which poses mathematical questions to test the model’s capabilities. This insight is beneficial for AI practitioners looking to validate the performance and outputs of generative models.
Detailed Description: The content highlights an example of utilizing OpenAI’s gpt-oss model for evaluation purposes, showcasing practical steps and insights for developers and researchers in the AI field. Here are the major points worth noting:
– **Evaluation Suite Implementation**: The text introduces the gpt-oss eval suite, a tool designed for validating the implementations of OpenAI’s models.
– **Execution Environment**: The evaluation was conducted on a MacBook Pro, demonstrating how local infrastructure can be used for testing large language models (LLMs).
– **Command Breakdown**: A specific command is provided to facilitate running the evaluation suite, emphasizing the process of setting up the environment and executing the tests.
– Key command components include:
– Directory creation for evaluation outputs
– Setting the environment variable for OpenAI’s API key
– Use of the `uv` command for running Python applications
– Parameters that define the evaluation, such as model type, base URL, and reasoning effort.
– **AIME 2025 Evaluation**: The evaluation itself consists of a series of mathematical questions, providing insights into the model’s reasoning and problem-solving capabilities. It involves:
– 30 questions asked multiple times to assess consistency and accuracy.
– Example question format illustrates the complexity and logical reasoning required from the model.
– **Relevance for AI Professionals**: This practical example serves as a guide for AI security and LLM practitioners who want to assess the robustness of their models or need insights into evaluation methodologies.
Overall, the text encapsulates an important aspect of model validation, which is crucial for ensuring security and reliability in AI applications. This evaluation framework can assist professionals in identifying strengths and weaknesses in generative AI models, paving the way for improved model governance and compliance in future AI deployments.