Simon Willison’s Weblog: llama.cpp guide: running gpt-oss with llama.cpp

Source URL: https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp/
Source: Simon Willison’s Weblog
Title: llama.cpp guide: running gpt-oss with llama.cpp

Feedly Summary: llama.cpp guide: running gpt-oss with llama.cpp
Really useful official guide to running the OpenAI gpt-oss models using llama-server from llama.cpp – which provides an OpenAI-compatible localhost API and a neat web interface for interacting with the models.
TLDR version for macOS to run the smaller gpt-oss-20b model:
brew install llama.cpp
llama-server -hf ggml-org/gpt-oss-20b-GGUF \
–ctx-size 0 –jinja -ub 2048 -b 2048 -ngl 99 -fa

This downloads a 12GB model file from ggml-org/gpt-oss-20b-GGUF on Hugging Face, stores it in ~/Library/Caches/llama.cpp/ and starts it running on port 8000.
You can then visit this URL to start interacting with the model:
http://localhost:8000/

On my 64GB M2 MacBook Pro it runs at around 82 tokens/second.

The guide also includes notes for running on NVIDIA and AMD hardware.
Via @ggerganov
Tags: macos, ai, openai, generative-ai, local-llms, llms, llama-cpp, gpt-oss

AI Summary and Description: Yes

Summary: The text provides a guide for running OpenAI’s gpt-oss models using llama.cpp, offering insights into the deployment of AI models on local systems. This is relevant for professionals in AI and infrastructure security as it combines practical setup instructions with considerations for hardware compatibility.

Detailed Description:

The provided guide details how to run OpenAI’s gpt-oss models utilizing the llama-server implemented in llama.cpp. It presents a straightforward setup process suitable for users on macOS and offers insights that could be beneficial for security and compliance professionals involved in AI deployment and usage.

Key Points:
– **Installation Procedure**:
– Using Homebrew to install llama.cpp.
– Configuring the server to utilize the gpt-oss-20b model, which is a sizeable 12GB file.
– Storing the model file and running the server on a specified local port (8000).

– **Performance Considerations**:
– The guide indicates that the model runs at approximately 82 tokens per second on a high-performance device (64GB M2 MacBook Pro), highlighting the hardware requirements for effective performance.

– **Hardware Compatibility**:
– Provides additional instructions or considerations for running the setup on both NVIDIA and AMD hardware, which can be crucial for optimizing performance or ensuring compatibility across different infrastructures.

– **Local API Access**:
– The model operates with a local API that can be accessed via a web interface, facilitating interaction with the AI in a secure, local environment—important for maintaining data privacy and compliance while using AI tools.

Overall, this guide is significant for professionals working in AI, cloud computing, and infrastructure security, as it exemplifies the integration of AI with local computing resources while emphasizing practical implementation details and hardware considerations.