Simon Willison’s Weblog: llama.cpp guide: running gpt-oss with llama.cpp

Aug 19, 2025

—

Source URL: https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp/
Source: Simon Willison’s Weblog
Title: llama.cpp guide: running gpt-oss with llama.cpp

Feedly Summary: llama.cpp guide: running gpt-oss with llama.cpp
Really useful official guide to running the OpenAI gpt-oss models using llama-server from llama.cpp – which provides an OpenAI-compatible localhost API and a neat web interface for interacting with the models.
TLDR version for macOS to run the smaller gpt-oss-20b model:
brew install llama.cpp
llama-server -hf ggml-org/gpt-oss-20b-GGUF \
–ctx-size 0 –jinja -ub 2048 -b 2048 -ngl 99 -fa

This downloads a 12GB model file from ggml-org/gpt-oss-20b-GGUF on Hugging Face, stores it in ~/Library/Caches/llama.cpp/ and starts it running on port 8000.
You can then visit this URL to start interacting with the model:
http://localhost:8000/

On my 64GB M2 MacBook Pro it runs at around 82 tokens/second.

The guide also includes notes for running on NVIDIA and AMD hardware.
Via @ggerganov
Tags: macos, ai, openai, generative-ai, local-llms, llms, llama-cpp, gpt-oss

AI Summary and Description: Yes

Summary: The text provides a guide for running OpenAI’s gpt-oss models using llama.cpp, offering insights into the deployment of AI models on local systems. This is relevant for professionals in AI and infrastructure security as it combines practical setup instructions with considerations for hardware compatibility.

Detailed Description:

The provided guide details how to run OpenAI’s gpt-oss models utilizing the llama-server implemented in llama.cpp. It presents a straightforward setup process suitable for users on macOS and offers insights that could be beneficial for security and compliance professionals involved in AI deployment and usage.

Key Points:
– **Installation Procedure**:
– Using Homebrew to install llama.cpp.
– Configuring the server to utilize the gpt-oss-20b model, which is a sizeable 12GB file.
– Storing the model file and running the server on a specified local port (8000).

– **Performance Considerations**:
– The guide indicates that the model runs at approximately 82 tokens per second on a high-performance device (64GB M2 MacBook Pro), highlighting the hardware requirements for effective performance.

– **Hardware Compatibility**:
– Provides additional instructions or considerations for running the setup on both NVIDIA and AMD hardware, which can be crucial for optimizing performance or ensuring compatibility across different infrastructures.

– **Local API Access**:
– The model operates with a local API that can be accessed via a web interface, facilitating interaction with the AI in a secure, local environment—important for maintaining data privacy and compliance while using AI tools.

Overall, this guide is significant for professionals working in AI, cloud computing, and infrastructure security, as it exemplifies the integration of AI with local computing resources while emphasizing practical implementation details and hardware considerations.

.NET 1 2 2025 4 5 800 a access Act ads age AI ai model AI models AI tool AI tools All AMD and API app art as at Bi book C Cache CI CIA Cloud cloud computing co compatibility compliance compliance professionals Computing computing resources cpp cross D data data privacy de deployment device e effective environment face file for g Gen generative GPT gs H hardware hardware co hardware compatibility hardware considerations hardware requirements high high-performance Highlight Homebrew http HTTPS hugging Hugging Face implementation implementation detail in infrastructure infrastructure security infrastructures insights installation instruction integration inter interaction interface io Iron J k Key l led Li library llama llama.cpp llm llms lm load local localhost M mac MacBook MacOS man ML Mode model models my N no notes Nvidia o of off on ons open openai opt oS oss over per performance performance considerations point pre privacy pro process professionals ps Q R rate RCE re real Requirements resource resources Ro s sec secure security security and compliance server side Sig Sim size small source SSE STAR start structures system systems T Tags: Tails ted text the to token tokens tool tools Tor TP UI up US usage use user Users V version Ware web web interface Wi x z