Simon Willison’s Weblog: CompileBench: Can AI Compile 22-year-old Code?

Sep 22, 2025

—

Source URL: https://simonwillison.net/2025/Sep/22/compilebench/
Source: Simon Willison’s Weblog
Title: CompileBench: Can AI Compile 22-year-old Code?

Feedly Summary: CompileBench: Can AI Compile 22-year-old Code?
Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture?
This is one of my favorite applications of coding agent tools like Claude Code or Codex CLI: I no longer fear working through convoluted build processes for software I’m unfamiliar with because I’m confident an LLM will be able to brute-force figure out how to do it.
The benchmark on compilebench.com currently show Claude Opus 4.1 Thinking in the lead, as the only model to solve 100% of problems (allowing three attempts). Claude Sonnet 4 Thinking and GPT-5 high both score 93%. The highest open weight model scores are DeepSeek 3.1 and Kimi K2 0905, both at 80%.
This chart showing performance against cost helps demonstrate the excellent value for money provided by GPT-5-mini:

The Gemini 2.5 family does surprisingly badly solving just 60% of the problems. The benchmark authors note that:

When designing the benchmark we kept our benchmark harness and prompts minimal, avoiding model-specific tweaks. It is possible that Google models could perform better with a harness or prompt specifically hand-tuned for them, but this is against our principles in this benchmark.

The harness itself is available on GitHub. It’s written in Go – I had a poke around and found their core agentic loop in bench/agent.go – it builds on top of the OpenAI Go library and defines a single tool called run_terminal_cmd, described as “Execute a terminal command inside a bash shell".
The system prompts live in bench/container/environment.go and differ based on the operating system of the container. Here’s the system prompt for ubuntu-22.04-amd64:

You are a package-building specialist operating a Ubuntu 22.04 bash shell via one tool: run_terminal_cmd.
The current working directory of every run_terminal_cmd is /home/peter.
Execution rules:

Always pass non-interactive flags for any command that could prompt (e.g., -y, –yes, DEBIAN_FRONTEND=noninteractive).
Don’t include any newlines in the command.
You can use sudo.

If you encounter any errors or issues while doing the user’s request, you must fix them and continue the task.
At the end verify you did the user request correctly.

Via Hacker News
Tags: go, ai, prompt-engineering, generative-ai, llms, ai-assisted-programming, evals, coding-agents

AI Summary and Description: Yes

Summary: The text discusses a new benchmark for evaluating large language models (LLMs) in their ability to handle software compilation tasks, particularly focusing on how well different AIs can manage specialized coding challenges. This benchmarking underscores the practical application of AI in streamlining complex software processes and provides valuable insights for developers and AI practitioners.

Detailed Description:
The text provides an overview of CompileBench, a benchmark created by Piotr Grabowski and Piotr Migdał that measures the efficacy of various LLMs in compiling code, with a specific focus on cross-compiling for ARM64 architecture. Key insights from this benchmark are:

– **Performance Evaluation of LLMs**: The benchmark evaluates how different models perform on a set of compilation challenges, providing comparative results:
– **Claude Opus 4.1**: Leads with a 100% problem-solving rate after three attempts.
– **Claude Sonnet 4 Thinking and GPT-5 high**: Both achieved a 93% success rate.
– **DeepSeek 3.1 and Kimi K2 0905**: Both score an 80% success rate.
– **Gemini 2.5 family**: Performs poorly, with only 60% success.

– **Model Handling and Benchmark Design**: The authors intentionally kept the benchmark framework straightforward and model-agnostic. This method ensures that results reflect the true capabilities of each LLM without optimizations or tailored prompts that could skew the performance metrics.

– **Implementation Details**: The benchmark system uses a tool called `run_terminal_cmd`, which executes commands in a bash shell. The specifications were designed for a controlled environment, dictating execution rules that guide how commands are processed, emphasizing non-interactive execution and error handling.

– **Open Repository**: The harness for the benchmark is available on GitHub, fostering openness and collaboration within the AI community.

This benchmark is significant for AI practitioners and software developers, as it demonstrates how LLMs can facilitate software development processes, particularly in compiling complex codebases. The evaluation also informs decisions on which models to leverage based on their performance in real-world applications, highlighting considerations for efficiency and cost.

– **Practical Implications**:
– Reinforces the value of LLMs in programming tasks, particularly for unfamiliar software environments.
– Insights from benchmarking can guide future developments in AI-assisted programming tools.
– The focus on non-tweaked models promotes transparency in AI performance evaluations, facilitating better understanding and trust in LLM capabilities.

Overall, the CompileBench evaluation serves as a significant contribution to understanding how AI can aid in software engineering, providing a foundation for further research and application in coding assistance.

.NET 1 10 2 2025 3 4 5 a Act ads after age agent agent tool agentic agentic loop agents agnostic AI ai-assisted-programming All allow AMD and app Application applications Arch architecture ARM Arm64 art as assistance assisted assisted programming at ated authors based bash benchmark benchmark design benchmarking Bi bot brute building by C capabilities cell challenge challenges CI CIA Claude Claude Code Claude Opus 4 Claude Sonnet Claude Sonnet 4 cli co code codebase Codebases codex coding coding agent coding assistance coding challenges Col collaboration command community compilation compilebench container control controlled environment core cost cross Current D de Debian decision decisions deep DeepSeek DeFi demo design developer developers development development process Development processes developments e efficiency ELF end Engineer engineering environment environments error error handling errors evals evaluation evaluations Excel execution fine fines for framework front future future developments g Gemini Gemini 2 Gen generative git GitHub Go Google Google model GPT gs H hack hacker Hacker News handling high Highlight HR http HTTPS implementation implementation detail implications in Inforce insights intent inter io IoT Iron issue ite J Just k Key l Labor language language model language models large large language model large language models Large Language Models (LLMs) led Li library line llm llms lm long loop low M man measures metrics mini ML Mode model models my N new news no non o oE of on one only ons open openai operating system OPM opt optimization optimizations Opus ory oS oss out over per performance performance evaluation performance metrics practical application practical implications principles pro problem problem-solving process processes programming programming tasks prompt prompt-engineering prompts ps Q R rag rate RCE re real real-world applications red repository research Ro Rust s search self side Sig signing Sim Simon Willison single software software developers software development software engineer software engineering solving source specialized specific SSE sudo system system prompt system prompts T Tags: Tails Task tasks ted terminal text the thinking to tool tools Tor TP transparency trust Ubuntu UI under US use user uth V val Valuation Ware web weight Well Wi world world application world applications x z