Simon Willison’s Weblog: CompileBench: Can AI Compile 22-year-old Code?

Source URL: https://simonwillison.net/2025/Sep/22/compilebench/
Source: Simon Willison’s Weblog
Title: CompileBench: Can AI Compile 22-year-old Code?

Feedly Summary: CompileBench: Can AI Compile 22-year-old Code?
Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture?
This is one of my favorite applications of coding agent tools like Claude Code or Codex CLI: I no longer fear working through convoluted build processes for software I’m unfamiliar with because I’m confident an LLM will be able to brute-force figure out how to do it.
The benchmark on compilebench.com currently show Claude Opus 4.1 Thinking in the lead, as the only model to solve 100% of problems (allowing three attempts). Claude Sonnet 4 Thinking and GPT-5 high both score 93%. The highest open weight model scores are DeepSeek 3.1 and Kimi K2 0905, both at 80%.
This chart showing performance against cost helps demonstrate the excellent value for money provided by GPT-5-mini:

The Gemini 2.5 family does surprisingly badly solving just 60% of the problems. The benchmark authors note that:

When designing the benchmark we kept our benchmark harness and prompts minimal, avoiding model-specific tweaks. It is possible that Google models could perform better with a harness or prompt specifically hand-tuned for them, but this is against our principles in this benchmark.

The harness itself is available on GitHub. It’s written in Go – I had a poke around and found their core agentic loop in bench/agent.go – it builds on top of the OpenAI Go library and defines a single tool called run_terminal_cmd, described as “Execute a terminal command inside a bash shell".
The system prompts live in bench/container/environment.go and differ based on the operating system of the container. Here’s the system prompt for ubuntu-22.04-amd64:

You are a package-building specialist operating a Ubuntu 22.04 bash shell via one tool: run_terminal_cmd.
The current working directory of every run_terminal_cmd is /home/peter.
Execution rules:

Always pass non-interactive flags for any command that could prompt (e.g., -y, –yes, DEBIAN_FRONTEND=noninteractive).
Don’t include any newlines in the command.
You can use sudo.

If you encounter any errors or issues while doing the user’s request, you must fix them and continue the task.
At the end verify you did the user request correctly.

Via Hacker News
Tags: go, ai, prompt-engineering, generative-ai, llms, ai-assisted-programming, evals, coding-agents

AI Summary and Description: Yes

Summary: The text discusses a new benchmark for evaluating large language models (LLMs) in their ability to handle software compilation tasks, particularly focusing on how well different AIs can manage specialized coding challenges. This benchmarking underscores the practical application of AI in streamlining complex software processes and provides valuable insights for developers and AI practitioners.

Detailed Description:
The text provides an overview of CompileBench, a benchmark created by Piotr Grabowski and Piotr Migdał that measures the efficacy of various LLMs in compiling code, with a specific focus on cross-compiling for ARM64 architecture. Key insights from this benchmark are:

– **Performance Evaluation of LLMs**: The benchmark evaluates how different models perform on a set of compilation challenges, providing comparative results:
– **Claude Opus 4.1**: Leads with a 100% problem-solving rate after three attempts.
– **Claude Sonnet 4 Thinking and GPT-5 high**: Both achieved a 93% success rate.
– **DeepSeek 3.1 and Kimi K2 0905**: Both score an 80% success rate.
– **Gemini 2.5 family**: Performs poorly, with only 60% success.

– **Model Handling and Benchmark Design**: The authors intentionally kept the benchmark framework straightforward and model-agnostic. This method ensures that results reflect the true capabilities of each LLM without optimizations or tailored prompts that could skew the performance metrics.

– **Implementation Details**: The benchmark system uses a tool called `run_terminal_cmd`, which executes commands in a bash shell. The specifications were designed for a controlled environment, dictating execution rules that guide how commands are processed, emphasizing non-interactive execution and error handling.

– **Open Repository**: The harness for the benchmark is available on GitHub, fostering openness and collaboration within the AI community.

This benchmark is significant for AI practitioners and software developers, as it demonstrates how LLMs can facilitate software development processes, particularly in compiling complex codebases. The evaluation also informs decisions on which models to leverage based on their performance in real-world applications, highlighting considerations for efficiency and cost.

– **Practical Implications**:
– Reinforces the value of LLMs in programming tasks, particularly for unfamiliar software environments.
– Insights from benchmarking can guide future developments in AI-assisted programming tools.
– The focus on non-tweaked models promotes transparency in AI performance evaluations, facilitating better understanding and trust in LLM capabilities.

Overall, the CompileBench evaluation serves as a significant contribution to understanding how AI can aid in software engineering, providing a foundation for further research and application in coding assistance.