Simon Willison’s Weblog: OpenAI’s gold medal performance on the International Math Olympiad

Jul 19, 2025

—

Source URL: https://simonwillison.net/2025/Jul/19/openai-gold-medal-math-olympiad/#atom-everything
Source: Simon Willison’s Weblog
Title: OpenAI’s gold medal performance on the International Math Olympiad

Feedly Summary: OpenAI’s gold medal performance on the International Math Olympiad
OpenAI research scientist Alexander Wei:

I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs. […]
Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!
HUGE congratulations to the team—Sheryl Hsu, Noam Brown, and the many giants whose shoulders we stood on—for turning this crazy dream into reality! I am lucky I get to spend late nights and early mornings working alongside the very best.
Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

(Normally I would just link to the tweet, but in this case Alexander built a thread… and Twitter threads no longer work for linking as they’re only visible to users with an active Twitter account.)
Here’s Wikipedia on the International Mathematical Olympiad:

It is widely regarded as the most prestigious mathematical competition in the world. The first IMO was held in Romania in 1959. It has since been held annually, except in 1980. More than 100 countries participate. Each country sends a team of up to six students, plus one team leader, one deputy leader, and observers.

This year’s event is in Sunshine Coast, Australia. Here’s the web page for the event, which includes a button you can click to access a PDF of the six questions – maybe they don’t link to that document directly to discourage it from being indexed.
The first of the six questions looks like this:

Alexander shared the proofs produced by the model on GitHub. They’re in a slightly strange format – not quite MathML embedded in Markdown – which Alexander excuses since “it is very much an experimental model".
The most notable thing about this is that the unnamed model achieved this score without using any tools. OpenAI’s Sebastien Bubeck emphasizes that here:

Just to spell it out as clearly as possible: a next-word prediction machine (because that’s really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.

Tags: mathematics, ai, openai, generative-ai, llms, llm-reasoning

AI Summary and Description: Yes

Summary: OpenAI’s latest experimental LLM has achieved gold medal-level performance at the International Math Olympiad (IMO), marking a significant milestone in the capabilities of AI in reasoning and problem-solving. This achievement highlights the advancements in general-purpose reinforcement learning and the scaling of computational resources during test times.

Detailed Description: The text discusses the remarkable accomplishment of an experimental reasoning large language model (LLM) developed by OpenAI, which demonstrated its proficiency by solving complex mathematical problems akin to those encountered in the International Math Olympiad (IMO). This success reflects broader implications for the fields of AI and machine learning, particularly in the aspects of reasoning and creativity.

– Key Highlights:
– **Gold Medal Achievement**: The model achieved a score of 35 out of 42 points in the 2025 IMO, solving 5 out of 6 problems, thus earning gold medal recognition.
– **Grading Process**: The solutions were independently assessed by three former IMO medalists, showcasing the quality of the model’s proofs.
– **Innovative Approach**: The achievement came not from a focused, narrow AI perspective but through advancements in general-purpose reinforcement learning and improved computational methods during evaluation.
– **Future Developments**: OpenAI is expected to release GPT-5 soon, although the advanced mathematical capabilities of the IMO-competent model are not intended for immediate public release.
– **Nature of Recognition**: The model’s success is emphasized by its ability to produce creative and original mathematical proofs without any external tools, questioning the boundaries of machine creativity in high-level academic tasks.

This achievement signals a pivotal moment in the development of AI, showcasing its potential not just in practical applications but also in academic and intellectual domains, challenging preconceived notions about the capabilities of AI systems. For security and compliance professionals, the implications of such advanced AI capabilities necessitate a reevaluation of security posture and risk management strategies associated with deploying such models in sensitive or critical areas.

.NET 1 10 2 2025 3 4 5 a access account Act ads advanced advanced AI advancement advancements after AI AI capabilities AI systems Alexa alt and and Risk app Application applications Arch art as at ated Australia being Best Bi built by C capabilities capability challenge CI CIA CleaR co Col Competition compliance compliance professionals computation computational methods computational resources compute consensus core creativity critical D de demo development developments document domain domains e ELF end evaluation event exp External first focused for future future developments g Gen general generative git GitHub Go GPT grade grading gs H high Highlight HR http HTTPS human implications in Inforce innovative approach Intel intellectual inter intern internet io IRS ite J Just k Key l language language model large large language model Large Language Model (LLM) learning led level Li Link Lite llm llms lm long M mac machine Machine Learning man management management strategies markdown math math problem mathematical mathematical problems mathematical proofs mathematics media milestone ML Mode model models N nation natural language new next no nothing o of off on one only open openai OPM oS out pdf per performance point post potential practical applications pre Prediction Machine pro problem problem-solving process professionals proof ps public Q quality question R rag rate RCE reading real reality reasoning red reinforcement reinforcement learning release research research scientist resource resources Risk risk management risk management strategies Ro row s sam scaling search sec security security and compliance security posture self server servers SHA side Sig Signal Sim size SoC solutions solving source specific SSE SSO state STIG strategies system systems T Tags: Task tasks team ted test test-time compute text the Threads Time time compute times to tool tools TP trie turn twitter two UI under up US use user Users V val Valuation web Wi Wikipedia world writing x yt z