Simon Willison’s Weblog: OpenAI O3 breakthrough high score on ARC-AGI-PUB

Source URL: https://simonwillison.net/2024/Dec/20/openai-o3-breakthrough/#atom-everything
Source: Simon Willison’s Weblog
Title: OpenAI O3 breakthrough high score on ARC-AGI-PUB

Feedly Summary: OpenAI O3 breakthrough high score on ARC-AGI-PUB
François Chollet is the co-founder of the ARC Prize and had advanced access to today’s o3 results. His article here is the most insightful coverage I’ve seen of o3, going beyond just the benchmark results to talk about what this all means for the field in general.
One fascinating detail: it cost $6,677 to run o3 in “high efficiency" mode against the 400 public ARC-AGI puzzles for a score of 82.8%, and an undisclosed amount of money to run the "low efficiency" mode model to score 91.5%. A note says:

o3 high-compute costs not available as pricing and feature availability is still TBD. The amount of compute was roughly 172x the low-compute configuration.

So we can get a ballpark estimate here in that 172 * $6,677 = $1,148,444!
Here’s how François explains the likely mechanisms behind o3, which reminds me of how a brute-force chess computer might work.

For now, we can only speculate about the exact specifics of how o3 works. But o3’s core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model. To note, Demis Hassabis hinted back in a June 2023 interview that DeepMind had been researching this very idea – this line of work has been a long time coming.
So while single-generation LLMs struggle with novelty, o3 overcomes this by generating and executing its own programs, where the program itself (the CoT) becomes the artifact of knowledge recombination. Although this is not the only viable approach to test-time knowledge recombination (you could also do test-time training, or search in latent space), it represents the current state-of-the-art as per these new ARC-AGI numbers.
Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of "programs" (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.

I’m not sure if o3 (and o1 and similar models) even qualifies as an LLM any more – there’s clearly a whole lot more going on here than just next-token prediction.
On the question of if o3 should qualify as AGI (whatever that might mean):

Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don’t think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).

Tags: inference-scaling, generative-ai, openai, o3, francois-chollet, ai, llms

AI Summary and Description: Yes

**Summary:** The text discusses OpenAI’s recent O3 breakthrough, highlighting its significant capabilities in solving ARC-AGI puzzles through a novel mechanism of natural language program search. The insights provided by François Chollet emphasize the implications of these advancements in the broader AI field, particularly in relation to the definition and characteristics of Artificial General Intelligence (AGI) and the resources required for high-efficiency operations.

**Detailed Description:**
– **Overview of O3 Breakthrough:**
– OpenAI’s O3 achieved a high score on the ARC-AGI-PUB benchmark, marking a notable advancement in AI capabilities.
– François Chollet, co-founder of the ARC Prize, provides an in-depth analysis that extends beyond mere benchmark results.

– **Cost and Compute Efficiency:**
– O3 was run in two modes:
– **High Efficiency Mode:** Cost approximately $6,677 to achieve a score of 82.8%.
– **Low Efficiency Mode:** Achieved a score of 91.5% with undisclosed costs.
– The high-compute version reportedly used about 172 times the resources of the low-compute configuration, leading to an estimated total cost of $1,148,444 for high-compute operations.

– **Mechanisms Behind O3:**
– O3’s functionality appears to involve natural language programming for search and execution within token space.
– It employs a search methodology reminiscent of Monte-Carlo tree search, similar to techniques used in chess AI like AlphaZero.
– The system generates possible “Chains of Thought” (CoTs) to solve tasks rather than relying solely on a single-step prediction.

– **Comparisons to Other AGI Concepts:**
– Although O3 enhances novelty generation through program execution, it still does not meet the full criteria for AGI.
– Chollet expresses doubts about referring to O3 as an LLM (Large Language Model), as it incorporates more complex processes than traditional next-token prediction.

– **Challenges Ahead:**
– Passing the ARC-AGI benchmark does not imply that O3 has achieved AGI; it has limitations, particularly in handling simpler tasks.
– Upcoming challenges (ARC-AGI-2 benchmark) may significantly test O3’s capabilities, potentially lowering its performance score.

This analysis provides crucial insights for security and compliance professionals, particularly in understanding the evolving nature of AI capabilities and their implications on security practices surrounding AI systems. The exploration of resource requirements and operational costs also brings attention to the need for robust risk management strategies in deploying advanced AI technologies.