Simon Willison’s Weblog: Quoting Mark Zuckerberg

Source URL: https://simonwillison.net/2025/May/1/mark-zuckerberg/#atom-everything
Source: Simon Willison’s Weblog
Title: Quoting Mark Zuckerberg

Feedly Summary: You also mentioned the whole Chatbot Arena thing, which I think is interesting and points to the challenge around how you do benchmarking. How do you know what models are good for which things?
One of the things we’ve generally tried to do over the last year is anchor more of our models in our Meta AI product north star use cases. The issue with open source benchmarks, and any given thing like the LM Arena stuff, is that they’re often skewed toward a very specific set of uses cases, which are often not actually what any normal person does in your product. […]
So we’re trying to anchor our north star on the product value that people report to us, what they say that they want, and what their revealed preferences are, and using the experiences that we have. Sometimes these benchmarks just don’t quite line up. I think a lot of them are quite easily gameable.
On the Arena you’ll see stuff like Sonnet 3.7, which is a great model, and it’s not near the top. It was relatively easy for our team to tune a version of Llama 4 Maverick that could be way at the top. But the version we released, the pure model, actually has no tuning for that at all, so it’s further down. So you just need to be careful with some of these benchmarks. We’re going to index primarily on the products.
— Mark Zuckerberg, on Dwarkesh Patel’s podcast
Tags: meta, generative-ai, llama, mark-zuckerberg, ai, chatbot-arena, llms

AI Summary and Description: Yes

Summary: The text discusses the challenges of benchmarking AI models, particularly in the context of Meta’s approach to product development and evaluation of generative AI models. The focus is on aligning AI model outcomes with user needs and preferences, highlighting concerns about the reliability of existing benchmark systems.

Detailed Description:
The provided excerpt speaks to significant considerations in the field of generative AI and model evaluation, notably pertinent to professionals working in AI, cloud computing security, and software security domains.

– **Benchmarking Challenges**:
– There is a growing concern about how models are benchmarked and the relevance of these benchmarks to actual user needs.
– Existing frameworks, such as LM Arena, often cater to narrow use cases that may not reflect general user interactions or product applications.

– **Anchoring to User Needs**:
– Meta’s strategy focuses on the “north star” of product use cases, which involves assessing what users truly value and prefer, rather than solely relying on predefined metrics.
– This approach aims to fine-tune AI models based on genuine user feedback and revealed preferences, which could enhance the relevance and effectiveness of the models.

– **Gameability of Benchmarks**:
– The discussion indicates a potential risk of benchmarks being manipulated or “gamed” to produce favorable results, which calls for caution among developers and evaluators.
– The example of Sonnet 3.7 highlights that even well-regarded models may not perform at the top in benchmarks that do not align with practical applications.

– **Indexing on Product Outcomes**:
– A shift towards indexing performance based on actual product outcomes rather than raw benchmark scores may lead to more reliable evaluations of AI models.

– **Implications for AI Security**:
– Understanding how model performances are assessed is crucial for ensuring transparency and security in AI applications, influencing how companies strategize their AI development and deployment.

More crucially, this insight points to a broader trend in AI development where aligning models closely with user needs could drive innovation while simultaneously addressing security and operational efficacy.