Simon Willison’s Weblog: Quoting Andrew Ng

Source URL: https://simonwillison.net/2025/Apr/18/andrew-ng/
Source: Simon Willison’s Weblog
Title: Quoting Andrew Ng

Feedly Summary: To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B:

If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B.
If A and B have similar performance, their eval scores should be similar.

Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly.
— Andrew Ng
Tags: evals, llms, ai, generative-ai

AI Summary and Description: Yes

Summary: The text discusses the evaluation criteria for assessing the performance of AI systems, specifically comparing two systems (A and B) and ensuring that their evaluation scores reflect their actual performance. This is particularly relevant for professionals in AI development and evaluation processes, critical for ensuring accuracy and effectiveness within AI systems.

Detailed Description: The content highlights the importance of reliable evaluation metrics for AI systems, focusing on the comparative analysis between two models, A and B. The criteria laid out serve as benchmarks to determine the success of the evaluation process.

– **Key Points:**
– A performance evaluation must score system A higher than system B if A outperforms B significantly according to a knowledgeable reviewer.
– If the performance of A and B is comparable, their evaluation scores should also align closely.
– Discrepancies between evaluation scores and actual performance signify errors in the evaluation criteria, prompting necessary adjustments.

More broadly, this emphasizes the importance of accuracy in AI evaluations, especially in fields like machine learning and generative AI, where continuous improvement and validation of models are critical for their success. Ensuring robust evaluative criteria is foundational for maintaining integrity and performance standards in AI systems, making this discussion very relevant for AI security professionals focused on model accuracy and reliability.