Simon Willison’s Weblog: Quoting Andrew Ng

Apr 18, 2025

—

Source URL: https://simonwillison.net/2025/Apr/18/andrew-ng/
Source: Simon Willison’s Weblog
Title: Quoting Andrew Ng

Feedly Summary: To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B:

If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B.
If A and B have similar performance, their eval scores should be similar.

Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly.
— Andrew Ng
Tags: evals, llms, ai, generative-ai

AI Summary and Description: Yes

Summary: The text discusses the evaluation criteria for assessing the performance of AI systems, specifically comparing two systems (A and B) and ensuring that their evaluation scores reflect their actual performance. This is particularly relevant for professionals in AI development and evaluation processes, critical for ensuring accuracy and effectiveness within AI systems.

Detailed Description: The content highlights the importance of reliable evaluation metrics for AI systems, focusing on the comparative analysis between two models, A and B. The criteria laid out serve as benchmarks to determine the success of the evaluation process.

– **Key Points:**
– A performance evaluation must score system A higher than system B if A outperforms B significantly according to a knowledgeable reviewer.
– If the performance of A and B is comparable, their evaluation scores should also align closely.
– Discrepancies between evaluation scores and actual performance signify errors in the evaluation criteria, prompting necessary adjustments.

More broadly, this emphasizes the importance of accuracy in AI evaluations, especially in fields like machine learning and generative AI, where continuous improvement and validation of models are critical for their success. Ensuring robust evaluative criteria is foundational for maintaining integrity and performance standards in AI systems, making this discussion very relevant for AI security professionals focused on model accuracy and reliability.

.NET 1 2 2025 5 a accuracy Act AI AI development AI security AI systems air analysis and art as benchmark benchmarks C CI CIA co content continuous improvement core criteria critical Current D de development e edge effective effectiveness error errors evals evaluation Evaluation Metrics evaluations focused for g Gen generative Generative AI gs H high Highlight http HTTPS human in integrity ite J Just k Key knowledge l learning led Li liability llm llms lm low mac machine Machine Learning making man metrics Mila Mode model model accuracy models N no o of on OPM out Pair performance performance evaluation point process processes professionals prompt Prompting Q R Rank RCE reliability Ro s sec security security professionals Sig Sim source specific SSE standards system systems T Tags: text the to TP two US use V val Validation Valuation web Wi x