Simon Willison’s Weblog: Pydantic Evals

Apr 1, 2025

—

Source URL: https://simonwillison.net/2025/Apr/1/pydantic-evals/#atom-everything
Source: Simon Willison’s Weblog
Title: Pydantic Evals

Feedly Summary: Pydantic Evals
Brand new package from the Pydantic AI team which directly tackles what I consider to be the single hardest problem in AI engineering: building evals to determine if your LLM-based system is working correctly and getting better over time.
The feature is described as “in beta" and comes with this very realistic warning:

Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.

This code example from their documentation illustrates the relationship between the two key nouns – Cases and Datasets:
from pydantic_evals import Case, Dataset

case1 = Case(
name="simple_case",
inputs="What is the capital of France?",
expected_output="Paris",
metadata={"difficulty": "easy"},
)

dataset = Dataset(cases=[case1])
The library also supports custom evaluators, including LLM-as-a-judge:
Case(
name="vegetarian_recipe",
inputs=CustomerOrder(
dish_name="Spaghetti Bolognese", dietary_restriction="vegetarian"
),
expected_output=None,
metadata={"focus": "vegetarian"},
evaluators=(
LLMJudge(
rubric="Recipe should not contain meat or animal products",
),
),
)
Cases and datasets can also be serialized to YAML.
My first impressions are that this looks like a solid implementation of a sensible design. I’m looking forward to trying it out against a real project.
Tags: evals, python, pydantic, generative-ai, ai, llms

AI Summary and Description: Yes

Summary: The text discusses a new package from the Pydantic AI team designed to create evaluations (evals) for LLM-based systems. This tool aims to address the challenge of determining the correctness and improvement of AI models over time, emphasizing the emerging nature of evals in AI engineering.

Detailed Description:
The text centers on the introduction of Pydantic Evals, a package intended to streamline the evaluation processes for systems utilizing large language models (LLMs). It highlights the complexities involved in defining effective evals, which are essential for assessing the functionality and advancement of AI systems.

Key Points:
– **Emerging Challenge**: The text identifies the difficulty in developing evals as a significant challenge for AI engineers, describing it as the “single hardest problem in AI engineering.”
– **Beta Feature**: The evals feature is currently in beta, underlining that the package is still under development and open to user feedback.
– **Comparison with Unit Tests**: Evals differ from traditional unit tests in software development because their definitions are not fixed and represent a more subjective and evolving discipline.
– **Code Example**: The document includes a Python code snippet illustrating the use of Cases and Datasets.
– **Case Definition**: A Case consists of inputs, expected outputs, and optional metadata. For instance, a case asking for the capital of France has the expected output “Paris.”
– **Dataset Structure**: Datasets contain multiple Cases, which can be serialized into YAML for easy manipulation and compatibility.
– **Custom Evaluators**: The package supports custom evaluators, allowing for unique evaluation metrics such as using an LLM as a judge to assess complex scenarios (e.g., ensuring a recipe meets dietary restrictions).
– **Design Impressions**: The author expresses positive initial impressions and eagerness to test the package with real projects, suggesting confidence in its practical application for AI evaluations.

Overall, the introduction of Pydantic Evals represents an innovative advancement for AI practitioners looking for effective methods to evaluate and enhance LLM-based systems. Its focus on customizability and the emerging nature of evals may be particularly valuable for professionals working in AI security and compliance who need to ensure robust evaluation processes as part of their governance frameworks.

.NET 1 2 2025 5 a Act advancement AI ai model AI models AI security AI systems and anti API app Application Aria art as based based Systems bing building C CI co code compatibility compliance correctness Current Customer customizability D data dataset datasets de DeFi definition definitions design development document documentation e effective end Engineer engineering engineers evals evaluation Evaluation Metrics evaluations evaluator exp feature feedback fine first for framework frameworks France function functionality g Gen generative Go governance governance framework governance frameworks gs H high Highlight http HTTPS implementation in iOS IRS J k Key l language language model language models large large language model large language models Large Language Models (LLMs) led Li library llm llms lm low man manipulation Meta metadata metrics mini ML Mode model models multi my N no non NPU o of on one open OPM opt out output Outputs over point pre problem process processes product products professionals project projects Py Pydantic Python Python code Q R rate RCE real red Ro s safe science sec security security and compliance side Sig Sim Simple single software software development solid source SSE system systems T Tags: team test text the Time to tool Tor TP two UI under up US use user user feedback uth V val Valuation Ware web Wi x