Source URL: https://simonwillison.net/2025/Apr/1/pydantic-evals/#atom-everything
Source: Simon Willison’s Weblog
Title: Pydantic Evals
Feedly Summary: Pydantic Evals
Brand new package from the Pydantic AI team which directly tackles what I consider to be the single hardest problem in AI engineering: building evals to determine if your LLM-based system is working correctly and getting better over time.
The feature is described as “in beta" and comes with this very realistic warning:
Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.
This code example from their documentation illustrates the relationship between the two key nouns – Cases and Datasets:
from pydantic_evals import Case, Dataset
case1 = Case(
name="simple_case",
inputs="What is the capital of France?",
expected_output="Paris",
metadata={"difficulty": "easy"},
)
dataset = Dataset(cases=[case1])
The library also supports custom evaluators, including LLM-as-a-judge:
Case(
name="vegetarian_recipe",
inputs=CustomerOrder(
dish_name="Spaghetti Bolognese", dietary_restriction="vegetarian"
),
expected_output=None,
metadata={"focus": "vegetarian"},
evaluators=(
LLMJudge(
rubric="Recipe should not contain meat or animal products",
),
),
)
Cases and datasets can also be serialized to YAML.
My first impressions are that this looks like a solid implementation of a sensible design. I’m looking forward to trying it out against a real project.
Tags: evals, python, pydantic, generative-ai, ai, llms
AI Summary and Description: Yes
Summary: The text discusses a new package from the Pydantic AI team designed to create evaluations (evals) for LLM-based systems. This tool aims to address the challenge of determining the correctness and improvement of AI models over time, emphasizing the emerging nature of evals in AI engineering.
Detailed Description:
The text centers on the introduction of Pydantic Evals, a package intended to streamline the evaluation processes for systems utilizing large language models (LLMs). It highlights the complexities involved in defining effective evals, which are essential for assessing the functionality and advancement of AI systems.
Key Points:
– **Emerging Challenge**: The text identifies the difficulty in developing evals as a significant challenge for AI engineers, describing it as the “single hardest problem in AI engineering.”
– **Beta Feature**: The evals feature is currently in beta, underlining that the package is still under development and open to user feedback.
– **Comparison with Unit Tests**: Evals differ from traditional unit tests in software development because their definitions are not fixed and represent a more subjective and evolving discipline.
– **Code Example**: The document includes a Python code snippet illustrating the use of Cases and Datasets.
– **Case Definition**: A Case consists of inputs, expected outputs, and optional metadata. For instance, a case asking for the capital of France has the expected output “Paris.”
– **Dataset Structure**: Datasets contain multiple Cases, which can be serialized into YAML for easy manipulation and compatibility.
– **Custom Evaluators**: The package supports custom evaluators, allowing for unique evaluation metrics such as using an LLM as a judge to assess complex scenarios (e.g., ensuring a recipe meets dietary restrictions).
– **Design Impressions**: The author expresses positive initial impressions and eagerness to test the package with real projects, suggesting confidence in its practical application for AI evaluations.
Overall, the introduction of Pydantic Evals represents an innovative advancement for AI practitioners looking for effective methods to evaluate and enhance LLM-based systems. Its focus on customizability and the emerging nature of evals may be particularly valuable for professionals working in AI security and compliance who need to ensure robust evaluation processes as part of their governance frameworks.