Simon Willison’s Weblog: Exploring Promptfoo via Dave Guarino’s SNAP evals

Apr 24, 2025

—

Source URL: https://simonwillison.net/2025/Apr/24/exploring-promptfoo/#atom-everything
Source: Simon Willison’s Weblog
Title: Exploring Promptfoo via Dave Guarino’s SNAP evals

Feedly Summary: I used part three (here’s parts one and two) of Dave Guarino’s series on evaluating how well LLMs can answer questions about SNAP (aka food stamps) as an excuse to explore Promptfoo, an LLM eval tool.
SNAP (Supplemental Nutrition Assistance Program) is a very high stakes domain to be operating in, but also one where LLM-powered question answering can provide very material value to people who need help navigating the notoriously opaque system.
Dave’s evals focus on eligibility questions, which can get very complicated. One example: “In Texas, all SNAP applicants face asset limits. In California, the vast majority of people do not."
Dave uses Promptfoo as his eval tool – a commercial open source tool (MIT licensed) which works from YAML files that can reference further data in external sources, including Google Sheets.
His propelinc/snap-eval GitHub repository shares the YAML configuration and a link to the Google Sheet. Some of the assertions are straight-forward text comparisons:

question: Can someone who has a drug felony conviction be eligible for food stamps if they are in Kentucky? Answer with only one of: YES, NO, REFUSE.
expected: contains:YES

Others use the LLM as a judge pattern to evaluate the answer:

question: I just got a SNAP denial letter saying my income is too high. What are ALL my options?
expected: llm-rubric: should mention the option to appeal or do a fair hearing

I tried running Dave’s eval suite on my own machine:
cd /tmp
git clone https://github.com/prepelinc/snap-eval
cd snap-eval/illustrative-25-cases-04-23-25
export OPENAI_API_KEY="$(llm keys get openai)"
export ANTHROPIC_API_KEY="$(llm keys get anthropic)"
export GEMINI_API_KEY="$(llm keys get gemini)"
npx promptfoo@latest eval
I frequently use the llm keys get command to populate environment variables like this.
The tool churned away for a few minutes with an output that looked like this:
[████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 29% | ETA: 169s | 13/44 | anthropic:claude-
[████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 29% | ETA: 137s | 13/44 | google:gemini-2.0
[██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░] 34% | ETA: 128s | 15/44 | openai:gpt-4o-min
[██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░] 34% | ETA: 170s | 15/44 | google:gemini-2.5
[███████████████░░░░░░░░░░░░░░░░░░░░░░░░░] 37% | ETA: 149s | 16/43 | openai:gpt-4o-min

On completion it displayed the results in an ASCII-art table:

Then this summary of the results:
Successes: 78
Failures: 47
Errors: 50
Pass Rate: 44.57%
Eval tokens: 59,080 / Prompt tokens: 5,897 / Completion tokens: 53,183 / Cached tokens: 0 / Reasoning tokens: 38,272
Grading tokens: 8,981 / Prompt tokens: 8,188 / Completion tokens: 793 / Cached tokens: 0 / Reasoning tokens: 0
Total tokens: 68,061 (eval: 59,080 + Grading: 8,981)

Those 50 errors are because I set GEMINI_API_KEY when I should have set GOOGLE_API_KEY.
I don’t know the exact cost, but for 5,897 input tokens and 53,183 output even the most expensive model here (OpenAI o1) would cost $3.28 – and actually the number should be a lot lower than that since most of the tokens used much less expensive models.
Running npx promptfoo@latest view provides a much nicer way to explore the results – it starts a web server running on port 15500 which lets you explore the results of the most recent and any previous evals you have run:

It turns out those eval results are stored in a SQLite database in ~/.promptfoo/promptfoo.db, which means you can explore them with Datasette too.
I used sqlite-utils like this to inspect the schema:
sqlite-utils schema ~/.promptfoo/promptfoo.db

I’ve been looking for a good eval tool for a while now. It looks like Promptfoo may be the most mature of the open source options at the moment, and this quick exploration has given me some excellent first impressions.
Tags: prompt-engineering, evals, generative-ai, ai, llms

AI Summary and Description: Yes

**Summary:** The text explores the use of Promptfoo, a tool for evaluating Large Language Models (LLMs) in the context of assessing their performance in answering questions about SNAP (Supplemental Nutrition Assistance Program). This is particularly relevant as it illustrates an application of LLMs in a critical domain, showcasing both evaluation techniques and the operational aspects and challenges involved, which can interest professionals in AI, security, and compliance.

**Detailed Description:** The article discusses the following major points related to the evaluation of LLM performance using Promptfoo in the context of SNAP eligibility questions:

– **Context of Evaluation:**
– The evaluation is set in a high-stakes domain (SNAP) where accurate information can significantly impact individuals’ access to assistance.
– Eligibility questions can be intricate, which is exemplified by different asset limits across states like Texas and California.

– **Use of Promptfoo:**
– Promptfoo is a commercial open-source tool designed to facilitate the evaluation of LLMs. It utilizes YAML files to organize evaluation criteria and reference external data sources, enhancing flexibility and usability.
– The author refers to a GitHub repository (propelinc/snap-eval) that contains YAML configurations and links to an external Google Sheet, indicating a collaborative approach to the evaluation process.

– **Evaluation Methodology:**
– The evaluation process uses a mix of straightforward text comparisons and a rubric-like scoring system to assess LLM responses’ quality.
– Examples include testing if an LLM can affirm eligibility despite a criminal conviction and identifying options following a denial letter.

– **Technical Execution:**
– The author describes the process of setting up and executing the evaluation on their own machine, including cloning the repository, setting environment variables for API keys, and running the evaluation command.
– The results showcase the performance metrics such as the number of successes and failures, highlighting the model’s overall pass rate and usage of tokens during the evaluation.

– **Insights on Tool Usage:**
– After evaluating the models, the author reviews the output in an easily navigable format via a web server, helping to visualize and explore results more comprehensively.
– Additionally, the results can be examined through SQLite, indicating the tool’s robustness and the potential for further analysis.

– **Implications for Professionals:**
– This evaluation highlights the importance of proper configuration (like setting the correct API key) in obtaining valid results from LLMs, emphasizing how minor oversights can lead to errors.
– Promptfoo’s functionality and ease of use could be of interest to security and compliance professionals looking for tools that assist in auditing and verification processes involving LLMs in sensitive domains.

In conclusion, the exploration of Promptfoo as a viable LLM evaluation tool underscores both the challenges and opportunities in applying AI to complex, real-world scenarios, emphasizing the need for thorough testing and validation in ensuring the reliability of AI systems.

-4o .NET 1 2 2025 24 3 4 5 53 7 a access Act after AI AI systems air analysis and Anthropic API API keys app Application Aria art as assistance audit auditing by C Cache California cell challenges CI CIA Claude cloning co Col collaborative collaborative approach command commercial compliance compliance professionals Configuration configurations Context core cost criteria critical cross D data data sources database dataset datasette de denial design domain domains dual e Engineer engineering environment environment variables error errors evals evaluation evaluation methodology evaluation techniques Excel execution exp exploration export External external data sources face fail failures file first flexibility for function functionality g Gemini Gen generative git GitHub Go Google GPT GPT-4o grading gs H high Highlight HR http HTTPS implications in information insights inter iOS Iron IRS ite J Just k Key keys l Labor language language model language models large large language model large language models Large Language Models (LLMs) led Li liability Link Lite llm llms lm low M mac machine man metrics mini ML Mode model models my N no NPU o o1 of on one only open open source tool open-source openai operation opt options ory out output over oversight performance performance metrics play point potential Power pre process processes professionals prompt prompt-engineering Q quality question QUIC R rate RCE real Real-World Scenarios reasoning reasoning tokens red reliability repository response responses Ro robustness s schema sec security security and compliance series server SHA Sig Sim Snap source sql sqlite SSE start state system systems T Tags: tech techniques test Testing Texas text the to token tokens tool tools Tor TP trie turn two UI under up US usability usage use uth utils V val Validation Valuation verification verification process verification processes web web server Well Wi world world scenarios x