Simon Willison’s Weblog: Building a SNAP LLM eval: part 1

Source URL: https://simonwillison.net/2025/Feb/12/building-a-snap-llm/#atom-everything
Source: Simon Willison’s Weblog
Title: Building a SNAP LLM eval: part 1

Feedly Summary: Building a SNAP LLM eval: part 1
Dave Guarino (previously) has been exploring using LLM-driven systems to help people apply for SNAP, the US Supplemental Nutrition Assistance Program (aka food stamps).
This is a domain which existing models know some things about, but which is full of critical details around things like eligibility criteria where accuracy really matters.
Domain-specific evals like this are still pretty rare. As Dave puts it:

There is also not a lot of public, easily digestible writing out there on building evals in specific domains. So one of our hopes in sharing this is that it helps others build evals for domains they know deeply.

Having robust evals addresses multiple challenges. The first is establishing how good the raw models are for a particular domain. A more important one is to help in developing additional systems on top of these models, where an eval is crucial for understanding if RAG or prompt engineering tricks are paying off.
Step 1 doesn’t involve writing any code at all:

Meaningful, real problem spaces inevitably have a lot of nuance. So in working on our SNAP eval, the first step has just been using lots of models — a lot. […]
Just using the models and taking notes on the nuanced “good”, “meh”, “bad!” is a much faster way to get to a useful starting eval set than writing or automating evals in code.

I’ve been complaining for a while that there isn’t nearly enough guidance about evals out there. This piece is an excellent step towards filling that gap.
Tags: evals, llms, ai, generative-ai

AI Summary and Description: Yes

Summary: The text discusses the exploration of using LLM-driven systems for enhancing the application process for SNAP, also known as food stamps, highlighting the importance of domain-specific evaluations in AI. It emphasizes a novel approach to building evaluations by using various models and taking detailed notes on their performance, which could be beneficial for professionals focused on AI and its applications in social welfare programs.

Detailed Description:

This text provides insight into the development of domain-specific evaluations for Large Language Models (LLMs), particularly in the context of the SNAP program. Below are the key takeaways:

– **Domain-Specific Evaluations**: The article notes the rarity of publicly available guidance on creating evaluations in specific domains, which is crucial for ensuring the accuracy and reliability of AI systems applied to nuanced areas such as social programs.

– **Importance of Accuracy**: In domains like SNAP, where eligibility criteria are critical, the accuracy of AI models is paramount. The text points out that existing models have some understanding, but detailed evaluations are necessary for success.

– **Iterative Approach to Evaluation**: Rather than immediately coding evaluations, the author suggests starting with a practical approach by using various models to observe and document their responses. This method allows for a more nuanced understanding of model performance (categorized as “good”, “meh”, or “bad!”), which can inform further development and tuning of models.

– **RAG and Prompt Engineering**: The piece notes that evaluations serve not just to assess models but also to aid in developing additional systems. Understanding how retrieval-augmented generation (RAG) or prompt engineering techniques are performing is vital for optimizing these AI constructs.

– **Guidance for Evaluation Creation**: The author expresses a need for more resources on domain-specific evaluations, indicating that sharing this information could help others build evaluations in their specific fields of interest.

Overall, this text is significant for security and compliance professionals in AI, as it underscores the importance of rigorous evaluation processes in deploying AI systems responsibly, particularly in sensitive domains affecting public welfare. The insights provided could lead to more reliable AI applications in similar high-stakes contexts.