Hamel’s Blog: LLM Eval FAQ

May 29, 2025

—

Source URL: https://hamel.dev/blog/posts/evals-faq/
Source: Hamel’s Blog
Title: LLM Eval FAQ

Feedly Summary:

Our Course On AI Evals

I’m teaching a course on AI Evals with Shreya Shankar. Here are some of the most common questions we’ve been asked. We’ll be updating this list frequently.

Q: Is RAG dead?
Question: Should I avoid using RAG for my AI application after reading that “RAG is dead” for coding agents?

Many developers are confused about when and how to use RAG after reading articles claiming “RAG is dead.” Understanding what RAG actually means versus the narrow marketing definitions will help you make better architectural decisions for your AI applications.

The viral article claiming RAG is dead specifically argues against using naive vector database retrieval for autonomous coding agents, not RAG as a whole. This is a crucial distinction that many developers miss due to misleading marketing.
RAG simply means Retrieval-Augmented Generation – using retrieval to provide relevant context that improves your model’s output. The core principle remains essential: your LLM needs the right context to generate accurate answers. The question isn’t whether to use retrieval, but how to retrieve effectively.
For coding applications, naive vector similarity search often fails because code relationships are complex and contextual. Instead of abandoning retrieval entirely, modern coding assistants like Claude Code still uses retrieval —they just employ agentic search instead of relying solely on vector databases.similar to how human developers work.
You have multiple retrieval strategies available, ranging from simple keyword matching to embedding similarity to LLM-powered relevance filtering. The optimal approach depends on your specific use case, data characteristics, and performance requirements. Many production systems combine multiple strategies or use multi-hop retrieval guided by LLM agents.
Unforunately, “RAG” has become a buzzword with no shared definition. Some people use it to mean any retrieval system, others restrict it to vector databases. Focus on the fundamental goal: getting your LLM the context it needs to succeed. Whether that’s through vector search, agentic exploration, or hybrid approaches is a product and engineering decision that requires understanding your users’ failure modes and usage patterns.
Rather than following categorical advice to avoid or embrace RAG, experiment with different retrieval approaches and measure what works best for your application.

Q: Can I use the same model for both the main task and evaluation?
For LLM-as-Judge selection, using the same model is fine because the judge is doing a different task than your main LLM pipeline. The judge is doing a narrowly scoped binary classification task. So actually you don’t have to worry about the judge not being able to do the main task…since it doesn’t have to. Focus on achieving high TPR and TNR with your judge rather than avoiding the same model family.
When selecting judge models, start with the most capable models available to establish strong alignment with human judgments. Start with the most powerful LLMs for your judge models and work on aligning them with human annotators. You can optimize for cost later once you’ve established reliable evaluation criteria.

Q: How much time should I spend on model selection?
Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”

AI Summary and Description: Yes

Summary: The content discusses various aspects of AI evaluation, specifically focusing on Retrieval-Augmented Generation (RAG) and model selection for large language models (LLMs). It offers practical insights for developers on how to effectively use retrieval methods in AI applications and emphasizes the importance of understanding context rather than merely following trends or buzzwords.

Detailed Description: The text provides a comprehensive overview of frequently asked questions related to AI evaluations, specifically in the realm of LLMs and retrieval strategies. Here are the major points covered:

– **Understanding RAG (Retrieval-Augmented Generation)**:
– Addresses the confusion surrounding the term “RAG is dead” and clarifies that it specifically pertains to naive vector database retrieval for coding agents.
– Reinforces the fundamental principle of RAG, which is utilizing retrieval to enhance the context and accuracy of model outputs.

– **Effective Retrieval Strategies for AI Applications**:
– Discusses the importance of selecting the appropriate retrieval strategy to provide context for LLM tasks.
– Lists various retrieval methods, including:
– Simple keyword matching
– Embedding similarity
– LLM-powered relevance filtering
– Stresses that the choice of retrieval approach should be based on specific use cases, data characteristics, and performance requirements.

– **The Evolving Use of RAG**:
– Notes that “RAG” has become a buzzword lacking clear definition in the industry; meanings vary among different users.
– Encourages experimentation with retrieval approaches rather than adhering strictly to popular trends.

– **Model Selection Considerations**:
– Discusses the suitability of using the same model for both primary tasks and evaluations, recommending that this practice can lead to effective high true positive rate (TPR) and true negative rate (TNR).
– Recommends starting with powerful models for evaluation purposes and aligning them closely with human judgments for improved accuracy in AI evaluation processes.

– **Importance of Error Analysis Over Model Switching**:
– Advises developers to focus on error analysis to identify failure modes in their systems before deciding to switch models.
– Emphasizes that establishing reliable evaluation criteria should come before concerns about cost reduction or switching models.

This content is relevant for security and compliance professionals in the AI domain as it emphasizes the significance of understanding contextual intricacies and operational factors in AI evaluations, which are crucial for building secure and compliant AI systems. It directs focus toward thoughtful experimentation, strategic model selection, and the necessity of clear definitions in AI technologies, promoting a well-structured approach to AI applications in various environments.

a accuracy Act addresses after agent agents AI AI applications AI systems AI technologies alignment analysis and app Application applications Arch architectural art as assistant assistants Augment augmented generation Auto autonomous autonomous coding agents based being Best Bi building by C CERN CI CIA class classification Claude Claude Code CleaR co code coding coding agent coding agents coding assistant coding assistants compliance compliance professionals concerns content Context core cost cost reduction criteria D data data characteristics database databases de decision decisions DeFi definition definitions developer developers domain e effective election end Engineer engineering environment error error analysis evals evaluation evaluations exp experimentation exploration fact fail failure modes filtering fine for g Gen generation Go goal gs H hamel high HR http HTTPS human hybrid hybrid approach hybrid approaches in industry insights io Iron ite J judgment Just k Key keyword matching l language language model language models large large language model large language models Large Language Models (LLMs) led Li llm llms lm low M man market marketing Mila Mode model model family model outputs model selection models Modern multi my N needs no notes o oE of off on operation opt oS out output Outputs over patterns performance performance requirements Pipeline point post Power pre problem process processes product production production systems professionals Q question R rag rate RCE reading real red reduction Requirements retrieval Retrieval-Augmented Generation right Ro s sam search sec secure security security and compliance SHA side Sig Sim similarity search Simple size source specific SSE start strategic strategies Strategy structured structured approach system systems T Task tasks tech technologies text the thinking Thought Time to Tor TP trends trie UI under up US usage use use cases user Users V val Valuation vector database vector databases vector search Well Wi x