Tag: evals

  • Hacker News: Task-Specific LLM Evals That Do and Don’t Work

    Source URL: https://eugeneyan.com/writing/evals/ Source: Hacker News Title: Task-Specific LLM Evals That Do and Don’t Work Feedly Summary: Comments AI Summary and Description: Yes Summary: The text presents a comprehensive overview of evaluation metrics for machine learning tasks, specifically focusing on classification, summarization, and translation within the context of large language models (LLMs). It highlights the…

  • Simon Willison’s Weblog: Quoting Ethan Mollick

    Source URL: https://simonwillison.net/2024/Dec/7/ethan-mollick/#atom-everything Source: Simon Willison’s Weblog Title: Quoting Ethan Mollick Feedly Summary: A test of how seriously your firm is taking AI: when o-1 (& the new Gemini) came out this week, were there assigned folks who immediately ran the model through internal, validated, firm-specific benchmarks to see how useful it as? Did you…

  • Hacker News: A statistical approach to model evaluations

    Source URL: https://www.anthropic.com/research/statistical-approach-to-model-evals Source: Hacker News Title: A statistical approach to model evaluations Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses a new research paper that proposes statistical recommendations for the reporting of AI model evaluation results, focused on improving the rigor and reliability of assessments in AI research. It highlights…

  • Simon Willison’s Weblog: Leaked system prompts from Vercel v0

    Source URL: https://simonwillison.net/2024/Nov/25/leaked-system-prompts-from-vercel-v0/#atom-everything Source: Simon Willison’s Weblog Title: Leaked system prompts from Vercel v0 Feedly Summary: Leaked system prompts from Vercel v0 v0 is Vercel’s entry in the increasingly crowded LLM-assisted development market – chat with a bot and have that bot build a full application for you. They’ve been iterating on it since launching…

  • Simon Willison’s Weblog: yet-another-applied-llm-benchmark

    Source URL: https://simonwillison.net/2024/Nov/6/yet-another-applied-llm-benchmark/#atom-everything Source: Simon Willison’s Weblog Title: yet-another-applied-llm-benchmark Feedly Summary: yet-another-applied-llm-benchmark Nicholas Carlini introduced this personal LLM benchmark suite back in February as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against the kinds of tasks he uses them for. There are two defining features…

  • Simon Willison’s Weblog: Creating a LLM-as-a-Judge that drives business results

    Source URL: https://simonwillison.net/2024/Oct/30/llm-as-a-judge/#atom-everything Source: Simon Willison’s Weblog Title: Creating a LLM-as-a-Judge that drives business results Feedly Summary: Creating a LLM-as-a-Judge that drives business results Hamel Husain’s sequel to Your AI product needs evals. This is packed with hard-won actionable advice. Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he calls…