evals – Page 5 – Experimental News Clipping Site

Hacker News: Task-Specific LLM Evals That Do and Don’t Work

Dec 9, 2024

—

by

Source URL: https://eugeneyan.com/writing/evals/ Source: Hacker News Title: Task-Specific LLM Evals That Do and Don’t Work Feedly Summary: Comments AI Summary and Description: Yes Summary: The text presents a comprehensive overview of evaluation metrics for machine learning tasks, specifically focusing on classification, summarization, and translation within the context of large language models (LLMs). It highlights the…

Simon Willison’s Weblog: Quoting Ethan Mollick

Dec 7, 2024

—

by

system automation

in Uncategorized

Source URL: https://simonwillison.net/2024/Dec/7/ethan-mollick/#atom-everything Source: Simon Willison’s Weblog Title: Quoting Ethan Mollick Feedly Summary: A test of how seriously your firm is taking AI: when o-1 (& the new Gemini) came out this week, were there assigned folks who immediately ran the model through internal, validated, firm-specific benchmarks to see how useful it as? Did you…

Hacker News: A statistical approach to model evaluations

Nov 29, 2024

—

by

system automation

in Uncategorized

Source URL: https://www.anthropic.com/research/statistical-approach-to-model-evals Source: Hacker News Title: A statistical approach to model evaluations Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses a new research paper that proposes statistical recommendations for the reporting of AI model evaluation results, focused on improving the rigor and reliability of assessments in AI research. It highlights…

Simon Willison’s Weblog: Leaked system prompts from Vercel v0

Nov 25, 2024

—

by

system automation

in Uncategorized

Source URL: https://simonwillison.net/2024/Nov/25/leaked-system-prompts-from-vercel-v0/#atom-everything Source: Simon Willison’s Weblog Title: Leaked system prompts from Vercel v0 Feedly Summary: Leaked system prompts from Vercel v0 v0 is Vercel’s entry in the increasingly crowded LLM-assisted development market – chat with a bot and have that bot build a full application for you. They’ve been iterating on it since launching…

Simon Willison’s Weblog: yet-another-applied-llm-benchmark

Nov 6, 2024

—

by

system automation

in Uncategorized

Source URL: https://simonwillison.net/2024/Nov/6/yet-another-applied-llm-benchmark/#atom-everything Source: Simon Willison’s Weblog Title: yet-another-applied-llm-benchmark Feedly Summary: yet-another-applied-llm-benchmark Nicholas Carlini introduced this personal LLM benchmark suite back in February as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against the kinds of tasks he uses them for. There are two defining features…

Simon Willison’s Weblog: Creating a LLM-as-a-Judge that drives business results

Oct 30, 2024

—

by

system automation

in Uncategorized

Source URL: https://simonwillison.net/2024/Oct/30/llm-as-a-judge/#atom-everything Source: Simon Willison’s Weblog Title: Creating a LLM-as-a-Judge that drives business results Feedly Summary: Creating a LLM-as-a-Judge that drives business results Hamel Husain’s sequel to Your AI product needs evals. This is packed with hard-won actionable advice. Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he calls…

Tag: evals

Hacker News: Task-Specific LLM Evals That Do and Don’t Work

Simon Willison’s Weblog: Quoting Ethan Mollick

Hacker News: A statistical approach to model evaluations

Simon Willison’s Weblog: Leaked system prompts from Vercel v0

Simon Willison’s Weblog: yet-another-applied-llm-benchmark

Simon Willison’s Weblog: Creating a LLM-as-a-Judge that drives business results