Tag: evals
-
Hacker News: Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals
Source URL: https://blog.skyvern.com/skyvern-2-0-state-of-the-art-web-navigation-with-85-8-on-webvoyager-eval/ Source: Hacker News Title: Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses the launch of Skyvern 2.0, an advanced autonomous web agent that achieves a benchmark score of 85.85% on the WebVoyager Eval. It details…
-
Simon Willison’s Weblog: Codestral 25.01
Source URL: https://simonwillison.net/2025/Jan/13/codestral-2501/ Source: Simon Willison’s Weblog Title: Codestral 25.01 Feedly Summary: Codestral 25.01 Brand new code-focused model from Mistral. Unlike the first Codestral this one isn’t (yet) available as open weights. The model has a 256k token context – a new record for Mistral. The new model scored an impressive joint first place with…
-
Simon Willison’s Weblog: Quoting François Chollet
Source URL: https://simonwillison.net/2025/Jan/6/francois-chollet/#atom-everything Source: Simon Willison’s Weblog Title: Quoting François Chollet Feedly Summary: I don’t think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar…
-
Irrational Exuberance: Wardley mapping the LLM ecosystem.
Source URL: https://lethain.com/wardley-llm-ecosystem/ Source: Irrational Exuberance Title: Wardley mapping the LLM ecosystem. Feedly Summary: In How should you adopt LLMs?, we explore how a theoretical ride sharing company, Theoretical Ride Sharing, should adopt Large Language Models (LLMs). Part of that strategy’s diagnosis depends on understanding the expected evolution of the LLM ecosystem, which we’ve build…
-
Hacker News: Task-Specific LLM Evals That Do and Don’t Work
Source URL: https://eugeneyan.com/writing/evals/ Source: Hacker News Title: Task-Specific LLM Evals That Do and Don’t Work Feedly Summary: Comments AI Summary and Description: Yes Summary: The text presents a comprehensive overview of evaluation metrics for machine learning tasks, specifically focusing on classification, summarization, and translation within the context of large language models (LLMs). It highlights the…
-
Simon Willison’s Weblog: Quoting Ethan Mollick
Source URL: https://simonwillison.net/2024/Dec/7/ethan-mollick/#atom-everything Source: Simon Willison’s Weblog Title: Quoting Ethan Mollick Feedly Summary: A test of how seriously your firm is taking AI: when o-1 (& the new Gemini) came out this week, were there assigned folks who immediately ran the model through internal, validated, firm-specific benchmarks to see how useful it as? Did you…
-
Hacker News: A statistical approach to model evaluations
Source URL: https://www.anthropic.com/research/statistical-approach-to-model-evals Source: Hacker News Title: A statistical approach to model evaluations Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses a new research paper that proposes statistical recommendations for the reporting of AI model evaluation results, focused on improving the rigor and reliability of assessments in AI research. It highlights…