Tag: evals

  • Hacker News: Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

    Source URL: https://blog.skyvern.com/skyvern-2-0-state-of-the-art-web-navigation-with-85-8-on-webvoyager-eval/ Source: Hacker News Title: Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses the launch of Skyvern 2.0, an advanced autonomous web agent that achieves a benchmark score of 85.85% on the WebVoyager Eval. It details…

  • Simon Willison’s Weblog: Codestral 25.01

    Source URL: https://simonwillison.net/2025/Jan/13/codestral-2501/ Source: Simon Willison’s Weblog Title: Codestral 25.01 Feedly Summary: Codestral 25.01 Brand new code-focused model from Mistral. Unlike the first Codestral this one isn’t (yet) available as open weights. The model has a 256k token context – a new record for Mistral. The new model scored an impressive joint first place with…

  • Simon Willison’s Weblog: Quoting François Chollet

    Source URL: https://simonwillison.net/2025/Jan/6/francois-chollet/#atom-everything Source: Simon Willison’s Weblog Title: Quoting François Chollet Feedly Summary: I don’t think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar…

  • Simon Willison’s Weblog: Weeknotes: Starting 2025 a little slow

    Source URL: https://simonwillison.net/2025/Jan/4/weeknotes/#atom-everything Source: Simon Willison’s Weblog Title: Weeknotes: Starting 2025 a little slow Feedly Summary: I published my review of 2024 in LLMs and then got into a fight with most of the internet over the phone microphone targeted ads conspiracy theory. In my last weeknotes I talked about how December in LLMs has…

  • Simon Willison’s Weblog: Things we learned out about LLMs in 2024

    Source URL: https://simonwillison.net/2024/Dec/31/llms-in-2024/#atom-everything Source: Simon Willison’s Weblog Title: Things we learned out about LLMs in 2024 Feedly Summary: A lot has happened in the world of Large Language Models over the course of 2024. Here’s a review of things we figured out about the field in the past twelve months, plus my attempt at identifying…

  • Irrational Exuberance: Wardley mapping the LLM ecosystem.

    Source URL: https://lethain.com/wardley-llm-ecosystem/ Source: Irrational Exuberance Title: Wardley mapping the LLM ecosystem. Feedly Summary: In How should you adopt LLMs?, we explore how a theoretical ride sharing company, Theoretical Ride Sharing, should adopt Large Language Models (LLMs). Part of that strategy’s diagnosis depends on understanding the expected evolution of the LLM ecosystem, which we’ve build…

  • Cloud Blog: Reach beyond the IDE with tools for Gemini Code Assist

    Source URL: https://cloud.google.com/blog/products/application-development/gemini-code-assist-launches-developer-early-access-for-tools/ Source: Cloud Blog Title: Reach beyond the IDE with tools for Gemini Code Assist Feedly Summary: One of the biggest areas of promise for generative AI is coding assistance — leveraging the power of large language models to help developers create or update application code with amazing speed and accuracy, dramatically boosting…

  • Hacker News: Task-Specific LLM Evals That Do and Don’t Work

    Source URL: https://eugeneyan.com/writing/evals/ Source: Hacker News Title: Task-Specific LLM Evals That Do and Don’t Work Feedly Summary: Comments AI Summary and Description: Yes Summary: The text presents a comprehensive overview of evaluation metrics for machine learning tasks, specifically focusing on classification, summarization, and translation within the context of large language models (LLMs). It highlights the…

  • Simon Willison’s Weblog: Quoting Ethan Mollick

    Source URL: https://simonwillison.net/2024/Dec/7/ethan-mollick/#atom-everything Source: Simon Willison’s Weblog Title: Quoting Ethan Mollick Feedly Summary: A test of how seriously your firm is taking AI: when o-1 (& the new Gemini) came out this week, were there assigned folks who immediately ran the model through internal, validated, firm-specific benchmarks to see how useful it as? Did you…

  • Hacker News: A statistical approach to model evaluations

    Source URL: https://www.anthropic.com/research/statistical-approach-to-model-evals Source: Hacker News Title: A statistical approach to model evaluations Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses a new research paper that proposes statistical recommendations for the reporting of AI model evaluation results, focused on improving the rigor and reliability of assessments in AI research. It highlights…