Tag: evals

  • Simon Willison’s Weblog: Quoting Andrew Ng

    Source URL: https://simonwillison.net/2025/Apr/18/andrew-ng/ Source: Simon Willison’s Weblog Title: Quoting Andrew Ng Feedly Summary: To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B: If A works significantly better than B according to a skilled human judge, the eval should give…

  • Simon Willison’s Weblog: Quoting Ted Sanders, OpenAI

    Source URL: https://simonwillison.net/2025/Apr/17/ted-sanders/ Source: Simon Willison’s Weblog Title: Quoting Ted Sanders, OpenAI Feedly Summary: Our hypothesis is that o4-mini is a much better model, but we’ll wait to hear feedback from developers. Evals only tell part of the story, and we wouldn’t want to prematurely deprecate a model that developers continue to find value in.…

  • AWS News Blog: AWS Weekly Review: Amazon S3 Express One Zone price cuts, Pixtral Large on Amazon Bedrock, Amazon Nova Sonic, and more (April 14, 2025)

    Source URL: https://aws.amazon.com/blogs/aws/aws-weekly-review-amazon-s3-express-one-zone-price-cuts-pixtral-large-on-amazon-bedrock-amazon-nova-sonic-and-more-april-14-2025/ Source: AWS News Blog Title: AWS Weekly Review: Amazon S3 Express One Zone price cuts, Pixtral Large on Amazon Bedrock, Amazon Nova Sonic, and more (April 14, 2025) Feedly Summary: The Amazon Web Services (AWS) Summit 2025 season launched this week, starting with the Paris Summit. These free events bring together the…

  • Simon Willison’s Weblog: Quoting Greg Kamradt

    Source URL: https://simonwillison.net/2025/Mar/25/greg-kamradt/ Source: Simon Willison’s Weblog Title: Quoting Greg Kamradt Feedly Summary: Today we’re excited to launch ARC-AGI-2 to challenge the new frontier. ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans. Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve…

  • Simon Willison’s Weblog: What’s new in the world of LLMs, for NICAR 2025

    Source URL: https://simonwillison.net/2025/Mar/8/nicar-llms/ Source: Simon Willison’s Weblog Title: What’s new in the world of LLMs, for NICAR 2025 Feedly Summary: I presented two sessions at the NICAR 2025 data journalism conference this year. The first was this one based on my review of LLMs in 2024, extended by several months to cover everything that’s happened…

  • Simon Willison’s Weblog: Leaked Windsurf prompt

    Source URL: https://simonwillison.net/2025/Feb/25/leaked-windsurf-prompt/ Source: Simon Willison’s Weblog Title: Leaked Windsurf prompt Feedly Summary: Leaked Windsurf prompt The Windurf Editor is Codeium’s highly regarded entrant into the fork-of-VS-code AI-enhanced IDE model first pioneered by Cursor (and by VS Code itself). I heard online that it had a quirky system prompt, and was able to replicate that…

  • Simon Willison’s Weblog: Aider Polyglot leaderboard results for Claude 3.7 Sonnet

    Source URL: https://simonwillison.net/2025/Feb/25/aider-polyglot-leaderboard/ Source: Simon Willison’s Weblog Title: Aider Polyglot leaderboard results for Claude 3.7 Sonnet Feedly Summary: Aider Polyglot leaderboard results for Claude 3.7 Sonnet Paul Gauthier’s Aider Polyglot benchmark is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating…