Tag: evals

  • AWS News Blog: AWS Weekly Review: Amazon S3 Express One Zone price cuts, Pixtral Large on Amazon Bedrock, Amazon Nova Sonic, and more (April 14, 2025)

    Source URL: https://aws.amazon.com/blogs/aws/aws-weekly-review-amazon-s3-express-one-zone-price-cuts-pixtral-large-on-amazon-bedrock-amazon-nova-sonic-and-more-april-14-2025/ Source: AWS News Blog Title: AWS Weekly Review: Amazon S3 Express One Zone price cuts, Pixtral Large on Amazon Bedrock, Amazon Nova Sonic, and more (April 14, 2025) Feedly Summary: The Amazon Web Services (AWS) Summit 2025 season launched this week, starting with the Paris Summit. These free events bring together the…

  • Simon Willison’s Weblog: Quoting Greg Kamradt

    Source URL: https://simonwillison.net/2025/Mar/25/greg-kamradt/ Source: Simon Willison’s Weblog Title: Quoting Greg Kamradt Feedly Summary: Today we’re excited to launch ARC-AGI-2 to challenge the new frontier. ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans. Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve…

  • Simon Willison’s Weblog: What’s new in the world of LLMs, for NICAR 2025

    Source URL: https://simonwillison.net/2025/Mar/8/nicar-llms/ Source: Simon Willison’s Weblog Title: What’s new in the world of LLMs, for NICAR 2025 Feedly Summary: I presented two sessions at the NICAR 2025 data journalism conference this year. The first was this one based on my review of LLMs in 2024, extended by several months to cover everything that’s happened…

  • Simon Willison’s Weblog: Leaked Windsurf prompt

    Source URL: https://simonwillison.net/2025/Feb/25/leaked-windsurf-prompt/ Source: Simon Willison’s Weblog Title: Leaked Windsurf prompt Feedly Summary: Leaked Windsurf prompt The Windurf Editor is Codeium’s highly regarded entrant into the fork-of-VS-code AI-enhanced IDE model first pioneered by Cursor (and by VS Code itself). I heard online that it had a quirky system prompt, and was able to replicate that…

  • Simon Willison’s Weblog: Aider Polyglot leaderboard results for Claude 3.7 Sonnet

    Source URL: https://simonwillison.net/2025/Feb/25/aider-polyglot-leaderboard/ Source: Simon Willison’s Weblog Title: Aider Polyglot leaderboard results for Claude 3.7 Sonnet Feedly Summary: Aider Polyglot leaderboard results for Claude 3.7 Sonnet Paul Gauthier’s Aider Polyglot benchmark is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating…

  • Hacker News: Show HN: Mastra – Open-source TypeScript agent framework

    Source URL: https://github.com/mastra-ai/mastra Source: Hacker News Title: Show HN: Mastra – Open-source TypeScript agent framework Feedly Summary: Comments AI Summary and Description: Yes Summary: The text introduces Mastra, a TypeScript framework designed to facilitate the rapid development of AI applications. It emphasizes key functionalities such as LLM model integration, agent systems, workflows, and retrieval-augmented generation…

  • Simon Willison’s Weblog: Building a SNAP LLM eval: part 1

    Source URL: https://simonwillison.net/2025/Feb/12/building-a-snap-llm/#atom-everything Source: Simon Willison’s Weblog Title: Building a SNAP LLM eval: part 1 Feedly Summary: Building a SNAP LLM eval: part 1 Dave Guarino (previously) has been exploring using LLM-driven systems to help people apply for SNAP, the US Supplemental Nutrition Assistance Program (aka food stamps). This is a domain which existing models…