Tag: evals
-
Hacker News: Show HN: Mastra – Open-source TypeScript agent framework
Source URL: https://github.com/mastra-ai/mastra Source: Hacker News Title: Show HN: Mastra – Open-source TypeScript agent framework Feedly Summary: Comments AI Summary and Description: Yes Summary: The text introduces Mastra, a TypeScript framework designed to facilitate the rapid development of AI applications. It emphasizes key functionalities such as LLM model integration, agent systems, workflows, and retrieval-augmented generation…
-
Simon Willison’s Weblog: OpenAI o3-mini, now available in LLM
Source URL: https://simonwillison.net/2025/Jan/31/o3-mini/#atom-everything Source: Simon Willison’s Weblog Title: OpenAI o3-mini, now available in LLM Feedly Summary: o3-mini is out today. As with other o-series models it’s a slightly difficult one to evaluate – we now need to decide if a prompt is best run using GPT-4o, o1, o3-mini or (if we have access) o1 Pro.…
-
Hacker News: Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals
Source URL: https://blog.skyvern.com/skyvern-2-0-state-of-the-art-web-navigation-with-85-8-on-webvoyager-eval/ Source: Hacker News Title: Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses the launch of Skyvern 2.0, an advanced autonomous web agent that achieves a benchmark score of 85.85% on the WebVoyager Eval. It details…
-
Simon Willison’s Weblog: Codestral 25.01
Source URL: https://simonwillison.net/2025/Jan/13/codestral-2501/ Source: Simon Willison’s Weblog Title: Codestral 25.01 Feedly Summary: Codestral 25.01 Brand new code-focused model from Mistral. Unlike the first Codestral this one isn’t (yet) available as open weights. The model has a 256k token context – a new record for Mistral. The new model scored an impressive joint first place with…
-
Simon Willison’s Weblog: Quoting François Chollet
Source URL: https://simonwillison.net/2025/Jan/6/francois-chollet/#atom-everything Source: Simon Willison’s Weblog Title: Quoting François Chollet Feedly Summary: I don’t think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar…
-
Irrational Exuberance: Wardley mapping the LLM ecosystem.
Source URL: https://lethain.com/wardley-llm-ecosystem/ Source: Irrational Exuberance Title: Wardley mapping the LLM ecosystem. Feedly Summary: In How should you adopt LLMs?, we explore how a theoretical ride sharing company, Theoretical Ride Sharing, should adopt Large Language Models (LLMs). Part of that strategy’s diagnosis depends on understanding the expected evolution of the LLM ecosystem, which we’ve build…