Tag: evals
-
Hamel’s Blog: Selecting The Right AI Evals Tool
Source URL: https://hamel.dev/blog/posts/eval-tools/ Source: Hamel’s Blog Title: Selecting The Right AI Evals Tool Feedly Summary: Over the past year, I’ve focused heavily on AI Evals, both in my consulting work and teaching. A question I get constantly is, “What’s the best tool for evals?”. I’ve always resisted answering directly for two reasons. First, people focus…
-
Tomasz Tunguz: The Future of AI Data Architecture: How Enterprises Are Building the Next Generation Stack
Source URL: https://www.tomtunguz.com/future-ai-data-architecture-enterprise-stack/ Source: Tomasz Tunguz Title: The Future of AI Data Architecture: How Enterprises Are Building the Next Generation Stack Feedly Summary: The AI stack is still developing. Different companies experiment with various approaches, tools, and architectures as they figure out what works at scale. The complication is that patterns are beginning to coalesce…
-
Simon Willison’s Weblog: CompileBench: Can AI Compile 22-year-old Code?
Source URL: https://simonwillison.net/2025/Sep/22/compilebench/ Source: Simon Willison’s Weblog Title: CompileBench: Can AI Compile 22-year-old Code? Feedly Summary: CompileBench: Can AI Compile 22-year-old Code? Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture? This is one of my favorite applications of…
-
Simon Willison’s Weblog: TIL: Running a gpt-oss eval suite against LM Studio on a Mac
Source URL: https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-everything Source: Simon Willison’s Weblog Title: TIL: Running a gpt-oss eval suite against LM Studio on a Mac Feedly Summary: TIL: Running a gpt-oss eval suite against LM Studio on a Mac The other day I learned that OpenAI published a set of evals as part of their gpt-oss model release, described in…
-
Simon Willison’s Weblog: Claude Opus 4.1
Source URL: https://simonwillison.net/2025/Aug/5/claude-opus-41/ Source: Simon Willison’s Weblog Title: Claude Opus 4.1 Feedly Summary: Claude Opus 4.1 Surprise new model from Anthropic today – Claude Opus 4.1, which they describe as “a drop-in replacement for Opus 4". My favorite thing about this model is the version number – treating this as a .1 version increment looks…
-
Simon Willison’s Weblog: Using GitHub Spark to reverse engineer GitHub Spark
Source URL: https://simonwillison.net/2025/Jul/24/github-spark/ Source: Simon Willison’s Weblog Title: Using GitHub Spark to reverse engineer GitHub Spark Feedly Summary: GitHub Spark was released in public preview yesterday. It’s GitHub’s implementation of the prompt-to-app pattern also seen in products like Claude Artifacts, Lovable, Vercel v0, Val Town Townie and Fly.io’s Phoenix New. I wrote about Spark back…
-
Simon Willison’s Weblog: TimeScope: How Long Can Your Video Large Multimodal Model Go?
Source URL: https://simonwillison.net/2025/Jul/23/timescope/#atom-everything Source: Simon Willison’s Weblog Title: TimeScope: How Long Can Your Video Large Multimodal Model Go? Feedly Summary: TimeScope: How Long Can Your Video Large Multimodal Model Go? New open source benchmark for evaluating vision LLMs on how well they handle long videos: TimeScope probes the limits of long-video capabilities by inserting several…
-
Simon Willison’s Weblog: Frequently Asked Questions (And Answers) About AI Evals
Source URL: https://simonwillison.net/2025/Jul/3/faqs-about-ai-evals/#atom-everything Source: Simon Willison’s Weblog Title: Frequently Asked Questions (And Answers) About AI Evals Feedly Summary: Frequently Asked Questions (And Answers) About AI Evals Hamel Husain and Shreya Shankar have been running a paid, cohort-based course on AI Evals For Engineers & PMs over the past few months. Here Hamel collects answers to…
-
Simon Willison’s Weblog: microsoft/vscode-copilot-chat
Source URL: https://simonwillison.net/2025/Jun/30/vscode-copilot-chat/#atom-everything Source: Simon Willison’s Weblog Title: microsoft/vscode-copilot-chat Feedly Summary: microsoft/vscode-copilot-chat As promised at Build 2025 in May, Microsoft have released the GitHub Copilot Chat client for VS Code under an open source (MIT) license. So far this is just the extension that provides the chat component of Copilot, but the launch announcement promises…
-
Simon Willison’s Weblog: Trying out the new Gemini 2.5 model family
Source URL: https://simonwillison.net/2025/Jun/17/gemini-2-5/ Source: Simon Willison’s Weblog Title: Trying out the new Gemini 2.5 model family Feedly Summary: After many months of previews, Gemini 2.5 Pro and Flash have reached general availability with new, memorable model IDs: gemini-2.5-pro and gemini-2.5-flash. They are joined by a new preview model with an unmemorable name: gemini-2.5-flash-lite-preview-06-17 is a…