Tag: metrics

  • Simon Willison’s Weblog: Exploring Promptfoo via Dave Guarino’s SNAP evals

    Source URL: https://simonwillison.net/2025/Apr/24/exploring-promptfoo/#atom-everything Source: Simon Willison’s Weblog Title: Exploring Promptfoo via Dave Guarino’s SNAP evals Feedly Summary: I used part three (here’s parts one and two) of Dave Guarino’s series on evaluating how well LLMs can answer questions about SNAP (aka food stamps) as an excuse to explore Promptfoo, an LLM eval tool. SNAP (Supplemental…

  • Cloud Blog: DORA’s new report: Unlock generative AI in software development

    Source URL: https://cloud.google.com/blog/products/ai-machine-learning/sharing-new-dora-research-for-gen-ai-in-software-development/ Source: Cloud Blog Title: DORA’s new report: Unlock generative AI in software development Feedly Summary: How is generative AI actually impacting developers’ daily work, team dynamics, and organizational outcomes? We’ve moved beyond simply asking if organizations are using AI, and instead are focusing on how they’re using it. That’s why we’re excited…

  • Slashdot: AI Secretly Helped Write California Bar Exam, Sparking Uproar

    Source URL: https://news.slashdot.org/story/25/04/23/2025217/ai-secretly-helped-write-california-bar-exam-sparking-uproar?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: AI Secretly Helped Write California Bar Exam, Sparking Uproar Feedly Summary: AI Summary and Description: Yes Summary: The State Bar of California’s decision to use AI in generating questions for the February 2025 bar exam has sparked significant backlash from legal educators and test-takers. The controversy raises concerns about…

  • Simon Willison’s Weblog: llm-fragment-symbex

    Source URL: https://simonwillison.net/2025/Apr/23/llm-fragment-symbex/#atom-everything Source: Simon Willison’s Weblog Title: llm-fragment-symbex Feedly Summary: llm-fragment-symbex I released a new LLM fragment loader plugin that builds on top of my Symbex project. Symbex is a CLI tool I wrote that can run against a folder full of Python code and output functions, classes, methods or just their docstrings and…

  • The Cloudflare Blog: New year, no shutdowns: the Q1 2025 Internet disruption summary

    Source URL: https://blog.cloudflare.com/q1-2025-internet-disruption-summary/ Source: The Cloudflare Blog Title: New year, no shutdowns: the Q1 2025 Internet disruption summary Feedly Summary: In Q1 2025, we observed Internet disruptions around the world caused by cable damage, power outages, natural disasters, fire, a cyberattack, and technical problems. AI Summary and Description: Yes Summary: The text provides a detailed…

  • Cloud Blog: Diving into the technology behind Google’s AI-era global network

    Source URL: https://cloud.google.com/blog/products/networking/google-global-network-technology-deep-dive/ Source: Cloud Blog Title: Diving into the technology behind Google’s AI-era global network Feedly Summary: The unprecedented growth and unique challenges of AI applications are driving fundamental architectural changes to Google’s next-generation global network.  The AI era brings an explosive surge in demand for network capacity, with novel traffic patterns characteristic of…

  • Simon Willison’s Weblog: OpenAI o3 and o4-mini System Card

    Source URL: https://simonwillison.net/2025/Apr/21/openai-o3-and-o4-mini-system-card/ Source: Simon Willison’s Weblog Title: OpenAI o3 and o4-mini System Card Feedly Summary: OpenAI o3 and o4-mini System Card I’m surprised to see a combined System Card for o3 and o4-mini in the same document – I’d expect to see these covered separately. The opening paragraph calls out the most interesting new…

  • Simon Willison’s Weblog: Quoting Andrew Ng

    Source URL: https://simonwillison.net/2025/Apr/18/andrew-ng/ Source: Simon Willison’s Weblog Title: Quoting Andrew Ng Feedly Summary: To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B: If A works significantly better than B according to a skilled human judge, the eval should give…