Tag: benchmark

  • Slashdot: OpenAI Says GPT-5 Stacks Up To Humans in a Wide Range of Jobs

    Source URL: https://slashdot.org/story/25/09/25/176219/openai-says-gpt-5-stacks-up-to-humans-in-a-wide-range-of-jobs?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: OpenAI Says GPT-5 Stacks Up To Humans in a Wide Range of Jobs Feedly Summary: AI Summary and Description: Yes Summary: OpenAI has introduced GDPval, a new benchmark to assess the performance of its AI models against that of human professionals across various industries. The benchmark indicates that models…

  • Simon Willison’s Weblog: GPT-5-Codex

    Source URL: https://simonwillison.net/2025/Sep/23/gpt-5-codex/#atom-everything Source: Simon Willison’s Weblog Title: GPT-5-Codex Feedly Summary: GPT-5-Codex OpenAI half-relased this model earlier this month, adding it to their Codex CLI tool but not their API. Today they’ve fixed that – the new model can now be accessed as gpt-5-codex. It’s priced the same as regular GPT-5: $1.25/million input tokens, $10/million…

  • Simon Willison’s Weblog: Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action

    Source URL: https://simonwillison.net/2025/Sep/23/qwen3-vl/ Source: Simon Willison’s Weblog Title: Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action Feedly Summary: Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action I’ve been looking forward to this. Qwen 2.5 VL is one of the best available open weight vision LLMs, so I had high hopes for Qwen 3’s vision models. Firstly, we…

  • Simon Willison’s Weblog: CompileBench: Can AI Compile 22-year-old Code?

    Source URL: https://simonwillison.net/2025/Sep/22/compilebench/ Source: Simon Willison’s Weblog Title: CompileBench: Can AI Compile 22-year-old Code? Feedly Summary: CompileBench: Can AI Compile 22-year-old Code? Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture? This is one of my favorite applications of…

  • Simon Willison’s Weblog: Magistral 1.2

    Source URL: https://simonwillison.net/2025/Sep/19/magistral/ Source: Simon Willison’s Weblog Title: Magistral 1.2 Feedly Summary: Mistral quietly released two new models yesterday: Magistral Small 1.2 (Apache 2.0, 96.1 GB on Hugging Face) and Magistral Medium 1.2 (not open weights same as Mistral’s other “medium" models.) Despite being described as "minor updates" to the Magistral 1.1 models these have…

  • AWS News Blog: DeepSeek-V3.1 model now available in Amazon Bedrock

    Source URL: https://aws.amazon.com/blogs/aws/deepseek-v3-1-now-available-in-amazon-bedrock/ Source: AWS News Blog Title: DeepSeek-V3.1 model now available in Amazon Bedrock Feedly Summary: AWS launches DeepSeek-V3.1 as a fully managed models in Amazon Bedrock. DeepSeek-V3.1 is a hybrid open weight model that switches between thinking mode for detailed step-by-step analysis and non-thinking mode for faster responses. AI Summary and Description: Yes…

  • Cloud Blog: How Google Cloud’s AI tech stack powers today’s startups

    Source URL: https://cloud.google.com/blog/topics/startups/differentiated-ai-tech-stack-drives-startup-innovation-google-builders-forum/ Source: Cloud Blog Title: How Google Cloud’s AI tech stack powers today’s startups Feedly Summary: AI has accelerated startup innovation more than any technology since perhaps the internet itself, and we’ve been fortunate to have a front row seat to much of this innovation here at Google Cloud. Nine of the top…

  • Schneier on Security: Time-of-Check Time-of-Use Attacks Against LLMs

    Source URL: https://www.schneier.com/blog/archives/2025/09/time-of-check-time-of-use-attacks-against-llms.html Source: Schneier on Security Title: Time-of-Check Time-of-Use Attacks Against LLMs Feedly Summary: This is a nice piece of research: “Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents“.: Abstract: Large Language Model (LLM)-enabled agents are rapidly emerging across a wide range of applications, but their deployment introduces vulnerabilities with security implications.…