Tag: benchmarks

  • The Register: JetBrains backs open AI coding standard that could gnaw at VS Code dominance

    Source URL: https://www.theregister.com/2025/10/07/jetbrains_acp_vs_code/ Source: The Register Title: JetBrains backs open AI coding standard that could gnaw at VS Code dominance Feedly Summary: Google and Zed have already adopted ACP – will Microsoft now follow? JetBrains has joined Google and Zed Industries in adopting the fledgling Agent Client Protocol (ACP), a standard for how AI agents…

  • OpenAI : Disrupting malicious uses of AI: October 2025

    Source URL: https://openai.com/global-affairs/disrupting-malicious-uses-of-ai-october-2025 Source: OpenAI Title: Disrupting malicious uses of AI: October 2025 Feedly Summary: Discover how OpenAI is detecting and disrupting malicious uses of AI in our October 2025 report. Learn how we’re countering misuse, enforcing policies, and protecting users from real-world harms. AI Summary and Description: Yes Summary: The text discusses OpenAI’s initiatives…

  • Simon Willison’s Weblog: Two more Chinese pelicans

    Source URL: https://simonwillison.net/2025/Oct/1/two-pelicans/#atom-everything Source: Simon Willison’s Weblog Title: Two more Chinese pelicans Feedly Summary: Two new models from Chinese AI labs in the past few days. I tried them both out using llm-openrouter: DeepSeek-V3.2-Exp from DeepSeek. Announcement, Tech Report, Hugging Face (690GB, MIT license). As an intermediate step toward our next-generation architecture, V3.2-Exp builds upon…

  • Simon Willison’s Weblog: Improved Gemini 2.5 Flash and Flash-Lite

    Source URL: https://simonwillison.net/2025/Sep/25/improved-gemini-25-flash-and-flash-lite/#atom-everything Source: Simon Willison’s Weblog Title: Improved Gemini 2.5 Flash and Flash-Lite Feedly Summary: Improved Gemini 2.5 Flash and Flash-Lite Two new preview models from Google – updates to their fast and inexpensive Flash and Flash Lite families: The latest version of Gemini 2.5 Flash-Lite was trained and built based on three key…

  • Slashdot: OpenAI Says GPT-5 Stacks Up To Humans in a Wide Range of Jobs

    Source URL: https://slashdot.org/story/25/09/25/176219/openai-says-gpt-5-stacks-up-to-humans-in-a-wide-range-of-jobs?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: OpenAI Says GPT-5 Stacks Up To Humans in a Wide Range of Jobs Feedly Summary: AI Summary and Description: Yes Summary: OpenAI has introduced GDPval, a new benchmark to assess the performance of its AI models against that of human professionals across various industries. The benchmark indicates that models…

  • Simon Willison’s Weblog: Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action

    Source URL: https://simonwillison.net/2025/Sep/23/qwen3-vl/ Source: Simon Willison’s Weblog Title: Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action Feedly Summary: Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action I’ve been looking forward to this. Qwen 2.5 VL is one of the best available open weight vision LLMs, so I had high hopes for Qwen 3’s vision models. Firstly, we…

  • Simon Willison’s Weblog: Magistral 1.2

    Source URL: https://simonwillison.net/2025/Sep/19/magistral/ Source: Simon Willison’s Weblog Title: Magistral 1.2 Feedly Summary: Mistral quietly released two new models yesterday: Magistral Small 1.2 (Apache 2.0, 96.1 GB on Hugging Face) and Magistral Medium 1.2 (not open weights same as Mistral’s other “medium" models.) Despite being described as "minor updates" to the Magistral 1.1 models these have…

  • AWS News Blog: DeepSeek-V3.1 model now available in Amazon Bedrock

    Source URL: https://aws.amazon.com/blogs/aws/deepseek-v3-1-now-available-in-amazon-bedrock/ Source: AWS News Blog Title: DeepSeek-V3.1 model now available in Amazon Bedrock Feedly Summary: AWS launches DeepSeek-V3.1 as a fully managed models in Amazon Bedrock. DeepSeek-V3.1 is a hybrid open weight model that switches between thinking mode for detailed step-by-step analysis and non-thinking mode for faster responses. AI Summary and Description: Yes…