Source URL: https://simonwillison.net/2025/Aug/5/claude-opus-41/
Source: Simon Willison’s Weblog
Title: Claude Opus 4.1
Feedly Summary: Claude Opus 4.1
Surprise new model from Anthropic today – Claude Opus 4.1, which they describe as “a drop-in replacement for Opus 4".
My favorite thing about this model is the version number – treating this as a .1 version increment looks like it’s an accurate depiction of the model’s capabilities.
Anthropic’s own benchmarks show very small incremental gains.
Comparing Opus 4 and Opus 4.1 (I got 4.1 to extract this information from a screenshot of Anthropic’s own benchmark scores, then asked it to look up the links, then verified the links myself and fixed a few):
Agentic coding (SWE-bench Verified): From 72.5% to 74.5%
Agentic terminal coding (Terminal-Bench): From 39.2% to 43.3%
Graduate-level reasoning (GPQA Diamond): From 79.6% to 80.9%
Agentic tool use (TAU-bench):
Retail: From 81.4% to 82.4%
Airline: From 59.6% to 56.0% (decreased)
Multilingual Q&A (MMMLU): From 88.8% to 89.5%
Visual reasoning (MMMU validation): From 76.5% to 77.1%
High school math competition (AIME 2025): From 75.5% to 78.0%
Likewise, the model card shows only tiny changes to the various safety metrics that Anthropic track.
It’s priced the same as Opus 4 – $15/million for input and $75/million for output, making it one of the most expensive models on the market today.
I had it draw me this pelican riding a bicycle:
For comparison I got a fresh new pelican out of Opus 4 which I actually like a little more:
I shipped llm-anthropic 0.18 with support for the new model.
Tags: ai, generative-ai, llms, llm, anthropic, claude, evals, llm-pricing, pelican-riding-a-bicycle, llm-release
AI Summary and Description: Yes
Summary: The text discusses the release of Claude Opus 4.1 from Anthropic, highlighting its minor improvements over the previous model, pricing, and performance benchmarks. This incremental advancement is particularly relevant for AI developers, researchers, and security professionals focused on evaluating and integrating AI models into secure applications.
Detailed Description: The announcement revolves around Claude Opus 4.1, a new model from Anthropic that is intended to serve as a drop-in replacement for its predecessor, Opus 4. Noteworthy aspects of the release include slight performance gains in several benchmarks, which may impact its application in AI-driven solutions, especially in security contexts.
Key insights include:
– **Incremental Updates**: The model increment from Opus 4 to 4.1 signifies minor improvements rather than a complete overhaul. This can inform decision-making for organizations considering updates or new acquisitions of AI models, emphasizing the need for thorough benchmarking against security and performance requirements.
– **Benchmark Performance**:
– Agentic coding (SWE-bench): Increased accuracy from 72.5% to 74.5%
– Agentic terminal coding (Terminal-Bench): Improved from 39.2% to 43.3%
– Graduate-level reasoning (GPQA Diamond): Enhanced from 79.6% to 80.9%
– Agentic tool use (TAU-bench) showcased mixed results across sectors, highlighting potential context-specific performance:
– Retail: Increased from 81.4% to 82.4%
– Airline: Decreased from 59.6% to 56.0%
– Multilingual Q&A (MMMLU): Increased from 88.8% to 89.5%
– Visual reasoning (MMMU validation): Increased from 76.5% to 77.1%
– High school math competition (AIME 2025): Increased from 75.5% to 78.0%
– **Safety Metrics**: The minor changes in safety metrics are of interest to security professionals, as they may affect how the model behaves in production environments, especially concerning compliance and governance.
– **Cost Considerations**: At $15 per million for input and $75 per million for output, the model’s pricing places it among the most expensive options available, which could influence budgetary decisions for organizations investing in AI capabilities.
– **Generative Capabilities**: The mention of generating images (e.g., a pelican riding a bicycle) indicates the model’s usability in creative applications, which may also require considerations for security and ethical implications of generated content.
Overall, the incremental nature of the upgrade and its implications for AI security and compliance professionals highlight the importance of continual evaluation and monitoring of AI models in operational environments.