Simon Willison’s Weblog: Claude Opus 4.1

Aug 5, 2025

—

Source URL: https://simonwillison.net/2025/Aug/5/claude-opus-41/
Source: Simon Willison’s Weblog
Title: Claude Opus 4.1

Feedly Summary: Claude Opus 4.1
Surprise new model from Anthropic today – Claude Opus 4.1, which they describe as “a drop-in replacement for Opus 4".
My favorite thing about this model is the version number – treating this as a .1 version increment looks like it’s an accurate depiction of the model’s capabilities.
Anthropic’s own benchmarks show very small incremental gains.
Comparing Opus 4 and Opus 4.1 (I got 4.1 to extract this information from a screenshot of Anthropic’s own benchmark scores, then asked it to look up the links, then verified the links myself and fixed a few):

Agentic coding (SWE-bench Verified): From 72.5% to 74.5%
Agentic terminal coding (Terminal-Bench): From 39.2% to 43.3%
Graduate-level reasoning (GPQA Diamond): From 79.6% to 80.9%
Agentic tool use (TAU-bench):
Retail: From 81.4% to 82.4%
Airline: From 59.6% to 56.0% (decreased)
Multilingual Q&A (MMMLU): From 88.8% to 89.5%
Visual reasoning (MMMU validation): From 76.5% to 77.1%
High school math competition (AIME 2025): From 75.5% to 78.0%

Likewise, the model card shows only tiny changes to the various safety metrics that Anthropic track.
It’s priced the same as Opus 4 – $15/million for input and $75/million for output, making it one of the most expensive models on the market today.
I had it draw me this pelican riding a bicycle:

For comparison I got a fresh new pelican out of Opus 4 which I actually like a little more:

I shipped llm-anthropic 0.18 with support for the new model.
Tags: ai, generative-ai, llms, llm, anthropic, claude, evals, llm-pricing, pelican-riding-a-bicycle, llm-release

AI Summary and Description: Yes

Summary: The text discusses the release of Claude Opus 4.1 from Anthropic, highlighting its minor improvements over the previous model, pricing, and performance benchmarks. This incremental advancement is particularly relevant for AI developers, researchers, and security professionals focused on evaluating and integrating AI models into secure applications.

Detailed Description: The announcement revolves around Claude Opus 4.1, a new model from Anthropic that is intended to serve as a drop-in replacement for its predecessor, Opus 4. Noteworthy aspects of the release include slight performance gains in several benchmarks, which may impact its application in AI-driven solutions, especially in security contexts.

Key insights include:

– **Incremental Updates**: The model increment from Opus 4 to 4.1 signifies minor improvements rather than a complete overhaul. This can inform decision-making for organizations considering updates or new acquisitions of AI models, emphasizing the need for thorough benchmarking against security and performance requirements.

– **Benchmark Performance**:
– Agentic coding (SWE-bench): Increased accuracy from 72.5% to 74.5%
– Agentic terminal coding (Terminal-Bench): Improved from 39.2% to 43.3%
– Graduate-level reasoning (GPQA Diamond): Enhanced from 79.6% to 80.9%
– Agentic tool use (TAU-bench) showcased mixed results across sectors, highlighting potential context-specific performance:
– Retail: Increased from 81.4% to 82.4%
– Airline: Decreased from 59.6% to 56.0%
– Multilingual Q&A (MMMLU): Increased from 88.8% to 89.5%
– Visual reasoning (MMMU validation): Increased from 76.5% to 77.1%
– High school math competition (AIME 2025): Increased from 75.5% to 78.0%

– **Safety Metrics**: The minor changes in safety metrics are of interest to security professionals, as they may affect how the model behaves in production environments, especially concerning compliance and governance.

– **Cost Considerations**: At $15 per million for input and $75 per million for output, the model’s pricing places it among the most expensive options available, which could influence budgetary decisions for organizations investing in AI capabilities.

– **Generative Capabilities**: The mention of generating images (e.g., a pelican riding a bicycle) indicates the model’s usability in creative applications, which may also require considerations for security and ethical implications of generated content.

Overall, the incremental nature of the upgrade and its implications for AI security and compliance professionals highlight the importance of continual evaluation and monitoring of AI models in operational environments.

-bench Verified .NET 1 2 2025 3 4 5 7 a accuracy acquisition acquisitions Act advancement age agent agentic agentic coding AI AI capabilities AI developers ai model AI models AI security air and Anthropic app Application applications Arch art as at ated benchmark benchmark performance benchmarking benchmarks Bi bicycle budget C capabilities CERN CI CIA Claude Claude Opus 4 co coding Competition compliance compliance and governance compliance professionals content Context core cost cost considerations Creative Applications cross D day de decision decision-making decisions developer developers drive driven driven solutions e ELF end environment ethical ethical implications evals evaluation exp focused for g Gen generated Generated Content generative Go governance grade gs H high Highlight HR http HTTPS IAM image implications in Influence information insights inter io Iron ite k Key l led level Li Link llm llm-pricing llms lm M making man market math metrics mixed ML Mode model model card models Monitor monitoring multi Multil multilingual my N new no NPU o of on one only ons operation operational opt options Opus organization organizations oS oss out output over Paris pelican per performance performance benchmark performance benchmarks performance gains performance requirements potential pre price pricing pro product production production environment production environments professionals ps Q R rack rate RCE re reasoning red release Requirements research researchers retail riding Ro s safe safety sam screenshot search sec sector secure secure applications security security and compliance security professionals self shot side Sig Sim small solutions source specific SSE SSO support T Tags: ted terminal text the to tool tool use Tor TP UI up update updates upgrade US usability use V val Valid Validation Valuation version visual reasoning web Wi x z