Simon Willison’s Weblog: Claude Opus 4.1 and Opus 4 degraded quality

Aug 30, 2025

—

Source URL: https://simonwillison.net/2025/Aug/30/claude-degraded-quality/#atom-everything
Source: Simon Willison’s Weblog
Title: Claude Opus 4.1 and Opus 4 degraded quality

Feedly Summary: Claude Opus 4.1 and Opus 4 degraded quality
Notable because often when people complain of degraded model quality it turns out to be unfounded – Anthropic in the past have emphasized that they don’t change the model weights after releasing them without changing the version number.
In this case a botched upgrade of their inference stack cause a genuine model degradation for 56.5 hours:

From 17:30 UTC on Aug 25th to 02:00 UTC on Aug 28th, Claude Opus 4.1 experienced a degradation in quality for some requests. Users may have seen lower intelligence, malformed responses or issues with tool calling in Claude Code.
This was caused by a rollout of our inference stack, which we have since rolled back for Claude Opus 4.1. […]
We’ve also discovered that Claude Opus 4.0 has been affected by the same issue and we are in the process of rolling it back.

Tags: ai, generative-ai, llms, anthropic, claude, claude-4

AI Summary and Description: Yes

Summary: The text discusses a specific incident where the Claude Opus 4.1 model experienced a degradation in quality due to a botched upgrade of its inference stack. This issue, which lasted for over two days, resulted in users facing reduced model performance and erroneous outputs. The significance of this incident lies in the implications for AI oversight and quality assurance in machine learning operations (MLOps).

Detailed Description:

The incident involving Claude Opus 4.1 highlights the vital importance of maintaining model performance and reliability in AI applications. A faulty upgrade of the inference stack resulted in a genuine degradation of the model for an extended period, bringing to light several key issues around the management of AI systems.

Key Points:

– **Incident Overview**:
– Duration: From 17:30 UTC on August 25 to 02:00 UTC on August 28.
– **Degradation Effects**: Users encountered lower intelligence in responses, malformed outputs, and issues with tool integration in Claude Code functionalities.

– **Causes of the Quality Decrease**:
– The rollout of an update to the inference stack was identified as the root cause of the performance degradation.
– Anthropic’s previous assurance that model weights would remain unchanged post-release, barring changes in version numbers, appears to clash with the quality issues experienced during this upgrade.

– **Response to Issues**:
– A rollback to restore Claude Opus 4.1 was initiated, demonstrating a proactive approach to mitigating customer impact.
– Further investigation revealed that Claude Opus 4.0 also suffered from the same issues, indicating a broader problem with the upgrade process.

This situation underscores the necessity of robust AI and MLOps practices to ensure continual quality and reliability of AI systems, especially post-deployment. The incident serves as a reminder for AI developers and practitioners to be vigilant about the potential for technical disruptions and to have contingency plans in place for promptly addressing such issues. This aligns with principles found in AI Security and MLOps disciplines, emphasizing the importance of maintaining system integrity and user trust.

.NET 1 2 2025 3 4 5 7 a Act after age AI AI applications AI developers AI security AI systems All and Anthropic app Application applications as assurance at ated Bi by C calling CI CIA Claude Claude Code Claude Opus 4 co code core custom Customer D day days de demo deployment developer developers disruption e end exp experience eXtended fault for function g Gen generative grade gs H high Highlight HR http HTTPS impact implications in incident Inference inference stack integration integrity Intel intelligence investigation io issue k Key l learning led Li liability line llm llms lm low M mac machine Machine Learning man management ML Mode model model degradation model performance model weights N no o of on one ons operation operations ops Opus oS out output Outputs over oversight per performance performance degradation point post potential practices pre principles pro proactive problem process prompt ps Q quality quality assurance R RCE re red release reliability response responses restore Ro Root Rust s sam sec security Sig Sim Simon Willison size source specific SSE stack STIG system system integrity systems T Tags: tech technical ted text the to tool tool calling tool integration Tor TP trust turn two UI under up update upgrade US use user user trust Users V version web weight Wi x yt z