Simon Willison’s Weblog: Anthropic: A postmortem of three recent issues

Sep 18, 2025

—

Source URL: https://simonwillison.net/2025/Sep/17/anthropic-postmortem/
Source: Simon Willison’s Weblog
Title: Anthropic: A postmortem of three recent issues

Feedly Summary: Anthropic: A postmortem of three recent issues
Anthropic had a very bad month in terms of model reliability:

Between August and early September, three infrastructure bugs intermittently degraded Claude’s response quality. We’ve now resolved these issues and want to explain what happened. […]
To state it plainly: We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone. […]
We don’t typically share this level of technical detail about our infrastructure, but the scope and complexity of these issues justified a more comprehensive explanation.

I’m really glad Anthropic are publishing this in so much detail. Their reputation for serving their models reliably has taken a notable hit.
I hadn’t appreciated the additional complexity caused by their mixture of different serving platforms:

We deploy Claude across multiple hardware platforms, namely AWS Trainium, NVIDIA GPUs, and Google TPUs. […] Each hardware platform has different characteristics and requires specific optimizations.

It sounds like the problems came down to three separate bugs which unfortunately came along very close to each other.
Anthropic also note that their privacy practices made investigating the issues particularly difficult:

The evaluations we ran simply didn’t capture the degradation users were reporting, in part because Claude often recovers well from isolated mistakes. Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback. This protects user privacy but prevents engineers from examining the problematic interactions needed to identify or reproduce bugs.

The code examples they provide to illustrate a TPU-specific bug show that they use Python and JAX as part of their serving layer.
Tags: python, ai, postmortem, generative-ai, llms, anthropic, claude

AI Summary and Description: Yes

Summary: This text details Anthropic’s recent challenges with their AI model, Claude, related to infrastructure bugs that impacted model reliability. The postmortem emphasizes the complexity of their hardware environments and highlights privacy practices complicating issue investigation.

Detailed Description:
The text provides an in-depth analysis of the operational issues faced by Anthropic regarding their AI model Claude, particularly focused on three recent infrastructure bugs that affected the model’s reliability. The main points are summarized as follows:

– **Infrastructure Bugs**: Anthropic experienced three separate bugs that negatively impacted the quality of responses from their model, Claude. They clarify that these issues were not related to performance degradation influenced by demand or server load.

– **Technical Complexity**: The company operates Claude across multiple hardware platforms, including:
– AWS Trainium
– NVIDIA GPUs
– Google TPUs
Each platform carries unique demands and optimizations, which adds to the complexity of maintaining model performance.

– **Privacy Practices**: The investigation into the degradation of service was hindered by their internal privacy controls, which limit engineers’ ability to access user interactions if those are not flagged as feedback. This emphasis on privacy, while essential for protecting user data, can complicate the troubleshooting processes for operational teams.

– **Postmortem Insight**: The publication of this detailed report serves as an effort to regain consumer confidence following the reliability issues. By sharing technical insights and challenges faced during the incident, Anthropic aims to enhance transparency and address reputational damage.

– **Technical Implementation**: The text also mentions that the technical stack includes Python and JAX, showing the tools they employ in their infrastructure, specifically in relation to TPU-specific bugs.

This analysis is particularly relevant for professionals in AI, cloud, and infrastructure security as it addresses the delicate balance between privacy practices and operational efficiency, as well as the inherent complexities in managing diverse hardware platforms for AI deployment. The case illustrates the importance of robust monitoring and troubleshooting strategies that do not compromise user privacy.

.NET 1 2 2025 5 7 a access Act actions addresses age AGI AI ai model All analysis and Anthropic app Arize art as at ated AWS Bi Bug bugs by C challenge challenges CI CIA Claude Cloud co code code examples complexity consumer control controls cross D data day de demand deployment depth e efficiency Engineer engineers environment environments evaluation evaluations event exp experience face feedback focused following for g Gen generative Go Google GPU GPUs grade gs H hardware hardware platforms high Highlight HR http HTTPS impact implementation in incident Influence infrastructure infrastructure security insights inter interaction interactions intern investigation io Iron issue J Jax Just k l Lance led level Li liability llm llms lm load long low M made man mini mistakes Mixture Mode model model Claude model performance model quality model reliability models Monitor monitoring multi N nation no Nvidia NVIDIA GPUs o of on one ons operation operational operational efficiency opt optimization optimizations oS oss other out over per performance performance degradation platform platforms point porting post postmortem practices pre privacy privacy control privacy controls privacy practices pro problem process processes professionals ps public publishing Py Python Q quality R rate RCE re real red reliability reliability issues report reporting reports reputation response responses Ro RoT s scope scope and complexity sec security security controls server service SHA sharing Sig Sim Simon Willison size sizes source specific SSE stack state STIG strategies T Tags: Tails team Teams tech technical technical implementation technical insights ted text the Time to tool tools Tor TP TPUs Trainium transparency troubleshooting UI US use user user data user interaction user interactions user privacy Users V val Valuation vents WAN Ware web Well Wi x yt z