Simon Willison’s Weblog: Anthropic: A postmortem of three recent issues

Source URL: https://simonwillison.net/2025/Sep/17/anthropic-postmortem/
Source: Simon Willison’s Weblog
Title: Anthropic: A postmortem of three recent issues

Feedly Summary: Anthropic: A postmortem of three recent issues
Anthropic had a very bad month in terms of model reliability:

Between August and early September, three infrastructure bugs intermittently degraded Claude’s response quality. We’ve now resolved these issues and want to explain what happened. […]
To state it plainly: We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone. […]
We don’t typically share this level of technical detail about our infrastructure, but the scope and complexity of these issues justified a more comprehensive explanation.

I’m really glad Anthropic are publishing this in so much detail. Their reputation for serving their models reliably has taken a notable hit.
I hadn’t appreciated the additional complexity caused by their mixture of different serving platforms:

We deploy Claude across multiple hardware platforms, namely AWS Trainium, NVIDIA GPUs, and Google TPUs. […] Each hardware platform has different characteristics and requires specific optimizations.

It sounds like the problems came down to three separate bugs which unfortunately came along very close to each other.
Anthropic also note that their privacy practices made investigating the issues particularly difficult:

The evaluations we ran simply didn’t capture the degradation users were reporting, in part because Claude often recovers well from isolated mistakes. Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback. This protects user privacy but prevents engineers from examining the problematic interactions needed to identify or reproduce bugs.

The code examples they provide to illustrate a TPU-specific bug show that they use Python and JAX as part of their serving layer.
Tags: python, ai, postmortem, generative-ai, llms, anthropic, claude

AI Summary and Description: Yes

Summary: This text details Anthropic’s recent challenges with their AI model, Claude, related to infrastructure bugs that impacted model reliability. The postmortem emphasizes the complexity of their hardware environments and highlights privacy practices complicating issue investigation.

Detailed Description:
The text provides an in-depth analysis of the operational issues faced by Anthropic regarding their AI model Claude, particularly focused on three recent infrastructure bugs that affected the model’s reliability. The main points are summarized as follows:

– **Infrastructure Bugs**: Anthropic experienced three separate bugs that negatively impacted the quality of responses from their model, Claude. They clarify that these issues were not related to performance degradation influenced by demand or server load.

– **Technical Complexity**: The company operates Claude across multiple hardware platforms, including:
– AWS Trainium
– NVIDIA GPUs
– Google TPUs
Each platform carries unique demands and optimizations, which adds to the complexity of maintaining model performance.

– **Privacy Practices**: The investigation into the degradation of service was hindered by their internal privacy controls, which limit engineers’ ability to access user interactions if those are not flagged as feedback. This emphasis on privacy, while essential for protecting user data, can complicate the troubleshooting processes for operational teams.

– **Postmortem Insight**: The publication of this detailed report serves as an effort to regain consumer confidence following the reliability issues. By sharing technical insights and challenges faced during the incident, Anthropic aims to enhance transparency and address reputational damage.

– **Technical Implementation**: The text also mentions that the technical stack includes Python and JAX, showing the tools they employ in their infrastructure, specifically in relation to TPU-specific bugs.

This analysis is particularly relevant for professionals in AI, cloud, and infrastructure security as it addresses the delicate balance between privacy practices and operational efficiency, as well as the inherent complexities in managing diverse hardware platforms for AI deployment. The case illustrates the importance of robust monitoring and troubleshooting strategies that do not compromise user privacy.