Simon Willison’s Weblog: Quoting Google Cloud outage incident report

Source URL: https://simonwillison.net/2025/Jun/14/google-cloud-outage-incident-report/#atom-everything
Source: Simon Willison’s Weblog
Title: Quoting Google Cloud outage incident report

Feedly Summary: Google Cloud, Google Workspace and Google Security Operations products experienced increased 503 errors in external API requests, impacting customers. […]
On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. […] The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. […]
On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.
— Google Cloud outage incident report
Tags: feature-flags, postmortem, google

AI Summary and Description: Yes

**Summary:** The text discusses an outage incident report related to Google Cloud services, which experienced increased API errors due to inadequate error handling in a new feature rollout. The incident highlights the significance of robust error management and feature flagging in cloud infrastructure security.

**Detailed Description:**

The text outlines a specific incident regarding Google Cloud’s service reliability and highlights key aspects that are crucial for security and compliance professionals, particularly in the realms of cloud computing security and infrastructure security.

– **Incident Overview:**
– Google Cloud, along with Google Workspace, encountered increased 503 errors tied to external API requests.
– The errors significantly impacted customer experiences, demonstrating the need for resilient cloud infrastructure.

– **Root Cause Analysis:**
– A new feature was introduced in Service Control for enhanced quota policy checks, which went through a regional release.
– The code that led to the failure was not executed during the rollout due to policy changes that hadn’t been applied.

– **Key Problems Identified:**
– **Lack of Error Handling:** The code update did not implement adequate error checking measures, which became critical during the operational phase.
– **Absence of Feature Flagging:** The newly introduced feature was not protected by feature flags, escalating the potential risk during its deployment and use.

– **Impact of the Incident:**
– On June 12, 2025, a policy change triggered a series of errors when it replicated across global servers, leading to null pointer exceptions.
– The incident resulted in a crash loop across all regional deployments, showcasing how quickly changes in policy metadata can lead to significant outages.

**Implications for Security Professionals:**
– **Importance of Error Handling:** This incident emphasizes the vital role of implementing robust error handling mechanisms within cloud services to prevent outages.
– **Feature Flagging as a Control Mechanism:** The lack of feature flags highlights a gap in deployment risk management strategies, underscoring the necessity for clear controls during changes in production environments.

In conclusion, this incident report serves as a cautionary tale for cloud service providers and security professionals about the inherent risks of inadequate planning and testing during code deployments, and the critical importance of both error handling and feature flagging in maintaining service availability and reliability.