Simon Willison’s Weblog: Quoting Google Cloud outage incident report

Jun 14, 2025

—

Source URL: https://simonwillison.net/2025/Jun/14/google-cloud-outage-incident-report/#atom-everything
Source: Simon Willison’s Weblog
Title: Quoting Google Cloud outage incident report

Feedly Summary: Google Cloud, Google Workspace and Google Security Operations products experienced increased 503 errors in external API requests, impacting customers. […]
On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. […] The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. […]
On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.
— Google Cloud outage incident report
Tags: feature-flags, postmortem, google

AI Summary and Description: Yes

**Summary:** The text discusses an outage incident report related to Google Cloud services, which experienced increased API errors due to inadequate error handling in a new feature rollout. The incident highlights the significance of robust error management and feature flagging in cloud infrastructure security.

**Detailed Description:**

The text outlines a specific incident regarding Google Cloud’s service reliability and highlights key aspects that are crucial for security and compliance professionals, particularly in the realms of cloud computing security and infrastructure security.

– **Incident Overview:**
– Google Cloud, along with Google Workspace, encountered increased 503 errors tied to external API requests.
– The errors significantly impacted customer experiences, demonstrating the need for resilient cloud infrastructure.

– **Root Cause Analysis:**
– A new feature was introduced in Service Control for enhanced quota policy checks, which went through a regional release.
– The code that led to the failure was not executed during the rollout due to policy changes that hadn’t been applied.

– **Key Problems Identified:**
– **Lack of Error Handling:** The code update did not implement adequate error checking measures, which became critical during the operational phase.
– **Absence of Feature Flagging:** The newly introduced feature was not protected by feature flags, escalating the potential risk during its deployment and use.

– **Impact of the Incident:**
– On June 12, 2025, a policy change triggered a series of errors when it replicated across global servers, leading to null pointer exceptions.
– The incident resulted in a crash loop across all regional deployments, showcasing how quickly changes in policy metadata can lead to significant outages.

**Implications for Security Professionals:**
– **Importance of Error Handling:** This incident emphasizes the vital role of implementing robust error handling mechanisms within cloud services to prevent outages.
– **Feature Flagging as a Control Mechanism:** The lack of feature flags highlights a gap in deployment risk management strategies, underscoring the necessity for clear controls during changes in production environments.

In conclusion, this incident report serves as a cautionary tale for cloud service providers and security professionals about the inherent risks of inadequate planning and testing during code deployments, and the critical importance of both error handling and feature flagging in maintaining service availability and reliability.

.NET 1 10 2 2025 3 4 5 a Act AI analysis and API app art as availability Bi binaries by C caution checking CI CIA CleaR Cloud cloud computing cloud computing security cloud infrastructure cloud infrastructure security cloud service cloud service providers cloud services co code code deployment code update compliance compliance professionals Computing control control mechanism controls critical cross Customer customer experience customer experiences D data de demo deployment e end environment error error handling error management errors event exp experience External fail feature feature flags for g Go Google Google Cloud Google Cloud services Google Security Operations Google Workspace gs H handling handling mechanisms high Highlight HR http HTTPS implications implications for security in incident incident report infrastructure infrastructure security inter io Iron issue J k Key l leading led Li liability lm long loop M man management management strategies measures Meta metadata N new no o of on operation operations oS out outage outages over planning point policies policy policy changes post postmortem potential pre problem product production production environment production environments products professionals ps Q QUIC quota management R rate RCE real red Region release reliability replicate report Resil Risk risk management risk management strategies risks Ro Role Root Root Cause Analysis RoT s sec security security and compliance security operations security professionals series server servers service service availability service control service providers service reliability services Sig Sim size source Spanner specific SSE strategies T Tags: ted test Testing text the to Tor TP UI under up update US use V web Wi x