Slashdot: Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections

Jun 17, 2025

—

Source URL: https://tech.slashdot.org/story/25/06/16/2141250/google-cloud-caused-outage-by-ignoring-its-usual-code-quality-protections?utm_source=rss1.0mainlinkanon&utm_medium=feed
Source: Slashdot
Title: Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections

Feedly Summary:

AI Summary and Description: Yes

Summary: The text details a major outage in Google Cloud caused by a flawed update to its Service Control system, highlighting critical issues related to error handling and the lack of feature flag protection. This incident underscores the importance of robust testing and error management in cloud infrastructure security, particularly relevant for professionals operating in cloud computing and infrastructure security domains.

Detailed Description:
The incident described in the text revolves around a significant outage experienced by Google Cloud, which was traced back to a problematic code update in its Service Control system. This event illustrates several key aspects pertinent to cloud computing and infrastructure security:

– **Flawed Code Update**: The outage was triggered by a new feature added to the Service Control system for enforcing additional quota policy checks.
– **Missing Error Handling**: The code update failed due to a lack of appropriate error handling, specifically related to a null pointer issue that caused a crash loop.
– **Lack of Feature Flag Protection**: The update lacked feature flag protection, which is critical for safely deploying new features. If it had been protected by feature flags, the problematic code would likely have been caught during staging.
– **Global Impact**: The failure escalated globally, affecting various regional deployments of Google Cloud, illustrating how a single code update can have widespread ramifications on distributed systems.
– **Response and Recovery**: Google’s Site Reliability Engineering (SRE) team was quick to detect and triage the incident, identifying the root cause promptly but faced challenges in recovery due to the “herd effect” on the infrastructure.
– **Operational Changes**: In response to the incident, Google has committed to enhancing their operational protocols, including better communication with customers and ensuring the resilience of their monitoring and communication infrastructure.

Moving forward, this incident serves as a crucial reminder of the following considerations for professionals in cloud computing and infrastructure security:

– Emphasize comprehensive testing and error handling mechanisms within code updates.
– Implement feature flagging to mitigate risks when launching new features.
– Improve incident response protocols to expedite recovery and minimize downtime.
– Strengthen communication strategies with stakeholders during outages to ensure timely updates and maintain operational continuity.

In conclusion, this incident not only showcases specific vulnerabilities in cloud infrastructure but also emphasizes the need for robust security practices to prevent similar occurrences in the future.

1 2 4 5 a Act AGI AI and app art as Bi by C C code challenges CI CIA Cloud cloud computing cloud infrastructure cloud infrastructure security co code code quality code update Col commit communication communication infrastructure communication strategies Computing control core critical Customer D de deployment distributed system distributed systems domain domains DoT downtime e Engineer engineering error error handling error management event exp experience face fail feature feature flag protection feature flags features for future g global impact Go Google Google Cloud gs H handling handling mechanisms high Highlight http HTTPS in incident incident response incident response protocols infrastructure infrastructure security inter io issue ite J k Key l law led Li liability Link loop low M man management Mila mini Monitor monitoring N new no non o of on only operation Operational Changes operational continuity ory out outage outages over point policy practices pre problem professionals prompt protection protocol protocols ps Q quality QUIC R rate RCE recovery red Region reliability Resil resilience response Risk risks Ro robust security robust security practices Root RoT s safe sec security security practices service service control side Sig Sim single Site Reliability Engineer site reliability engineering size source specific SRE stakeholders strategies system systems T Tails team tech ted test Testing text the Time to Tor TP UI under up update updates US use V vulnerabilities Wi x