The Cloudflare Blog: Cloudflare Incident on February 6, 2025

Feb 7, 2025

—

Source URL: https://blog.cloudflare.com/cloudflare-incident-on-february-6-2025/
Source: The Cloudflare Blog
Title: Cloudflare Incident on February 6, 2025

Feedly Summary: On Thursday February 6th, we experienced an outage with our object storage service (R2) and products that rely on it. Here’s what happened and what we’re doing to fix this going forward.

AI Summary and Description: Yes

Summary: The text outlines a significant incident at Cloudflare involving downtime of the R2 object storage service due to human error. The incident emphasizes the importance of robust controls and safeguards within cloud computing environments. For security and compliance professionals, it highlights the need for improved operational validations and the implementation of defense-in-depth strategies to mitigate risks associated with human actions in cloud services.

Detailed Description:
– **Incident Overview**:
– Cloudflare’s R2 object storage service experienced downtime for 59 minutes due to human error—specifically, a failure in the abuse remediation process for a phishing report.
– The R2 Gateway service was inadvertently disabled, leading to a complete failure rate for all operations against R2 during the incident window (08:14 to 09:13 UTC).

– **Service Impact**:
– All dependent services, including Stream, Images, Cache Reserve, Vectorize, and Log Delivery, faced significant operational failures.
– Despite the downtime, no data loss or corruption occurred within the R2 storage subsystem.
– A full recovery was achieved after the R2 Gateway service was re-enabled and redeployed.

– **Root Cause Analysis**:
– The incident was attributed to insufficient validation safeguards and a lack of controls regarding which accounts could be affected during remediation actions.
– The failure stemmed from a misconfiguration that allowed an operator to disable the entire R2 service instead of just the individual bucket associated with the abuse report.

– **Timeline of Events**:
– Detailed incident timeline showing progression from the initial error to recovery, highlighting critical alerts, customer reports, and the escalation of the incident severity.

– **Remediation Measures**:
– Immediate actions taken include deploying additional safeguards in the admin API, disabling high-risk manual actions during abuse review, and establishing more rigorous account provisioning protocols.
– Future actions focus on enhancing operational checks, requiring two-party approvals for significant changes, and transitioning internal accounts to a new organizational model to improve management and oversight.

– **Key Takeaways for Security and Compliance Professionals**:
– Highlights the critical need for strong governance frameworks in cloud service management to prevent human errors from resulting in significant service disruptions.
– Emphasizes the importance of implementing a robust defense-in-depth strategy that mitigates risks from operator errors through automated safeguards and rigorous account management processes.
– Serves as a case study for refining incident response strategies and enhancing resilience in cloud infrastructure security.

This incident serves as a valuable lesson for organizations leveraging cloud services, underscoring the necessity of comprehensive control measures, proper training, and validation processes to safeguard against potential operational failures.

1 2 3 4 5 a abuse account Act after AGI AI alerts analysis and API art as Auto automated safeguards C CIA Cloud cloud computing cloud infrastructure cloud service cloud services Cloudflare Col compliance compliance professionals Computing computing environments Configuration control controls critical critical alerts Customer D data data loss day de defense defense-in-depth defense-in-depth strategies depth disruption downtime dual e end environment error errors event exp experience face fail for framework frameworks full future g Gateway Go governance governance framework governance frameworks high Highlight HR http HTTPS human human error image implementation in incident incident response incident response strategies infrastructure infrastructure security inter intern ite J Just k Key l led low management media Misconfiguration model no o object storage of on operation operational failures operational validation Operator organization organizations out outage over oversight party phi phishing potential pre processes product products professionals Progress protocol protocols provisioning R R2 R2 storage rag rate RCE recovery red remediation remediation actions report resilience response response strategies Risk risks Ro Root Root Cause Analysis s Sable safeguards SD sec security security and compliance service service disruption service disruptions Service Management services severity Sig SoC source SSE storage Strategy system T text the Time to Tor TP training transition two UI up US use V val Validation validation processes vectorize Vision Wi Wind x