The Cloudflare Blog: Cloudflare incident on March 21, 2025

Mar 25, 2025

—

Source URL: https://blog.cloudflare.com/cloudflare-incident-march-21-2025/
Source: The Cloudflare Blog
Title: Cloudflare incident on March 21, 2025

Feedly Summary: On March 21, 2025, multiple Cloudflare services, including R2 object storage experienced an elevated rate of error responses. Here’s what caused the incident, the impact, and how we are making sure it

AI Summary and Description: Yes

Summary: The incident described involves a major lapse in operational security during a credential rotation process for Cloudflare’s R2 object storage service. This human error led to a 100% failure rate of write operations and approximately a 35% failure rate of read operations affecting multiple Cloudflare services. No data loss occurred, but the incident highlights significant challenges in visibility and human factors in the cloud infrastructure.

Detailed Description:
The text provides a comprehensive account of a significant operational failure related to Cloudflare’s R2 object storage. The incident encapsulates several critical security and compliance themes, particularly surrounding credential management and operational practices in a cloud environment. Here are the major points of analysis:

– **Incident Overview**:
– The failure occurred due to a mistake during the credential rotation process where new API credentials were deployed to the wrong environment (development instead of production).
– This led to a total failure of write operations and significant failures in read operations for R2, impacting multiple Cloudflare services.

– **Services Affected**:
– **R2**: 100% error on writes, 35% error on reads.
– **Billing**: Customers faced issues accessing invoices.
– **Cache Reserve**: Increased requests to origins due to read failures.
– **Email Security**: Metrics updates were delayed.
– **Images and Stream services**: Failed uploads and degraded delivery of stored content.

– **Incident Timeline**:
– Documented the sequence of events, providing clear timestamps of actions taken, identification of the root cause, and the resolution steps, highlighting the sequential thought process in troubleshooting.

– **Root Cause Analysis**:
– The incident was attributed to human error (omission of environment parameter) and a lack of visibility into which credentials were active.
– Emphasizes the importance of validating deployment environments during sensitive operations like credential rotation.

– **Resolution and Next Steps**:
– Immediate recovery measures included properly deploying the updated credentials.
– Long-term strategies are outlined to prevent recurrence, focusing on enhanced logging, automated health checks for new keys, and procedural changes mandating multi-person validation for critical changes.

– **Implications for Security Professionals**:
– **Credential Management**: Highlights the vulnerabilities associated with manual credential management processes and human oversight.
– **Operational Resilience**: Underscores the necessity of robust logging, monitoring, and multi-layered validation in change management protocols to ensure service continuity and security.
– **Cloud Infrastructure**: The case presents important lessons for cloud security, emphasizing the requirement for stringent operational controls and the potential for significant disruptions due to seemingly small oversights.

Overall, this incident serves as a reminder of the complexities involved in managing cloud services and the utmost importance of rigorous operational security practices. It’s a critical learning point for security and compliance professionals working within or with cloud infrastructures.

1 10 2 2025 3 5 a access account Act actions ads AGI AI alt analysis and API Arch art as attribute Auto C Cache challenges change management CIA CleaR Cloud cloud environment cloud infrastructure cloud security cloud service cloud services Cloudflare co Col compliance compliance professionals content control controls core credential credential management credential rotation credentials critical Customer D data data loss de deployment development disruption document e email email security environment error event exp experience face fact fail failures for g Gen Go grade H health health checks high Highlight http HTTPS human human error human factor human factors human oversight image implications in incident infrastructure infrastructures Iron ite J k Key keys l learning led Li logging long making man management management protocols media metrics mission Monitor monitoring multi N NCA next no o object storage of on operation operational control operational practices operational resilience operational security OPM out over oversight parameter point potential pre process processes product production professionals protocol protocols R R2 rag rate RCE recovery red resilience resolution response responses Ro Root Root Cause Analysis RoT s sec security security and compliance security practices security professionals sequence service service continuity services Sig SoC source SSE SSO storage structures T text the Thought Time to Tor TP troubleshooting UI under up update updates US use V val Validation vents visibility voice vulnerabilities Wi x