Source URL: https://blog.cloudflare.com/cloudflare-incident-on-august-21-2025/
Source: The Cloudflare Blog
Title: Cloudflare incident on August 21, 2025
Feedly Summary: On August 21, 2025, an influx of traffic directed toward clients hosted in AWS us-east-1 caused severe congestion on links between Cloudflare and us-east-1. In this post, we explain the details.
AI Summary and Description: Yes
Summary: The incident detailed in the text highlights a network congestion issue between Cloudflare and AWS us-east-1 caused by a sudden surge of requests from a single customer. This incident underscores the importance of robust network management and architecture to prevent such disturbances in cloud infrastructure.
Detailed Description:
The text describes a significant congestion event that occurred on August 21, 2025, impacting customers connected to Cloudflare via AWS us-east-1. Here are the major points encapsulated in this incident:
– **Incident Overview**:
– A surge in traffic from one customer caused severe congestion between Cloudflare and AWS, impacting users with high latency, packet loss, and connection failures.
– The congestion commenced at 16:27 UTC and was alleviated by 19:38 UTC.
– The issue was not an attack but was due to excessive legitimate traffic.
– **Causes of Congestion**:
– The network was saturated due to excessive requests from a single client, leading to inadequate capacity on some peering links between Cloudflare and AWS.
– AWS’s decision to withdraw certain BGP advertisements exacerbated the issue by rerouting traffic, which then also became overloaded.
– **Response Actions**:
– Cloudflare’s incident team worked closely with AWS to manage the surge and restore normal service.
– Rate limiting was employed to decrease congestion, along with additional engineering actions to mitigate the issues.
– **Timeline of Events**:
– The incident timeline provides granular detail on the sequence of events, including when traffic surged, when AWS began withdrawing BGP prefixes, and the subsequent response actions taken by Cloudflare and AWS.
– **Remediations and Future Actions**:
– The incident prompted Cloudflare to develop strategies for better isolation of customer traffic to prevent one user’s spikes from affecting others.
– Planned actions include enhancing network capacity through Data Center Interconnect upgrades, developing deprioritization mechanisms for congesting traffic, and a long-term strategy for better traffic management to ensure fair resource allocation.
– **Implications for Security and Compliance Professionals**:
– The incident illustrates the critical need for robust network management practices in cloud environments to maintain service quality.
– It highlights the potential risks associated with single points of congestion, particularly in distributed cloud infrastructures, which could lead to compliance issues if customers experience poor service levels.
In conclusion, this case serves as a valuable lesson in the significance of architecture such as redundancy, capacity management, and proactive traffic management in cloud computing environments, thereby informing security and compliance strategies to better manage unforeseen customer behaviors and ensure the stability of cloud services.