Source URL: https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-12-dashboard-and-api-outage/
Source: The Cloudflare Blog
Title: A deep dive into Cloudflare’s September 12, 2025 dashboard and API outage
Feedly Summary: Cloudflare’s Dashboard and a set of related APIs were unavailable or partially available for an hour starting on Sep 12, 17:57 UTC. The outage did not affect the serving of cached files via the
AI Summary and Description: Yes
**Summary:**
The text discusses a significant outage caused by a bug in the Cloudflare Dashboard that affected the Tenant Service API and other associated APIs. The incident illustrates the complexities involved in API management and highlights the importance of effective incident response and resource management in cloud computing environments. Professionals in AI, cloud, and infrastructure security may find the incident’s analysis and proposed improvements valuable for enhancing system resilience.
**Detailed Description:**
The document details a recent incident involving the Cloudflare Dashboard and the Tenant Service API, pinpointing a bug as the root cause of a widespread outage. Here are the key points:
– **Incident Overview:**
– A bug in the dashboard’s React code led to excessive and unnecessary calls to the Tenant Service API.
– The bug occurred due to a problematic object in the useEffect hook’s dependency array, causing a loop of requests that overwhelmed the Tenant Service API.
– This overload affected API request authorizations, resulting in multiple 5xx status code errors.
– **Timeline of Events:**
– Highlights specific timestamps, including dashboard version releases and API service deployments, tracking the escalation of the incident.
– Marks key points of system availability drops and restorations during the incident (e.g., the dashboard’s availability plummeting but later recovering post-interventions).
– **Response Mechanism:**
– The immediate focus was on restoring service through resource allocation and rate-limiting to manage the load on the Tenant Service API.
– While temporary fixes were employed, some changes were ineffective and required reversion to restore overall system health.
– **Lessons Learned:**
– Emphasizes the importance of employing robust deployment methodologies, such as Argo Rollouts, for automatic rollbacks during failures, which could minimize outage durations.
– Identifies the “Thundering Herd” phenomenon, where a spike in requests occurs following service restoration, exacerbating the issue of resource capacity.
– Tentative solutions include random delays for retries to lower the strain on services during recovery.
– **Future Improvements:**
– Improvements focus on enhancing observability tools to discern between retries and new requests easily, potentially preventing similar incidents in the future.
– Plans to adjust system architecture to ensure sufficient resource allocation and proactive monitoring to handle demand surges more effectively.
Overall, the incident underscores the complexities of cloud API management and the need for rigorous testing, well-designed recovery mechanisms, and proactive monitoring solutions in infrastructure security and cloud computing environments. Implementing the conducted analysis and the proposed improvements could significantly enhance the resilience of the services offered, safeguarding against similar occurrences in the future.