The Cloudflare Blog: A deep dive into Cloudflare’s September 12, 2025 dashboard and API outage

Sep 13, 2025

—

Source URL: https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-12-dashboard-and-api-outage/
Source: The Cloudflare Blog
Title: A deep dive into Cloudflare’s September 12, 2025 dashboard and API outage

Feedly Summary: Cloudflare’s Dashboard and a set of related APIs were unavailable or partially available for an hour starting on Sep 12, 17:57 UTC. The outage did not affect the serving of cached files via the

AI Summary and Description: Yes

**Summary:**
The text discusses a significant outage caused by a bug in the Cloudflare Dashboard that affected the Tenant Service API and other associated APIs. The incident illustrates the complexities involved in API management and highlights the importance of effective incident response and resource management in cloud computing environments. Professionals in AI, cloud, and infrastructure security may find the incident’s analysis and proposed improvements valuable for enhancing system resilience.

**Detailed Description:**
The document details a recent incident involving the Cloudflare Dashboard and the Tenant Service API, pinpointing a bug as the root cause of a widespread outage. Here are the key points:

– **Incident Overview:**
– A bug in the dashboard’s React code led to excessive and unnecessary calls to the Tenant Service API.
– The bug occurred due to a problematic object in the useEffect hook’s dependency array, causing a loop of requests that overwhelmed the Tenant Service API.
– This overload affected API request authorizations, resulting in multiple 5xx status code errors.

– **Timeline of Events:**
– Highlights specific timestamps, including dashboard version releases and API service deployments, tracking the escalation of the incident.
– Marks key points of system availability drops and restorations during the incident (e.g., the dashboard’s availability plummeting but later recovering post-interventions).

– **Response Mechanism:**
– The immediate focus was on restoring service through resource allocation and rate-limiting to manage the load on the Tenant Service API.
– While temporary fixes were employed, some changes were ineffective and required reversion to restore overall system health.

– **Lessons Learned:**
– Emphasizes the importance of employing robust deployment methodologies, such as Argo Rollouts, for automatic rollbacks during failures, which could minimize outage durations.
– Identifies the “Thundering Herd” phenomenon, where a spike in requests occurs following service restoration, exacerbating the issue of resource capacity.
– Tentative solutions include random delays for retries to lower the strain on services during recovery.

– **Future Improvements:**
– Improvements focus on enhancing observability tools to discern between retries and new requests easily, potentially preventing similar incidents in the future.
– Plans to adjust system architecture to ensure sufficient resource allocation and proactive monitoring to handle demand surges more effectively.

Overall, the incident underscores the complexities of cloud API management and the need for rigorous testing, well-designed recovery mechanisms, and proactive monitoring solutions in infrastructure security and cloud computing environments. Implementing the conducted analysis and the proposed improvements could significantly enhance the resilience of the services offered, safeguarding against similar occurrences in the future.

1 2 2025 5 7 a Act age AI All alt analysis and API API management APIs Arch architecture Argo art as at ated authorization Auto availability Bi board Bug by C Cache capacity CERN CI CIA Cloud cloud computing cloud computing environments Cloudflare co code Computing computing environments core D dashboard de deep demand demand surge dependency deployment deployments design document e effective end environment environments error errors escalation event eventing fail failures file fixes following for future future improvements g Go H health Helm high Highlight HR http HTTPS in incident incident response infrastructure infrastructure security inter io Iron issue ite J Just k Key l led Li limiting line lm load loop low M man management media methodologies Mila mini Monitor monitoring monitoring solutions multi N new no non o observability observability tool observability tools of off on ons ops oS other out outage over point post potential pre pro proactive proactive monitoring problem professionals ps Q R rack rate Ray RCE re react recovery red release releases Resil resilience resource resource allocation resource management response restore retries Rigorous Testing Ro Root s safe sec security service services Sig Sim size sizes SoC solutions source specific SSE SSO STAR start system system architecture system resilience T Tails ted tenant test Testing text the Time times to tool tools Tor TP tracking trie UI under US use uth V val vents version Well Wi x z