The Cloudflare Blog: Cloudflare’s perspective of the October 30 OVHcloud outage

Source URL: https://blog.cloudflare.com/cloudflare-perspective-of-the-october-30-2024-ovhcloud-outage
Source: The Cloudflare Blog
Title: Cloudflare’s perspective of the October 30 OVHcloud outage

Feedly Summary: On October 30, 2024, cloud hosting provider OVHcloud (AS16276) suffered a brief but significant outage. Within this post, we review Cloudflare’s perspective on this outage.

AI Summary and Description: Yes

Summary: The text details a significant outage experienced by OVHcloud on October 30, 2024, highlighting the incident’s cause of a misconfigured network route pushed by a peering partner. Key insights include the operational impact on traffic routing to Cloudflare and suggestions for mitigating similar routing incidents in the future, emphasizing the importance of BGP controls and practices in infrastructure security.

Detailed Description:

– **Incident Overview**:
– On October 30, 2024, OVHcloud encountered a brief outage starting at 13:23 UTC and lasting 17 minutes, impacting their backbone infrastructure and affecting traffic routing to Cloudflare.
– Traffic levels dropped significantly (about 95%) just before the incident was logged, and recovery was observed shortly after.

– **Traffic Dynamics**:
– The majority of traffic between OVHcloud and Cloudflare normally occurs via private peering; however, during the outage, traffic shifted to transit links and was rerouted through a single Internet Exchange point in Amsterdam.
– This shift indicates issues in BGP routing, likely due to manual configuration errors.

– **Cause of the Outage**:
– OVHcloud’s postmortem identified a network configuration error by one of their peering partners as the incident’s cause.
– The potential for a BGP route leak was highlighted, where a peering partner could advertise too many prefixes and overwhelm network capacities.

– **BGP Monitoring and Recovery**:
– The edge routing updates and prefix limits were closely monitored, helping prevent wider impacts on Cloudflare’s network during the incident.
– Evidence was found of a breach of maximum prefix-limit thresholds, which allowed for swift reconnection after routing anomalies.

– **Best Practices and Recommendations**:
– The text emphasizes measures to prevent BGP route leaks:
– Setting max prefix-limits for peering sessions to automatically shut down BGP sessions when limits are exceeded.
– Utilizing Resource Public Key Infrastructure (RPKI) with Autonomous System Provider Authorization (ASPA) for securing BGP updates.
– Adopting guidelines from MANRS (Mutually Agreed Norms for Routing Security) to improve Internet resilience and safety for network operators.

This incident serves as a crucial learning point for security and compliance professionals, highlighting the necessity of robust network configuration management, real-time monitoring, and proactive measures to guard against routing incidents in cloud and infrastructure environments.