Hacker News: The Canva outage: another tale of saturation and resilience

Source URL: https://surfingcomplexity.blog/2024/12/21/the-canva-outage-another-tale-of-saturation-and-resilience/
Source: Hacker News
Title: The Canva outage: another tale of saturation and resilience

Feedly Summary: Comments

AI Summary and Description: Yes

Summary: The incident at Canva, detailed by Brendan Humphries, highlights a series of interconnected failures that led to a significant service outage. Notably, a CDN misconfiguration and an API gateway performance regression resulted in a massive surge of traffic that the system was ill-equipped to handle. This case underscores the importance of system resilience and operator knowledge in managing complex infrastructure failures.

Detailed Description:
The incident at Canva serves as a compelling study of how a combination of technological mismanagement and unexpected user behavior can lead to significant operational failures. Below are the key points and insights from this incident:

– **Incident Trigger**: The incident was triggered by the deployment of a new version of Canva’s editor page, which was not inherently faulty; rather, it was the system’s response to user requests that led to failure.

– **CDN Misconfiguration**:
– Canva used Cloudflare as their CDN, where a stale traffic management rule caused traffic to be routed through the public internet instead of the private backbone, leading to increased latency and packet loss for users in Asia.

– **Request Backlog**:
– The high latency resulted in over 270,000 requests waiting simultaneously for the same JavaScript file, effectively “synchronizing” call behavior among users.
– This backlog created conditions for a “thundering herd” problem when requests to the API rapidly spiked to 1.5 million per second after the JavaScript fetch completed.

– **API Gateway Challenges**:
– A performance regression in the API due to a change in telemetry library code introduced freezing issues leading to failed request processing.
– The load balancer exacerbated the situation by directing traffic to already overloaded tasks, further collapsing the API service.

– **Failure Modes in Cloud Environments**:
– The cascading failure was amplified by the interaction of various systems, including the load balancer, autoscaler, and Linux’s Out of Memory (OOM) killer, which contributed to a rapid and devolving failure cycle.

– **Adaptive Response**:
– Canva engineers undertook manual intervention, implementing a temporary traffic block at the Cloudflare level to allow the system to stabilize.
– Incremental traffic restoration was performed to avoid overwhelming the system once order was restored.

– **Observation of Performance Issues**:
– Incident investigation revealed a trend of functional issues being easier to detect compared to performance-related problems, which often remain unnoticed until under extreme pressure.

– **Long-term Resilience Strategies**:
– The engineers gathered insights for future incidents, stressing the need for better operator awareness and the building of internal runbooks to manage traffic dynamics effectively.
– Emphasis on both system robustness and resilience through understanding system behavior is crucial in preventing such incidents.

This incident illustrates vital lessons about the complex interplay of cloud infrastructure, operational management, and the importance of having contingency plans, particularly in highly dynamic environments like those experienced by SaaS platforms. The continued evolution of operational practices and incident management strategies is essential for maintaining efficacy and reliability in cloud-based services.