Hacker News: The Failure Rate of EBS

Mar 18, 2025

—

Source URL: https://planetscale.com/blog/the-real-fail-rate-of-ebs
Source: Hacker News
Title: The Failure Rate of EBS

Feedly Summary: Comments

AI Summary and Description: Yes

**Summary:** The text discusses the challenges and failure rates associated with Amazon Elastic Block Store (EBS) volumes, specifically noting that while complete failures are rare, performance degradation occurs frequently. This has significant implications for cloud infrastructure reliability and performance, particularly for businesses reliant on consistent database operations.

**Detailed Description:** The article by Nick Van Wiggeren highlights critical insights regarding the operational failures of Amazon Web Services’ Elastic Block Store (EBS), making it relevant for professionals in cloud computing, infrastructure security, and performance optimization:

* **Frequent Partial Failures:**
– EBS performance can degrade, leading to latency spikes that feel like failures to accessing applications despite not resulting in outright data loss.
– The text points out that minor issues in one part of the architecture can cascade and cause perceived failures in user-facing applications.

* **Performance Metrics and Expectations:**
– The documentation indicates a provisioned performance guarantee, with expectations that a volume will operate below 90% of its provisioned performance for only 1% of the time.
– However, in practice, users experience latency spikes that can render a volume effectively unusable, causing significant operational impacts such as 500 errors on webpages.

* **Exponential Impact in Large Systems:**
– In large-scale distributed databases, a seemingly small reduction in performance can reverberate across systems due to interdependencies.
– For instance, in a scenario with 768 EBS volumes, observers noted a 99.65% chance of encountering at least one incident affecting production workloads at any given time.

* **Monitoring and Mitigation Strategies:**
– PlanetScale implements rigorous monitoring of EBS metrics to minimize the impact of performance degradation.
– Automated systems allow for rapid response to degrade conditions, such as performing zero-downtime reparenting and volume replacement, ensuring user experience remains smooth.

* **Transition to Alternative Architectures:**
– The realization of EBS’s limitations led PlanetScale to develop PlanetScale Metal, a solution that uses local storage to avoid the pitfalls of network-attached storage, demonstrating a proactive approach to infrastructure reliability.

Overall, the insights provided by Van Wiggeren underline the need for robust operational strategies when dealing with cloud-based storage solutions, especially concerning performance guarantees and the management of complex interconnected systems. This is particularly crucial for professionals focused on optimizing the reliability and performance of cloud infrastructure.

1 5 7 a access Act ads AI alt Amazon Amazon Web Services and API Application applications Arch architecture architectures art as attached storage Auto Automated Systems based business by C CERN challenges CIA Cloud cloud computing cloud infrastructure cloud-based Computing Condi critical cross D data data loss database databases de demo dependencies Distributed Database distributed databases document documentation downtime e effective Elastic Block Store end error errors exp experience fail failures focused for g Go grade H hack hacker Hacker News high Highlight http HTTPS implications in incident infrastructure infrastructure reliability infrastructure security insights inter ite k l large latency least led Li liability limitations local local storage low making man management Meta metrics mini mitigation mitigation strategies Monitor monitoring N native native architectures network news no o of on one only operation operational failures operational impact operational strategies opt optimization out over performance performance degradation performance metrics performance optimization pitfalls PlanetScale point proactive product production professionals R rag rapid response rate RCE real red reliability response right Ro s Sable Scale sec security server servers service services Sig SoC solutions source specific SSE SSO storage storage solutions system systems T text the Time to Tor TP transition two US use user user experience Users V Vision web web services Wi workload workloads x zero