Source URL: https://blog.cloudflare.com/rearchitecting-workers-kv-for-redundancy/
Source: The Cloudflare Blog
Title: Redesigning Workers KV for increased availability and faster performance
Feedly Summary: Workers KV is Cloudflare’s global key-value store. After the incident on June 12, we re-architected KV’s redundant storage backend, remove single points of failure, and make substantial improvements.
AI Summary and Description: Yes
Summary: The text details Cloudflare’s response to a significant service outage on June 12, 2025, caused by a failure in their Workers KV service’s storage infrastructure. The incident prompted a redesign of their system towards a hybrid architecture, eliminating reliance on third-party providers and enhancing performance and availability. This strategic shift reflects critical lessons learned in infrastructure resilience and service reliability, relevant for professionals across cloud security, infrastructure, and service reliability domains.
Detailed Description:
The document outlines a comprehensive overview of Cloudflare’s incident on June 12, 2025, due to a significant storage infrastructure failure affecting the Workers KV service and subsequently disrupting multiple services. The incident serves as a case study in infrastructure security, resilience, and the complexities of multi-provider architectures. The following points detail the critical aspects and insights from this incident:
– **Incident Background:**
– Workers KV is a key-value store critical for configuration, authentication, and asset delivery.
– The failure stemmed from reliance on a third-party cloud provider which experienced a global outage, leading to substantial service disruption.
– **Architectural Redesign:**
– Post-incident, Cloudflare moved to store all data on its infrastructure.
– The new design eliminates single points of failure and enhances operational redundancy by integrating multiple storage systems.
– Transition from a dual-provider architecture to a single, Cloudflare-owned infrastructure focuses on increased performance and control.
– **Hybrid Storage Solution:**
– The new architecture combines advantages of distributed databases to optimize storage for small objects—better suited for KV service demands.
– Implementation of a KV Storage Proxy (KVSP) allows seamless communication and interactions with database clusters, optimizing for both small and large objects.
– **Consistency Mechanisms:**
– Advanced mechanisms are implemented to maintain data consistency across multiple storage backends, even in the face of operational failures.
– These include racing writes and reads, ballooning cache invalidation strategies, and utilizing background processes to rectify inconsistencies.
– **Performance Improvements:**
– The redesign led to significant internal latency reductions, with performance benchmarks illustrating faster response times across services—most notably in European operations.
– **Future Directions:**
– Plans are underway to eliminate all dependencies on third-party storage providers, striving for total infrastructure independence.
– Adoption of hybrid architecture strategies aims to improve resilience and performance not only for Worker KV but for other Cloudflare services as well.
This incident underscores the importance of operational resilience and the need for robust cloud infrastructure design, particularly for organizations managing critical services in highly dynamic environments. Security and compliance professionals can glean insights from the methodologies used in the redesign concerning multi-cloud strategy, data consistency challenges, and proactive incident management, which are all essential in safeguarding service reliability.