Skip to main content
Engineering, Backend

How Uber ensures Apache Cassandra®’s tolerance for single-zone failure

June 20 / Global
Featured image for How Uber ensures Apache Cassandra®’s tolerance for single-zone failure
Image
Figure 1: Single Zone Failure and Availability Impact.
Image
Image
Figure 2: The Desired Setup: Grouping the Nodes by Zone.
Image
Figure 3: Uber’s Old Setup: Single Unique Rack.
Image
Figure 4: Hotspot Issue When Introducing a New Rack to a Single-Rack Setup.
Image
Image
Figure 5: Phase 1: provision a new set of multi-rack nodes with binary disabled.
Image
Image
Figure 6: Phase 2: include the new set of nodes in replication.
Image
Image
Figure 7: Phase 3: sync old data.
Image
Figure 8: Phase 4: traffic switch.
Image
Image
Figure 9: Phase 5: get rid of the old set of nodes.
Image
Image
Image
Long Pan

Long Pan

Long Pan is a Senior Software Engineer on the Cassandra team at Uber. He is deeply involved in a broad range of activities to maintain and enhance Cassandra at scale as a distributed database. His work spans operational improvements and development projects. Outside of work, he enjoys cooking and exploring new places.

Gopal Mor

Gopal Mor

Gopal Mor is a Sr. Staff Software Engineer and a Tech Lead Manager on the Cassandra team at Uber. He works on distributed systems, web architectures, databases, and reliability improvements. He has been part of the industry since monoliths and traditional architecture; championed resolving tail-end issues for performance, latency, and efficiency. In his most recent role, he leads and manages the team responsible for Cassandra at Uber. In his spare time, he likes to tinker with IoT devices, raspberry pi, FPGAs, and ESP32.

Jaydeepkumar Chovatia

Jaydeepkumar Chovatia

Jaydeepkumar Chovatia is a Sr. Staff Software Engineer on the Storage Platform Org at Uber. He has contributed to big projects in core Cassandra as well as upstreamed patches to Open source Cassandra. His primary focus area is building distributed storage solutions and databases that scale along with Uber's hyper-growth. Prior to Uber, he worked at Oracle. Jaydeepkumar holds a master's degree in Computer Science from the Birla Institute of Technology and Science with a specialization in distributed systems.

Shriniket Kale

Shriniket Kale

Shriniket Kale is an Engineering Manager II on the Storage Platform org at Uber. He leads Storage Foundations, which has teams focusing on three major charters (MySQL and MyRocks, Docstore Control Plane and Reliability, and Cassandra). These teams power the Storage Platform, on which all critical functions and lines of business at Uber rely worldwide. The platform serves tens of millions of QPS with an availability of 99.99% or more and stores tens of Petabytes of operational data.

Gabriele Di Bernardo

Gabriele Di Bernardo

Gabriele Di Bernardo is a former Senior Software Engineer on the Stateful Platform team at Uber. His focus was on efficient and reliable node placement, large-scale fleet automation and automatic data center turn-up/tear-down.

Posted by Long Pan, Gopal Mor, Jaydeepkumar Chovatia, Shriniket Kale, Gabriele Di Bernardo