At Uber, our MySQL® fleet is the backbone of our data infrastructure, supporting a vast array of operations critical to our platform. Uber operates an extensive MySQL fleet, consisting of over 2,300 independent clusters. Building a control plane to manage this fleet at such a massive scale, while ensuring zero downtime and no data loss, is among the most challenging problems in the industry.
In the last couple of years, we embarked on improving MySQL fleet availability from 99.9% to 99.99% through various optimizations and a re-architecture of the control plane. This is the first post in a multi-part blog series exploring MySQL deployment and operations at Uber. In this blog, we’ll talk about Uber’s MySQL fleet architecture, control plane operations, and some of the improvements brought in the last couple of years at the MySQL control plane layer.
Architecture
The MySQL fleet at Uber consists of multiple clusters, each with numerous nodes. There are two major flows: the data flow, where clients/services interact with the MySQL cluster, and the control flow, which manages the life cycle of clusters.
For the data flow, stateless services hosted in Kubernetes® connect to its MySQL cluster via a standard client. Each server has a reverse proxy, which stores routing mapping for MySQL nodes based on roles (primary/replica/batch). This enables the client to discover and connect to the appropriate cluster using the JDBC™ protocol based on the query to execute.
The control flow manages the provisioning, maintenance, and decommissioning of clusters and nodes while ensuring security posture and integration with the Uber ecosystem.
The MySQL fleet at Uber consists of these major components, shown in Figure 1:
- Control plane
- Data plane
- Discovery plane
- Observability
- Change data capture and data warehouse ingestion
- Backup/restore
Control Plane
The MySQL Control plane is a state-based system comprising multiple components/services and stores. At its core is the technology manager, who’s responsible for the orchestration of other control plane components. One of its key responsibilities is publishing the goal state or the desired state of the cluster to Odin, Uber’s in-house technology-agnostic management platform for stateful technologies, which also manages the placement of nodes. The manager publishes the goal-state to Odin. The goal state includes key configurations such as resource profiles, node counts, roles (Primary/Follower), the side containers that should be running on the data nodes, server settings (like bin log format, and SQL mode), and more. The control plane ensures that the actual MySQL cluster or MySQL node always converges to the defined desired state at any stage.
The other key role of the technology manager is to allow for changing the state of the system via workflows. A workflow is a fault-tolerant long-running process powered by Cadence. Some examples of a workflow would be adding a new node on an existing cluster, performing a primary fail-over on a cluster, applying some MySQL variables on the node, changing the replication master of a MySQL replica, etc. Some other key functions of the technology manager are listed below:
- Cluster management: Handles operations like creating, updating, and deleting clusters.
- Primary failover: Changes the primary node of the cluster.
- Node life cycle management: Manages the life cycle of nodes through operations such as adding, replacing, and deleting MySQL server nodes.
- Balanced placement: Provides signals to Odin’s placement engine to ensure balanced placement of server nodes across all geographical locations where Uber infra is deployed. This ensures resiliency against hardware failures or even data center outages.
- DB operations: Manages db-specific operations such as system variable setup, replication setup, and scaling operations.
Traditionally, the MySQL control plane was tightly coupled with the underlying infrastructure processes. As the MySQL fleet grew, this caused issues with infrastructure placement operations, which got blocked by MySQL failures. The MySQL team spent significant time debugging these issues. This coupling impacted the operational reliability of 60+ workflows, including primary failover, node replacement, and more. Additionally, MySQL relied on a Git-based config storage system to manage cluster state, a solution not optimized for such a use case. All this posed reliability and scalability issues, which required a re-architecture of the entire control plane.
Controller
As part of the control plane redesign, we introduced a component called the controller to the MySQL control plane. The controller acts as an external observer for all MySQL clusters, collecting signals from the database and other control plane components. The controller consists of a rule evaluator who monitors them and takes action if any defined rules are violated in any cluster. One of the key roles of the controller is to monitor the health of primary nodes in MySQL instances and automatically trigger a primary fail-over if the current primary node experiences issues. In addition, the controller also ensures establishing a balance in clusters that are part of a consensus group in a group replication setup.
Orchestration of Critical Flows
The primary mechanism for interacting with the control plane is through workflows. Workflows are asynchronous, event-driven processes, defined as a series of steps that orchestrate complex long-running tasks. The MySQL control plane uses Cadence™ to power these workflows, which provides durability, fault tolerance, and scalability. Figure 2 shows a typical workflow in the MySQL control plane.
This redesign has revamped all the control plane operations. This next section reviews the orchestration of a few critical flows.
Primary Failover
At Uber, we have a single primary-multiple replica setup. The writes are handled by the primary node, which is replicated to replicas using standard MySQL binlog replication.
The primary failover is an automated process that changes the primary node of a cluster from one host to another. Since there is a single primary node, keeping it healthy and operational is critical to guarantee high write availability and minimize downtimes. These primary failover workflows are used as a mitigation action by components, which continuously monitor the health of the primary node and perform fail-over in cases of any degradation.
Based on the health of the existing primary node, we perform two types of failovers: graceful and emergency.
Graceful promotions are required during general maintenance activities of the current primary node, for example, when the host of the existing primary node needs to be repaired. They involve selecting a new primary candidate and then gracefully transferring the write load from the old primary to the new one. Graceful failovers assume that the existing primary node is available and healthy. This applies to async and semi-sync replication setups. There’s another deployment step involving group replication, which is out of scope for this blog.
Graceful failover performs these steps:
- Puts the current primary node into read-only mode.
- Shuts down traffic on the current primary node.
- Select a new primary node (primary elect). By default, the workflow picks the primary elect from the same data center. The replication lag of the replica nodes is also considered, and the node with the most advanced binlog position is given precedence.
- Retrieves the binlog positions in the previous primary and waits for those transactions to apply on the primary elect.
- Enabling writes for the new primary.
In cases where the existing primary node isn’t available (because of a data center zone failure, or network isolation) MySQL performs an emergency fail-over. It performs the same steps as graceful promotions, with the only exception being that it won’t rely on the current primary to replicate all the data to the new primary node because the current primary is assumed to be unreachable.
We guarantee 99.99% availability to our downstream services, and primary failover is a critical process that helps us meet this SLA.
Node Replacement
Replacing a node in the control plane involves moving a MySQL node (and all its data) from one host to another without affecting the users of that MySQL database.
Uber’s hardware infrastructure is spread over multiple cloud providers and on-prem data centers and comprises hundreds of thousands of machines and other hardware and network components. Node replacement is crucial in the MySQL control plane to protect the fleet from disruptions on this vast infrastructure and keep the fleet agile. The node replacement workflow is a maintenance activity that shuts down a node on the affected host and creates an identical node on a different host with similar resources and geographical location in a way that’s completely transparent to the users of the database, who aren’t even aware of this movement.
Node replacement, although seemingly a simple operation of data movement, comes with its nuances:
- Hardware profile: The new node must have the same hardware profile as the replaced one. This means it must have the same number of CPU cores, disk and memory space, ports, etc.
- Co-location: The new node must be placed on a host with the same fault tolerance level as the one it replaces to ensure identical network latencies. Customers only care that query latencies remain consistent, regardless of the node’s location.
- Dependencies: If the current node is the replication parent for other nodes in the topology, its child nodes must either point to the new replacement node or connect to another node in the cluster.
- Primary promotion: If the node being replaced is the Primary node of the cluster, before it is decommissioned or removed from traffic, a graceful Primary promotion must be triggered to transfer the write responsibilities to a different Primary node.
Node replacement internally consists of two independent operations, node addition, and node deletion.
Node addition is the bootstrapping process, which consists of placement and data provisioning. Placement consists of finding the location of the node. This involves identifying a host that consists of the required resources for the new node. Data synchronization consists of installing a MySQL process on the identified node and then starting a data transfer from one of the existing nodes (preferably the primary) to the new node to the new node. The node addition process is designed to support adding multiple nodes in parallel, which is especially useful during disaster recovery scenarios.
Node deletion is a process of gracefully removing the host after all the node’s dependencies are taken care of (listed above).
Schema Changes
The MySQL control plane automates schema changes through a self-serve workflow. This process uses MySQL’s instant alter or Percona™ ptosc (pt-online-schema-change) to perform a safe and non-blocking schema update on the primary node. The workflow intelligently selects the schema application strategy based on the schema change type and data size. For example, it uses instant-alter for quick and safe column additions and non-blocking online methods like ptosc for datatype changes.
The schema change workflow also allows a dry-run capability. The dry-run enables customers to apply a schema change on an isolated replica before applying it to the primary node (and the rest of the cluster). This gives additional assurance that the schema change is backward-compatible, non-destructive, and safe.
The schema change workflow is also integrated with Uber’s CI-CD pipelines to make the process of schema changes completely automated and subject to a review process. Developers make the schema changes in a schema file which is also checked in with the rest of the source code. Once the update is approved and merged with the main branch, the CI system detects it and triggers the schema change workflow. This gives the developers full control of their schema and also ensures that deployed code always aligns with the database schema.
Data Plane
A running MySQL node is composed of several containers running on a single host. The database container runs a MySQL process and several other helper components performing well-defined jobs. These components are isolated Docker® containers running inside a single host that can talk to each other via Docker networking. The anatomy of a MySQL node is shown in Figure 4.
A MySQL node consists of:
- Database container: Runs the Oracle InnoDB® engine within the mysqld process. We can configure this to use other MySQL engines like Meta RocksDB™.
- Worker container: This is a sidecar container, responsible for converging the actual state of the node to its goal state. This integrates the MySQL node with the Odin placement engine.
- Metrics container: Polls various database signals (like QPS, query types, lock times, and connection metrics) that the MySQL process emits and publishes them for monitoring.
- Health prober: This periodically tracks the health of the MySQL process and emits signals on primary health. The controller consumes these signals and takes action to mitigate primary node failures, giving strict write downtime SLAs to MySQL users across Uber.
- Backup: An ephemeral container that periodically spans to take a database backup and upload it to an object store.
Discovery Plane
The routing or discovery plane simplifies client interaction with the MySQL cluster by providing an abstraction over the ever-changing Uber hardware infrastructure. This provides a single virtual IP for services to connect to their MySQL clusters, hiding all the changes at the hardware level.
The routing and discoverability plane consists of three major components.
- Reverse proxy: Acts as a load balancer and forwards the client’s request to and from the database hosts.
- Pooling service: Responsible for updating proxy configurations during any cluster/node management operations such as primary failover or node replacements.
- Standard client: Provides simple and easy-to-use functions for creating connections to primary and replica nodes based on the type of request (read/write), along with connection polling, time-out handling, client-related metrics, etc.
As part of the control plane re-architecture, the routing plane was updated to use a strongly consistent etcd™ data store as its topology store. Changes to the topology, such as adding a new node or handling primary failovers, are recorded in the topology store by the manager. These updates are then propagated to the pooling service via etcd™ watches, which then adjusts the reverse proxy configuration to direct traffic to the new nodes or drain traffic from leaving nodes (during node replacement). All of this remains fully abstracted from clients, which connect to the reverse proxy using a static VIP. The proxy configuration is generated to route the write queries to the primary node and load-balance the read queries on all replicas, prioritizing replicas in the same geographical region.
The discovery plane supports disabling traffic on specific nodes. This is very useful for debugging any hardware or software failures in MySQL nodes without impacting customer traffic. Automating this feature using the controller to disable traffic on nodes experiencing issues like replication lag is also possible.
Observability
Apart from collecting system metrics from containers and clusters, probers are used to simulate the data flow and collect metrics on the health of various components of each cluster. These metrics and logs are collected by Uber’s metrics and logging system. Alerts are configured to detect failures such as write unavailability, replication lag, high CPU usage, and abnormal connections into the primary node. This observability ecosystem ensures the MySQL-db-as-a-service team is on top of the MySQL fleet’s health. Teams owning upstream services connected to the databases can also subscribe to these alerts.
Change Data Capture
For change data capture (CDC), MySQL fleet uses Storagetapper, which captures changes (inserts, updates, deletes) from binlog, streams this to Apache Kafka® and this further gets ingested into an Apache Hive™ data store. This system can handle upstream schema changes, transformations, and format conversions.
Backup and Restore
Backup and restore are fully automated processes for MySQL at Uber. Backup uses Percona XtraBackup™ capabilities. MySQL backup and restore maintains a 4-hour RPO and RTO of a few minutes to hours, depending on the data size.
Conclusion
MySQL sits at the core of many critical services across Uber. The control plane provides a reliable, scalable, and highly performant MySQL as a platform for these services, abstracting all the operational overhead of maintaining such a fleet at Uber’s scale.
This introductory blog explored the MySQL control plane’s major components, discussing each component’s architecture and role. We also explored some important operations and automation in the control plane. These keep the fleet healthy and agile without requiring manual intervention, allowing us to serve many customers and use cases. In the next blogs in this series, we’ll delve deeper into how we operate MySQL to guarantee high availability and high throughput.
Acknowledgments
The authors would like to thank all the contributors to the MySQL platform and all the Uber platform and customer teams for their feedback and collaboration.
Cover Photo Attribution: The cover photo was generated using OpenAI ChatGPT Enterprise.
Apache®, Apache Hive™, Apache Kafka®, and Hive are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Docker® and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries. Docker, Inc. and other parties may also have trademark rights in other terms used herein.
InnoDB® and MySQL® are registered trademarks of Oracle and/or its affiliates. No endorsement by Oracle is implied by the use of these marks.
JDBC™ is a trademark of ORACLE AMERICA, INC.
Kubernetes®, etcd®, and Kubernetes® logo are registered trademarks of the Linux Foundation in the United States and/or other countries. No endorsement by The Linux Foundation is implied by the use of these marks.
Pecona™ and Percona XtraBackup™ are trademarks of Percona, LLC.
RocksDB® is a registered trademark of Meta Platforms, Inc.
Banty Kumar
Banty Kumar is a Senior Software Engineer on the storage platform team. He led the MySQL control plane re-design and is involved in various other MySQL initiatives. His areas of interest include databases and running distributed systems at scale.
Debadarsini Nayak
Debadarsini Nayak is a Senior Engineering Manager, providing leadership for various storage technologies.
Raja Sriram Ganesan
Raja Sriram Ganesan is a Senior Staff Software Engineer on the Core Storage team at Uber. He is the tech lead for MySQL initiatives and has led critical reliability and modernization projects for MySQL at Uber.
Amit Jain
Amit Jain is an Engineering Manager in the Storage Platform team, leading the MySQL offering at Uber. He has successfully guided the re-design of the control plane and discovery plane along with resiliency, reliability, and security initiatives. Amit also leads the Change Data Capture (CDC) initiative for MySQL and other technologies.
Posted by Banty Kumar, Debadarsini Nayak, Raja Sriram Ganesan, Amit Jain
Related articles
Most popular
Enabling Infinite Retention for Upsert Tables in Apache Pinot
Presto® Express: Speeding up Query Processing with Minimal Resources
Unified Checkout: Streamlining Uber’s Payment Ecosystem
The Accounter: Scaling Operational Throughput on Uber’s Stateful Platform
Products
Company