Herb: Multi-DC Replication Engine for Uber’s Schemaless Datastore

25 July 2018 / Global

Share
Facebook
X social
Linkedin
Envelope
Schemaless, Uber’s fault-tolerant and scalable datastore, supports the 600-plus cities where we operate and the 15 million rides per day that occur on our platform, not to mention Uber Eats, Uber Freight, and other business lines. Since 2014, we have deployed more than 50 Schemaless instances, each with multiple datastores and many thousands of storage nodes. 
As our business scaled across global markets, we needed the ability to replicate data to multiple data centers in order to support our active-active architecture. Replicated data sets let our applications read and write across data centers as seamlessly as possible, and in the case of a data center failure, enables applications to continuing functioning as normal.
To meet these needs, we built Herb, our replication solution. Herb was built with Go, one of Uber Engineering’s more popular languages. Whenever data is written to one data center, Herb can replicate that data to others, ensuring resilience and availability. Herb is also transport protocol-agnostic, allowing network flexibility and applicability to future architectures.
Design challenges
When we began designing and scoping out the replication problem that Herb would solve, we identified a few requirements that would force us to be creative about our solution: 
Figure 1: The replication system we designed needed to address five key concerns. 
Consistency: To create seamless user experiences, we need to ensure that data is fresh and consistent between all data centers participating in replication. Inconsistent data may lead to a subpar experience on our platform.
At-least-once delivery: As Schemaless is append only, data stores reapplying updates is not an issue.
In-order delivery of new updates: All updates must be ordered based on the origin of the data center. For example, updates originated in DC1 will have the same order in DC2, and vice versa. With this approach, applications can read from any data center and see the same ordered updates based on the origin.
Different consumption speeds: Different data centers consume data at different speeds. As Uber scales, we must ensure that a slower data center will not block replication to a faster data center.
Fault tolerance:  We need a solution that is fault tolerant, meaning that one data center failure will not impact replication to other data centers.
System design and architecture
Figure 2: This figure represents mesh replication topology with ‘n’ data centers, i.e. Herb connects each schemaless instance to all other instances in the other data centers. 
Each Herb data center manages a mesh topology. In the full mesh topology, each node is directly connected to each of the other nodes. In this context, each data center connects to all other data centers. The replication setup consists of multiple streams, one in each direction for each data center. When a write happens in one Schemaless instance in a data center, then Herb is responsible for transporting the write to all other data centers. This way, if one data center goes down, its data remains accessible by the other data centers. 
Figure 3: In this example, there are five tasks, three of which are assigned to the Herb worker on Host1 while the two remaining are handled by the Herb worker on Host2. 
Herb is deployed on multiple hosts. For host discovery, it leverages the Uber Naming Scheme (UNS), Uber’s homegrown service registration solution. Each Herb process can have multiple tasks, in other words, a unit of work in one Herb process, as shown in Figure 3, above. There is no dependency between tasks and each task is executed in an independent Goroutine
We designed Herb’s transport  to be configurable so it is not coupled with any single protocol. For instance, we first used the Transmission Control Protocol (TCP), moved to Hypertext Transfer Protocol (HTTP), and are currently using YARPC, Uber’s open source remote procedure call (RPC) platform for Go services. We implemented our transport layer in a way that allows easy extension to any other protocol. 
For the database reader, we initially tried using a polling reader in our prototype. However, database queries are resource-intensive, and we found that constant polling of zero or low-traffic datastores results in unnecessary database loads. This approach was not efficient and scalable, so we implemented a streaming reader instead. We chose the commit logs as the streaming source, as these logs contain all recent updates and do not require database queries to capture their data. 
Ensuring local data center ordering of updates was an important part of our design, as this constraint greatly simplifies the logic of applications that consume data. These applications need not worry about the local origin ordering, as reading updates from anywhere will result in the same order. When Herb receives acknowledgement of a write from one data center, it updates the offset for that data center. This individual offset tracking helps data centers run at different speeds during scenarios like restart or a failure. At the same time, Herb maintains the ordering of our data center updates. 
To understand how Herb maintains its ordering, assume we have updates a, b, c, d, and e in order at DC1 (see figure 7 and 8). Another data center, DC2 may have applied and acknowledged update d while DC3 is at update a. In this example, if Herb at DC1 restarts, then it will send update e to DC2 but DC3 will still be receiving updates starting at b. As all these updates originated at one data center, that means the update order should be preserved in DC2 and DC3.
Schemaless’ updates to the datastore are append only. In the case of a restart, we replay the updates from the last persisted offset. Since writes are idempotent, applying the updates again should not be an issue. For instance, assume DC2 has received updates a, b, and c. Meanwhile, Herb in DC1 got restarted and only received acknowledgement for update a. In this example, DC1 may replay updates starting at b and DC2 may receive updates b and c again without affecting the datastore. 
Streaming
Uber stores petabytes of data split across multiple data stores. As our data increases, it is critical that our replication solution can scale accordingly. To support our growth, we built Herb to be efficient while also reducing end-to-end replication latency between data centers. During our initial testing, we found that querying the database for reads would not scale up. As an alternative, we came up with a much more efficient method by which we read database logs to identify the updated data that needs to be replicated across our data centers. This log-based streaming model enabled us to speed up Herb and reduce our end-to-end latency to milliseconds.
Figure 4: The median replication performance of Herb, is 550-900 cells received per second and a latency of around 82 to 90 milliseconds. That latency also includes network communication time between data centers, which comes to around 40 milliseconds. 
Figure 4, above, shows graphs taken from one of the datastores in a production instance of Mezzanine, our Schemaless datastore. This datastore receives around 550 to 900 cells per second; it takes Herb around 82 to 90 milliseconds to replicate these updates, with the cells available to be read by other data centers. Herb’s actual latency is even less than shown in Figure 4 because the median lag includes network time between data centers, which adds around 40 milliseconds to the total latency. 
Commit logs
The logs read by Herb are transactional, recording every change applied to our databases. Each record entry is assigned a unique, monotonically increasing number. Of course, Herb cannot modify existing records in the logs, but it scans and interprets them to to capture changes made to the databases. 
Figure 5: Our log files as they are stored on disk, and each record is stored in a data file. The corresponding offset is stored in an index file. 
As shown in Figure 5, above, the logs consist of both a data file, containing individual records, and an index file. Index files are used to index the disk offset of the records present in a given data file to avoid scanning the file. By doing so, it acquires the changes with low latency and does not impact the performance of the database. 
Deployment model
From a deployment model perspective, the complete set of Schemaless instances are partitioned into subsets that we call “replication cohorts.”
Figure 6: Each of Herb’s replication cohorts contain a few Schemaless instances split according to their needs. For instance, a high-traffic instance may share resources with one low-traffic instance, so both can be in one replication cohort. We also have dedicated replication cohorts for testing, staging, and other purposes. 
A replication cohort is a deployment unit in Herb that contains multiple Schemaless instances. Instances within a cohort can share replication resources with each other, and one deployment handles one cohort of Schemaless instances. Critical instances can be isolated from others by placing them in their own cohort. Within a cohort, each instance is guaranteed to have a minimum set of resources.
Scheduling
As each replication cohort contains multiple instances of running Herb workers, we need to ensure cooperation and coordination through a seamless scheduling process. To accomplish this, we decided to leverage Ringpop, Uber’s open source software designed for load balancing and coordination among applications. In our implementation, each Herb worker announces itself, becomes a node in the ring, and discovers others. Task distribution occurs after successful ring formation. 
Consistency
Figure 7: Herb prevents data centers from receiving out-of-order updates. 
Herb guarantees the order of the updates it receives from its local data center, and preserves the ordering while transporting these updates to other data centers. As an example, Figure 7, above, shows that data center DC1 received four updates, which we are labelling 101, 201, 301, and 401, stored in monotonically increasing order. Herb maintains this update order when replicating the data at our second data center, DC2, first writing 101, then 201, 301, and 401. 
Since the system supports asynchronous replication, our data centers eventually become consistent. Writes are considered complete as soon as the remote data center acknowledges it.
Validation and rollout
As Microsoft Distinguished Scientist Leslie Lamport famously wrote in 1987, “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” This quote illustrates the need to validate whether a distributed system is functioning properly, ensuring that even partial failures are not corrupting data. With Herb, we built tooling to validate the order of updates, and to make sure that our data is consistent between data centers. 
Shadowing traffic
Although we tested individual components of Herb before launch, we needed to see how it would behave with live production traffic, and whether the replicated data was consistent. Using offline tools, we read from the production instance and used Herb to replicate those updates to a test instance. When we were satisfied with Herb’s consistency, we deployed it into production. 
Continuous checking
We also built a near real-time validation framework to audit data replicated by Herb. This framework reports any data discrepancies caused by replication, letting us monitor the system continuously in production.
Figure 8: Herb’s continuous checking framework reads across Schemaless nodes and validating order. As in this figure the data block represent cluster1 of both the data center and we can see that updates (101, 301, 401 and 501) originated in data center A and both the data center have them in same order. Similarly for update (201, 601) have same order across data centers. 
Next steps
In this version of Herb, our main focus was data replication across geographic data centers and making our Schemaless system active-active. While building Herb, we identified another area  for improvement in this domain: the Schemaless data flow, i.e. piping data between OLTP datastores to our data warehouse and further streaming data to consumers. Taking this into account, as a next step we plan on building new features that will transport these changelog streams to consumers and make them source-agnostic.
If building scalable systems with global impact interests you, consider applying for a role on our team! 
Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.

Himank Chaudhary

Himank is the Tech Lead of Docstore at Uber. His primary focus area is building distributed databases that scale along with Uber's hyper-growth. Prior to Uber, he worked at Yahoo in the mail backend team to build a metadata store. Himank holds a master's degree in Computer Science from the State University of New York with a specialization in distributed systems.

Posted by Himank Chaudhary

Category:

Data / ML

Engineering