Skip to main content
Engineering

Uber’s Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability

19 January 2021 / Global
Featured image for Uber’s Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability
Figure 1: Simplified architecture of Gairos shows major components in the platform.
Figure 2: Dispatch query service gets and displays the data for driver state transitions in San Francisco from Gairos.
Figure 3: Dispatch service gets driver states data for a given driver and displays.
Figure 4: Driver utilizations grouped by geo locations.
Figure 5: Surge multipliers for different hexagons in Oakland when there is a game.
Figure 6: The new architecture closes the loop for Gairos with the new data flow indicated by the red arrow.
Figure 7: High-level architecture shows the data flow in the platform. Red arrows represent new data flows, and the light green components represent two new optimizations: the Giaros Query Analyzer and Optimization Engine.
Figure 8: Benchmarking service will carry out tests and save test results.
Figure 9: A query for drivers in SF needs query all shards without sharding while it only queries one shard with sharding.
Figure 11: Four cities with different size of data (i.e., a greater volume of users on our platform) and different QPSs are partitioned into 4 shards based on given constraints.
Max # of docs per shardMin # of docs per shardMax/Min
Before47 million17 million2.76x
After30 million23 million1.3x
Figure 13: Latency with sharding is much lower for demand data and the difference increases as the number of clients increases, as depicted by this example use case.
Figure 14: QPS with sharding is 4x of QPS without sharding.
Figure 15: Latency with sharding is higher and the difference increases as the number of clients increases.
Figure 16: QPS with sharding is 4x of QPS without sharding.
Figure 17: Latency with sharding is higher when the number of clients is low and it is lower when the number of clients increases to 200+.
Figure 18: QPS with sharding is 4x of QPS without sharding.
Figure 19: CPU load for surge pricing cluster shows daily pattern and increases as time goes on during a day. Peak CPU load reduces from 60 to 10 and load for each node varies a little during a day.
Figure 20: Latency with caching is much lower and the difference increases a lot as the number of clients increases.
Figure 21: QPS with caching is 10x of QPS without caching.
Figure 22: Cache hit rate for supply_status is high with hit QPS around 50 while set QPS around 10.
Figure 23: Cache hit rate for demand_jobs is around 80% and hit has some spikes.
Figure 24: Cache hit rate is around 30% for supply_geodriver.
Figure 25: There is no cache hit at all for demand.
Figure 25: There is no cache hit at all for demand.
Figure 27: Check setting for each field based on usage.
Figure 28: The number of shards drops from around 40k to 20k after cleaning up these small indices.
Wenrui Meng

Wenrui Meng

Wenrui is a Senior Software Engineer at Uber working on real-time data metrics optimization.

Qing Xu

Qing Xu

Qing was an engineering manager of Marketplace Intelligence at Uber.

Yanjun Huang

Yanjun Huang

Yanjun Huang was a senior software engineer on Uber's Core Infrastructure team and is an Elasticsearch Expert.

Posted by Susan Shimotsu, Wenrui Meng, Qing Xu, Yanjun Huang

Category: