Skip to main content
Data / ML

Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber

April 23, 2019 / Global
Featured image for Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber
Figure 1: In our ingestion process, the global index distinguishes between inserts and updates to the dataset, and also looks up relevant files that need to be written into to reflect the updates.
Figure 2: This overview of our data architecture shows how we integrate global indexing with the ingestion platform.
Figure 3: When source data is grouped during bootstrap ingestion such that it contains no updates, the global index lookup can be skipped. Once bootstrap ingestion is complete, corresponding indexes are bulk-uploaded to HBase in order to prepare the dataset to enter the next phase, incremental ingestion.
Figure 4: Our Big Data ecosystem’s model of indexes stored in HBase contains entities shown in green that help identify files that need to be updated corresponding to a given record in an append-plus-update dataset.
Figure 5: The layout of index entries in HFiles lets us sort based on key value and column.
Figure 6: FlatMapToMair transformation in Apache Spark does not preserve the ordering of entries, so a partition isolated sort is performed. The partitioning is unchanged to ensure each partition still corresponds to a non-overlapping key range.
Figure 7: HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.
Figure 8: Three Apache Spark jobs corresponding to three different datasets access their respective HBase index table, creating loads on HBase regional servers hosting these tables.
Figure 9: Adding more servers to the HBase cluster for a single dataset that is using global index linearly correlates with a QPS increase, although the dataset’s QPSFraction remains constant.
Nishith Agarwal

Nishith Agarwal

Nishith Agarwal currently leads the Hudi project at Uber and works largely on data ingestion. His interests lie in large scale distributed systems. Nishith is one of the initial engineers of Uber’s data team and helped scale Uber's data platform to over 100 petabytes while reducing data latency from hours to minutes.

Kaushik Devarajaiah

Kaushik Devarajaiah

Kaushik Devarajaiah is the Tech Lead for LedgerStore at Uber. His primary focus is building distributed gateways and databases that scale with Uber's hyper-growth. Previously, he worked on scaling Uber's Data Infrastructure to handle over 100 petabytes of data. Kaushik holds a master's degree in Computer Science from SUNY Stony Brook University.

Posted by Nishith Agarwal, Kaushik Devarajaiah

Category: