Skip to main content
Data / ML

DBEvents: A Standardized Framework for Efficiently Ingesting Data into Uber’s Apache Hadoop Data Lake

March 14, 2019 / Global
Featured image for DBEvents: A Standardized Framework for Efficiently Ingesting Data into Uber’s Apache Hadoop Data Lake
Figure 1: Our source pluggable bootstrap library reads from the HDFS backup to prepare datasets for Marmaray, our ingestion platform.
Figure 2: StorageTapper reads the binary changelogs from MySQL, encodes the events in Apache Avro, and sends them to Apache Kafka or backs them up in HDFS. These events can then be used to reconstruct datasets in other systems such as Apache Hive.
Figure 3: The DBEvents heatpipe library encodes the data and Schema-Service acts as the gateway for all schemas. This is how schematization of all data is achieved.
Figure 4: In DBEvents, each source type emits changelog events to Kafka in a unified message format.
Hadoop metadata fields
Metadata fieldDescription
Row KeyThe Row Key field is a unique key per source table that is used to identify the source table row and, based on the result, merge partial changelogs.
Reference KeyThe Reference Key field is the version of the received changelog and must monotonically increase. This key is used to determine whether the data represents a more recent update for a particular row or not.
Changelog ColumnsThe Changelog Columns field is an array<record{“name”:string, “ref_key”:long, “Hadoop_Changelog_Fields”:array<string>}> that contains a list, including column names, ref_key, and all_changed_fieldnames, that are updated in this current message event.
SourceThe Source field reflects the type of the source table that is used to generate this changelog. A few examples include Apache Kafka, Schemaless, Apache Cassandra, and MySQL.
TimestampThe Timestamp field marks the time of creation of the event in milliseconds. The timestamp attribute has multiple uses but most importantly monitors latency and completeness. (We refer to the creation of an event as the time when a data schematization service  such as StorageTapper actually schematizes the event before pushing it to Apache Kafka.)
isDeleted[True/False]. This is a boolean value to support the deletion of a row_key in Hive tables.
Error ExceptionError Exception is a string capturing the exception or issue faced with sending the current changelog message (null in case of no error). In case of any schema issue with the source data, Error Exception will reflect the exception received which can later be used to track the source issue or fix/re-publish the message.
Error Source DataError Source Data is a string containing the actual source data-containing error (null in case of no error). In case of any problematic message, we cannot ingest this string into the main table and we move it to the related error table. We can use this data to work with the producer on a fix.
ForceUpdate[True/False]. ForceUpdate is a boolean value which ensures this changelog is applied on top of existing data. In many cases, a ref_key older than the last seen ref_key is considered a duplicate and skipped. Having set this flag, however, the changelog will be applied regardless of the hadoop_ref_key field.
Data CenterThe Data Center field refers to the the originating data center where the event originated. This field is very helpful to track messages and debugging any potential issue especially with the active-active or all-active architecture. Heatpipe automatically fill this value based on the data center the message is being published from.
Figure 5: All schema non-conforming data is written to DBEvents’ special error tables.
Nishith Agarwal

Nishith Agarwal

Nishith Agarwal currently leads the Hudi project at Uber and works largely on data ingestion. His interests lie in large scale distributed systems. Nishith is one of the initial engineers of Uber’s data team and helped scale Uber's data platform to over 100 petabytes while reducing data latency from hours to minutes.

Ovais Tariq

Ovais Tariq

Ovais is a Sr. Manager in the Core Storage team at Uber. He leads the Operational Storage Platform group with a focus on providing a world-class platform that powers all the critical business functions and lines of business at Uber. The platform serves tens of millions of QPS with an availability of 99.99% or more and stores tens of Petabytes of operational data.

Posted by Nishith Agarwal, Ovais Tariq

Category: