Skip to main content
Data / ML

Engineering Data Analytics with Presto and Apache Parquet at Uber

July 11, 2017 / Global
Featured image for Engineering Data Analytics with Presto and Apache Parquet at Uber
Figure 1: Uber’s Presto architecture incorporates one coordinator node that analyzes and schedules tasks and several worker nodes that scan and aggregate data for use by the client.
Figure 2: Supported by our tech stack, Uber’s Hadoop infrastructure captures and stores data from a variety of sources.
Figure 3: Parquet is Uber Engineering’s storage solution for our Hadoop ecosystem, partitioning data horizontally into rows and then vertically into columns for easy compression.
Figure 4: The original open source Parquet reader does not fully incorporate columnar storage, making it inefficient to analyze large sets of Uber data.
Figure 5: Uber’s new open source reader can skip over unnecessary data via nested column pruning.
Figure 6: Our new reader enhances querying by reading columns directly as opposed to row-by-row.
Figure 7: Predicate pushdowns allow us to skip reading Parquet row groups to save disk IOs. In this example, the query is looking for city_id = 12, one row group city_id max is 10, new Parquet reader will skip this row group.
Figure 8: Like predicate pushdowns, dictionary pushdowns are executed in one step that simultaneously reads and evaluates data columns while building columnar blocks for our Presto engine. In this example, the query is looking for city_id = 12; since one row group’s city_id dictionary includes the IDs 3, 5, 9, 14, 21, the new reader will skip this group.
Figure 9: Lazy reads are performed only when they match the predicate, saving CPU and memory.
Figure 10: Our new reader demonstrated 2-10X speedup for Uber’s benchmark SQL queries.
Zhenxiao Luo

Zhenxiao Luo

Zhenxiao Luo is a senior software engineer with Uber's Interactive SQL team.

Posted by Zhenxiao Luo

Category: