Skip to main content
Engineering, Data / ML

Pinot for Low-Latency Offline Table Analytics

August 29 / Global
Featured image for Pinot for Low-Latency Offline Table Analytics
Image
Figure 1: Shows a sample Hive to RTA Pipeline which runs daily in uWorc.
Image
Figure 2: Spark Partitioner for Overwrite Tables that have Column Partitioning Enabled.
Image
Figure 3: Spark Partitioner for append tables. Segments per day constraint is met by assigning records to each segment in a round-robin fashion.
Image
Figure 4: Shows how you can map a record to a Spark partition. The record’s column is assigned based on the date the record represents. Each date has 4 segments, and the row can be assigned either based on the hash of a partitioning column or randomly.
Image
Figure 5: Sample Segment Names.
Image
Figure 6: Spark Partitioner for Append Tables that also have Column Partitioning enabled on a column. Before calling the partitioner, we concatenate the time column and the partitioning column values with “\0” as the delimiter.
Image
Figure 7: Example PySpark app for exporting Pinot data into Hive.
Image
Figure 8: Shows the data flow from Pinot Table to Spark Executor and Hive Sink. Resultset is loaded and transferred in chunks in both Pinot Server and Spark Executor which enables extracting datasets larger than heap.
Ankit Sultana

Ankit Sultana

Ankit Sultana is a Staff Software Engineer on the Real Time Analytics (RTA) team at Uber, and the technical lead for the RTA query stack.

Caner Balci

Caner Balci

Caner is a Software Engineer at Uber’s Real-Time Analytics Team.

Posted by Ankit Sultana, Caner Balci