Skip to main content
Uber AI, Engineering

Introducing Petastorm: Uber ATG’s Data Access Library for Deep Learning

September 21, 2018 / Global
Featured image for Introducing Petastorm: Uber ATG’s Data Access Library for Deep Learning
Figure 1: In this deep learning cluster architecture, data from a central dataset is being used by three compute nodes.
Rowcamera #1camera #2LidarLabels
1<camera1-1><camera2-1><lidar 1><labels 1>
2<camera1-2><camera2-2><lidar 2><labels 2>
3<camera1-3><camera2-3><lidar 3><labels 3>
Row storageColumnar storage
row 1<camera1-1><camera1-1>
<camera2-1><camera1-2>
<lidar 1><camera1-3>
<labels 1><camera2-1>
row 2<camera1-2><camera2-2>
<camera2-2><camera2-3>
<lidar 2><lidar 1>
<labels 2><lidar 2>
row 3<camera1-3><lidar 3>
<camera2-3><labels 1>
<lidar 3><labels 2>
<labels 3><labels 3>
Figure 2. A dataset is generated by combining multiple data-sources into a single tabular structure. The same dataset can be used multiple times for model training and evaluation.
Figure 3. n-grams are constructed while reading the dataset. n-grams cannot span the Parquet row groups.
Figure 4. Data shuffling is achieved by randomly selecting a row-group to load, and then by placing individual samples into an in-memory shuffling buffer.
Figure 5. Petastorm feeds non-overlapping subsets of a dataset to different machines participating in a distributed training.
Figure 6. If local cache is enabled, the data will be downloaded only once per session.
Figure 7. Petastorm includes components that support dataset generation and reading. Unischema defines a common data-schema that is used by both.
Robbie Gruener

Robbie Gruener

Robbie Gruener is a software engineer on the Uber ATG Perception team.

Yevgeni Litvin

Yevgeni Litvin

Yevgeni Litvin is a senior software engineer on the Uber ATG Perception team.

Posted by Robbie Gruener, Owen Cheng, Yevgeni Litvin