Skip to main content
Engineering, Data / ML

Selective Column Reduction for DataLake Storage Cost Efficiency

September 20, 2023 / Global
Featured image for Selective Column Reduction for DataLake Storage Cost Efficiency
Image
Figure 1: A sample table in logic view
Image
Figure 2: Row-oriented storage on disk or network
Image
Figure 3: Column-oriented storage on disk or network
Image
 Figure 4: The process of translating file to remove columns with original Parquet reader and writer
Image
Figure 5: The new process of translating file to remove columns with selective column pruner 
Image
Figure 6: The copy & Skip process in selective column prunnner
Image
Figure 7: The benchmarking result comparison between the selective column pruner and Spark
Image
Figure 8: Code snippet for parallel execution in Spark
Image
Figure 9: Using Column Pruning Tool to translate tables
Xinli Shang

Xinli Shang

Xinli Shang is a Manager on the Uber Big Data Infra team, Apache Parquet PMC Chair, Presto Commmiter, and Uber Open Source Committee member. He is leading the Apache Parquet community and contributing to several other communities. He is also leading several initiatives on data format for storage efficiency, security, and performance. He is also passionate about tuning large-scale services for performance, throughput, and reliability.

Kai Jiang

Kai Jiang

Kai Jiang is a Senior Software Engineer on Uber’s Data Platform team. He has been working on Spark Ecosystem and Big Data file format encryption and efficiency. He is also a contributor to Apache Beam, Parquet, and Spark.

Ryan Chen

Ryan Chen

Ryan Chen is a Software Engineer on the Uber Data Infra team. He primarily works on Kafka and ZooKeeper, with additional experience in Spark and Parquet from this project.

Jing Zhao

Jing Zhao

Jing Zhao is a Principal Engineer on the Data team at Uber. He is a committer and PMC member of Apache Hadoop and Apache Ratis.

Mingmin Chen

Mingmin Chen

Mingmin Chen is a Director of Engineering and head of the Data Infrastructure Engineering team at Uber. He has been leading the team to build and operate Hadoop data lake to power multiple Exabytes of data in storage, Kafka infrastructure to power tens of trillions messages per day, and compute infrastructure to power hundreds of thousands compute jobs per day. His team builds highly scalable, highly reliable yet efficient data infrastructure with innovative ideas while leveraging many open-source technologies such as Hadoop (HDFS/YARN), Hudi, Kafka, Spark, Flink, Zookeeper etc.

Mohammad Islam

Mohammad Islam

Mohammad Islam is a Distinguished Engineer at Uber. He currently works within the Engineering Security organization to enhance the company's security, privacy, and compliance measures. Before his current role, he co-founded Uber’s big data platform. Mohammad is the author of an O'Reilly book on Apache Oozie and serves as a Project Management Committee (PMC) member for Apache Oozie and Tez.

Karthik Natarajan

Karthik Natarajan

Karthik Natarajan is a Senior Manager, Engineering at Uber. He leads Ingestion and Datalake storage - Hudi within the Uber Data Platform Organization.

Ajit Panda

Ajit Panda

Ajit Panda is a Senior Manager, Engineering at Uber. He leads the Security, privacy and compliance engineering within the Uber Data Platform Organization.

Posted by Xinli Shang, Kai Jiang, Ryan Chen, Jing Zhao, Mingmin Chen, Mohammad Islam, Karthik Natarajan, Ajit Panda