Building a Better Big Data Architecture: Meet Uber’s Presto Team
October 9, 2019 / GlobalUber’s daily operations generate data, such as the number of trip requests or food orders at any given time, that can show us how and where to improve our services. However, this information is only truly useful if we can retrieve it when we need it. Lacking access to our business data would be like having a large tank of water without any faucets. To give our internal analysts insights to improve our operations, we needed to find the right data query engine.
Many of these advanced systems are available, but our teams found that the open source Presto, a data source-agnostic SQL query engine, best fit Uber’s current needs.
Our Big Data infrastructure encompasses a variety of data sources, each leveraging the technology that works best for their requirements, from real-time streaming to large data lakes. Presto lets internal users at Uber run SQL queries on a wide variety of database technologies. Presto’s versatility gives us the ability we need to make smart, data-driven business decisions, and run critical business operations.
Uber engineers have embraced Presto development, writing and contributing database connecters and other improvements back to the open source community. Uber’s Presto experts support over a thousand nodes on our Presto cluster. These nodes run about 400,000 queries per day.
Recognizing Presto’s value, Uber has joined The Presto Foundation as a founding member. Under the umbrella of the Linux Foundation, the Presto Foundation works to advance and develop increasingly powerful SQL query technology.
We sat down with Uber’s Presto developers to find out what they like about this open source technology and why it’s so valuable to our company:
Girish Baliga, Interactive Analytics Team Manager
I manage the Interactive Analytics team at Uber. My team optimizes Presto for Uber use cases, maintains our production Presto deployments, and manages our data warehouse on Vertica, a popular interactive data analytics platform.
What is your favorite engineering task?
I generally enjoy optimizing SQL queries for our operations, business, and data science users. My favorite task is helping our users solve critical problems on Presto. For example, one of our users runs weekly document audits on our driver partners in Canada. He had a Presto SQL query that was failing because it was running out of memory. It turned out that his query was moving around too much data in memory while computing a RANK() function. I refactored the query to read the document data after rank computations, and his query now runs under a minute. I could tell several similar stories about helping various business components at Uber every week. It’s inspiring to help colleagues get critical business work done efficiently on Presto.
What are Presto’s advantages?
Presto is an innovative reimagining of a data analytics SQL engine that runs queries concurrently on a large, shared cluster. It targets the vast majority of data analytics queries whose results fit in volatile memory, trading dedicated resource management for simplicity of design and operation. Since it executes all its functions in memory, Presto is incredibly fast and interactive.
Presto evolved to handle large critical data analytics workloads at Facebook, so it’s optimized for high-query throughput to handle the majority of an organization’s analytics workload. It is also very easy to set up and operate. Presto is very stable and robust, even at Uber’s immense scale.
What do you like about working with Presto?
Presto is very well-designed and quite extensible. We have successfully extended it to work with the Apache Parquet file format that we use at Uber. We’ve also written connectors for storage systems like Apache Pinot, currently undergoing Incubation at the Apache Software Foundation, and Elasticsearch, which store critical business data at Uber.
I also appreciate that Presto has a very active and robust community with lots of contributors from a variety of companies and institutions across the world. Since the community is so engaged, they’ve extended Presto for an array of data formats, storage systems, and use cases. Presto’s comprehensive code options and simple SQL interface make it an ideal interface to query and join data across multiple diverse storage systems at Uber.
Devesh Agrawal, Data Analytics Software Engineer
I am a software engineer on Uber’s Data Analytics team. I work on Presto and Apache Pinot primarily. I also really appreciate opportunities to mentor other engineers.
What open source projects do you contribute to and/or use?
I have contributed to Apache Superset, an enterprise-ready business intelligence web application currently undergoing incubation at the Apache Software Foundation. I’ve also been working on contributions to Apache Pinot and Presto that aren’t yet pushed upstream. On my team at Uber, we use a variety of open source software, including HDFS, Apache Hive, and, of course, Presto.
Why did your team choose to use Presto?
There are not many full SQL open source engines in Java out there. Plus, Presto has a robust plugin/connector model that allows federating atop other engines.
Compared to Apache Hive, I find the Presto codebase very developer-friendly. The development community has made it easy to integrate Presto with IDEs and run it on laptops, which makes onboarding and debugging easy.
What external organizations does Uber work with for Presto development? How does that collaboration work?
My colleagues and I mainly work with the Facebook team. Our collaboration is very friendly and impromptu. We often go to their office and discuss code on the whiteboard, engineer-to-engineer.
What are your future plans to contribute to Presto?
I am interested in achieving the holy grail of low-latency, full-featured SQL over real-time data. Current systems make sacrifices in one or more of these three dimensions. In order to remedy those limitations, we are working on low-latency Presto on top of other real-time engines like Apache Pinot and AresDB, Uber’s open source GPU-powered, real-time analytics engine. Currently, we’ve been able to achieve low latency, with overheads measured under 50 milliseconds, and support a range of queries, including joins, filters, and aggregation.
Bhavani Sudha Saktheeswaran, Data Analytics Software Engineer
I am a software engineer on Uber’s Data Analytics team. I focus mainly on optimizing Presto NameNode interactions. I also contribute to both the Presto and Apache Hudi open source projects.
What do you like about working with Presto?
Presto is very lightweight and flexible. It is fairly easy to develop a connector to any data source and query it from Presto.
What gives Presto the edge over other querying options for your team’s use case?
Presto’s superpower is querying heterogeneous data sources in a single query. It hides any complexities neatly behind SQL abstractions. I love Presto’s ability to quickly analyze different data sources without having to hop across different query platforms and correlate the results using custom pipelines.
Atul Gupte, Product Platform Product Manager
I’m a product manager on the Product Platform team at Uber. I work across our Interactive Analytics, Data Science Workbench, and Data Knowledge Platform teams. I help drive product decisions that grant Uber’s myriad teams access to our foundational infrastructure, stable compute resources, and advanced tooling. This work helps our teams ensure that Uber’s services operate efficiently and seamlessly.
What do you like best about your work at Uber?
I’m a technologist at heart; I firmly believe in the power of technology to simplify challenging tasks and help people achieve their goals. At Uber, teams work with massive volumes of data in order to make the platform experience seamless for our riders, eaters, driver-partners, and restaurant-partners. By building data products under the umbrella of the Product Platform team, I create avenues that multiply the effectiveness of Uber’s teams. I find fulfillment in helping my colleagues reach their full potential.
Why did your team choose to leverage Presto for its stack?
As its name suggests, Presto is a quick way for users at Uber to make sense of our vast data by getting near-immediate responses to their questions. As a technology, it’s been easy to set up and operate. Since it operates on a shared-resource model, it doesn’t require the complicated overhead of managing compute resources.
Our internal users span a wide spectrum of professional roles, from operations managers and analysts to data scientists and machine learning researchers. Despite the range of technical skills and experience they bring to the table, all of our Uber users have easily learned Presto SQL. Today, our Presto installation at Uber reliably supports hundreds of thousands of queries submitted by users working in Uber offices across the world.
How has Presto helped our teams at Uber?
Presto is a huge asset in Uber’s Interactive Analytics portfolio. Its design and extensibility work very well for a company of our size and scale, integrating beautifully with the various storage systems we rely on.
Given Presto’s adaptability and ease of use, teams across Uber rely on it to probe and understand our data, which helps them make business-critical decisions. The tools Presto enables run the gamut from dashboards powered by real-time systems that monitor order volume in Uber Eats, to queries written by analysts to allocate marketing spend in key cities. Being a part of the Presto team gives me an incredible overview of the exciting challenges that are being solved across the company.
Zhongting Hu, Data Analytics Software Engineer
I am a software engineer on Uber’s Data Analytics team, working mainly on Presto security and production. I find it rewarding to build and debug reliable, scalable production systems. For instance, once we had an issue where some Presto functionalities were broken. Over several hours, I worked to identify and reproduce the problem, investigated logs, and performed real-time debugging on our Presto implementation and other ecosystems, such as HDFS NameNode and HMS, until I finally fixed it. That was a fun day.
What open source projects do you use?
I have been working with Elasticsearch, which many Uber engineers use, as well as projects within the Apache Hadoop ecosystem, such as HDFS, Hive, and Spark.
What are Presto’s advantages compared to other tools?
First, of all, Presto is fast because it processes all data in memory, unlike some tools which need to write intermediate data to disk. Second, Presto leverages SQL, which has become a standard in data tools, so most engineers are familiar with it. That makes it easy to onboard new users.
What are your future plans to contribute to Presto?
I am interested in optimizing queries and the execution engine, and adding more connectors for different databases. Also, I think I could contribute to the Presto ecosystem in areas around production verification, shadow testing frameworks, and monitoring.
Venki Korukanti, Data Analytics Software Engineer
I am a software engineer on Uber’s Data Analytics team, and I primarily work on Presto.
What open source projects do you contribute to and/or use?
I have contributed to a few open source projects from the Apache Software Foundation, including Drill, Hive, Calcite, and Arrow. I recently started contributing to Presto.
What do you like about working with Presto?
I like Presto’s production stability and scalability. The open source community around Presto is developer-friendly and focuses on the importance of writing quality code.
Presto provides a very good connector framework for building a single user-facing query engine on top of multiple data sources, which is exactly what Uber requires. Facebook tests every Presto release at web scale, which makes me confident in its stability and reliability.
What are your future plans to contribute to Presto?
As part of my work at Uber, I have developed a pushdown framework for aggregations, filters, and projections into connectors. I’ve already implemented Pinot and AresDB connectors using this framework. I’m hoping to make this framework open source soon. I’m also currently working on improving Presto’s Parquet reader performance.
Interested in working with our Presto team, or with other engineering teams at Uber? Consider applying for a role!
Wayne Cunningham
Wayne Cunningham, senior editor for Uber Tech Brand, has enjoyed a long career in technology journalism. Wayne has always covered cutting edge topics, from the early days of the web to the threat of spyware to self-driving cars. In his spare time he writes fiction, having published two novels, and indulges in film photography.
Posted by Wayne Cunningham
Related articles
Most popular
Uber, Unplugged: insights from 6 transit leaders on the future of how we move
Enabling Infinite Retention for Upsert Tables in Apache Pinot
Presto® Express: Speeding up Query Processing with Minimal Resources
Unified Checkout: Streamlining Uber’s Payment Ecosystem
Products
Company