Skip to main content
Data / ML

Solving Big Data Challenges with Data Science at Uber

20 March 2019 / Global
Featured image for Solving Big Data Challenges with Data Science at Uber
Figure 1: In this data infrastructure, data flows from Apache Kafka to HDFS, and is eventually replicated across multiple isolated Vertica databases. Clients connect to the databases through a middle layer that helps distribute query load across available databases.
Figure 2: A fully replicated database system replicates all data elements across all the databases. In a partially replicated database system, different datasets contain different overlapping sets of data elements.
Figure 3: In our partial replication structure, the data manager contains the configuration, which tells the data manager in which cluster each table belongs. The data manager shares this information with the proxy manager, which contains statistics on query load, data location, and cluster health.
Atul Gupte

Atul Gupte

Atul Gupte is a former product manager on Uber's Product Platform team. At Uber, he drives product decisions to ensure our data science teams are able to achieve their full potential, by providing access to foundational infrastructure and advanced software to power Uber’s global business.

Ritesh Agrawal

Ritesh Agrawal

Ritesh Agrawal is a senior data scientist on Uber's Data Science team, leading the intelligent infrastructure and developer platform teams. His work is focused on finding innovative ways to use data science and AI to make Uber’s infrastructure more adaptive and scalable and enhance developer productivity.

Posted by Atul Gupte, Ritesh Agrawal

Category: