Skip to main content
Data / ML

Monitoring Data Quality at Scale with Statistical Modeling

May 7, 2020 / Global
Featured image for Monitoring Data Quality at Scale with Statistical Modeling
Figure 1. Uber’s Data Quality Monitoring System connects various service and platform components. The front end allows users to onboard data tables for monitoring and receiving quality scores, the back end performs the data processing and statistical modeling, and the data metric generators characterize data table patterns.
Figure 2. Uber’s data sets are usually highly seasonal. Projection using PCA on our high-dimensional time series helps to bundle correlated time series together to simplify the anomaly detection problem.
Figure 3. Examining five of our top PCs from one of our biggest data tables over a select period, we can see abnormal state changes in the data pattern.
Figure 4. Metric-level anomalies (right) are too noisy for everyday use, and induce alert fatigue. If, however, we set an appropriate threshold for table-level anomalies (red line, left), we can minimize the number of alerts generated and focus on only the most destructive issues.
Figure 5. One of our next steps towards making alerts more intelligent is leveraging data table lineage information. In fact, we have observed strong correlation between data table quality and lineage as the clustering of table-level quality scores over time can reconstruct table ancestry. This is validated in practice as we see related tables have common root causes when they degrade in data quality.
Ye Henry Li

Ye Henry Li

Ye Henry Li works as a data scientist on Uber's Platform Data Science team.

Ritesh Agrawal

Ritesh Agrawal

Ritesh Agrawal is a senior data scientist on Uber's Data Science team, leading the intelligent infrastructure and developer platform teams. His work is focused on finding innovative ways to use data science and AI to make Uber’s infrastructure more adaptive and scalable and enhance developer productivity.

Santhosh Shanmugam

Santhosh Shanmugam

Santhosh Shanmugam works as a senior data scientist on Uber's Marketplace team.

Andrea Pasqua

Andrea Pasqua

Andrea Pasqua is a data science manager overseeing our Intelligent Decisions Systems team at Uber.

Posted by Ye Henry Li, Ritesh Agrawal, Santhosh Shanmugam, Andrea Pasqua

Category: