Real-Time Analytics for Mobile App Crashes using Apache Pinot
2 November 2023 / Global
In today’s fast-paced world of software development, new changes (both in code and infrastructure) are being released at breakneck pace. At Uber, we roll out ~11,000 changes every week and it’s important for us to have a way to quickly be able to identify and resolve issues caused by these changes. A delay in detecting issues can create a number of issues including impacts to: user experience, our ability to facilitate transactions on the platform, company revenue, and the overall trust and confidence of our users. At Uber, we have built a system called “Healthline” to help with our Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) issues and to avoid potential outages and large-scale user impacts. Due to our ability to detect the issues in real time, this has become the go-to tool for release managers to observe the impact of canary release and decide whether to proceed further or to rollback.
In this article we will be sharing details on how we are leveraging Apache Pinot™ to achieve this in real time at Uber scale.
About Healthline
Healthline is a crash/exception logging, monitoring, analysis, and alerting tool built for all Uber mobile applications for multiple platforms as well as more than 5,000 microservices written using various programming languages, and owned by a number of tech teams within Uber.
On a very high level, it processes all the crashes, errors, and exceptions logs generated by internal systems and builds analytical insights on these by classifying the similar exceptions/crashes into different buckets called issues. It also solves for identifying potential changes causing these issues and subsequently notifying the system owners whenever there is an anomaly observed for a specific issue.

For this article, we will limit our scope to Mobile App Crash Analytics, however it’s quite similar for backend logs. Also we will not be covering the data collection pipelines and preprocessing steps like deobfuscation of crash dumps, classification logic, code change, and ownership identifications.
Terminology
Crash – We are using it to represent an instance of any fatal or non fatal (e.g., memory leak) report we receive from our SDK embedded in each Uber mobile app.
Issue – A cluster of crashes which are identified as having the same root cause.
App – An app represents an Uber-developed app. We consider App + Platform (iOS/Android) as one unique app, hence Rider iOS and Rider Android are counted as two apps.
Data Volume and Retention
At peak, we classify more than 1.5 million errors and warning logs from backend services, and more than 1,500 mobile app crashes per second across Uber apps for both iOS and Android. The size of the crash varies from 10 KB to 200 KB with an average per-day data size of 36 TB. The crash QPS is ~1,500. We retain the data for 45 days to understand historical trends. Most of our use cases access the last 30 days data.
Query Patterns
For dashboards and to perform alerting we perform following queries with expected response time:
Filtering Patterns on Data
- Exact match filters (e.g., city = ‘Bangalore’)
- Numerical range filters (e.g., report_time>=123 and report_time<=456)
- Partial text matching using regex
- Search use case:
Healthline should allow users to search crash reports using fields like crash dump module name, exception name, exception class, issue ID, etc.
- Search use case:
Aggregation Patterns on Data
We have a large number of attributes on which aggregation can be performed, and they may occur together in the same query.
Following are the aggregation patterns identified (with simplified examples):
- Aggregate by issue ID
- Apply count distinct for each: crash ID, user, device
- Apply min/max on report_time
- Aggregate by an attribute to get the record distribution over attribute values
- Some fields are present as arrays–for such fields, create a record distribution over unique array elements
Histogram Query Patterns on Data
We also display a variety of graphs/histograms on the UI. They can be generalized as “for a given time window, divide it into equal length bins, and for each bin, perform a particular aggregation.”
Why Apache Pinot
It’s clear from above query patterns that we have a very strong use case for aggregations and matches (including range match). Some partial match use cases are present, however they are very low in QPS. More importantly, most of our queries have a time window specified. This makes Pinot tailor-fit for our use case, with its rich support for various types of indices. We also have a strong platform team providing managed Pinot offering. Pinot also has a very strong open source community with some of the committors from Uber.
Intro to Pinot
Apache Pinot is a real-time, distributed, columnar OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from batch data sources (such as Apache Hadoop® Distributed File System, S3, Azure® Data Lake, Google Cloud Storage) as well as streaming sources (such as Apache Kafka®). Pinot is designed to scale horizontally, so that it can scale to larger data sets and higher query rates as needed.
A table is a logical abstraction that represents a collection of related data. It is composed of columns and rows (known as documents in Pinot). The columns, data types, and other metadata related to the table are defined using a schema.
Pinot breaks a table into multiple segments and stores these segments in a deep-store such as Hadoop Distributed File System (HDFS) as well as Pinot servers. Pinot supports the following types of tables:
Type | Description |
Offline | Offline tables ingest pre-built Pinot segments from external data stores and are generally used for batch ingestion. |
Real-time | Real-time tables ingest data from streams (such as Kafka) and build segments from the consumed data. |
Hybrid | Hybrid Pinot tables have both real-time as well as offline tables under the hood. By default, all tables in Pinot are hybrid. |
Excerpt taken from Apache Pinot Official Documentation. For more information, please visit the Apache Pinot Docs.
Architecture
Our ingestion pipeline processes raw crash data and enriches it with classification and other details. It publishes this enriched crash data to a Kafka topic. This is the source which we want to store and want to perform required aggregations and filtering operations to serve to user dashboards and other integrations.

- Write Path
- Apache Flink