Stay up to date with the latest from Uber Engineering

Identifying Outages with Argos, Uber Engineering’s Real-Time Monitoring and Root-Cause Exploration Tool

November 24, 2015 / Global

Share
Facebook
X social
Linkedin
Envelope
Imagine being on-call and woken up at 4 am. Half asleep, you grab your phone and quickly silence the annoying ringtone. This alert had better be an emergency, you say to yourself. If not, alarm fatigue will set in and the real ones may not draw attention to themselves, a case of the indicator that cried outage. The development of Argos is the story of how Uber Engineering sought to provide highly accurate, real-time alerts on millions of system and business metrics in Uber’s fast-paced environment.
Our front view dashboard of Argos at launch, circa November 2014, analyzing an outage due to a configuration change. Positive alerts are for metric counts that are higher than expected, while negative alerts are for metrics counts that are lower than expected (compared to normal business conditions).Background
Uber utilizes user-defined static thresholds¹ (Nagios) in combination with a paging system for alerting on-call personnel to monitor its system health² via tens of millions of metrics. Many other tech companies also do this for mobile and their backend. However, we also have to use dynamic thresholds, since given Uber’s ~20% MoM growth, the thresholds for millions of metrics need to be updated on a regular basis. Our demand-related metrics also follow strong daily and weekly periodicities—get ready for Saturday night peaks—for which static thresholds are simply inadequate. Static thresholds tend to lead to two scenarios:
1) False negatives: Thresholds that are as loose as a goose, set too wide to allow for underlying growth (daily variations and short-term fluctuations), result in undetected issues for extended periods of time.
2) False positives: Tightly set thresholds lead to the alert that cried outage. Hundreds of daily (and nightly!) false alarms for on-call engineers create alert fatigue when the real outage arrives.
Furthermore, at Uber, any service can in principle call any other service (hi!). There are many engineering advantages for this, but keeping track of how they work together is impossible to do in your head. Mapping the interactions of the services for human interpretation doesn’t help, as a resulting dependence graph resembles a bowl of spaghetti and meatballs.
Understanding how various services use and rely on each other, in conjunction with their corresponding metrics—the prerequisite for root cause analysis—requires extensive knowledge which expands as the system it monitors does.
With all of this going on, starting in September 2014, we decided to build an omniscient internal monitoring system to track our increasing number of engineering ecosystem interdependencies, Argos. The time of static-only thresholds and manually monitoring metrics across many dashboards was over. After all, even if we worked like dogs, the best engineering team is only human.
Building an Anomaly Detection Algorithm
There are several requirements for a streaming anomaly detection algorithm:
Make fast decisions. We need to decide whether a data point is indicative of an outage in less time than the arrival of the next successive data point. This specification greatly restricts the algorithmic complexity of the outlier detection algorithm’s online portion. In order to achieve a low evaluation time while maintaining high actionability, we divided our system into two parts: an outlier detector and an outage detector:
 
The outlier detector compares whether in-streaming data is an inlier or outlier against pre-calculated thresholds, updated hourly. This makes the online part of the outlier detection system extremely fast and scalable. After this stage, a very high fraction of metrics will be labeled as inliers; their evaluation is complete. The small remainder of metrics classified as outliers are subsequently analyzed by the outage detector.
The outage detector is computationally more complex and thus cannot be applied to all metrics with high frequency. A combination of the outlier and outage detector, however, ensures high frequency and high actionability.
Be actionable. We need to only issue alerts if there is a degradation in system performance to minimize engineer alert fatigue. On the other side of the spectrum, we can’t be too encompassing to issue alerts for promotions and extreme weather, which have large effects on Uber’s metrics.
Think of this as the Baby Bear Porridge philosophy: our outage alert triggering sensitivity has to tuned just right. Getting here is challenging—and first involves triggering too much and too little—because we don’t have a complete real-time picture of external events affecting our metrics, such as a sudden localized thunderstorm:
Thunderbolts and lightning, very, very frightening: Weather events, concerts, sporting events and promotions greatly affect Uber’s metrics (blue). Purple markers indicate the onset of local thunderstorms on a June 2015 evening in Chicago. For reference, we overlaid the previous week’s time period metrics in orange. 
An actionable outage detection system must not alert on these high-demand events, despite being outliers. To overcome insufficient information, we developed a multivariate non-linear approach. It infers whether an event is indicative of an outage or something else (e.g. high-demand) from a combination of multiple data streams in real-time, and adjusts our algorithm accordingly. Our current algorithm achieves 90% actionability: nine out of ten pages are indicative of a true outage, many multiples above many other contemporary real-time outlier detection systems.
Be selective. Data streams are by definition infinite in length. Given memory constraints, the minimum number of historic data points needs to be chosen such that high accuracy is maintained. We use about 730 carefully selected data points to forecast the next hour. Despite this aggressive data selection, we achieve comparable forecasting accuracy to methods with 10–100x more input data.
Never fly blind. Our algorithm has to be computationally robust, always giving us an answer. For example, subspace search algorithms that may not be convergent are undesirable. SARIMA (Seasonal autoregressive integrated moving average) requires a predefined maximum subspace size, which is often too small for our purposes.
Let the past be the past. Past outages and outliers must not affect the outlier score for in-streaming data points. For example, during a thunderstorm in Chicago in June 2015 (see above image), ride demand increased 300%. That next week, we didn’t want our outlier algorithm to adapt to it. We utilize a median approach to allow for statistical robustness.
Be dynamic. Our outlier detection algorithm has to account for daily and weekly seasonality. There’s a weekend every week, and morning rush hours and lunch time lulls contribute to the pulse of our cities around the world. We need to account for recurring phenomena while ignoring random occurrences.
Learn with glimpses of the truth. Outliers are rare, a bit like unicorn startups, and like the entrepreneurial venture capitalist we desperately want to identify them while they’re still small. Few if any labeled data are available, so the outlier detection algorithm should not require labeled data to track every time series. Due to our fast growth and many new markets, sometimes we’re aware of future high-demand events, but we don’t have the sufficient historic data to quantify their impact: 2015 was only our second Lunar New Year in China.
Don’t miss anything. By far the most important criterion for a system that monitors system health is never to miss an outage, even at the expense of spurious alerts. It’s FoMO to an extreme.
Now that we’ve outlined its specifications, let’s look at how we’ve implemented Argos at Uber in the context of its operating environment.
Outlier Detection Algorithm
First, let’s examine some upper and lower dynamic hourly updated thresholds (red, yellow) generated from our outlier detection algorithm, which we compare to the in-streaming data. We’ve added the actual minute-by-minute data for reference in a graph (blue), which are of course unknown ahead of time:
Thresholds follow underlying growth, demand patterns, and intrinsic fluctuations well; we can keep thresholds tight and detect outages without causing a large number of spurious alerts. The sole user-defined inputs to generate these thresholds are a list of metrics to be tracked and the channel by which the owner should be notified (pager, email, automatic Phabricator ticket generation, etc.) 
The lower threshold maximum exceeds the upper threshold minimum a few hours later, impossible to accomplish with static thresholds:
Actual time series (blue) with predicted lower thresholds (yellow) and upper thresholds (red). The thresholds closely follow the pattern of the actual metric. Time advances to the right. 
To maintain high computational speed, the outlier detector considers each time series individually. Next, we wanted to bring this together in one big picture.
Outage Detection, Root-Cause Exploration, and System Health
Once the aberrant outliers are identified, our system has to decide what to do with them, and this is where the dependencies of the earmarked metrics’ services play a large role. In order to understand Uber’s service dependencies we performed clustering of time series based on their similarity. The resulting output can be visualized in a graph framework:
Here, each metric is represented as a node. An edge between nodes indicates that the metrics have high similarity. (The shorter the edge, the more similar the time series.) 
We color each node in the graph by the probability of the corresponding time series currently being in an anomalous state. This allows for rapid visual identification of whether a particular (sub)system is affected in a concerted fashion.
As the number of metrics in the system grows, summary statistics identifying anomalous hot spots in such networks and their origins becomes essential. We have developed a system health index (SHI) which quantifies and prioritizes this for us. Our SHI consists of three main components. The first term considers the additive effect of how anomalous each individual time series is. The second term takes into consideration the pair-wise interaction of each metric: is a metric closely related to you also showing trouble? Finally, we pay extra attention to nodes in the graph that are central to connecting other clusters. Outage-indicative metrics for these nodes would be particularly troublesome, as it could mean eminent cascading failures of services.
Knowing the (sub)system health of our services allows us to automatically gauge which person in the on-call escalation chain to alert. For example, an issue with an individual metric should be brought to the attention of an on-call engineer, but does not warrant notifying an engineering director. However, if Argos detects that an entire subsystem is affected (i.e., mobile), the issue can be escalated quickly.
Summary
Before grabbing your laptop in the early morning to investigate the possible outages from your pager alert, you want to know:
What part of the system is affected?
How quickly is the metric in question deteriorating?
Given your engineering experience, is this type of alert a legitimate cause for concern in the first place?
Argos addresses the above with anomaly detection algorithms that are fully automated, embarrassingly parallel, linearly scaling, and statistically and computationally robust. Answers appear in the alert and are easily accessible in dashboards once you wake your sleeping computer. 
Like a faithful companion and furry friend, Argos is always on guard. Argos occasionally acts up, which causes us to wake up. But it learns quickly; at one year old, it’s accurate enough to automatically trigger pager alerts. Thanks to Argos and its continued development from Uber Engineering, we are dogged with unidentified outages of all kinds less frequently.
Footnotes
¹ Static thresholds entail upper and lower thresholds, typically manually set by service owners and held constant across time.
² System health means no degradation in response time, i.e. uptime both for backend systems and product features.
Follow @UberEng on Twitter for the latest from Uber Engineering.

Franziska Bell

Fran Bell is a Data Science Director at Uber, leading platform data science teams including Applied Machine Learning, Forecasting, and Natural Language Understanding.

Posted by Franziska Bell

Category:

Data / ML