Go anywhere with Uber

Introducing QALM, Uber’s QoS Load Management Framework

March 22, 2018 / Global

Much of Uber’s business involves connecting people with people, making the reliability of our customer platform crucial to our success. The customer platform supports everything from ridesharing and Uber Eats, to Uber Freight and Uber for Business. Our platform team owns four services with thousands of hosts, serving peak traffic up to 300,000 requests per second, with more than 450 internal services.

Any system of this complexity is likely to experience outages, especially one that has grown so quickly. Analyzing outages that occurred over a six-month period, we found that 28 percent could have been mitigated or avoided through graceful degradation.

The three most frequent types of failures we observed were due to:

Inbound request pattern changes, including overload and bad actors
Resource exhaustion such as CPU, memory, io_loop, or networking resources
Dependency failures, including infrastructure, data store, and downstream services

To proactively manage our traffic loads based on the criticality of requests, we built QoS Aware Load Management (QALM), a dynamic load shedding framework for incoming requests based on criticality. When the service degrades due to traffic overload, resource exhaustion, or dependency failure, QALM prioritizes server resources for more critical requests and sheds less critical ones. Our goal with QALM is to reduce the frequency and severity of any outages or incidents, leading to more reliable user experiences across our business.

QALM architecture

We originally wrote QALM as a Go library, integrating it into the application layer of one of our most high volume services. The framework features intelligent overload detection, criticality registration, and endpoint isolation to fulfill the quality of service (QoS) requirement. Intelligent overload detection lets QALM reserve resources for critical requests during instances of service degradation, while endpoint level isolation can protect specific endpoints from traffic spikes.

Intelligent overload detection

A static inbound rate limiting framework, which throttles requests over a set threshold, has too many constraints for our platform. While the static threshold needs to be updated whenever our service capacity changes, the service level agreement (SLA) and capacity difference between endpoints makes configuration complicated.

A dynamic overload detector offers more flexibility and improves hardware efficiency, especially in a complex service such as ours. With QALM, we implemented an overload detector inspired by the CoDel algorithm. A lightweight request buffer (implemented by goroutine and channels) is added for each enabled endpoint to monitor the latency between when requests are received from the caller and processing begins in the handler. Each queue monitors the minimum latency within a sliding time window, triggering an overload condition if latency goes over a configured threshold.

QALM diagram — *Figure 1: In QALM, our overload detector calculates request latency in the buffer queue to detect overload.*

Endpoint isolation

To enable endpoint isolation, QALM creates separate queues based on its configuration. Each queue has its own overload detector, so when degradation happens in one endpoint, QALM will only shed its load from that endpoint, without impacting others.

QALM architecture — *Figure 2: QALM isolates endpoints, such that if EP1 suffers degradation, EP2 still works as normal.*

Request Criticality

One of the most important features introduced by QALM is request criticality, which ensures QoS during degradation. QALM defines four levels of criticality based on Uber’s business requirements:

Core infrastructure requests: never dropped.
Core trip flow requests: never dropped.
User facing feature requests: may be dropped when the system is overloaded.
Internal traffic such as testing requests: higher possibility of being dropped when the system is overloaded.

As depicted in Figure 2, when EP1 is experiencing degradation, only non-critical requests from Consumer1 get dropped, while the critical requests from Consumer2 still succeed.

QALM supports both local configuration files and a simple online graphic user interface (GUI) integrated through Uber’s service management system. The GUI service utilizes jaeger tracing data to pre-fill the callers for each endpoint and the default criticality. With periodic configuration syncing, updates from the GUI take effect in a couple of minutes.

*Figure 3: Service owners can use the QALM UI to update the criticality for endpoint-caller pairs.*

Load testing experiments

We performed multiple load tests using one of our critical production services to quantify how QALM improved graceful degradation and criticality awareness. The integration significantly improved the success request latency and correctly dropped non-critical requests during the overload period.

Graceful degradation

We stress tested one endpoint with 6,000 requests per second (RPS). Without QALM, the latency worsened overtime, getting as high as 20 seconds. In practice, most services timeout after such a long latency.

QALM P99 request — *Figure 4: P99 Request latency suffers non-linear increase to ~20 seconds without QALM integration.*

Running the load test with dynamic overload detection enabled gives us impressive results. QALM improves graceful degradation significantly, keeping the p99 latency under 400 milliseconds by dropping non-critical requests (20 percent of the total). We saw a 98 percent improvement in success request latency compared to load tests without QALM integration.

QALM RPS graph

QALM latency graph — *Figure 5: QALM integration improved the success request latency p99 ~400 milliseconds during the overload period.*

Criticality awareness

During a period when the system becomes overloaded with requests, QALM sheds non-critical requests while preserving those we designate as critical. For this test, we configured two callers with different criticality (document:critical and document-staging:non-critical), having them hit the same endpoint at a rate of about 2,300 RPS each. When QALM detected system overload, it dropped some of the document-staging requests, so that all of the document requests, designated as critical, succeeded.

QALM critical graph — *Figure 6: QALM correctly identified non-critical requests from document-staging, shedding ~50 percent of them during the overload period.*

From application layer to RPC layer

After we built and tested QALM, four core production services within our group integrated it and immediately reaped significant reliability improvements. Now that we knew the solution worked for our pilot services, we refocused our energies on determining how other teams could use QALM, too.

From multiple QALM service adoptions, we found that integrating both code and configuration changes slowed down the process. When considering how we could make QALM available to other groups within Uber, we focused on three values:

Load shedding should be a built-in feature of every service
QALM should be easy to configure
Service owners should not have to worry about mixing load shedding with business logic code

Fortunately, Uber is invested in many developer efficiency projects. One such project is YARPC, an open source remote procedure call (RPC) platform for Go services. YARPC features flexible and extensive middleware to handle both inbound and outbound requests.

We rewrote QALM as inbound middleware, so it could easily be plugged in to an existing YARPC service. In addition, we created a QALM module for the UberFx dependency injection service framework. With this approach, service owners do not need to make any code changes when deploying QALM, and can enable load shedding through simple configuration. Using YARPC and UberFx, we reduced QALM’s implementation time from a couple of days to under two minutes.

QALM RPC architecture — *Figure 7: This graph shows QALM inbound middleware plugged into the YARPC layer, so service handlers do not need to make changes to their code.*

While moving QALM into the application layer made implementation easier, we needed to make sure it would still work efficiently. Since QALM creates separate buffers for each endpoint, it will inevitably cause some CPU overhead. To understand QALM impact, we used CPU profiling in a test run with five endpoints on a single host, at 100 RPS each. Comparing this configuration to one without QALM, we found only a slight increase in CPU overhead of about three percent.

QALM cpu — *Figure 8: This graph shows QALM low CPU usage of ~3 percent overhead.*

Using QALM in production at Uber

To make sure our new QALM YARPC plugin is ready for integration across Uber Engineering, we launched a closed beta program with another team in a different group. We took this beta program as an opportunity to improve our technical documentation and integration support.

We chose the team supporting Uber Visa for this program because their service has a very specific reliability requirement: endpoint level isolation. One critical endpoint of the service, apply, which lets end users apply for a credit card, needs to be isolated from other non-critical endpoint traffic spikes.

Our load tests focused on endpoint level isolation, using these parameters:

~10 RPS for the apply endpoint for 30 minutes
>1,200 RPS spike for the non-critical endpoint getProvidedCards

Uber Visa graph without QALM — *Figure 9: Without QALM, all apply requests began to timeout after getProvidedCards reached 1,200 RPS.*

QALM Uber Visa getProvidedCard graph — *Figure 10: With QALM enabled, apply can still serve requests even as getProvidedCard reached 1,800 RPS. Only two requests timed out during 30 minutes of load testing.*

QALM Uber Visa graph RPS

Takeaways

We began developing QALM specifically for our team, and when we saw how it improves service reliability, we knew it could benefit teams across Uber Engineering. From our experiences developing and integrating QALM, we learned a few lessons:

Reliability features should not be mixed with business logic. Service owners should be able to easily plug it in or disable it, as needed. Building QALM into the RPC layer made adoption very easy.
Metrics speak louder than words. Quantifying the reliability improvement makes the value of QALM easily understood by service owners. With thorough performance load tests, we can present a convincing case for reliability enhancement.
Documentation is important, especially for self-serve onboarding. We provided a comprehensive wiki, monitoring dashboard, load test instructions, and an alert template to help service owners understand the process.
Collaboration is very important when you are contributing to the RPC layer. We have been working with multiple teams on this project and proactively providing updates to stakeholders.

Moving forward

We are continuing to improve QALM to make it easier for teams implement and further increase reliability. Specifically, we intend to:

Build an automated flow for service owners to reduce manual configuration when conducting load tests and set up alerts to simplify the integration process.
Raise the alert threshold to prevent non-critical work for our on-call engineers. Integrate anomaly detection with QALM will also provide a more accurate alert threshold.

Keep QALM and code on — *Our motto for the QALM project.*

If building the next level of customer platform to support Uber’s sustainable growth interests you, come join us to make an impact.

Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.

Acknowledgements

QALM was built by the Customer Platform Foundation Team – Scott Yao, Ping Jin, Feng Wang, Xin Peng. We would also like to thank Kris Kowal from the RPC team for his collaboration on this project. Lastly, QALM could not have been built without the support from our engineering managers Deepti Chheda, Chintan Shah, and Yan Zhang.