As Uber’s architecture has grown to encompass thousands of interdependent microservices, we need to test our mission-critical components at max load in order to preserve reliability. Accurate load testing allows us to validate if a set of services are working at peak usage and optimal efficiency while retaining reliability.
Load testing those services within a short time frame comes with its unique set of challenges. Most of these load tests historically involved writing, running, and supervising tests manually. Moreover, the degree to which tests accurately represent production traffic patterns gradually decreases over time as traffic organically evolves, imposing a long-term maintenance burden. The scope of the load testing effort continuously increases as the number of services grows, incurring a hidden cost to adding new features.
With this in mind, we developed Ballast, an adaptive load test framework that leverages traffic capture using Berkeley Packet Filter (BPF) and replays the traffic using a PID Controller mechanism to adjust the number of requests per second (RPS) to each service. Ballast removes the toil of writing, running, and supervising load tests, improves load test coverage, and performs continuous load testing, providing insight into service capacity and improving deployment safety on an always-on basis.
In the following sections, we will describe the design of Ballast and how this powerful framework for load testing has freed us from the associated toil.
High-Level Architecture Overview
At a high level, Ballast consists of 6 major components:
- Load Generator reads the load test fixture and forwards it to the target service to perform the load tests.
- Traffic Capture provides the framework with the ability to capture service traffic in real-time. This is used for test fixture preparation. Users can manually provide the test fixture as well.
- Golden Signals provides the framework with the ability to measure golden signals for production services, including latencies, availability, throughput, and resource utilization, such as CPU and cost for running the service per unit economics.
- PID Controller is a proportional–integral–derivative controller for the load generator RPS. It takes the Golden Signals feedback to complete a control loop while running the load test.
- Scheduler provides the framework with the ability to schedule the load tests based on user needs. An always-on option for load tests is also available.
- Ballast Watchdog watches all the running load tests and subscribes to critical alerts to ensure the safety of each load test for unexpected issues. For example, global lockdown, service outages, etc.
Load Generator
Ballast’s default load generator is called Shadower.
Shadower uses a coordinator/worker architecture. The coordinator is in charge of querying all the load tests in the scheduled state and finding a suitable worker to run them.
Shadower Coordinator
The coordinator is the public-facing component that uses leader election and it’s horizontally scaled. From any coordinator you can:
- Schedule a load test: Scheduling is done asynchronously, where the coordinator is continuously checking the load tests in the scheduled state and trying to find a suitable worker to run them.
- Update a load test: Any coordinator can receive this, but the call gets forwarded to the leader.
- Stop a load test: This is done asynchronously; when the worker sends a heartbeat, the coordinator checks the running load tests and notifies if any of them has been stopped.
- Query a load test
- List all load tests
Shadower Worker
The worker is in charge of executing the load tests. It has the following features:
- Quota management: It uses heartbeats to transfer the state to the coordinator. Although the state is asynchronous, if the coordinator thinks a worker has a quota for a load test, the call to run a load test is synchronous, at this moment the worker can reject the load test and provide the updated quota to the coordinator.
- Read payloads from Kafka.
- Read payloads from Storage.
- Request mirroring: If you provide multiple hosts, it can send the same request to every host.
- Metrics emission to our time-series database.
Shadower Mapper
A command-line tool that provides encoding/decoding for all the available encodings we use at Uber (JSON, Thrift, etc). It has the following features:
- Base64 output to have a JSON encoded binary payload.
- Gzip compressed payloads.
- Method mapping: If the endpoint is named as /v1/foobar in the payload we need to map the endpoint to a thrift/proto message definition.
These payloads are later uploaded to a persistent store for the worker to be able to read from them.
Example:
maps to →
Traffic Capture
The traffic capture component reads packets off the wire and assembles them into a valid request payload. It leverages the Berkeley packet filter (BPF) provided by package pcap to capture a specific service’s payload. The BPF is a technology used in certain computer operating systems for programs that need to analyze network traffic.
It supports 3 protocols: HTTP 1.1, HTTP 2.0, and TChannel. TChannel is a networking framing protocol built at Uber for general RPC, supporting out-of-order responses with extremely high performance, allowing intermediaries to make forwarding decisions quickly.
This component can capture the packets for these protocols off the wire, assemble them into a valid service request, and be ready for the load generator reads.
Golden Signals
This component is built on top of the Uber Metrics platform – M3 and provides the framework with the ability to retrieve the 4 golden signals (latency, traffic, errors, and saturation) for the load-tested service. During the load test, Ballast monitors the service’s health and availability by calling Golden Signals.
PID Controller
After Ballast starts the load test, the Ballast PID controller continuously calculates an error value as the difference between the desired service load test state (goal state provided as r(t)) and a measurement of the service golden signals (y(t)) and applies a correction on the load generator target throughput based on proportional (P), integral (I), and derivative (D) terms. The following is how we define PID controller in Ballast:
Defining u(t) as the controller output, the final form of the PID algorithm is:
Where:
Kp is the proportional gain, a tuning parameter,
Ki is the integral gain, a tuning parameter,
Kd is the derivative gain, a tuning parameter,
e(t) = r(t) – y(t) is the error (r(t) is the setpoint (goal state), and y(t) is the feedback value):
t is the time or instantaneous time (i.e., present),
τ is the variable of integration (takes on values from time 0 to the present t)
Using Ballast domain language:
r(t): service golden signals SLO, including target CPU usage (e.g., 80%), request latency (e.g., 600ms), error rate (e.g., 0.1%)
y(t): measured service golden signals in real-time
Ballast Data Flow
- Ballast captures the production traffic or users manually prepare the test fixture.
- Users configure a load test plan with the target service name, data center, target SLOs, test fixture location, etc.
- Ballast starts the load test and monitors the service’s golden signals to adjust the load generator’s target throughput.
- Ballast stops the load test and records the load test result when the service reaches the target SLOs (e.g., when CPU usage reaches 80%).
See an example of the Ballast load test controlled by the PID controller for our map search (P:4, I:0.2, D:1):
Shadower RPS ramped up quickly at the beginning but slowed down when its CPU usage was approaching the target CPU usage.
Ballast Use Cases at Uber
Holiday Peak Capacity Estimation
Ballast can run continuously without human intervention. The canary deployment is used to detect and prevent bugs before rolling it out globally. We enabled Ballast in the canary deployment for our services so that we always know our services’ capacity limit. With Ballast, it becomes simple math to prepare the proper capacity for our anticipated holiday peak: estimated holiday peak throughput divided by Ballast load test peak RPS per instance.
Resilient Rollout
Ballast is used for improving canary deployment defect detection and reducing outages in production. We have enabled Ballast’s always-on feature to continuously run load tests in a service’s canary deployment. When a developer deploys new code in canary, the before/after metrics are immediately visible. For example, we know there’s been a performance degradation when:
- A Ballast run triggers a CPU alert on a service that on previous runs never did.
- Ballast QPS goes down considerably compared to previous runs.
Load Shedder Behavior Testing
Ballast is used for testing the behavior of a load shedder used by the Uber Eats backend. To ensure that only the lowest-priority requests are dropped under heavy load, Ballast can inject different combinations of requests to simulate various production traffic scenarios.
Production Debugging
Ballast is used for replaying production high error rate traffic in staging to narrow down the failed requests. Developers can isolate the triggered code paths by analyzing the failed requests to efficiently debug the issues in production.
Final Thoughts
As we continue onboarding more services and use cases with Ballast, we are observing the emergence of an ecosystem of payload encoders, PID tuners, next generation of observability-based supervisors, chaos engineering scenarios, and more. Collectively these are analogous to operator architecture, with the Ballast controller running the reconciliation loop. Ballast lays the groundwork for the future of our resiliency testing, and we’re excited about its potential as our reliability platform. We hope this article is helpful to you!
Minglei Wang
Minglei Wang is a Staff Software Engineer on the Maps Production Engineering team at Uber. He works on reliability and efficiency initiatives across multiple organizations and platforms. He is currently leading the load testing initiative across the company.
Cristian Velazquez
Cristian Velazquez is a Staff Software Engineer on the Maps Production Engineering team at Uber. He works on multiple efficiency initiatives across multiple organizations. He has done several tuning across multiple services and multiple Java versions.
Posted by Minglei Wang, Cristian Velazquez
Related articles
Most popular
Introducing Preferred Deliveries and new criteria with Uber Eats Pro
Introducing the Prompt Engineering Toolkit
Making Uber’s ExperimentEvaluation Engine 100x Faster
Serving Millions of Apache Pinot™ Queries with Neutrino
Products
Company