Uber has extensively adopted Go as a primary programming language for developing microservices. Our Go monorepo consists of about 50 million lines of code and contains approximately 2,100 unique Go services. Go makes concurrency a first-class citizen; prefixing function calls with the go keyword runs the call asynchronously. These asynchronous function calls in Go are called goroutines. Developers hide latency (e.g., IO or RPC calls to other services) by creating goroutines within a single running Go program. Goroutines are considered “lightweight”, and the Go runtime context switches them on the operating-system (OS) threads. Go programmers often use goroutines liberally. Two or more goroutines can communicate data either via message passing (channels) or shared memory. Shared memory happens to be the most commonly used means of data communication in Go.
A data race occurs in Go when two or more goroutines access the same memory location, at least one of them is a write, and there is no ordering between them, as defined by the Go memory model. Outages caused by data races in Go programs are a recurring and painful problem in our microservices. These issues have brought down our critical, customer-facing services for hours in total, causing inconvenience to our customers and impacting our revenue. In this blog, we discuss deploying Go’s default dynamic race detector to continuously detect data races in our Go development environment. This deployment has enabled detection of more than 2,000 races resulting in ~1,000 data races fixed by more than two hundred engineers.
Dynamically Detecting Data Races
Dynamic race detection involves analyzing a program execution by instrumenting the shared memory accesses and synchronization constructs. The execution of unit tests in Go that spawn multiple goroutines is a good starting point for dynamic race detection. Go has a built-in race detector that can be used to instrument the code at compile time and detect the races during their execution. Internally, the Go race detector uses the ThreadSanitizer runtime library which uses a combination of lock-set and happens-before based algorithms to report races.
Important attributes associated with dynamic race detection are as follows:
- Dynamic race detection will not report all races in the source code, as it is dependent on the analyzed executions
- The detected set of races are dependent on the thread interleavings and can vary across multiple runs, even though the input to the program remains unchanged
When to Deploy a Dynamic Data Race Detector?
We use more than 100,000 Go unit tests in our repository to exercise the code and detect data races. However, we faced a challenging question on when to deploy the race detector.
Running a dynamic data race detector at pull request (PR) time is fraught with the following problems:
- The race detection is non-deterministic. Hence a race introduced by the PR may not be exposed and can go undetected. The consequence of this behavior is that a later benign PR may be affected by the dormant race being detected and get incorrectly blocked, thus impacting developer productivity. Further, the presence of pre-existing data races in our 50M code base makes this a non-starter.
- Dynamic data race detectors have 2-20x space and 5-10x memory overheads, which can result in either violation of our SLAs or increased hardware costs.