Skip to main content
Engineering, Backend

Debugging with Production Neighbors – Powered by SLATE

18 June 2024 / Global
Featured image for Debugging with Production Neighbors – Powered by SLATE

Introduction

Software development is an iterative and staged process that needs validation and testing at function, component, and service levels. In the case of microservice-based architecture, it becomes far more important to develop in conjunction with dependent services. Microservice-based architecture provides distinct advantages that allow us to scale, maintain, and abstract responsibilities. The more abstraction, the easier it is for us to develop and define business logic. 

SLATE is an E2E testing tool that bridges the gap by allowing services under test to be deployed and work along with production upstream and downstream services. This allows developers to generate test requests mirroring production call flow yet target services under test. Such functionality facilitates various use cases, including feature development within a production environment or replicating production bugs, which often entail troubleshooting both code and configuration. To aid or simplify the process of troubleshooting and make it nearer to the local experience we have developed features to enable debugging of services deployed in the SLATE environment. 

In this blog we’ll explore different debugging options developed on SLATE that emulates the behavior of services under test with production upstream and downstream. 

Let us check the following three high-level options developed in detail:

  1. Remote debugging a SLATE deployed instance
  2. Local Debugging in laptop/dev pod machine
  3. Debug issues by filtered monitoring

Debugging Until Now

Debug via Logs

Debugging using logs is a fundamental practice that provides insights into a program’s execution. Logs enables developers to identify issues. However, inefficient logging can clutter the system with irrelevant information, leading to complicating rather than aiding the debugging process.

Debug via Staging

Staging environment is developer controlled environment that mirrors the production setup.

While staging environments are very beneficial, they may still differ from the live environment and provide false confidence with a longer turnaround time. 

Debug Locally

Local debugging is essential for faster iteration to test service in isolation. However, debugging user scenarios can be challenging due to constraints in simultaneously debugging multiple services together.

Remote Debugging of SLATE Instance

Testing and debugging on SLATE relies on logs from the service being tested. Depending solely on logs for understanding complex processes isn’t practical. Additionally, adding new logs requires a new deployment, causing delays. Remote debugging can address these issues by letting developers step through statements and monitor variables, eliminating the need for commit and deployment iterations. Co-working with production infra, needs to balance security and developer experience.

This brings a need to enhance visibility into the code for runtime debugging, achieved through breakpoints, step-ins, or dynamic tracepoints. Remote debugging is limited to SLATE instances handling test requests to ensure production security.

High-level Goals

  • Deploy a debuggable binary/code on a SLATE container
  • Ability to add breakpoints and tracepoints to a service under test
  • Ability to see values of different params on control hitting a breakpoint
  • Create a seamless developer experience similar to remote debugging 
  • Design solutions to be compliant with security and privacy issues

Design

SLATE leverages the production infrastructure to generate containers, compile code, and execute services. However, modifications were required to the build and deployment infrastructure to facilitate debugging functionalities for services deployed on SLATE. This involved three significant enhancements. Firstly, enabling the generation of builds with integrated debugging tools and functionalities. Secondly, configuring software execution with remote debugging options. Thirdly, facilitating developer access to remote containers by allocating and exposing ports from said containers.

Debuggable Deployment

The current deployment pipeline is not flexible to support different options to generate both debuggable and production binaries. To be able to generate and deploy debuggable binary, multiple components of the pipeline should realize the type of binary and configure their features accordingly. This diagram indicates the components that would be involved during the feature development.

Image
Fig 1: Modifications to the deployment pipeline to support debugging for SLATE.

Allocating Ports

The SLATE Container gets created alongside the production host. To be able to connect to the debugger, we have to expose a new debug port, similar to a gRPC/HTTP port. Currently UP is responsible for allocating random ports and mapping the same to the host port. The exposure of the new port will be opened only for debuggable SLATE deployments and SLATE implicitly handles the test requests by design. This new port exposure needs a security review. The below diagram indicates the high-level interactions.

Image

Image
Fig 2: Allocation and safely exposing debug ports.

Reaching Debuggable Service

To improve the security and avoid malicious access, SLATE debugging needs to be access controlled. This would ensure that only the service owners would be able to connect the debugger. The diagram below indicates the access control that would limit access to only the LDAP users of the service. 

Image
Fig 3: Password-based SSH tunneling to the remote host from the developer machines.

Debugger Execution

  • The debugger runs the application within a dedicated debugging server
  • The process blocks, awaiting attachment by the debugger client
  • The debugger process listens on a specific TCP/IP network port, referred to as a debug port

Controlling Program Execution

Debugging clients (e.g., VSCode, GoLand, JetBrains) connect via the debug port. Clients issue commands for various debugging tasks like setting breakpoints, displaying local variables and function arguments, printing CPU register contents, etc.

Remote Debugging with Debug Port

Remote debugging enables debugging on diverse environments, configurations, or architectures. Useful for troubleshooting specific scenarios or hardware/software related issues that cannot be replicated locally.

Access Control

Restricting for LDAP Users

During debugging sessions, users attach the debugger to the application to intervene in program execution and gather debug information. This would also mean trying to identify and resolve bugs in the program. This means to get access to some sensitive service information if allowed for every user. So restricting to LDAP users (service developers/owners) is important to ensure minimum security.

Secure SSH Connection

For remote debugging, a secure SSH connection is established between local and remote systems. This will allow for local port forwarding and redirects debug requests through an SSH tunnel. This tunnel would ensure encrypted communication and secure data transmission.

SSH Authorization

To begin an SSH connection, users need the correct password linked to the “slatedev” account. This password is a randomly generated 16-digit code in the file within the service container. The password is generated during the container’s startup before the main service application runs. This Password is accessible only to the container access group, which is service owner LDAP. LDAP users can access the password through Compute CLI, enabling them to establish SSH connections and perform debugging tasks. Compute CLI ensures restricted access to Non-LDAP users, which doesn’t allow password access.

Limitations

  • Remote debugging on production infra has limitations about dynamic modifications, so it’s limited to read-only
  • Large iteration time, as each change involves build, deploy, and test

Local Debugging using SLATE Attach

Remote Debugging allows for read-only debugging on production infrastructure. Being in production infra allows for seamless connections with upstream and downstream services/tools. For a developer it’s very important to experience a debuggable environment with faster iteration to fix and test the same. This gap can be filled by creating a local debugging experience in connection with production upstream and downstream services. SLATE Attach fills this gap and allows for rapid development on attaching local environments.

High-level Goals

The main goal is to reduce the code-deploy-test cycle (and hence, the time to validate iterative changes), by providing E2E testing with local development instances (laptop or dev machine), ensuring production isolation and safety.

Iteration Cycle

The iteration cycle in this context is the time between making the code change and validating them. The smaller the iteration cycle, the more efficient the use of developers’ time for end-to-end validation of subsequent changes.

Image
Fig 4: Iteration steps for development using SLATE.

Need for SLATE Attach

  • Iterative development generates a build binary at a faster pace
  • Reduced code-deploy-test cycle
  • Faster identification and resolution of local, E2E failures
  • Faster setup time 
  • Avoid the need for service changes or onboarding

Design

This design aims to introduce a SLATE proxy that handles all the test requests aimed at SLATE instances for local debugging. These requests will then be redirected to the appropriate local developer machine for debugging and development. This allows users to iterate faster and improve developers’ productivity. 

This feature could be enabled mainly in 2 contexts in SLATE environment lifecycle:

  1. SLATE Control plane that maps local laptop/devpod to a slate environment
  2. Test Request Data plane that redirects the requests to developers’ laptops

Control Plane

The main feature of the control plane is to enable services running in local laptops or devpods to attach to a SLATE environment. The local laptop/devpod that intends to run the service has to attach local environment credentials to a SLATE environment so that test requests are routed locally. The prerequisite for this attachment is to create a SLATE environment. This will allow mapping updates in routing control DB and local routing DB.

Image
Fig 5: Request Call flow for testing code running on developer machines

Call flow

  1. User initiates SLATE attach from local laptop/devpod
  2. The SLATE CLI calls Attach() API of SLATE Backend
  3. SLATE Backend fetches the Proxy information (host:port) from SLATE Proxy
  4. SLATE Backend updates the routing override in routing control DB using the fetched proxy info
  5. User initiates the SSH Session using the Cerberus CLI
  6. Cerberus gateway adds the mapping of deputized tenancy/UUID to the laptop credentials in Flipr DB and creates a SSH session for the laptop

Routing Control DB

Routing Control DB maps test tenancy to routing overrides and user account UUIDs to test tenancy. It stores the SLATE Proxy host:port against the service under test and ensures that all requests targeting a particular SLATE environment, reaches SLATE Proxy. SLATE Proxy finally routes the request to the development instance running in the user’s machine.

Local Routing DB

Local routing DB contains the development instance’s credentials that have been attached to the SLATE environment. SLATE Proxy interacts with the local routing DB to fetch routing credentials and finally routes the request to service-under-test running in local environment

Data Plane

This section mainly talks about the flow of test requests from different clients (mobile, studio, web, etc.). This data plane mainly involves 2 entities: routing override header and host tenancy mapping. The below diagram indicates how different test requests reach a local laptop through the SLATE proxy. The control plane ensures routing override and host mapping maintained in different databases. 

Image
Fig 6: Proxy setup for routing test requests to developer machines.

Above is the test request flow targeted for local laptop with production upstreams and downstreams:

  1. Test account request originates from mobile client
  2. E2E test proxy retrieves routing override and injects the routing header to the test request
  3. The test request propagates through production services via Mutley until the request has service 3 target 
  4. The request redirects to SLATE proxy as the routing override has slate proxy host:port against service 3
  5. SLATE proxy forwards the request to an open port on Cerberus-gateway based on a host:port config in the Cerberus-deputy Flipr namespace, set by the user when running the Cerberus CLI
  6. The Cerberus-gateway forwards the request to the user’s local development machine for the user to debug
  7. From the local laptop, the request will be finally forwarded to production downstreams through Cerberus

Limitations

  • Running a service locally may not be feasible for some complex services, as they need support for some dependencies like spanners that can only exist in production infra
  • This is limited to test requests as it enables dynamic changing of requests and in turn secures production traffic
  • Requests timeout on longer wait for a debug request in local

Impact

  • Plug-and-play development environment to improve developer productivity
  • Ability to create local experiences that co-work with production for developers
  • Increase Developer Velocity: Production debugging can help developers identify and fix issues more efficiently
Image
Fig 7. Impact figures for improving developer velocity using SLATE attach feature.

What’s Next

SLATE Sniffer to debug issues by monitoring

The remote and local debugging mainly allow for test requests to debug. There is a need for observability on production, beyond logs that come up in uMonitor Tool. We aim to create this observability precisely and on-demand using SLATE Sniffer.  The main goals of SLATE Sniffer include:

  • Capture the request and responses as a filter of a service and UUID
  • Ability to support and filter on Production and Test requests

Conclusion

Our objective is to enhance the SLATE platform, positioning it as the primary tool for debugging production issues. The debugging features integrated into SLATE strike a balance between security and developers’ requirements. SLATE has introduced a new paradigm for developers’ code-related activities and service bootstrapping. We are looking forward to collaborating with different teams to shift the quality left and create visibility on potential issues at the early stage of development.

Vasu Kakkirala

Vasu Kakkirala

Vasu Kakkirala is a Staff Engineer and Tech Lead on the Platform Engineering team at Uber. He plays a major role in the ongoing development of SLATE and its seamless integration of other tools within the Uber ecosystem. He is passionate about solving design problems at scale.

Lakshita Bhatia

Lakshita Bhatia

Lakshita, a Software Engineer II at Uber for over three years, specializes in building end-to-end (E2E) backend testing tools. Since SLATE's inception, she has been pivotal to its growth. Currently, she is focused on enhancing debugging capabilities within the platform.

Sagar Talla

Sagar Talla

Sagar is a Senior Full-Stack Engineer working on the Platform Engineering team at Uber. He played a significant role in the development of different tools like Hailstorm, BITS, and SLATE at Uber. He is passionate about problem-solving and backend development.

Abhishek Sharma

Abhishek Sharma

Abhishek is a Software Engineer II on the Platform Engineering team at Uber. He actively contributes towards building backend tools that improve E2E testing and debugging experience at Uber across services.

Posted by Vasu Kakkirala, Lakshita Bhatia, Sagar Talla, Abhishek Sharma