Come reimagine with us

Uber’s Sustainable Engineering Journey

2 March 2023 / Global

Introduction

Uber has made a commitment to sustainability by setting several goals across various sectors. By 2030, Uber plans to become a zero-emission mobility platform in Canada, Europe, and the US – and by 2040, worldwide. Uber Green, which offers no- or low-emission rides, has become the most widely-available option of its kind globally. However, this commitment encompasses more than just rides, as it also includes Uber’s engineering infrastructure such as its data centers and hardware resources, both on-premise and in public clouds.

As engineers and technology leaders, we nurture and develop the concept of responsible ownership, which is often thought of as maintaining high quality of our products. Responsible ownership also implies building efficient services, of which metrics for energy efficiency and sustainability should be an integral part.

In late 2021, we embarked on a journey to find out the best sustainable engineering practices, tools, and technologies, and began building them into our services, products, and training sessions. In this article, we present our vision and roadmap, walk through Uber Eng best practices for engineering sustainably towards a zero-emission world, and introduce novel, sustainability-oriented services.

Vision and Roadmap

We track progress across four main categories, starting from the most manual efforts (awareness and training), and continuously progressing towards automation and “sustainable by default” configuration.

Awareness and training. Working with partners across the leading tech companies, we created a program outlying the best practices, guidelines, and requirements. With Engineering Sustainability in Software Development still being a new field, unfamiliar to many employees, we found leadership-backed training to be a crucial component. One such example is the required sustainability questionnaire in all documentation preceding development. Product owners and developers are encouraged to answer how they measure CO₂ emissions, and what efforts are in place to reduce these.
Assisted development. We built services that continuously analyze the environment, then utilize ticketing services (Jira, ServiceNow) to assign service owners with timely recommendations for changes that improve sustainability. We also provide metrics specifying the impact of accepting and following a recommendation, so that service owners can prioritize making impactful changes that require less effort first.
Automated sustainability improvements. Arguably, the most challenging part is automating sustainability-improving changes while minimizing potential impact on other criteria, such as performance. We identify low performance-impact areas where we communicate upcoming changes to the resource owners. In some cases the process requests one-click approvals, while in others changes are made automatically (always followed by notification). One example is setting up intelligent tiering of cloud storage (see AWS example here) by default, which will automatically configure infrequently accessed buckets to utilize a less expensive, “cold” storage, and vice versa in case the bucket begins to be accessed more frequently.
Sustainable by default. During the resource provisioning stage, we make the most sustainable choices–the default and obvious ones. For example, by default we allow provisioning only in highly-sustainable geographic regions. If a less sustainable region must be used, the resource requester must intentionally select it and may need to provide justification for deployment resources there.

Sustainable Software Development

Sustainability Stack

All services utilize physical resources that produce CO₂ through the underlying hardware, which can be compute-heavy or storage-heavy. Minimizing both time and space complexity not only affects program execution speed and storage or usage costs, but also the CO₂ emissions generated by that program. Therefore, a positive sustainability impact should be considered an important additional motivation when optimizing for code efficiency by code writers and reviewers.

The second major part of the equation is optimizing the hardware we utilize. Is it energy-efficient? Is it overprovisioned or underutilized? Are there VMs or SaaS resources running needlessly that should’ve been decommissioned long ago?

The third and final part is the energy spent powering and running the hardware, including electricity, water, leased or owned land, etc. We should ask questions such as: What are the energy regulations in that geography? How much of data center electricity is produced through “green energy”? What are the environmental effects of using the land for data centers? Etc.

In summary, to optimize for sustainability, we need to drive progress across all the categories of what we call the “Sustainability Stack”. Luckily, with so much of the tech-stack responsibilities being outsourced to Data Center and Cloud Providers and their managed offerings, actively optimizing for sustainability mostly becomes a challenge of comparing and tracking the providers’ own sustainability metrics, coupled with our own responsible management of the services they provide and of the software we write. Note that Cloud Providers do not simply “take care of everything”; there is still much to be done by the consumer, whose job becomes easier once they’re aware of the tools made available to them.

Shared Responsibility and Shared Fate

Many users of the public clouds will recognize the “Shared Responsibility” model, usually referred to when discussing security responsibilities of Cloud Service Providers (CSPs) and cloud users (see Figure 1 for AWS mapping of shared responsibility). When it comes to sustainability, we believe the term “Shared Fate” better describes the process and outcome of sustainable management. In this regard:

The Cloud / Data Center (DC) provider is responsible for optimizing the sustainability of the cloud – delivering efficient, shared infrastructure, water stewardship, and sourcing renewable power. Consider sustainability when choosing your provider! Examine and compare their sustainability-related disclosures and commitments, as well as services: at the very least, a provider should have readily available CO2 measurements for the resources you use.
The Cloud / DC customers and users are responsible for sustainability in the cloud – optimizing workloads and resource utilization, and minimizing the total resources required.

Best Practices

We follow with a collection of best practices for sustainable engineering and software development. We compiled this list after examining recommendations defined by Amazon, Google, Microsoft, and others, as well as adding items from our own experience. Consider these when planning, building, and optimizing your services.

Understand your impact: Invest in sustainability metrics (see this open-source example). All major Cloud Providers and DCs should be able to provide these metrics; if they can’t, consider using a different provider. Knowing these metrics will allow you to identify the main contributors to greenhouse gas emissions and prioritize work to address the “lowest hanging fruits” first.
Maximize utilization: Right-size workloads and implement efficient design to ensure high utilization and maximize the energy efficiency. Eliminate or minimize idle resources, processing, and storage to reduce the total energy required to power your workload.
Schedule compute and VMs to run only when needed: Use auto-scaling, and asynchronous (“on-demand”) invocation. Optimize CI/CD builds to reduce the number of runs.
Use managed services: These can minimize unused resources by automatically scaling workloads (e.g., SaaS or on-demand serverless function vs. always-on VM). Choose the best service for Google cloud here.
Set upper scaling limits and alerts: Beware of needless scaling-up of autoscaling groups due to configuration errors or DDoS/EDoS attacks.
Set cost or budget alerts (forecasted if possible): these may indicate over-provisioning due to misconfiguration or EDoS attacks.
“Clean up after you’re done”: Monitor for and terminate no-longer needed hardware, delete unused storage, and deprecate obsolete services.
Deploy in sustainable geographies: Some geographies are significantly more sustainable than others. For example, when deploying GCP resources in US-West, the Las Vegas region is >6 times more polluting than Oregon.
Refactor monolithic applications into microservices to gain efficiency benefits.
Improve code efficiency: Time and space complexity matter for sustainability! For more information, see Software Carbon Intensity Framework.
De-duplicate code and storage and optimize invocation frequency: Configure storage policies to automatically delete data after it’s no longer needed.
Consider energy-efficiency when selecting a programming language (reference)
Anticipate and adopt new, more efficient hardware and software offerings: Follow announcements by your providers, and design your systems for modularity / flexibility to allow this adoption.
Utilize carbon-free and carbon-neutral providers. If you’re choosing a cloud/DC provider, inquire about their sustainability certifications, commitments and efforts, and availability of related metrics and tools.

Uber Engineering Sustainability Services

We follow up with a collection of services we’ve built across our DCs and Clouds to improve sustainability–as well as reducing cost savings and the attack surface.

GCP Project Lifecycle

A GCP project can be seen as a logical grouping of cloud resources. We learned that a significant percentage of GCP assets and projects are not being actively used due to a number of reasons, such as:

The project owner has moved teams or organizations
A service was deprecated, or never reached production state
The assets had been used as a proof of concept and were forgotten about

Architecture

We built our service around the Google Active Assist tool, which we used to identify inactivity. Listen to this Google Podcast to learn more about our usage of the tool.

The Cloud Scheduler periodically initiates Worker Cloud Functions to enrich the project activity data exported by Active Assist, including billing (project cost). The results are then persisted into Firestore DB. The Cloud Function then creates ServiceNow and Jira tickets through PubSub notifications. The GCP project owner is assigned with the ServiceNow and Jira tickets and decides whether to delete the project. The ticket would appear like so:

Geography Recommender

Some CSPs provide geographic/regional sustainability data, which we’d like to be taken into account alongside other considerations when deploying resources.

We can assist in region selection before resources are deployed, as well as recommendations to migrate existing resources:

In line with our “Sustainable by Default” approach, before resources are deployed, we allow only the sustainable regions as their geographic locations, and require providing reasoning for selecting other geographies. We found that in many cases, engineers don’t have a strong preference for resource location (especially when looking at geographically close regions, such as Las Vegas and Oregon data centers), therefore they don’t mind deploying in the regions we make available by default.
Following our “Assisted Development” approach, we also provide recommendations for migrating existing resources to a more sustainable close-by region. For example, consider infrequently accessed storage buckets located in Las Vegas GCP region with a low Carbon-Free Energy (CFE) Score of 19%, versus nearby alternatives (Iowa CFE: 93%, Oregon CFE: 90%, Sao Paulo CFE: 88%). It requires minimal effort to move buckets to a different location; therefore, we may issue a recommendation through ServiceNow/Jira ticket that appears as follows:

Optimal Utilization Recommender

Another tool in the Assisted Development category is the Optimal Utilization Recommender. Launched for AWS and GCP, it utilizes the Trusted Advisor and Active Assist cloud-managed services to locate resources (currently, Compute VM instances) that have had low utilization in the recent time period. We generally rely on the CSP definition for utilization (mainly, observing low CPU and memory usage as a percentage of overall allocation) to decide on underutilization. When such resources are found, we create ServiceNow/Jira tickets describing the recommended actions (scale-down or terminate) to the resource owners. We motivate the action by specifying the cost and sustainability impact of the current vs. recommended configuration.

The ticket may look as follows:

Conclusion

As we increase awareness with training and automated recommendations, as well as making it easy to make the right call for sustainable setup, we begin to see a significant improvement in the carbon footprint of Uber engineering resources. We continue working on collecting metrics and identifying impactful actions and projects, as well as working with our industry partners to co-develop engineering practices and tools that will benefit the world’s transition to clean energy.

Michael Sudakovitch

Michael is a Staff Security Engineer and a Tech Lead at Uber. In his day-to-day job, he develops security solutions for Uber’s AWS, GCP, Azure, and OCI clouds. In late 2021, after identifying new opportunities for improving the company’s Carbon Footprint posture, Michael formed the cross-organizational Sustainability Engineering team where engineers volunteer their time to build impactful, sustainability-promoting services.

Posted by Michael Sudakovitch

Category:

Engineering

Backend