Skip to main content
Uber AI

Under the Hood of Uber ATG’s Machine Learning Infrastructure and Versioning Control Platform for Self-Driving Vehicles

4 March 2020 / Global
Featured image for Under the Hood of Uber ATG’s Machine Learning Infrastructure and Versioning Control Platform for Self-Driving Vehicles
Figure 1. Once we’ve selected and split the data we’ll use to train our model (Data Ingestion), we ensure the information is of appropriate quality (Data Validation), train our model with validated data (Model Training), and test our model to ensure it performs optimally (Model Evaluation). If it passes this assessment, we deploy it to our self-driving vehicles (Model Serving). If we encounter issues at any stage of this process, we can go back to the beginning.
Figure 2. Our self-driving vehicles generate various logs (for instance, camera and LiDAR information are depicted here, but radar information and ground truth labels also apply). We then extract this data from every log on the CPU cluster at once and save the extracted data to HDFS, making it easier for our pipeline to process.
Figure 3. We use extracted data (including the images pictured here, along with other sensor data) to run distributed training using Horovod on the GPU cluster and save the data to HDFS.
Figure 4. The traditional continuous delivery cycle differs from that used for ML in that, instead of just building code and testing it, ML developers must also construct data sets, train models, and compute model metrics.
Figure 5. The final result of ML workflows is just a tiny artifact compared to all of the supporting systems and code (such as configuration, data collection, feature extraction, data verification, machine resource management, analysis tools, process management tools, serving infrastructure, and monitoring). (Source: . Used with permission.)
Figure 6. The dependency graph for an object detection model, shown on the left, and two other ML models, shown on the right, depicts code and configurations that are handled by version control systems (in green) and items that are not handled by version control (in grey).
Figure 7. VerCD consists of a version and dependency metadata service, and an orchestrator service. We use stock frameworks such as Flask, SQLAlchemy, MySQL, and Jenkins but augment their functionality with ATG-specific integrations.
Figure 8. The Version and Dependency Metadata Service has individual endpoints for data set building, model training, and metrics computation. The REST API is a Flask and SQLAlchemy app, backed by MySQL to hold the dependency metadata. The yellow API handlers and data access layers were designed for ATG-specific use cases.
Figure 9. VerCD’s Orchestrator Service manages the workflow pipelines for building data sets, training models, and computing metrics. It is comprised of an off-the-shelf Jenkins distribution, augmented with our own ATG-specific integrations (yellow) that give the orchestrator the ability to interact with external ATG systems. ( logo used under license CC-BA 3.0.)
Figure 10. In VerCD workflows, the “Experimental” and “Validation” states are independent from one another, but both must be successful before a model can transition to the “Production” state.
Yu Guo

Yu Guo

Yu Guo is a Senior Manager of Software Engineering at Uber ATG.

Khalid Ashmawy

Khalid Ashmawy

Khalid Ashmawy is a Senior Tech Lead Manager at Uber ATG.

Eric Huang

Eric Huang

Eric Huang is a Senior Software Engineer and VerCD tech lead at Uber ATG.

Wei Zeng

Wei Zeng

Wei Zeng is a Software Engineer for VerCD at Uber ATG.

Posted by Yu Guo, Khalid Ashmawy, Eric Huang, Wei Zeng

Category: