Skip to main content
Engineering, Data / ML, Uber AI

Uber’s Journey to Ray on Kubernetes: Ray Setup

3 April / Global
Featured image for Uber��’s Journey to Ray on Kubernetes: Ray Setup
Image
Figure 1: Federated Resource Management. 
Image
Figure 2: Flow of the job lifecycle.
Image
Figure 3: Definition of the pod error field.
Image
Figure 4: Job controller termination operations. 
Image
Figure 5: Discovery mechanism for clients. 
Image
Figure 6: Definition of the head node information field. 
Image
Figure 7: Head discovery by workers. 
Bharat Joshi

Bharat Joshi

Bharat Joshi is a Staff Engineer on the ML platform at Uber. He’s based out of Seattle, WA. His current interests are in building scalable ML platforms. He has prior experience in large-scale distributed storage systems and holds a patent in the area of data restoration.

Anant Vyas

Anant Vyas

Anant Vyas is a Senior Staff Engineer and the Tech Lead of AI Infrastructure at Uber. His focus is on maximizing the performance and reliability of their extensive computing resources for training and serving.

Ben Wang

Ben Wang

Ben Wang is a Staff Technical Program Manager at Uber. He’s based out of Seattle, WA. He has prior experience in ML infra and is now working on Uber’s ML infrastructure.

Min Cai

Min Cai

Min Cai is a Distinguished Engineer at Uber working on the AI/ML platform (Michelangelo). He also led many infra projects such as cluster management (Mesos and Peloton), microservice platform (uDeploy), all-active datacenters, etc. He received his Ph.D. degree in Computer Science from Univ. of Southern California. He has published over 20 journal and conference papers, and holds 6 US patents.

Axansh Sheth

Axansh Sheth

Axansh Sheth is an Engineering Manager at Uber, based in Bangalore, India. With prior experience as an IC in ML Infra, he manages the Batch Compute Platform team and is focused on modernizing the batch compute stack.

Abhinav Dixit

Abhinav Dixit

Abhinav Dixit is a Software Engineer II at Uber, based in Bangalore, India. As a key member of the Compute Batch team, he specializes in resource management and the deployment of batch jobs within the organization. With a strong background in Kubernetes and the Peloton stack, he is dedicated to optimizing performance and enhancing efficiency in Uber’s computational infrastructure.

Posted by Bharat Joshi, Anant Vyas, Ben Wang, Min Cai, Axansh Sheth, Abhinav Dixit