Skip to main content
Engineering, Uber AI

Open Source and In-House: How Uber Optimizes LLM Training

October 17 / Global
Featured image for Open Source and In-House: How Uber Optimizes LLM Training
Image
Figure 1: Resource scheduling for LLM workflows.
Image
Figure 2: Uber LLM training software stack.
Image
Figure 3: Uber LLM distributed training pipeline.
Image
Figure 4: Training loss of Llama 2 models with and without (Q)LoRA.
Image
Figure 5: Model Flops Utilization of training Llama 2 70B.
Image
Figure 6: Model Flops Utilization of training Llama 2 7B.
Image
Figure 7: Distributed LLM scorer with Ray and vLLM.
Image
Figure 8: Throughput for Mixtral-8x7b on 2 x A100/H100.
Bo Ling

Bo Ling

Bo Ling is a Staff Software Engineer on Uber’s AI Platform team. He works on NLP, Large language models and recommendation systems. He is the leading engineer on embedding models and LLM in the team.

Jiapei Huang

Jiapei Huang

Jiapei Huang is a Software Engineer working on Deep Learning training infrastructure at Uber Michelangelo team. He has end-to-end experience of AI infra and unblocked multiple business-critical scenarios like LLM, time series modeling, etc.

Baojun Liu

Baojun Liu

Baojun Liu is a Software Engineer working on online and offline serving, software and hardware co-development for AI Infra at Uber Michelangelo team. Prior to that, he was a deep learning framework architect working on DL compiler intermediate representation, software stack development for heterogeneous architecture, and its enabling for serving and training.

Chongxiao Cao

Chongxiao Cao

Chongxiao Cao is a Senior Software Engineer at Uber, leading the development of the deep learning training infrastructure on Michelangelo, including scaling up data throughput, accelerating training speed, increasing model size, and optimizing resource utilization. He also serves as a leading contributor to the Horovod distributed deep learning framework and Petastorm data loading library.

Anant Vyas

Anant Vyas

Anant Vyas is the tech lead of AI Infrastructure at Uber, where his focus is on maximizing the performance and reliability of their extensive computing resources. Prior to this role, he contributed to the Compute Platform team, specializing in the development of resource scheduling systems.

Peng Zhang

Peng Zhang

Peng Zhang is an Engineering Manager on the AI Platform team at Uber. He supports the teams dedicated to developing modeling and training frameworks, managing GPU-based clusters, and enhancing ML infrastructure for training classical, deep learning, and generative AI models.

Posted by Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang