AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere

In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.

In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.

What you will learn:

How to identify GPU utilization and I/O-related performance bottlenecks in model training
Leverage GPU anywhere to maximize resource utilization
Best practices for monitoring and optimizing GPU usage across training and serving pipelines

Video:

Presentation slides:

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere from Alluxio, Inc.

Speaker:

Bin Fan is VP of Technology and Founding Engineer at Alluxio. Prior to joining Alluxio as a founding engineer, he worked for Google to build the next-generation storage infrastructure. Bin received his PhD in computer science from Carnegie Mellon University on the design and implementation of distributed systems.