In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.
In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.
What you will learn:
- How to identify GPU utilization and I/O-related performance bottlenecks in model training
- Leverage GPU anywhere to maximize resource utilization
- Best practices for monitoring and optimizing GPU usage across training and serving pipelines
In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.
In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.
What you will learn:
- How to identify GPU utilization and I/O-related performance bottlenecks in model training
- Leverage GPU anywhere to maximize resource utilization
- Best practices for monitoring and optimizing GPU usage across training and serving pipelines
Video:
In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.
In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.
What you will learn:
- How to identify GPU utilization and I/O-related performance bottlenecks in model training
- Leverage GPU anywhere to maximize resource utilization
- Best practices for monitoring and optimizing GPU usage across training and serving pipelines
Video:
Videos:
Presentation Slides:
Complete the form below to access the full overview:
Videos
In the rapidly evolving landscape of AI and machine learning, Platform and Data Infrastructure Teams face critical challenges in building and managing large-scale AI platforms. Performance bottlenecks, scalability of the platform, and scarcity of GPUs pose significant challenges in supporting large-scale model training and serving.
In this talk, we introduce how Alluxio helps Platform and Data Infrastructure teams deliver faster, more scalable platforms to ML Engineering teams developing and training AI models. Alluxio’s highly-distributed cache accelerates AI workloads by eliminating data loading bottlenecks and maximizing GPU utilization. Customers report up to 4x faster training performance with high-speed access to petabytes of data spread across billions of files regardless of persistent storage type or proximity to GPU clusters. Alluxio’s architecture lowers data infrastructure costs, increases GPU utilization, and enables workload portability for navigating GPU scarcity challenges.
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.