As enterprises race to roll out artificial intelligence, often overlooked are the infrastructure needs to support scalable ML model development and deployment. Efforts to effectively access and utilize GPUs often lead to extensive data engineering managing data copies or specialized storage, leading to out-of-control cloud and infrastructure costs.
To address the challenges, enterprises need a new data access layer to connect compute engines to data stores wherever they reside in distributed environments.
Join this webinar with Kevin Petrie, Eckerson Group VP of Research, and Sridhar Venkatesh, Alluxio SVP of Product, to explore tools, techniques, and best practices to remove data access bottlenecks and accelerate AI/ML model training. You will learn:
- Modern requirements for AI/ML model training and data engineering
- The challenges of GPU utilization in machine learning and the need for specialized hardware
- How a new data access layer connects compute to data stores across environments
- Best practices for optimizing ML training and guiding principles for success
As enterprises race to roll out artificial intelligence, often overlooked are the infrastructure needs to support scalable ML model development and deployment. Efforts to effectively access and utilize GPUs often lead to extensive data engineering managing data copies or specialized storage, leading to out-of-control cloud and infrastructure costs.
To address the challenges, enterprises need a new data access layer to connect compute engines to data stores wherever they reside in distributed environments.
Join this webinar with Kevin Petrie, Eckerson Group VP of Research, and Sridhar Venkatesh, Alluxio SVP of Product, to explore tools, techniques, and best practices to remove data access bottlenecks and accelerate AI/ML model training. You will learn:
- Modern requirements for AI/ML model training and data engineering
- The challenges of GPU utilization in machine learning and the need for specialized hardware
- How a new data access layer connects compute to data stores across environments
- Best practices for optimizing ML training and guiding principles for success
Video:
Presentation slides:
Complete the form below to access the full overview:
Videos
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.