On-Demand Videos
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
In this talk, Sandeep Manchem discussed big data and AI, covering typical platform architecture and data challenges. We had engaging discussions about ensuring data safety and compliance in Big Data and AI applications.
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.
Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years.
Specifically, we will introduce:
- Our on-prem HDFS cluster scalability challenges and how we solved them
- Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance
- The challenges we are facing during the ongoing Cloud migration and our solutions
Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive:
- Optimizing a developmental setup can include manual copies, which are slow and error-prone
- Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees
This webinar covers solutions to improve data loading for model training. You will learn:
- The data loading challenges with distributed infrastructure
- Typical solutions, including NFS/NAS on object storage, and why they are not the best options
- Common architectures that can improve data loading and cost efficiency
- Using Alluxio to accelerate model training and reduce costs
This hands-on session discusses best practices for using PyTorch and Alluxio during model training on AWS. Shawn and Lu provide a step-by-step demonstration of how to use Alluxio on EKS as a distributed cache to accelerate computer vision model training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.
ChatGPT and other massive models represents an amazing step forward in AI, yet they do not solve real-world business problems. In this session, Jordan Plawner, Global Director of Artificial Intelligence Product Manager and Strategy at Intel, surveys how the AI ecosystem has worked non-stop over this last year to take these all-purpose multi-task models and optimize them to they can be used by organizations to address domain specific problems. He explains these new AI-for-the-real world techniques and methods such as fine tuning and how they can be applied to deliver results which are highly performant with state-of-the-art accuracy while also being economical to build and deploy everywhere to enhance products and services.
In this talk, Wanchao Liang, Software Engineer at Meta Pytorch Team, explores the technology advancements of PyTorch Distributed, and dives into the details of how multi-dimensional parallelism is made possible to train Large Language Models by composing different PyTorch native distributed training APIs.
Machine learning models power Uber’s everyday business. However, developing and deploying a model is not a one-time event but a continuous process that requires careful planning, execution, and monitoring. In this session, Sally (Mihyong) Lee, Senior Staff Engineer & TLM @ Uber, highlights Uber’s practice on the machine learning lifecycle to ensure high model quality.
In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offer practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.
As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.
As enterprises race to roll out artificial intelligence, often overlookModel training requires extensive computational and GPU resources. When training models on AWS, loading data from S3 often becomes a major bottleneck, wasting valuable GPU cycles. Optimizing data loading can greatly reduce GPU idle time and increase GPU utilization.
In this webinar, Greg Palmer will discuss best practices for efficient data loading during model training on AWS. He will demonstrate how to use Alluxio on EKS as a distributed cache to accelerate PyTorch training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.
What you will learn:
- The challenges of feeding data-hungry GPUs in the cloud
- How to accelerate model training by optimizing data loading on AWS
- The reference architecture for running PyTorch jobs with Alluxio cache on EKS while reading data from S3, with benchmark results of training ResNet50 and BERT
- How to use TensorBoard to identify bottlenecks in GPU utilization
As enterprises race to roll out artificial intelligence, often overlooked are the infrastructure needs to support scalable ML model development and deployment. Efforts to effectively access and utilize GPUs often lead to extensive data engineering managing data copies or specialized storage, leading to out-of-control cloud and infrastructure costs.
To address the challenges, enterprises need a new data access layer to connect compute engines to data stores wherever they reside in distributed environments.
Join this webinar with Kevin Petrie, Eckerson Group VP of Research, and Sridhar Venkatesh, Alluxio SVP of Product, to explore tools, techniques, and best practices to remove data access bottlenecks and accelerate AI/ML model training. You will learn:
- Modern requirements for AI/ML model training and data engineering
- The challenges of GPU utilization in machine learning and the need for specialized hardware
- How a new data access layer connects compute to data stores across environments
- Best practices for optimizing ML training and guiding principles for success
Organizations are retooling their enterprise data infrastructure in the race for AI/ML. However, growing datasets, extensive data engineering overhead, high GPU costs, and expensive specialized storage can make it difficult to get fast results from model development.
The data access layer is the key to accelerating your path to AI/ML. In this webinar, Roland Theron, Senior Solutions Engineer at Alluxio, discusses how the data access layer can help you:
- Build AI architecture on your existing data lake without the need for specialized hardware.
- Streamline the time-consuming process of managing data copies in data engineering.
- Speed up training workloads with high GPU utilization.
- Achieve optimal concurrency to deliver models to inference clusters for demanding applications
Join us with David Loshin, President of Knowledge Integrity, and Sridhar Venkatesh, SVP of Product at Alluxio, to learn more about the infrastructure hurdles associated with AI/ML model training and deployment and how to overcome them. Topics include:
- The challenges of AI and model training
- GPU utilization in machine learning and the need for specialized hardware
- Managing data access and maintaining a source of truth in data lakes
- Best practices for optimizing ML training