On-Demand Videos
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
In this talk, Sandeep Manchem discussed big data and AI, covering typical platform architecture and data challenges. We had engaging discussions about ensuring data safety and compliance in Big Data and AI applications.
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.
Within Alluxio, the master processes keep track of global metadata for the file system. This includes file system metadata, block cache metadata, and worker metadata. When a client interacts with the filesystem it must first query or update the metadata on the master processes. Given their central role in the system, master processes can be backed by a highly available, fault tolerant replicated journal. This talk will introduce and compare the two available implementations of this journal in Alluxio, the first using Zookeeper and the more recent version using Raft.
Chen Liang from Uber and Beinan Wang from Alluxio will present the practical problems and interesting findings during the launch of Alluxio Local Cache. Their talk covers how Uber’s Presto team implements the cache invalidation and dashboard for Alluxio’s Local Cache. Chen Liang will also share his experience using a customized cache filter to resolve the performance degradation due to a large working set.
Data platform teams are increasingly challenged with accessing multiple data stores that are separated from compute engines, such as Spark, Presto, TensorFlow or PyTorch. Whether your data is distributed across multiple datacenters and/or clouds, a successful heterogeneous data platform requires efficient data access. Alluxio enables you to embrace the separation of storage from compute and use Alluxio data orchestration to simplify adoption of the data lake and data mesh paradigms for analytics and AI/ML workloads.
Join Alluxio’s Sr. Product Mgr., Adit Madan, to learn:
- Key challenges with architecting a successful heterogeneous data platform
- How data orchestration can overcome data access challenges in a distributed, heterogeneous environment
- How to identify ways to use Alluxio to meet the needs of your own data environment and workload requirements
ALLUXIO DAY IX 2022 January 21, 2022 Video: Presentation Slides: The Evolution of an Open Data Platform with Alluxio from Alluxio, Inc.
ALLUXIO DAY IX 2022 January 21, 2022 Video: Presentation Slides: Vipshop Offline Data Cache Acceleration System – Alluxio Integration from Alluxio, Inc.
ALLUXIO DAY IX 2022 January 21, 2022 Video: Presentation Slides: Industrial Bank's Alluxio Deployment from Alluxio, Inc.
This talk provides an overview of the read-after-write data consistent mechanism in the Alluxio system. Alluxio Core Maintainer and Presto Committer share their recent work on Alluxio and Apache Iceberg integration, as well as some recent work from the Presto community on Iceberg connector.
This talk will introduce Apache Iceberg and its place in a modern and open data platform. It will cover the motivation for creating Iceberg at Netflix, as well as the data architecture that Iceberg makes possible.
Feifei Cai & Hao Zhu from WeRide provide an overview of Alluxio + Spark use case, which has been deployed and running in production to accelerate auto data tagging in the autonomous driving development.
This talk describes the design of shadow cache, a lightweight component to track the working set size of Alluxio cache. Shadow cache can keep track of the working set size over the past window dynamically, and is implemented by a series of bloom filters. We’ve deployed the shadow cache in Facebook Presto and leverage the result to understand the system bottleneck and help with routing design decisions.
This talk discusses the opportunities and problems when Uber meets Alluxio. Zhongting from Uber will provide an overview of Uber traffic, cloud, distribution, invalidation, and consistent hashing. Beinan from Alluxio will provide a deep dive of metadata and monitoring metrics.
In this talk, we will provide a complete picture of the Hudi platform components, along with their unique design choices. We will then deep dive into two important areas of active development going forward – table metadata management and caching. Specifically, we will discuss gaps in the data lake ecosystem around these aspects and provide strawman design approaches for Hudi aims to solve them going forward.