On-Demand Videos
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
In this talk, Sandeep Manchem discussed big data and AI, covering typical platform architecture and data challenges. We had engaging discussions about ensuring data safety and compliance in Big Data and AI applications.
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.
Driven by strong interests from our open-source community, the core team of Alluxio started to re-design an efficient and transparent way for users to leverage data orchestration through the POSIX interface. We have introduced a new JNI-based FUSE implementation to support POSIX data access, as well as many improvements in relevant data operations like more efficient distributedLoad, optimizations on listing or calculating directories with a massive amount of files, which are common in model training.
RAPIDS is a set of open source libraries enabling GPU aware scheduling and memory representation for analytics and AI. Spark 3.0 uses RAPIDS for GPU computing to accelerate various jobs including SQL and DataFrame. With compute acceleration from massive parallelism on GPUs, there is a need for accelerating data access and this is what Alluxio enables for compute in any cloud. In this talk, you will learn how to use Alluxio and Spark with RAPIDS Accelerator on NVIDIA GPUs without any application changes.
Alluxio’s capabilities as a Data Orchestration framework have encouraged users to onboard more of their data-driven applications to an Alluxio powered data access layer. Driven by strong interests from our open-source community, the core team of Alluxio started to re-design an efficient and transparent way for users to leverage data orchestration through the POSIX interface. This effort has a lot of progress with the collaboration with engineers from Microsoft, Alibaba and Tencent. Particularly, we have introduced a new JNI-based FUSE implementation to support POSIX data access, created a more efficient way to integrate Alluxio with FUSE service, as well as many improvements in relevant data operations like more efficient distributedLoad, optimizations on listing or calculating directories with a massive amount of files, which are common in model training. We will also share our engineering lessons and roadmap in future releases to support Machine Learning applications.
At Aspect Analytics we intend to use Dask, a distributed computation library for Python, to deal with MSI data stored as large tensors. In this talk we explore using Alluxio and Alluxio FUSE as a data consolidation and caching layer for some of our bioinformatics workflows.
Increasingly powerful compute accelerators and large training dataset have made the storage layer a potential bottleneck in deep learning training/inference.
Offline inference job usually consumes and produces tens of tera-bytes data while running more than 10 hours.
For a large-scale job, it usually causes high IO pressure, increase job failure rate, and bring many challenges for system stability.
We adopt alluxio which acts as an intermediate storage tier between the compute tier and cloud storage to optimize IO throughput of deep learning inference job.
For the production workload, the performance improves 18% and we seldom see job failure because of storage issue.
Data Lake Analytics(DLA) is a large scale serverless data federation service on Alibaba Cloud. One of its serverless analytics engine is based on Presto. The DLA Presto engine supports a variety of data sources and is widely used in different application scenarios in the cloud. In this session, we will talk about the system architecture of DLA Presto engine, as well as the challenges and solutions. In particular, we will introduce the use of alluxio local cache to solve performance issues on OSS data sources caused by access delay and OSS bandwidth limitation. We will discuss the principle of alluxio local cache and some improvements we have made.
We are thrilled to announce the release of Alluxio 2.5!
Alluxio 2.5 focuses on improving interface support to broaden the set of data driven applications which can benefit from data orchestration. The POSIX and S3 client interfaces have greatly improved in performance and functionality as a result of the widespread usage and demand from AI/ML workloads and system administration needs. Alluxio is rapidly evolving to meet the needs of enterprises that are deploying it as a key component of their AI/ML stacks.
At the same time, Alluxio continues to integrate with the latest cloud and cluster orchestration technologies. In 2.5, Alluxio has new connectors for Google Cloud Storage and Azure Data Lake Storage Gen 2 as well as better operability functionality for Kubernetes environments.
In this Office Hour, we will go over:
- JNI Based POSIX API
- S3 Northbound API
- ADLS Gen 2 Connector
- GCSv2 Connector
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we’ll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Alluxio is an open source Data orchestration platform that can be deployed on multiple platforms. However, it can require a lot of thinking and experience to integrate Alluxio into an existing Data Architecture adhering to minimally required DevOps principles meeting Organizational standards.
The presentation talks about the best practices to set up and techniques to build a cluster with open source Alluxio on AWS EKS, for one of our clients, which made it Scalable, Reliable, and Secure by adapting to Kubernetes RBAC.
Our speaker Vasista Polali will show you how to :
- Bootstrap EKS cluster in AWS with Terraform.
- Deploy open source Alluxio in a Namespace with persistence in AWS EFS.
- Scale up and down the Alluxio worker nodes as Daemon sets by Scaling the EKS nodes with Terraform.
- Accessing data with S3 mount.
- Controlling the access to Alluxio with Kubernetes port-forwarding, “setfacl” functionality, and Kubernetes service accounts.
- Re-using the data/metadata in the persistence layer on a new cluster.
ALLUXIO DAY 2021 March 11, 2021
ALLUXIO DAY 2021 March 11, 2021
ALLUXIO DAY 2021 March 11, 2021