Performance, cache operability, and cost efficiency are key considerations for AI platform teams supporting large scale model training and distribution. In 2023, we launched Alluxio Enterprise AI, for managing AI training and model distribution I/O across diverse environments, whether in a single storage with diverse computing clusters or in a more complex multi-cloud, multi-data center environment.
Today, we are excited to announce the release of Alluxio Enterprise AI 3.2! Alluxio 3.2 incorporates extensive user feedback from iterations with leading companies building their own AI platforms. Highlights include significant performance enhancements with new checkpoint write support, expanded cache management options, and support for FSSpec interface for integration with the Python ecosystem. Additionally, these advancements enable organizations to adopt Alluxio on their existing data lake as an alternative solution to investing in HPC storage infrastructure.
New Features And Enhancements In Version 3.2
Leverage GPUs Anywhere with 97%+ GPU Utilization
In this version, we've significantly improved performance, allowing users to train their models with reading datasets and checkpoints up to 10GB/s throughput and 200K IOPS using a single Alluxio Fuse client. This enhancement is more than sufficient to handle most AI model training scenarios, especially with advanced setups using 8 A100 GPUs on a single node, eliminating any concerns about I/O bottlenecks. We achieved 97%+ GPU utilization of 20 GPUs when running the MLPerf benchmark, including both 3D-Unet and BERT benchmarks across various GPU (NVIDIA A100) configurations, showcasing superior scalability and efficiency in handling data-intensive workloads.
These performance advancements, combined with Alluxio’s unified namespace feature, which simplifies data access across various storage systems, allow organizations to scale their AI platforms without being constrained by data locality or GPU availability. With Alluxio, organizations can build and run AI training and serving workloads wherever GPUs are available, bridging the gap in adopting hybrid and multi-cloud infrastructures.
We’ve put together this simple tutorial to find out your GPU utilization rate in a few clicks: https://www.alluxio.io/gpu-test-tool/
New Checkpoint Write Support
We've implemented enhancements for handling large checkpoints. Users can now quickly write checkpoints to the local disk and Alluxio will subsequently upload these checkpoints to the cold persistent layer. This feature eliminates the need to wait for checkpoints to be written back to a slow persistent layer, thereby preventing GPU idle time. This is particularly beneficial for large language models and recommendation systems.
Effective Cache Administration
Cache management: Having exact control over cache utilization is critical to maximizing the efficiency of varying workloads. The following commands and configurations provide the flexibility needed to adapt to different scenarios:
- Cache preloading: Prepopulate the cache to avoid cold reads before starting a workload
- Cache eviction: Both passive eviction behaviors and manually triggered commands to make space for new data to be cached
- Cache filtering: Set rules based on file paths to determine if data should be permanently cached, never cached, or cached with an expiry time.
Manage cache via REST API:
Our enriched management REST API now allows for easy lifecycle management of cache space. In this release, you can integrate Alluxio with your control plane via REST API to issue commands to preload data, free data from cache, or set eviction configurations.
Kubernetes management enhancement: Support rolling upgrades and autoscaling for the Alluxio cluster to minimize downtime for workflows while updating the cluster. In the intermittent event that a client is unable to communicate with the cluster during the update process, clients can fallback to the UFS (Under File System) to directly retrieve data, preventing application failures due to I/O errors.
Introducing Alluxio FSSpec Python Filesystem Interface
The Alluxio FSSpec Python API (alluxiofs), an implementation of Filesystem Spec (FSSpec), allows applications to seamlessly interact with various storage backends using a unified Python filesystem interface. Python applications can seamlessly and easily adopt Alluxio Enterprise AI with this new API, simplifying integration and enhancing compatibility. This new interface allows popular Python-based compute frameworks, like Ray, to effortlessly integrate Alluxio to access both local and remote storage systems. This is particularly beneficial for data-intensive applications and AI training workloads where large datasets need quick and repeated access. The addition of the FSSpec interface extends Alluxio’s integration with the Python ecosystem.
Leverage Existing Data Lake Over Investing In HPC Storage Infrastructure
In this release, we tested Alluxio against HPC storage infrastructure market alternatives through robust MLPerf performance benchmarks and the results show that Alluxio provides comparable end-to-end performance. With infrastructure costs in mind, platform teams can leverage Alluxio with existing data lake resources rather than investing in additional HPC storage infrastructure.
Watch this video to see what's new in Alluxio Enterprise AI with a live demo:
https://www.youtube.com/watch?v=m1pAGdZQr6E
Try Alluxio Enterprise AI Today
With the Alluxio Enterprise AI 3.2 release, we have significantly improved performance, cost-efficiency, ease of use and cache management capabilities.
We invite you to learn more and try it today:
- Download free trial: https://www.alluxio.io/download/
- Watch the replay of the webinar: https://www.alluxio.io/resources/videos/whats-new-in-alluxio-enterprise-ai-3-2-leverage-gpu-anywhere-pythonic-filesystem-api-write-checkpointing-and-more/
- Follow this simple tutorial to find out your GPU utilization rate: https://www.alluxio.io/gpu-test-tool/
- Read the documentation: https://docs.alluxio.io/ee-ai/user/stable/en/Overview.html
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.