With the new year comes new features in Alluxio Enterprise AI! Just weeks into 2025 and we are already bringing you exciting new features to better manage, scale, and secure your AI data with Alluxio.
From accelerated checkpoint file creation performance and advanced cache management to our Python SDK and S3 API enhancements, our latest release of Alluxio Enterprise AI delivers more power and performance to your AI workloads.
Without further ado, let’s dig into the details.
Performance Optimization
Improve Checkpoint File Creation Performance up to 3X
with CACHE_ONLY Write Mode [ EXPERIMENTAL]
Alluxio’s new CACHE_ONLY Write Mode improves the performance of write operations, such as creating checkpoint files during model training. When enabled, model training workloads are accelerated by reducing checkpoint creation by 3X!
With model training logic paused during checkpoint file creation and most training workloads writing multiple checkpoint files per epoch, reducing write time by 3X can knock hours off of total training time! For example if it typically takes 1 hour to create a checkpoint file and your model training workload creates 5 checkpoint files per epoch, simply using CACHE_ONLY Write Mode reduces your training time by over 3 hours!
When this mode is enabled, data will be written to the Alluxio cache and not to the underlying file system or UFS. By only writing to the Alluxio cache, write performance improves by eliminating the bottlenecks associated with the UFS.
Note that because data is not written to the UFS, the durability of this data during a system outage is not guaranteed and therefore Alluxio CACHE_ONLY mode should not be used as persistent storage.
Read the Documentation
Caching Operations
Directory-Based Quota Management
Alluxio’s Directory-based Quota Management has been extended to allow administrators to set quotas on any directory, including subdirectories, to provide even more granular control of the Alluxio cache.
TTL Cache Eviction Policies
TTL Cache Eviction Policies, introduced in Alluxio Enterprise AI 3.4, can be set by administrators to enforce time-to-live (TTL) policies on cached data. These policies, set at the directory level, optimize cache efficiency by ensuring that less frequently accessed data is automatically evicted based on the policies settings.
Priority-Based Cache Eviction Policies
With Priority-based Cache Eviction Policies, administrators gain control over which data remains in the Alluxio cache. Policies override Alluxio’s Least Recently Used (LRU) cache eviction algorithm by defining the caching priority (High, Medium, Low) for data based on UFS path prefixes. Use Priority-based Cache Eviction Policies when you need to ensure specific data stays in cache even if the data would have otherwise been evicted based on the LRU algorithm.
Alluxio Client & SDK
Python SDK and FSSpec [ EXPERIMENTAL ]
Alluxio’s Python SDK is now integrated with the most popular AI frameworks, including PyTorch, PyArrow, and Ray.
With Alluxio’s Python SDK, applications seamlessly interact with various storage backends using a unified Python filesystem interface. Python applications can seamlessly and easily adopt Alluxio Enterprise AI, simplifying integration and enhancing compatibility by making it seamless to access both local and remote storage systems. This is particularly beneficial for data-intensive applications and AI training workloads where large datasets need quick and repeated access.
S3 API
This release includes several performance, scalability, and security enhancements to Alluxio’s S3 API:
- HTTP persistent connections, also called HTTP keep-alive, are now supported. By maintaining a single TCP connection that can be used for multiple requests, HTTP persistent connections reduce the overhead of opening a new connection for each request and can decrease latency by approximately 40% for 4KB S3 ReadObject requests.
- TLS encryption is now supported for communication between the Alluxio S3 API and the Alluxio worker.
- The Alluxio S3 API now supports multipart upload (MPU) to simplify and improve throughput by splitting files into multiple parts and uploading each part separately.
- The S3 API now supports zero copy network transport for better performance and reduced CPU usage.
Data Access
3-5X Faster Directory Listing [ EXPERIMENTAL ]
The Alluxio Index Service is a new caching service that improves the performance of directory listing for directories storing hundreds of millions of files and subdirectories. The Index Service ensures scalability and delivers 3-5X faster results by serving directory listing details from cache compared to listing directories on the UFS.
UFS
UFS Rate Limiter
Administrators can now set a rate limit to control the maximum bandwidth an individual Alluxio Worker can read from the UFS. By setting the UFS Read Rate Limiter, Administrators ensure resource utilization is optimized while also maintaining system stability. Alluxio supports rate limiting for various UFS types, including S3, HDFS, GCS, OSS, and COS.
Cluster Administration
Heterogeneous Worker Resources Support
Alluxio now supports clusters with worker nodes that have heterogenous resource configurations (CPU, memory, disk, and network). Supporting heterogenous worker nodes provides administrators with more flexibility when configuring their clusters and the opportunity to better optimize resource allocation.
Security
Common Vulnerabilities and Exposures (CVEs) Addressed in this Release
Alluxio has removed or upgraded packages corresponding to several critical CVEs, including:
- Log4j: Explicitly excluded any log4j 1.x versions that were transitively picked up through various dependencies
- Zookeeper: Removed and explicitly excluded from all dependencies, Hadoop related ones in particular
- Jackson-databind: Upgraded to 2.24.1
Want to learn more about Alluxio Enterprise AI? Schedule a demo today!
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.