New Features in Alluxio Enterprise AI 3.5

February 4, 2025

Bill Hodak

With the new year comes new features in Alluxio Enterprise AI! Just weeks into 2025 and we are already bringing you exciting new features to better manage, scale, and secure your AI data with Alluxio.

From accelerated checkpoint file creation performance and advanced cache management to our Python SDK and S3 API enhancements, our latest release of Alluxio Enterprise AI delivers more power and performance to your AI workloads.

Without further ado, let’s dig into the details.

Performance Optimization

Improve Checkpoint File Creation Performance up to 3X
with CACHE_ONLY Write Mode [EXPERIMENTAL]‍

Alluxio’s new CACHE_ONLY Write Mode improves the performance of write operations, such as creating checkpoint files during model training. When enabled, model training workloads are accelerated by reducing checkpoint creation by 3X!

With model training logic paused during checkpoint file creation and most training workloads writing multiple checkpoint files per epoch, reducing write time by 3X can knock hours off of total training time! For example if it typically takes 1 hour to create a checkpoint file and your model training workload creates 5 checkpoint files per epoch, simply using CACHE_ONLY Write Mode reduces your training time by over 3 hours!

When this mode is enabled, data will be written to the Alluxio cache and not to the underlying file system or UFS. By only writing to the Alluxio cache, write performance improves by eliminating the bottlenecks associated with the UFS.

Note that because data is not written to the UFS, the durability of this data during a system outage is not guaranteed and therefore Alluxio CACHE_ONLY mode should not be used as persistent storage.

Read the Documentation

Caching Operations

Directory-Based Quota Management

Alluxio’s Directory-based Quota Management has been extended to allow administrators to set quotas on any directory, including subdirectories, to provide even more granular control of the Alluxio cache.

TTL Cache Eviction Policies

TTL Cache Eviction Policies, introduced in Alluxio Enterprise AI 3.4, can be set by administrators to enforce time-to-live (TTL) policies on cached data. These policies, set at the directory level, optimize cache efficiency by ensuring that less frequently accessed data is automatically evicted based on the policies settings.

‍Priority-Based Cache Eviction Policies

With Priority-based Cache Eviction Policies, administrators gain control over which data remains in the Alluxio cache. Policies override Alluxio’s Least Recently Used (LRU) cache eviction algorithm by defining the caching priority (High, Medium, Low) for data based on UFS path prefixes. Use Priority-based Cache Eviction Policies when you need to ensure specific data stays in cache even if the data would have otherwise been evicted based on the LRU algorithm.

Read the Documentation

Alluxio Client & SDK

Python SDK and FSSpec [EXPERIMENTAL]

Alluxio’s Python SDK is now integrated with the most popular AI frameworks, including PyTorch, PyArrow, and Ray.

With Alluxio’s Python SDK, applications seamlessly interact with various storage backends using a unified Python filesystem interface. Python applications can seamlessly and easily adopt Alluxio Enterprise AI, simplifying integration and enhancing compatibility by making it seamless to access both local and remote storage systems. This is particularly beneficial for data-intensive applications and AI training workloads where large datasets need quick and repeated access.

Read the Documentation

S3 API

This release includes several performance, scalability, and security enhancements to Alluxio’s S3 API:

HTTP persistent connections, also called HTTP keep-alive, are now supported. By maintaining a single TCP connection that can be used for multiple requests, HTTP persistent connections reduce the overhead of opening a new connection for each request and can decrease latency by approximately 40% for 4KB S3 ReadObject requests.
TLS encryption is now supported for communication between the Alluxio S3 API and the Alluxio worker.
The Alluxio S3 API now supports multipart upload (MPU) to simplify and improve throughput by splitting files into multiple parts and uploading each part separately.
The S3 API now supports zero copy network transport for better performance and reduced CPU usage.

Read the Documentation

‍Data Access

3-5X Faster Directory Listing [ EXPERIMENTAL ]

‍The Alluxio Index Service is a new caching service that improves the performance of directory listing for directories storing hundreds of millions of files and subdirectories. The Index Service ensures scalability and delivers 3-5X faster results by serving directory listing details from cache compared to listing directories on the UFS.

Read the Documentation

UFS

UFS Rate Limiter

‍Administrators can now set a rate limit to control the maximum bandwidth an individual Alluxio Worker can read from the UFS. By setting the UFS Read Rate Limiter, Administrators ensure resource utilization is optimized while also maintaining system stability. Alluxio supports rate limiting for various UFS types, including S3, HDFS, GCS, OSS, and COS.

Read the Documentation

Cluster Administration

Heterogeneous Worker Resources Support

‍Alluxio now supports clusters with worker nodes that have heterogenous resource configurations (CPU, memory, disk, and network). Supporting heterogenous worker nodes provides administrators with more flexibility when configuring their clusters and the opportunity to better optimize resource allocation.

Read the Documentation

‍Security

Common Vulnerabilities and Exposures (CVEs) Addressed in this Release

‍Alluxio has removed or upgraded packages corresponding to several critical CVEs, including:

Log4j: Explicitly excluded any log4j 1.x versions that were transitively picked up through various dependencies
Zookeeper: Removed and explicitly excluded from all dependencies, Hadoop related ones in particular
Jackson-databind: Upgraded to 2.24.1

Read the Documentation

‍

Want to learn more about Alluxio Enterprise AI? Schedule a demo today!

Share this post

Blog

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

AI/ML Infra Meetup at Uber Seattle: Tackling Scalability Challenges of AI Platforms

Insights from from Uber, Snap, and Alluxio on LLM training, fine-tuning, deployment, designing scalable architectures, GPU optimization, and building recommendations systems.

Alluxio Enterprise for Data Analytics Scales to New Heights

We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo