What's New in Alluxio Enterprise 2.10: Radically Resource-efficient for Improved Speed at Lower Cost

June 20, 2023

Adit Madan

We are pleased to unveil the latest version of Alluxio. This new release represents a significant milestone to enhance system reliability under different kinds of resource limitations or stress scenarios, particularly to get the most out of limited hardware resources to scale at manageable costs.

Enhanced Functionality:

Dramatic Improvements in High Availability (HA): Mission-critical applications for extreme scale will not be interrupted by Alluxio Master failovers. SLA improvements include a 20x improvement (95% reduction) in the maximum time it takes for Alluxio to start serving requests again after a restart, while most common scenarios will observe a 2x improvement.
Significant Reduction in Resource Usage: Improvements allow provisioning of much cheaper hardware for Alluxio servers. Utilization of the memory resource for the Alluxio Master process is reduced by as much as 10x for scenarios in which Alluxio is required to rapidly keep up with out-of-band updates to underlying storage sources. Similarly, resource allocation specifically for pre-loading capabilities in Alluxio across the cluster have also been improved by 10x.

Improved High Availability at Scale

A 20x improvement (95% reduction) in the worst-case Alluxio Master failover time, for those using embedded journal, can be attributed to optimizations to the creation & restoration mechanism for snapshotting. This mechanism, which bounds the total number of journal entries the system needs to replay during a failover event, also allows administrators to improve the common failover time by 2x.

These improvements bring down the failover time from minutes to tens of seconds when over a hundred million files are actively managed by an Alluxio cluster. Planned downtime is obviated, and the impact has already been verified in production scenarios with a large number of small files.

Resource Efficiency

In the past, some users have observed surging memory resource consumption on the Alluxio Master due to an internal mechanism, called metadata synchronization, to keep files and directories in Alluxio’s namespace consistent with underlying data sources. Oftentimes, this process is triggered unintentionally while listing or pre-loading large directories.

High resource consumption has led to overprovisioning of resources. With the 2.10 release, the memory requirements for the Alluxio Master can be reduced by as much as 10x for rapid synchronization intervals, while also impacting the end-to-end performance by 2x.

Pre-loading capabilities, using the load operation, are often used to either improve SLAs when remote data is predictably accessed at a scheduled time of the day for analytical workloads, but also for accelerating the model training & deployment time. Compared to 2.9, these capabilities now require 10x fewer resources across the cluster to achieve the same or higher throughput.

Upgrade Alluxio Today

This new release is a testament to our unwavering commitment to deliver an extremely stable product with low maintenance overhead at scale. Upgrade today and get more out of your Alluxio deployments with ease. To learn more about the features and benefits of Alluxio 2.10, view the release notes and get in touch with our dedicated support team to explore the upgrade options.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo