Data Consistency Model in Alluxio

October 30, 2020

Baolong Mao

Jasmine Wang

Bin Fan

Unlike HDFS which provides one-copy update semantics or AWS S3 which provides eventual consistency, data consistency in Alluxio is a bit more complicated and depends on the configuration. In short, when clients are only reading and writing through Alluxio, the Alluxio file system provides strong consistency. However, when clients are writing data across both Alluxio and under storage, the consistency may depend on the write type and under storage type.

First of all, files in the Alluxio file system are write-once-read-many and identified by unique file IDs. So given this file ID all file readers access the file without worrying about data being stale.

As a result, Alluxio provides strong consistency for Alluxio applications (like MapReduce, Spark jobs) only accessing data through the Alluxio file system. Alluxio reader clients are always able to see the new files and their content written by Alluxio writer clients, as soon as the writes finish. This is because all the file system modification operations will go through Alluxio master service first and modify Alluxio file system state before returning to the client/applications successfully. Given this process, different Alluxio clients will always get the latest update as long as their corresponding write operations complete successfully.

Writing to Alluxio and Reading from Under Storage

When taking the state of under storage into account, it is possible that data written in Alluxio is inconsistent with under storage, depending on write types:

MUST_CACHE

MUST_CACHE does not write to under storage, so Alluxio space is inconsistent with under storage.

CACHE_THROUGH

CACHE_THROUGH writes data synchronously to Alluxio and under storage before returning success to applications:

In the case with HDFS: writing to under storage is also strongly consistent, and Alluxio space will be always consistent with under storage when there are no other out-of-band updates in under storage;

In the case with S3: writing to under storage is eventually consistent, it is possible that the file is written successfully to Alluxio but does not immediately show up in under storage. Therefore creates a short window of inconsistency before the data is finally propagated to the under storage. Even though in this case, Alluxio clients will still see a consistent file system as they will always consult Alluxio master that is strongly consistent.

ASYNC_THROUGH

ASYNC_THROUGH writes data to Alluxio and returns to applications, leaving Alluxio to propagate the data to UFS asynchronously. From the user’s perspective, the file can be written successfully to Alluxio, but gets persisted to the UFS after a lag.

THROUGH

THROUGH writes data to UFS directly without caching the data in Alluxio, however, Alluxio knows the files and its status. Thus the metadata is still consistent.

Writing to Under Storage and Reading from Alluxio

Meanwhile, if data in under storage is out of sync with Alluxio, e.g., due to out of band modification in under storage without going through Alluxio, the metadata sync feature can repair the inconsistency. Users can choose to automate the synchronization process periodically or by active sync. Please refer to documentation.

Conclusion

Due to its unique position in the ecosystem, Alluxio provides flexible data consistency models to balance performance, cost, and usability. Different metadata synchronization techniques are available to reconcile Alluxio and under storages.

Share this post

Blog

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

AI/ML Infra Meetup at Uber Seattle: Tackling Scalability Challenges of AI Platforms

Insights from from Uber, Snap, and Alluxio on LLM training, fine-tuning, deployment, designing scalable architectures, GPU optimization, and building recommendations systems.

New Features in Alluxio Enterprise AI 3.5

With the new year comes new features in Alluxio Enterprise AI! Just weeks into 2025 and we are already bringing you exciting new features to better manage, scale, and secure your AI data with Alluxio. From advanced cache management and improved write performance to our Python SDK and S3 API enhancements, our latest release of Alluxio Enterprise AI delivers more power and performance to your AI workloads. Without further ado, let’s dig into the details.

‍

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo