Unlike HDFS which provides one-copy update semantics or AWS S3 which provides eventual consistency, data consistency in Alluxio is a bit more complicated and depends on the configuration. In short, when clients are only reading and writing through Alluxio, the Alluxio file system provides strong consistency. However, when clients are writing data across both Alluxio and under storage, the consistency may depend on the write type and under storage type.
First of all, files in the Alluxio file system are write-once-read-many and identified by unique file IDs. So given this file ID all file readers access the file without worrying about data being stale.
As a result, Alluxio provides strong consistency for Alluxio applications (like MapReduce, Spark jobs) only accessing data through the Alluxio file system. Alluxio reader clients are always able to see the new files and their content written by Alluxio writer clients, as soon as the writes finish. This is because all the file system modification operations will go through Alluxio master service first and modify Alluxio file system state before returning to the client/applications successfully. Given this process, different Alluxio clients will always get the latest update as long as their corresponding write operations complete successfully.
Writing to Alluxio and Reading from Under Storage
When taking the state of under storage into account, it is possible that data written in Alluxio is inconsistent with under storage, depending on write types:
MUST_CACHE
MUST_CACHE
does not write to under storage, so Alluxio space is inconsistent with under storage.
CACHE_THROUGH
CACHE_THROUGH
writes data synchronously to Alluxio and under storage before returning success to applications:
- In the case with HDFS: writing to under storage is also strongly consistent, and Alluxio space will be always consistent with under storage when there are no other out-of-band updates in under storage;
- In the case with S3: writing to under storage is eventually consistent, it is possible that the file is written successfully to Alluxio but does not immediately show up in under storage. Therefore creates a short window of inconsistency before the data is finally propagated to the under storage. Even though in this case, Alluxio clients will still see a consistent file system as they will always consult Alluxio master that is strongly consistent.
ASYNC_THROUGH
ASYNC_THROUGH
writes data to Alluxio and returns to applications, leaving Alluxio to propagate the data to UFS asynchronously. From the user’s perspective, the file can be written successfully to Alluxio, but gets persisted to the UFS after a lag.
THROUGH
THROUGH
writes data to UFS directly without caching the data in Alluxio, however, Alluxio knows the files and its status. Thus the metadata is still consistent.
Writing to Under Storage and Reading from Alluxio
Meanwhile, if data in under storage is out of sync with Alluxio, e.g., due to out of band modification in under storage without going through Alluxio, the metadata sync feature can repair the inconsistency. Users can choose to automate the synchronization process periodically or by active sync. Please refer to documentation.
Conclusion
Due to its unique position in the ecosystem, Alluxio provides flexible data consistency models to balance performance, cost, and usability. Different metadata synchronization techniques are available to reconcile Alluxio and under storages.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.