Alluxio provides a distributed data access layer for applications like Spark or Presto to access different underlying file system (or UFS) through a single API in a unified file system namespace. If users only interact with the files in the UFS through Alluxio, since Alluxio has knowledge of any changes the client makes to the UFS, it will keep Alluxio namespace in sync with the UFS namespace (see the left figure below).
However, where a file in the UFS is changed without going through Alluxio, the UFS namespace and the Alluxio namespace can potentially get out of sync. When this happens, a UFS Metadata Sync operation is required to synchronize the two namespaces (illustrated in the right figure).
Sync On-demand
Alluxio automatically caches metadata information from the UFS so that subsequent metadata operations such as listStatus
(or ls
) will not need to access the UFS. This reduces the latency of these metadata operations. However, sometimes the metadata of the underlying UFS can change without notifying Alluxio. When that happens, this cache needs to be invalidated.
Since version 1.7.0, Alluxio has provided an option alluxio.user.file.metadata.sync.interval
which allows users to control how often this metadata cache gets refreshed. Anytime the client issues a metadata operation such as listStatus, it can specify the interval to be one of -1, 0 or a time value. When it is set to -1, Alluxio never fetches metadata information from the UFS. When it is set to 0, it always fetches metadata information from the UFS. When it is set to a time value, it will fetch the metadata information from the UFS if it has not done so in the recent past specified by the time value.
Here is an example.
$ alluxio fs ls -R -Dalluxio.user.file.metadata.sync.interval=0 /dir
This tells alluxio to always fetch the metadata information from the UFS.
One thing to note is that the Alluxio system never synchronizes with the UFS unless there is a client request to that UFS. This can cause problems because the first time a particular client accesses the UFS, the extra cost of accessing the UFS can cause a slowdown of the client request. This calls for a mechanism that will synchronize the Alluxio namespace and the UFS namespace in the background, or Active UFS sync.
Sync Proactively
Alluxio 2.0 preview release supports a new feature “Active UFS Sync”. It allows the users to specify a directory to be synchronized between Alluxio namespace and the UFS namespace, at a regular interval with a number of parameters to fine-tune that syncing behavior. Currently, Active UFS Sync is only supported between Alluxio and HDFS 2.7 or later. To use this feature, the user running Alluxio must be an HDFS admin user, in order to listen to the event stream HDFS provides.
To enable active sync on a directory, issue the following Alluxio command on a directory that is backed by HDFS.
$ alluxio fs startSync /syncedDir
You can also stop active sync on a directory by using the following command.
$ alluxio fs stopSync /syncedDir
Note the list of directories under active sync is remembered between master restarts. You can check which directories are under active sync by using the getSyncPathList
command.
$ alluxio fs getSyncPathList
Optimizations
There are a few parameters to optimize the active UFS sync behavior.
Sync interval: Users can control the active sync interval by changing the alluxio.master.activesync.interval
option, the default is 30 seconds.
Quiet period: To avoid syncing when the directory to be synced is under heavy modifications and adding more RPC workload to the UFS, active UFS Sync tries to only sync when the UFS is considered to be in a quiet period.
This quiet period is controlled by alluxio.master.activesync.maxactivity
. Activity is a heuristic based on the exponential moving average of a number of events in a directory. For example, if a directory had 100, 10, 1 event in the past three intervals. Its activity would be 100/10*10 + 10/10 + 1 = 3
. Property alluxio.master.activesync.maxactivity
is the maximum number of activities in the UFS directory to be considered “quiet”. However, if we only sync during the quiet period, we may have to wait a long time and metadata can become stale in the Alluxio namespace. Property alluxio.master.activesync.maxage
is the maximum number of intervals we will wait before synchronizing the UFS and the Alluxio space. The system guarantees that we will start syncing a directory if it is "quiet", or it has not been synced for a long period (a period longer than the max age).
Conclusion
When using Alluxio, it is important to keep the Alluxio namespace and the UFS namespace consistent. This article describes two ways to perform this synchronization. The synchronization can happen with a client call to Alluxio (On-demand) or happen in the background (Active UFS Sync), each with its own unique advantages. On-demand UFS metadata sync happens only when a client calls Alluxio, therefore it allows administrators to precisely control when sync happens. Active UFS Sync happens in the background, hence it requires minimal configuration and management. Administrators can choose the right strategy based on the specific use case.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.