Xi Chen, one of the Top 100 Alluxio open source project contributors, explains the block allocation policy of Alluxio at the code level.
Overview
Alluxio workers are responsible for managing local resources, and they store data as blocks. Users can allocate different storage tiers as the resources for Alluxio workers, including MEM/SSD/HDD, which are further composed of directories.
How does an Alluxio worker decide which directory to put a block in when a user reads or writes data through Alluxio? In this article, we analyze the policies of block allocation from the source code.
Introduction to Alluxio’s Block Allocation PolicY
Alluxio uses block allocation policies to define how to allocate new blocks across multiple storage directories (in the same tier or across different tiers). The allocation policy defines which storage directory to allocate the new block in.
Alluxio supports having multiple storage tiers(typically each tier stands for a storage medium like MEM or SSD), and in each tier, you can define multiple directories (for example, multiple SSD volumes as the SSD tier).
Currently, there are three types of block allocation policies in Alluxio.
alluxio.worker.block.allocator.GreedyAllocator
: It loops from the top tier to the lowest tier, trying to put the new block into the first directory that can contain the block. This policy is only for demonstration purposes. In production, you should choose a more complicated policy for better allocation decisions.alluxio.worker.block.allocator.MaxFreeAllocator
: Start trying from tier 0 to the lowest tier. In each tier, try to allocate the block to the storage directory that currently has the most availability. If no directory in the current tier can serve the new block, go to a lower tier.alluxio.worker.block.allocator.RoundRobinAllocator
: Start trying from tier 0 to the lowest tier. On each tier, try to allocate the new block into a directory following the Round Robin order. If no directory in the current tier can serve the new block, go to a lower tier.
The default policy is MaxFreeAllocator
, which can be configured by worker property alluxio.worker.allocator.class
.
Source Code Explanation
allocateSpace
allocateSpace
is responsible for allocating a block to a storage directory.
When reading or writing data, the Alluxio worker will request a storage tier for the block to store the data according to the block allocation policies. allocateSpace
will first try to find the location (starting from tier 0) in the location specified in the incoming options. If there is not enough space found in the specified location, it will try to find it in lower storage tiers.
private StorageDirView allocateSpace(long sessionId, AllocateOptions options) throws WorkerOutOfSpaceException, IOException { while (true) { if (options.isForceLocation()) { //… } else { //… dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(), options.getLocation(), allocatorView, false); if (dirView != null) { return dirView; } dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(), BlockStoreLocation.anyTier(), allocatorView, false); if (dirView != null) { return dirView; } } //… }
As you can see from the above source code, the storage directories selected by allocateSpace
are dynamically calculated. However, at some point, data needs to be available to write to the specified storage directory or tier, so allocateSpace
also supports evict blocks to free up enough space in a storage directory to accommodate new data. We’ll talk more about it in the later section.
private StorageDirView allocateSpace(long sessionId, AllocateOptions options) { StorageDirView dirView; //… while (true) { if (options.isForceLocation()) { dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(), options.getLocation(), allocatorView, true); if (dirView != null) { return dirView; } freeSpace(sessionId, options.getSize(), options.getSize(), options.getLocation()); dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(), options.getLocation(), allocatorView.refreshView(), true); //… } } //… } else { //… } return dirView; } }
Allocator
is the interface for the allocation policy of Alluxio managed data.
GreedyAllocator
, MaxFreeAllocator
, and RoundRobinAllocator
all implement the Allocator
interface to define the policy.
Allocator
public interface Allocator { //… StorageDirView allocateBlockWithView(long sessionId, long blockSize, BlockStoreLocation location, BlockMetadataView view, boolean skipReview); }
allocateBlockWithView
allocates a block from the given block store location under a given view. There are three main parameters.
blockSize
: the size of block in bytes.location
: the desired location in block store, tier 0, or any tier.skipReview
: for an allocation decision, there’s a review step to reject bad decisions. This flag specifies whether the review should be skipped, which means the decision is final.
Let’s take the default alluxio.worker.block.allocator.MaxFreeAllocator
as an example. It finds storage directories in the current tier, then it checks all the storage directories one by one and returns the largest one.
private StorageDirView allocateBlock(long sessionId, long blockSize, BlockStoreLocation location, boolean skipReview) { if (location.equals(BlockStoreLocation.anyTier())) { for (StorageTierView tierView : mMetadataView.getTierViews()) { candidateDirView = getCandidateDirInTier(tierView, blockSize, BlockStoreLocation.ANY_MEDIUM); if (candidateDirView != null) { // Review if (skipReview ,, mReviewer.acceptAllocation(candidateDirView)) { break; } } } } //… return candidateDirView; } private StorageDirView getCandidateDirInTier(StorageTierView tierView, long blockSize, String mediumType) { StorageDirView candidateDirView = null; long maxFreeBytes = blockSize - 1; for (StorageDirView dirView : tierView.getDirViews()) { if ((mediumType.equals(BlockStoreLocation.ANY_MEDIUM) ,, dirView.getMediumType().equals(mediumType)) && dirView.getAvailableBytes() > maxFreeBytes) { maxFreeBytes = dirView.getAvailableBytes(); candidateDirView = dirView; } } return candidateDirView; }
As you can see from the source code, after finding the candidateDirView
, it needs to go through a review process to finally decide which storage directory to return. Let’s take a closer look at the Review
policies next.
Block Allocation Review Policies
Alluxio uses block allocation review policies to complement allocation policies. Review
brings additional restrictions (e.g., SoftLimit
/HardLimit
) and randomness to the block allocation policy. The Reviewer
works together with the Allocator
.
Currently, Alluxio implements two Review
policies.
alluxio.worker.block.reviewer.AcceptingReviewer
: This reviewer accepts every block allocation, which is equivalent to no review.alluxio.worker.block.reviewer.ProbabilisticBufferReviewer
: Based on the available space in each storage directory, rejects the attempts to put new blocks into this directory probabilistically. This is the default behavior.
The client writes a block with a stream. So, instead of requesting the whole block size from the worker storage, the client requests one buffer size at a time until the stream is completed. This is because, in many cases, the real block size is smaller than a block size limit (64MB by default). Therefore, instead of allocating the whole 64MB, we start small and let the blocks expand on the way.
One problem is that when the storage dir has less than 64MB left, a new block may be allocated to this directory because the initial request is smaller than the vacancy. As the block expands, the vacancy will be filled up, and one other block has to be evicted from the directory so that this block can finish writing. This brings in additional evictions, and the new block should have been allocated to some other directories.
This is why ProbabilisticBufferReviewer
was introduced. As the vacancy shrinks in a directory, the directory will have a higher chance of rejecting new blocks, so that existing blocks may expand without evicting others.
Let's take a look at the source code of the default ProbabilisticBufferReviewer
.
public class ProbabilisticBufferReviewer implements Reviewer { //… double getProbability(StorageDirView dirView) { //… if (availableBytes > mSoftLimitBytes) { return 1.0; } if (availableBytes <= mHardLimitBytes) { return 0.0; } double x = capacityBytes - availableBytes; double k = 1.0 / (mHardLimitBytes - mSoftLimitBytes); // If HardLimit = SoftLimit, then we would have returned in the previous if-else double b = (capacityBytes - mHardLimitBytes + 0.0) / (mSoftLimitBytes - mHardLimitBytes); double y = k * x + b; return y; } }
ProbabilisticBufferReviewer
- If the remaining space in the current storage directory is less than
mHardLimitBytes
, then it will return 0, indicating that it did not pass the review. - If the remaining space in the current storage directory is larger than
mSoftLimitBytes
, then it will return 1, which means it passed the review. - If the remaining space size is between
mHardLimitBytes
andmSoftLimitBytes
, then a value between (0, 1) is returned.
freeSpace
As mentioned above, allocateSpace
supports the eviction of blocks to free up space in a storage directory to accommodate new data. How does Alluxio decide which block to evict?
Block eviction is done by freeSpace
. If there is not enough space, the block will be evicted one by one until space is freed up.
public synchronized void freeSpace(long sessionId, long minContiguousBytes, long minAvailableBytes, BlockStoreLocation location) { Iterator<Long> evictionCandidates = mBlockIterator.getIterator(location, BlockOrder.NATURAL); while (true) { //… if (contiguousSpaceFound && availableBytesFound) { break; } if (!evictionCandidates.hasNext()) { break; } long blockToDelete = evictionCandidates.next(); if (evictorView.isBlockEvictable(blockToDelete)) { // some blocks will not be evicted try { BlockMeta blockMeta = mMetaManager.getBlockMeta(blockToDelete); removeBlockFileAndMeta(blockMeta); //… } //… }
There are some blocks that will not be evicted.
public boolean isBlockEvictable(long blockId) { boolean pinned = isBlockPinned(blockId); boolean locked = isBlockLocked(blockId); boolean marked = isBlockMarked(blockId); boolean isEvictable = !pinned && !locked && !marked; if (!isEvictable) { LOG.debug("Block not evictable: {}. Pinned: {}, Locked: {}, Marked: {}", blockId, pinned, locked, marked); } return isEvictable; }
There are currently two policies for eviction.
alluxio.worker.block.annotator.LRUAnnotator
: Annotates the blocks based on least-recently-used (LRU) order. This is Alluxio’s default annotator.- a
lluxio.worker.block.annotator.LRFUAnnotator
: Annotates the blocks based on least-recently-used (LRU) and least-frequently-used (LFU) orders with a configurable weight. The weight controls if the behavior is closer to LRU or LFU.
The default policy is LRUAnnotator
, which can be configured with the property alluxio.worker.block.annotator.class
.
Let’s take a look at the source code of LRUAnnotator
.
public class LRUAnnotator implements BlockAnnotator<LRUAnnotator.LRUSortedField> { private static final Logger LOG = LoggerFactory.getLogger(LRUAnnotator.class); private AtomicLong mLRUClock; @Override public BlockSortedField updateSortedField(long blockId, LRUSortedField oldValue) { long clockValue = mLRUClock.incrementAndGet(); return new LRUSortedField(clockValue); } /** * Sorted-field for LRU. */ protected class LRUSortedField implements BlockSortedField { private Long mClockValue; private LRUSortedField(long clockValue) { mClockValue = clockValue; } @Override public int compareTo(BlockSortedField o) { Preconditions.checkState(o instanceof LRUSortedField); return mClockValue.compareTo(((LRUSortedField) o).mClockValue); } //… }
We can see that LRUAnnotator
internally identifies the access order of each block by a timestamp represented by a monotonically increasing AtomicLong
. The larger the AtomicLong
is, the more recent it is accessed.
About the Author
Xi Chen is one of the Top 100 contributors to the Alluxio open source project.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.