Running Presto with Alluxio is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot dataset close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.
Soft Affinity Scheduling
Presto uses a scheduling strategy called soft affinity scheduling to schedule a split (smallest unit of data processing) to the same Presto worker (preferred node). The mapping from a split and a Presto worker is computed by a hashing function on the split, making sure the same split will always be hashed to the same worker. The first time a split is processed, data will be cached on the preferred worker node. When subsequent queries process the same split, these requests will be scheduled to the same worker node again. Since data is already cached locally, no remote read will be necessary.
To improve load balancing and handle flaky workers, two preferred nodes are chosen. If the first choice is busy or unresponsive, the second one is used. Data might be cached at 2 worker nodes physically.
For more details on soft affinity scheduling, please read "Improving Presto Latencies with Alluxio Data Caching".
Hashing Algorithm
Soft affinity scheduling relies on hashing algorithms to compute the mapping between split and worker nodes. Previously, modular function is used:
WorkerID1 =Hash(splitID) % workerCount
WorkerID2 =Hash(splitID) % workerCount + 1
This hashing strategy is simple and works well when the cluster is stable and there’s no change on worker nodes. However, if a worker is temporarily unavailable or down, worker count could change, and the split to worker mapping would be completely reshuffled, causing cache hit rate to drop significantly. If the problematic worker comes back online later, this reshuffle would happen again.
To mitigate this issue, Presto uses all worker count instead of active worker count when using modular to compute worker assignment. However, this can only mitigate rehashing caused by temporary worker nodes offline. There are situations where it makes sense to add / remove workers due to workload fluctuations. In these scenarios, is it possible to still keep a reasonable cache hit rate and not introduce massive rehashing? The solution is consistent hashing.
Consistent Hashing
The concept of Consistent hashing was introduced in 1997 by David Karger as a way of distributing requests among a changing population of web servers. The technique is widely used in load balancing, distributed hash tables, etc.
How does consistent hashing work?
Imagine that the hash output range [0, MAX_VALUE] is mapped onto a circle (connects the MAX_VALUE to 0). To demonstrate how consistent hashing works, assume a Presto cluster of 3 Presto worker nodes and there are 10 splits that are queried repeatedly.
First, the worker nodes are hashed on to the hashing ring. For each split, it will be assigned to the worker that’s next to its hash value on the hashing ring.
In the scenario above, the splits are assigned as following:
Worker1Split1, Split4, Split6, Split8Worker2Split0, Split5, Split7Worker3Split2, Split3, Split9
Removing a worker
Now if worker2 becomes offline for some reason, according to the algorithm, split 0, 5 and 7 will be scheduled to the worker with the next hash value, which is worker2:
Worker1Split0, Split1, Split4, Split5, Split6, Split7, Split8Worker3Split2, Split3, Split9
Only the splits that were hashed to the offline worker (worker3 in our example) need to be rehashed. Other data were not affected. If worker3 comes online later, Split 2, 3 and 9 will again be hashed to worker3, not affecting hit rate on other workers.
Adding a worker
Now if the workload increases and another worker node, worker4, needs to be added to the cluster. Worker4’s hash value is on the hashing ring as following:
In this case split8 will fall into the range of worker4, all other splits’ assignments are not affected, thus cache hit rate on those splits will not be affected. The new assignment is:
Worker1Split1, Split4, Split6Worker2Split0, Split5, Split7Worker3Split2, Split3, Split9Worker4Split8
Virtual nodes
As you can see from above, consistent hashing can guarantee that in the situation of node changes, on average only Nsplits / Nnodes of splits need to be rehashed. However, due to lack of randomness in worker distribution, the splits might not be uniformly distributed among all worker nodes. The concept of “virtual nodes” is introduced to mitigate this issue. Virtual nodes can also help redistribute a node’s load to multiple nodes when they are disconnected, which reduces load fluctuation due to cluster instability.
Each physical worker node has multiple virtual nodes mapped to it. Virtual nodes are put on the hashing ring. A split will be assigned to the next virtual node on the hashing ring, thus route to the physical node mapped to the virtual node. The following examples shows a possible scenario where each physical worker node has 3 virtual nodes:
Worker1Worker1_v1Split6Worker1_v2Split0Worker1_v3Split1Worker2Worker2_v1Split5, Split7Worker2_v2Split4Worker2_v3Split9Worker3Worker3_v1Split2, Split3Worker3_v2Woker3_v3Split8
As the number of nodes on the hashing ring increases, the hash space is more likely to be evenly partitioned.
In the scenario of a physical node down, all virtual nodes corresponding to that physical node will be rehashed. But now instead of all splits being rehashed to the same node, they will be distributed across multiple virtual nodes, thus mapping to multiple physical nodes, providing better load balancing.
Following shows when worker3 is removed, Split2 and 3 are rehashed to worker2, while Split8 is rehashed to worker1.
Worker1Worker1_v1Split6Worker1_v2Split4, Split8Worker1_v3Split1Worker2Worker2_v1Split5, Split7Worker2_v2Split0, Split2, Split3Worker2_v3Split9
How to use consistent hashing in Presto?
This is currently an experimental feature that we recently contributed to Presto. Feel free to contact us if you are interested in testing or collaboration.
To use this feature, first enable caching with Presto by following this instruction or this tutorial.
Make sure you choose SOFT_AFFINITY as scheduling policy. In /catalog/hive.properties, add hive.node-selection-strategy=SOFT_AFFINIT
Y.
Enable consistent hashing. In config.properties
, add node-scheduler.node-selection-hash-strategy=CONSISTENT_HASHING
.
Conclusion
As illustrated above, consistent hashing can minimize the impact of workload assignments when nodes are introduced or removed. Scheduling workload based on consistent hashing can minimize the impact on cache hit rate on existing nodes when a cluster’s worker nodes change. This makes consistent caching a better strategy to use in situations where Presto’s cluster size would be scaled up and down according to workload needs or in situations where deployment does not have total control of the hardware, and workers can be potentially relocated every now and then.
At Alluxio community, we have constantly been improving the integration between Alluxio and data applications (e.g., Presto in this article) both in functionality and usability. With the introduction of consistent hashing in Presto scheduling, Alluxio can better leverage the potential of soft affinity in Presto with higher data locality and cache efficiency, which can be translated to better performance and cost efficiency. We will continue bringing further improvements and optimizations to the data ecosystem.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.