Alluxio is the data orchestration platform to unify data silos across heterogeneous environments. The following blog will discuss the architecture combining Spark with Alluxio.
Presto (PrestoDB and Trino) is a popular choice for organizations to run interactive analytic queries on multiple data sources at a large scale. Positioned as SQL-on-Everything, Presto is designed to be a storage-independent engine that queries against any disparate data sources anywhere.
Many organizations are constantly modernizing their data platforms and developing scalable solutions to fit their existing and future needs. Although Presto is proven to have massive data processing capabilities, it is not optimized while dealing with data access across the pipeline. As a result, data platform engineers still have to find extra solutions to eliminate data redundancy, error-prone, slow and inconsistent performance, and high costs.
To solve these challenges, an innovative architecture combining Presto with Alluxio together is recommended. Alluxio is a data orchestration platform that bridges computation frameworks and underlying storage systems. Presto and Alluxio working together enables a unified, robust, high-performance, low-latency, and cost-effective analytics architecture. This architecture not only benefits analytics but also all stages of the data pipeline, spanning ingestion, analytics, and modeling. It allows fast SQL queries across multiple storage systems on-prem, in public cloud, hybrid cloud, and multi-cloud environments.
Many companies have leveraged Alluxio to level up their current Presto platform, including Facebook, TikTok, Electronic Arts, Walmart, Tencent, Comcast, and more. They have gained significant benefits with Alluxio integrated into their Presto stack.
Why Pair Presto with Alluxio
Alluxio Helps Modernize the Data Platform as a Hybrid Cloud Data Lake
More and more organizations are considering moving their data platforms to the cloud for elasticity and analytics at scale. But cloud migration is not an easy journey. As an initial step, organizations may migrate analytics workloads, for example, Presto, to the cloud to leverage on-demand compute resources. In this case, the data platform is hybrid with Presto in the cloud and data stored on-prem.
Traditionally, hybrid cloud is not considered a suitable architecture for latency-sensitive interactive queries because fetching remote data through the network suffers from slow, inconsistent performance and query failures when it's timeout. Furthermore, multi-cloud deployment has similar issues as the compute and storage clusters reside in geographically disparate locations. These are huge roadblocks when trying to productionalize the cloud data platform.
To begin the cloud journey, many companies choose the solution of Lift and Shift, which means copying their data into a cloud environment and maintaining that duplicate data. However, manually managing data copies is slow and complex. Furthermore, some organizations may face regulations that prohibit persistent copies in the cloud, preventing them from using the Lift and Shift approach.
Alluxio enables hybrid and multi-cloud interactive queries using a different solution. Alluxio abstracts the storage layer and provides Presto with fast access to data on-prem, in the cloud, and multi-data centers, along with policies to manage data movement across environments:
- “Zero-copy” burst to tackle the challenges of cloud migration. Alluxio’s “zero-copy” burst enables Presto in the cloud to access data on-prem without first copying the data to the cloud, which helps seamlessly migrate workloads without changing the end-user experience.
- Intelligent multi-tiered cache to overcome network latency. Alluxio serves as an intelligent multi-tiered layer for Presto caching and brings data locally to Presto. Alluxio automatically utilizes near-compute storage media for optimal data access to minimize network traffic and achieve high and consistent performance in hybrid or multi-cloud environments.
- Data control for regulation and compliance. Alluxio provides tight control over the whole lifecycle of the data in the public cloud. Presto only interacts with Alluxio, which only stores data temporarily. Even if the data is moved to a public cloud, you can use policies or configurations in Alluxio to control how long the data lives in the public cloud before being wiped out.
Alluxio Enables a Unified Data Access Across Multiple Data Sources
Because Presto is storage independent, the flexibility to connect to multiple data sources is an essential requirement for the data platform supporting Presto queries. Otherwise, you will end up creating and managing copies of all the data sources, which is complex, time-consuming, and error-prone.
Alluxio can help simplify the architecture by unifying data access to multiple and varied data sources. When paired with Alluxio, Presto gets federated access directly from Alluxio with a single view of all the data. This brings several advantages:
- Simplify data management by a global namespace. Alluxio provides a unified view of the data using a global namespace, which greatly simplifies data management and eliminates data duplication. Instead of individually connecting to each data source, you only need to connect to Alluxio, and Alluxio will give Presto the only relevant data required.
- Completely decouples compute and storage systems. Ideally, Presto would access data independently from how the data was originally stored or managed. This is why Alluxio is a good match. Alluxio can bring the data locally to Presto on-demand, making data access on the decoupled architecture no longer a concern.
- No impact to the application side. Because Alluxio abstracts the storage layer, the applications running on the existing data platform, like Presto, can access data without the need to understand where data lives and how to access it. Furthermore, when making changes to the storage side, Alluxio provides seamless access without changing the application so that it gets easier and quicker to adopt innovations.
Alluxio Accelerates the Whole Data Pipeline with One-Caching-For-All Solution
Data analytics is never fast enough, especially for dashboarding and interactive queries or any user-facing data-driven application. Such workloads are typically read-heavy, low-latency, and high-throughput, and their performances are bottlenecked by slow access of the underlying storage.
Caching is a typical solution to improve query performance by data locality and data sharing. Presto provides embedded cache (called native cache or local cache) that benefits the reads. However, it does not support cache writes and cannot be shared among different Presto clusters. Using Presto embedded caching would lead to storage duplication to accommodate multiple caching models, increased storage cost, complex maintenance, and competition of resources.
Alluxio provides a differentiated caching solution, which is a one-caching-for-all, bringing significant benefits beyond any embedded caching Presto provides:
- Enable data sharing across the entire data pipeline. Data processing pipelines include a series of processing stages, produced by performing ETL with Spark, then executing a SQL query using Presto, then running a machine learning algorithm on the query's result. Alluxio's data orchestration platform spans the data pipeline from data ingestion to ETL to analytics and ML. Alluxio enables data sharing and saves the intermediate results for one compute engine to consume the output of another. The ability to share data between applications has led to even further performance gains and less data movement.
- Enhance performance across all compute engines. Because Alluxio spans through the entire data pipeline, it allows any computation framework to take advantage of previously cached data, further enhancing the speed of workload reads and writes. The cached data can stay in Alluxio for the next stage of the pipeline, resulting in significant performance gains.
- Unified, cost-effective, easy-to-manage caching platform. Using embedded caching provided by different compute engines means managing multiple caching models, which is complex. Alluxio, as a unified caching platform for all kinds of compute engines, manages all the caching storage for the entire enterprise, thus reducing the compute and storage needed. Also, Alluxio cache can be utilized across all engines without any changes to the engines themselves, bringing ease of deployment and management.
- Independently deployed Alluxio cluster allows Presto to scale down without losing cached data. When using Presto embedded caching, you will lose the cached data if you want to scale the Presto cluster down, and caching data lost usually causes severe performance problems. Instead of co-locating Presto and Alluxio, you have the flexibility of having a separate Alluxio cluster to operate independently of the compute capacity of Presto, allowing Presto to scale down but still keep the cached data.
With the Alluxio caching solution, Presto can achieve faster queries, lower latency, and consistent SLAs. Beyond accelerating Presto, the whole data pipeline gets better performance end-to-end. Better performance leads to higher data throughput and better resource utilization without purchasing unnecessary resources. Data platforms can scale as needed, reducing infrastructure scaling costs with a lower TCO.
Common Use Cases by Deployment Models and Case Studies
Here are three common use cases to use Presto with Alluxio by different deployment models, each followed by a real-world user story. You will be able to tie this to your architecture and start adding Alluxio to your data platform.
Single Cloud Analytics: Fast SQL on Any Cloud Storage / Electronic Arts
The first use case is a single cloud deployment model. Both Presto and Alluxio are deployed in a cloud environment such as AWS or Google Cloud. In this scenario, Alluxio sits between Presto and cloud storage as a caching layer to speed up the query performance. Presto + Alluxio in the cloud guarantees better SLAs than directly accessing data from cloud storage itself. You also save the cost of accessing cloud storage because of the reduced amount of data access. Alluxio not only handles the caching but also manages the corresponding metadata to avoid particular inefficient metadata operations and expensive metadata service costs.
Electronic Arts, a leading company in the gaming industry, adopted the innovative platform with Presto as the computing engine and Alluxio as a data orchestration layer between Presto and S3 storage to support real-time data visualization, dashboarding, and conversational analytics. With Alluxio, they achieved up to 5.9x performance gain with metadata caching when handling large numbers of small files and reduced infrastructure and S3 costs. Check out the case study here.
Hybrid Cloud Analytics: Simplify the Journey to the Cloud / Walmart
The second use case is to leverage the flexible compute resources on the public cloud so that you can burst Presto into the cloud on-demand and leave your data on-prem. With "zero-copy" burst, Alluxio caches the data on-demand based on data access patterns, rather than manually moving data to the cloud and removing data later. By integrating on-prem data stores with Alluxio + Presto, data is intelligently brought close to Presto on-demand. You will get high performance in the hybrid cloud environment with I/O offloaded, which is much better than directly accessing the remote data.
Walmart used Presto in the cloud to handle more capacity for Presto and offload previously on-prem Hive workloads. Alluxio is co-located with Presto for data locality. Alluxio's zero-copy burst ensures seamless migration and improved Presto query performance and concurrency. With Presto + Alluxio, they can achieve 2x compute capacity for the same environment. Check out Walmart's presentation here.
Cross Datacenter Analytics: High-Performance Analytics Anywhere / A Leading SaaS Company
The third use case is multiple datacenters where Presto is in one data center, and the data lake is in another data center. Many companies choose to deploy Presto in a separated cluster for performance and resource isolation but still use HDFS in the main cluster. This deployment often results in unstable performance as Presto always needs to fetch data remotely. Having Alluxio co-deployed with Presto in the satellite cluster helps with the performance. Alluxio can accelerate the remote data read from the main data cluster without adding extra ETL steps. Intelligent data caching eliminates the performance bottlenecks, decreasing the overall load of the main data cluster.
At a leading SaaS company, as the main data center got overloaded with growing workloads and users, they wanted to leverage compute resources outside of the primary on-prem data center for Presto workloads. By using Alluxio, they can reduce redundant storage needs in multiple data centers and enhance the performance of Presto by 3x to meet the query latency expectations of end-users.
Benchmarking Performance in a Hybrid Cloud Architecture
For benchmarking, we run SQL queries on data in a geographically separated Hive and HDFS cluster.
We used data and queries from the industry-standard TPC-DS benchmark for decision support systems that examines large amounts of data and answers business questions. The queries can be categorized into the following classes (according to visualizations in this repository): Reporting, Interactive, and Deep Analytics.
TPC-DS Data Specifications
Scale FactorFormatCompressionData SizeNumber of Files1000ParquetSnappy463.5 GB234.2 K
EMR Instance Specifications
Instance TypeMaster Instance CountWorker Instance CountAlluxio Storage Volume (us-west-1)HDFS Storage Volume (ap-southeast-1)r5.4xlarge1 each10 eachNVMe SSDEBS
We compared the performance of Presto with Alluxio (Cold and Warm) with Presto directly on HDFS (Local and Remote). Benchmarking shows an average of 3x improvement with Alluxio when the cache is warm over accessing HDFS data remotely. Check out this benchmarking blog for more details.
Presto + Alluxio Architecture Overview
The figure below depicts a high-level architecture of a typical Presto + Alluxio deployment.
A Typical Architecture
We all know that Presto has a Coordinator-Worker architecture. The Presto Coordinator is responsible for handling query requests and managing workers, and Presto Workers are in charge of query processing. Alluxio also employs a Master-Worker architecture - Alluxio Master manages metadata, monitors, and manages Workers; Alluxio Workers manage local storage resources (MEM/SSD/HDD), fetch data from storage, and store cached data.
A typical setup of Presto is with metadata served by Hive metastore. Presto Coordinator first gets metadata from Hive metastore, then schedules Presto Workers to query the data directly from UFS, and then provides the query result to the Presto Coordinator.
Now we add Alluxio into the architecture. It is recommended to deploy Alluxio Workers co-located with Presto Workers for data locality. As a data abstraction layer, Alluxio bridges computation frameworks and underlying storage systems so that Presto only needs to talk to Alluxio. The underlying storage systems are now mounted to Alluxio with the table location on Alluxio Master and cached data stored on Alluxio Workers.
How Presto and Alluxio Work Together
When the Presto Coordinator asks Hive Metastore for the metadata, it actually gets the metadata from Alluxio Master, who knows where the data is and finds the closest way to get it from Alluxio Workers. On a cache hit, either cached data from a local or remote Alluxio Worker is returned to the Presto Worker. Otherwise, Alluxio Worker would retrieve data from the remote data source and cache the data on the Alluxio Worker node for follow-up queries. Presto Workers then process (join, aggregate, and so on) the data stored in Alluxio Workers and provide the result to the user by the Presto Coordinator.
The benefit of Alluxio adding to the Presto architecture is that it abstracts out data access regardless of where the data resides. Alluxio decides where to store the data and moves the data based on data access (on-demand) or policies (pre-defined) to make sure that Presto and other applications achieve the desired performance. In addition, because Alluxio is a one-common-cache-for-all solution, the intermediate cached data is shared among all compute frameworks across the data pipeline, including Spark, Presto, Tensorflow, and so on. With Alluxio, Data Platform Engineers don't have to worry about where and how to store the data anymore to gain high-performance access. Furthermore, Alluxio workers can also be deployed on dedicated nodes so that the cache is independent of Presto nodes. As a result, Presto can elastically scale up and down without losing cache data or having any resources taken by Alluxio if the Presto clusters are already overloaded.
Ease of Deployment
It is very easy to add Alluxio to the existing Presto environment without the burdens of table redefinitions, manual data copies, or application rewrites. Alluxio’s Transparent URI makes sure that URIs will be routed from the external storage system to Alluxio’s namespace. This allows users to access the storage systems through Alluxio without any change from the Presto side.
Ready to Get Started?
Presto + Alluxio is a proven combination that many organizations in their data platforms are leveraging. Together, Presto and Alluxio provide hybrid cloud capabilities, unified data access, and exceptional performance to interactive analytic workloads. Alluxio enables the true separation of storage and compute without sacrificing data locality, increases query performance by 3x via intelligent tiered cache, and makes scaling Presto more elastic and cost-effective.
To get started with Presto + Alluxio, visit this 10-min tutorial and documentation for more details. You can also join our community slack channel to engage with our users or developers. And, of course, you can download a copy of the free Alluxio Community Edition or a trial of the full Alluxio Enterprise Edition to discover the benefits of Presto + Alluxio for your own use cases.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.