Bursting Spark or Presto Jobs to AWS using Alluxio
June 23, 2020
By 
Bin Fan
Lu Qiu

The hybrid cloud model, where cloud resources run Spark or Presto jobs against data stored on-premises, is an appealing solution to reduce resource contention in on-premise environments while also saving in overall costs. One key flaw in a hybrid model is the overhead associated with transferring data between the two environments. Data and metadata locality within the compute application must be achieved in order to maintain the similar performance of analytics jobs as if the entire workload was run on-premises.

In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.

We will cover:

  • Typical challenges of moving data to the cloud and expanding compute capacity.
  • Details about “zero-copy” hybrid cloud solution for burst computing
  • A demo of running Presto analytic queries using remote on-prem HDFS data with Alluxio deployed in AWS EMR

The hybrid cloud model, where cloud resources run Spark or Presto jobs against data stored on-premises, is an appealing solution to reduce resource contention in on-premise environments while also saving in overall costs. One key flaw in a hybrid model is the overhead associated with transferring data between the two environments. Data and metadata locality within the compute application must be achieved in order to maintain the similar performance of analytics jobs as if the entire workload was run on-premises.

In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.

We will cover:

  • Typical challenges of moving data to the cloud and expanding compute capacity.
  • Details about “zero-copy” hybrid cloud solution for burst computing
  • A demo of running Presto analytic queries using remote on-prem HDFS data with Alluxio deployed in AWS EMR

Video:

Slides:

Bursting Spark or Presto Jobs to AWS using Alluxio from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer