Bursting Apache Spark Workloads to the Cloud on Remote Data
March 10, 2020
By 
Bin Fan

Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.

A more efficient and simpler solution is to run Spark on Alluxio as a distributed cache on top of the remote data source. While caching data transparently based on access patterns and storing the working set closer, Alluxio provides Spark jobs much higher I/O throughput with enhanced data locality. In addition, Alluxio also provides data accessibility and abstraction for deployments in hybrid and multi-cloud environments.

In this Office Hour, we will go over how to:

  • Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
  • Use Alluxio as the input/output for Spark applications
  • Save and load Spark RDDs and Dataframes with Alluxio
ALLUXIO COMMUNITY OFFICE HOUR

Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.

A more efficient and simpler solution is to run Spark on Alluxio as a distributed cache on top of the remote data source. While caching data transparently based on access patterns and storing the working set closer, Alluxio provides Spark jobs much higher I/O throughput with enhanced data locality. In addition, Alluxio also provides data accessibility and abstraction for deployments in hybrid and multi-cloud environments.

In this Office Hour, we will go over how to:

  • Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
  • Use Alluxio as the input/output for Spark applications
  • Save and load Spark RDDs and Dataframes with Alluxio

Video:

Slides:

Bursting Apache Spark Workloads to the Cloud on Remote Data from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer