Tech Talk: Accelerate Spark Workloads on S3
June 28, 2019
By 
Dipti Borkar

While running analytics workloads using EMR Spark on S3 is a common deployment today, many organizations face issues in performance and consistency. EMR can be bottlenecked when reading large amounts of data from S3, and sharing data across multiple stages of a pipeline can be difficult as S3 is eventually consistent for read-your-own-write scenarios.  

A simple solution is to run Spark on Alluxio as a distributed cache for S3. Alluxio stores data in memory close to Spark, providing high performance, in addition to providing data accessibility and abstraction for deployments in both public and hybrid clouds.

In this webinar you’ll learn how to:

  • Increase performance by setting up Alluxio so Spark can seamlessly read from and write to S3
  • Use Alluxio as the input/output for Spark applications
  • Save and load Spark RDDs and Dataframes with Alluxio

While running analytics workloads using EMR Spark on S3 is a common deployment today, many organizations face issues in performance and consistency. EMR can be bottlenecked when reading large amounts of data from S3, and sharing data across multiple stages of a pipeline can be difficult as S3 is eventually consistent for read-your-own-write scenarios.  

A simple solution is to run Spark on Alluxio as a distributed cache for S3. Alluxio stores data in memory close to Spark, providing high performance, in addition to providing data accessibility and abstraction for deployments in both public and hybrid clouds.

In this webinar you’ll learn how to:

  • Increase performance by setting up Alluxio so Spark can seamlessly read from and write to S3
  • Use Alluxio as the input/output for Spark applications
  • Save and load Spark RDDs and Dataframes with Alluxio

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer