Bursting Apache Spark Workloads to the Cloud on Remote Data

March 10, 2020

Bin Fan

VP of Technology

Alluxio

Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.

A more efficient and simpler solution is to run Spark on Alluxio as a distributed cache on top of the remote data source. While caching data transparently based on access patterns and storing the working set closer, Alluxio provides Spark jobs much higher I/O throughput with enhanced data locality. In addition, Alluxio also provides data accessibility and abstraction for deployments in hybrid and multi-cloud environments.

In this Office Hour, we will go over how to:

Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
Use Alluxio as the input/output for Spark applications
Save and load Spark RDDs and Dataframes with Alluxio

ALLUXIO COMMUNITY OFFICE HOUR

In this Office Hour, we will go over how to:

Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
Use Alluxio as the input/output for Spark applications
Save and load Spark RDDs and Dataframes with Alluxio

Video:

Slides:

Bursting Apache Spark Workloads to the Cloud on Remote Data from Alluxio, Inc.

‍

In this Office Hour, we will go over how to:

Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
Use Alluxio as the input/output for Spark applications
Save and load Spark RDDs and Dataframes with Alluxio

Videos:

Presentation Slides:

Bursting Apache Spark Workloads to the Cloud on Remote Data from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

GTC 2025 | Alluxio Decouples Storage and Compute for a Faster AI Future

April 9, 2025

Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distributed Storage

Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data.

In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads.

This webinar you’ll learn about how 3FS works under the hood, including:

✅ The system architecture

✅ Core software components

✅ Read/write flows

✅ Data distribution/placement algorithms

✅ Cluster/node management and disaster recovery

Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.

‍

April 1, 2025

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

March 6, 2025

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

ALLUXIO COMMUNITY OFFICE HOUR

Videos:

Presentation Slides:

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer