Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

December 19, 2019

Bin Fan

VP of Technology

Alluxio

While adoption of the Cloud & Kubernetes has made it exceptionally easy to scale compute, the increasing spread of data across different systems and clouds has created new challenges for data engineers. Effectively accessing data from AWS S3 or on-premises HDFS becomes harder and data locality is also lost – how do you move data to compute workers efficiently, how do you unify data across multiple or remote clouds, and many more. Open source project Alluxio approaches this problem in a new way. It helps elastic compute workloads, such as Apache Spark, realize the true benefits of the cloud while bringing data locality and data accessibility to workloads orchestrated by Kubernetes.

One important performance optimization in Apache Spark is to schedule tasks on nodes with HDFS data nodes locally serving the task input data. However, more users are running Apache Spark natively on Kubernetes where HDFS is not an option. This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even if the storage service is outside or remote.

In this Office Hour we’ll go over:

Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
The roadmap to improve this Spark / Alluxio stack in the context of K8s

ALLUXIO COMMUNITY OFFICE HOUR

In this Office Hour we’ll go over:

Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
The roadmap to improve this Spark / Alluxio stack in the context of K8s

Video:

Slides:

Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio from Alluxio, Inc.

‍

In this Office Hour we’ll go over:

Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
The roadmap to improve this Spark / Alluxio stack in the context of K8s

Videos:

Presentation Slides:

Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

GTC 2025 | Alluxio Decouples Storage and Compute for a Faster AI Future

April 9, 2025

Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distributed Storage

Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data.

In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads.

This webinar you’ll learn about how 3FS works under the hood, including:

✅ The system architecture

✅ Core software components

✅ Read/write flows

✅ Data distribution/placement algorithms

✅ Cluster/node management and disaster recovery

Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.

‍

April 1, 2025

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

March 6, 2025

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

ALLUXIO COMMUNITY OFFICE HOUR

Videos:

Presentation Slides:

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer