AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training with NVMe GDS and RDMA

November 7, 2024

Bin Fan

VP of Technology

Alluxio

As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.

Videos:

Presentation Slides:

AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training with NVMe GDS and RDMA from Alluxio, Inc.

Videos:

Presentation Slides:

AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training with NVMe GDS and RDMA from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

AI/ML Infra Meetup | Building AI Applications on Zoom

In this talk, Ojus Save walks you through a demo of how to build AI applications on Zoom. This demo shows you an AI agent that receives transcript data from RTMS and then decides if it has to create action items based on the transcripts that are received.

August 14, 2025

AI/ML Infra Meetup | Accelerating the Data Path to the GPU for AI and Beyond

In this talk, Sandeep Joshi, , Senior Manager at NVIDIA, shares how to accelerate the data access between GPU and storage for AI. Sandeep will dive into two options: CPU- initiated GPUDirect Storage and GPU-initiated SCADA.

August 14, 2025

AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access

Bin Fan, VP of Technology at Alluxio, introduces how Alluxio, a software layer transparently sits between application and S3 (or other object stores), provides sub-ms time to first byte (TTFB) solution, with up to 45x lower latency.

August 14, 2025

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

Videos:

Presentation Slides:

Videos:

Presentation Slides:

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer