Speed up large-scale ML/DL offline inference job with Alluxio

April 27, 2021

Binyang Li

Software Engineer

Bing

Qianxi Zhang

Research Software Engineer

MSRA

Increasingly powerful compute accelerators and large training dataset have made the storage layer a potential bottleneck in deep learning training/inference.

Offline inference job usually consumes and produces tens of tera-bytes data while running more than 10 hours.

For a large-scale job, it usually causes high IO pressure, increase job failure rate, and bring many challenges for system stability.

We adopt alluxio which acts as an intermediate storage tier between the compute tier and cloud storage to optimize IO throughput of deep learning inference job.

For the production workload, the performance improves 18% and we seldom see job failure because of storage issue.

ALLUXIO DAY III 2021

April 27, 2021

Increasingly powerful compute accelerators and large training dataset have made the storage layer a potential bottleneck in deep learning training/inference.

Offline inference job usually consumes and produces tens of tera-bytes data while running more than 10 hours.

For a large-scale job, it usually causes high IO pressure, increase job failure rate, and bring many challenges for system stability.

We adopt alluxio which acts as an intermediate storage tier between the compute tier and cloud storage to optimize IO throughput of deep learning inference job.

For the production workload, the performance improves 18% and we seldom see job failure because of storage issue.

Video:

Presentation Slides:

Speed up large-scale ML/DL offline inference job with Alluxio from Alluxio, Inc.

‍

Increasingly powerful compute accelerators and large training dataset have made the storage layer a potential bottleneck in deep learning training/inference.

Offline inference job usually consumes and produces tens of tera-bytes data while running more than 10 hours.

For a large-scale job, it usually causes high IO pressure, increase job failure rate, and bring many challenges for system stability.

We adopt alluxio which acts as an intermediate storage tier between the compute tier and cloud storage to optimize IO throughput of deep learning inference job.

For the production workload, the performance improves 18% and we seldom see job failure because of storage issue.

Videos:

Presentation Slides:

Speed up large-scale ML/DL offline inference job with Alluxio from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distributed Storage

Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data.

In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads.

This webinar you’ll learn about how 3FS works under the hood, including:

✅ The system architecture

✅ Core software components

✅ Read/write flows

✅ Data distribution/placement algorithms

✅ Cluster/node management and disaster recovery

Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.

‍

April 1, 2025

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

March 6, 2025

AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune

March 6, 2025

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

ALLUXIO DAY III 2021

Videos:

Presentation Slides:

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer