Speed up large-scale ML/DL offline inference job with Alluxio
April 27, 2021
By 
Binyang Li
Qianxi Zhang

Increasingly powerful compute accelerators and large training dataset have made the storage layer a potential bottleneck in deep learning training/inference.

Offline inference job usually consumes and produces tens of tera-bytes data while running more than 10 hours.

For a large-scale job, it usually causes high IO pressure, increase job failure rate, and bring many challenges for system stability.

We adopt alluxio which acts as an intermediate storage tier between the compute tier and cloud storage to optimize IO throughput of deep learning inference job.

For the production workload, the performance improves 18% and we seldom see job failure because of storage issue.

ALLUXIO DAY III 2021

April 27, 2021

Increasingly powerful compute accelerators and large training dataset have made the storage layer a potential bottleneck in deep learning training/inference.

Offline inference job usually consumes and produces tens of tera-bytes data while running more than 10 hours.

For a large-scale job, it usually causes high IO pressure, increase job failure rate, and bring many challenges for system stability.

We adopt alluxio which acts as an intermediate storage tier between the compute tier and cloud storage to optimize IO throughput of deep learning inference job.

For the production workload, the performance improves 18% and we seldom see job failure because of storage issue.

Video:

Presentation Slides:

Speed up large-scale ML/DL offline inference job with Alluxio from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer