Products
Efficient Data Loading for Model Training on AWS
October 3, 2023
As enterprises race to roll out artificial intelligence, often overlookModel training requires extensive computational and GPU resources. When training models on AWS, loading data from S3 often becomes a major bottleneck, wasting valuable GPU cycles. Optimizing data loading can greatly reduce GPU idle time and increase GPU utilization.
In this webinar, Greg Palmer will discuss best practices for efficient data loading during model training on AWS. He will demonstrate how to use Alluxio on EKS as a distributed cache to accelerate PyTorch training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.
What you will learn:
- The challenges of feeding data-hungry GPUs in the cloud
- How to accelerate model training by optimizing data loading on AWS
- The reference architecture for running PyTorch jobs with Alluxio cache on EKS while reading data from S3, with benchmark results of training ResNet50 and BERT
- How to use TensorBoard to identify bottlenecks in GPU utilization
As enterprises race to roll out artificial intelligence, often overlookModel training requires extensive computational and GPU resources. When training models on AWS, loading data from S3 often becomes a major bottleneck, wasting valuable GPU cycles. Optimizing data loading can greatly reduce GPU idle time and increase GPU utilization.
In this webinar, Greg Palmer will discuss best practices for efficient data loading during model training on AWS. He will demonstrate how to use Alluxio on EKS as a distributed cache to accelerate PyTorch training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.
What you will learn:
- The challenges of feeding data-hungry GPUs in the cloud
- How to accelerate model training by optimizing data loading on AWS
- The reference architecture for running PyTorch jobs with Alluxio cache on EKS while reading data from S3, with benchmark results of training ResNet50 and BERT
- How to use TensorBoard to identify bottlenecks in GPU utilization
Video:
Presentation slides:
Videos:
Presentation Slides:
Complete the form below to access the full overview:
.png)
Videos
AI/ML Infra Meetup | AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

Pratik Mishra delivered insights on architecting scalable, deployable, and resilient AI infrastructure at scale. His discussion on fault tolerance, checkpoint optimization, and the democratization of AI compute through AMD's open ecosystem resonated strongly with the challenges teams face in production ML deployments.
September 30, 2025
AI/ML Infra Meetup | Alluxio + S3 A Tiered Architecture for Latency-Critical, Semantically-Rich Workloads

In this talk, Bin Fan, VP of Technology at Alluxio, presents on building tiered architectures that bring sub-millisecond latency to S3-based workloads. The comparison showing Alluxio's 45x performance improvement over S3 Standard and 5x over S3 Express One Zone demonstrated the critical role the performance & caching layer plays in modern AI infrastructure.
September 30, 2025
AI/ML Infra Meetup | Achieving Double-Digit Millisecond Offline Feature Stores with Alluxio

In this talk, Greg Lindstrom shared how Blackout Power Trading achieved double-digit millisecond offline feature store performance using Alluxio, a game-changer for real-time power trading where every millisecond counts. The 60x latency reduction for inference queries was particularly impressive.
September 30, 2025