Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distributed Storage

April 1, 2025

Tuesday April 1, 11am PT

Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data.

In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads.

This webinar you’ll learn about how 3FS works under the hood, including:

✅ The system architecture

✅ Core software components

✅ Read/write flows

✅ Data distribution/placement algorithms

✅ Cluster/node management and disaster recovery

Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.

‍Speaker Bio

Stephen Pu, Staff Software Engineer at Alluxio, has over 15 years of experience in software R&D for data centers and distributed storage systems. He has been involved in the core product development and design of large-scale distributed data platforms at IBM, HPE, and Fortinet. Stephen has deep expertise in the performance, scalability, and reliability of distributed data systems, with a strong understanding of architectural design in these areas.

‍

Sign up to the event

Thank you for registering for the webinar! You’ll receive the Zoom link via email shortly.

Events

Tech Talk: How Coupang Leverages Distributed Cache to Accelerate Search & Recommendation Model Training

Tuesday April 22, 11am PT

Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in annual revenue. Coupang's AI platform team builds and manages a large-scale AI platform in AWS for machine learning engineers to train models that enhance and customize product search results and product recommendations for its 100+ million customers.

As the search and recommendation models evolve, optimizing the underlying infrastructure for AI/ML workloads is essential for the e-commerce business. Coupang's platform team actively sought to improve their model training pipeline to boost machine learning engineers' productivity, publish models to production faster, and reduce operational costs.

Coupang focused on addressing several key areas:

Shortening data preparation and model training time
Improving GPU utilization in training clusters in different regions
Reducing S3 API and egress costs incurred from copying large training datasets across regions
Simplifying the operational complexity of storage system management

In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging distributed cache to power search and recommendation model training infrastructure.

Hyun will discuss:

How Coupang builds a world-class large-scale AI platform for machine learning engineers to deliver better search and recommendation models
How adding distributed caching to their multi-region AI infrastructure improves GPU utilization, accelerates end-to-end training time, and significantly reduces cross-region data transfer costs.
How to simplify platform operations and to easily deploy the same architecture to new GPU clusters.

About the Speaker

Hyun Jung Baek is a Staff Backend Engineer at Coupang.

‍

What’s New in Alluxio AI: 3X Faster Checkpoint File Creation, New Cache Eviction Policies, Python SDK enhancements, and more

Join us to learn about the latest release of Alluxio Enterprise AI. In this webinar, we’ll provide an overview of the new features and capabilities of Alluxio Enterprise AI, built to accelerate AI workloads and maximize GPU utilization.

Key highlights include:

New caching mode accelerates AI checkpoints
Advanced cache eviction policies provide fine-grained control
Python SDK integrations enhance AI framework compatibility
A demo of Alluxio accelerating AI training workloads in AWS

Accelerating AI: Alluxio 101

In the rapidly evolving landscape of AI and machine learning, Platform and Data Infrastructure Teams face critical challenges in building and managing large-scale AI platforms. Performance bottlenecks, scalability of the platform, and scarcity of GPUs pose significant challenges in supporting large-scale model training and serving.

In this talk, we will introduce how Alluxio helps Platform and Data Infrastructure teams deliver faster, more scalable platforms to ML Engineering teams developing and training AI models. Alluxio’s highly-distributed cache accelerates AI workloads by eliminating data loading bottlenecks and maximizing GPU utilization. Customers report up to 4x faster training performance with high-speed access to petabytes of data spread across billions of files regardless of persistent storage type or proximity to GPU clusters. Alluxio’s architecture lowers data infrastructure costs, increases GPU utilization, and enables workload portability for navigating GPU scarcity challenges.

‍

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo