On-Demand Videos
In this talk, Sandeep Joshi, , Senior Manager at NVIDIA, shares how to accelerate the data access between GPU and storage for AI. Sandeep will dive into two options: CPU- initiated GPUDirect Storage and GPU-initiated SCADA.
Bin Fan, VP of Technology at Alluxio, introduces how Alluxio, a software layer transparently sits between application and S3 (or other object stores), provides sub-ms time to first byte (TTFB) solution, with up to 45x lower latency.
In this talk, Pritish Udgata from Adobe provides a comprehensive overview of implementation challenges and solutions for LLM agents.
Topic include:
- CoT vs RAG vs Agentic AI
- Anatomy of an agent
- Single Agent with MCP
- Multi Agents with A2A
- Implementation Challenges and Solutions
.png)
In this talk, Sandeep Joshi, , Senior Manager at NVIDIA, shares how to accelerate the data access between GPU and storage for AI. Sandeep will dive into two options: CPU- initiated GPUDirect Storage and GPU-initiated SCADA.
Bin Fan, VP of Technology at Alluxio, introduces how Alluxio, a software layer transparently sits between application and S3 (or other object stores), provides sub-ms time to first byte (TTFB) solution, with up to 45x lower latency.
In this talk, Pritish Udgata from Adobe provides a comprehensive overview of implementation challenges and solutions for LLM agents.
Topic include:
- CoT vs RAG vs Agentic AI
- Anatomy of an agent
- Single Agent with MCP
- Multi Agents with A2A
- Implementation Challenges and Solutions

Watch this on-demand video to learn about the latest release of Alluxio Enterprise AI. In this webinar, discover how Alluxio AI 3.7 eliminates cloud storage latency bottlenecks with breakthrough sub-millisecond performance, delivering up to 45× faster data access than S3 Standard without changing your code. Alluxio AI 3.7 is also packed with new features designed to supercharge your AI infrastructure while keeping your data secure.Key highlights include:
- Alluxio Ultra Low Latency Caching for Cloud Storage
- Role-Based Access Control (RBAC) for S3 Access
- 5X Faster Cache Preloading with Alluxio Distributed Cache Preloader
- FUSE Non-Disruptive Upgrade
- Other New Features for Alluxio Admins

Real-time OLAP databases are optimized for speed and often rely on tightly coupled storage-compute architectures using disks or SSDs. Decoupled architectures, which use cloud object storage, introduce an unavoidable tradeoff: cost efficiency at the expense of performance. This makes them unsuitable for databases that need to provide low-latency, real-time analytics, especially the new wave of LLM-powered dashboards, retrieval-augmented generation (RAG), and vector-embedding searches that thrive only when fresh data is milliseconds away. Can we achieve both cost efficiency and performance?
In this talk, we’ll explore the engineering challenges of extending Apache Pinot—a real-time OLAP system—onto cloud object storage while still maintaining sub-second P99 latencies.
We’ll dive into how we built an abstraction in Apache Pinot to make it agnostic to the location of data. We’ll explain how we can query data directly from the cloud (without needing to download the entire dataset, as with lazy-loading) while achieving sub-second latencies. We’ll cover the data fetch and optimization strategies we implemented, such as pipelining fetch and compute, prefetching, selective block fetches, index pinning, and more. We'll also share our latest work about integration with open table formats like iceberg, and how we will continue to achieve fast analytics directly on parquet files by implementing all the same techniques that apply to tiered storage.

The data lake is a fantastic, low-cost place to put data at rest for offline analytics, but we've built it under the terms of a terrible bargain: all that cheap storage at scale was a great thing, but we gave up schema management and transactions along the way. Apache Iceberg has emerged as king of the Open Table Formats to fix this very problem.
Built on the foundation of Parquet files, Iceberg adds a simple yet flexible metadata layer and integration with standard data catalogs to provide robust schema support and ACID transactions to the once ungoverned data lake. In this talk, we'll build Iceberg up from the basics, see how the read and write path work, and explore how it supports streaming data sources like Apache Kafka™. Then we'll see how Confluent's Tableflow brings Kafka together with open table formats like Iceberg and Delta Lake to make operational data in Kafka topics instantly visible to the data lake without the usual ETL—unifying the operational/analytical divide that has been with us for decades.

Storing data as Parquet files on S3 is increasingly used not just as a data lake but also as a lightweight feature store for ML training/inference or a document store for RAG. However, querying petabyte- to exabyte-scale data lakes directly from cloud object storage remains notoriously slow (e.g., latencies ranging from hundreds of milliseconds to several seconds on AWS S3).
In this talk, we show how architecture co-design, system-level optimizations, and workload-aware engineering can deliver over 1000× performance improvements for these workloads—without changing file formats, rewriting data paths, or provisioning expensive hardware.
We introduce a high-performance, low-latency S3 proxy layer powered by Alluxio, deployed atop hyperscale data lakes. This proxy delivers sub-millisecond Time-to-First-Byte (TTFB)—on par with Amazon S3 Express—while preserving compatibility with standard S3 APIs. In real-world benchmarks, a 50-node Alluxio cluster sustains over 1 million S3 queries per second, offering 50× the throughput of S3 Express for a single account, with no compromise in latency.
Beyond accelerating access to Parquet files byte-to-byte, we also offload partial Parquet processing from query engines via a pluggable interface into Alluxio. This eliminates the need for costly index scans and file parsing, enabling point queries with 0.3 microseconds latency and up to 3,000 QPS per instance (measured using a single-thread)—a 100× improvement over traditional query paths.
Nilesh Agarwal, Co-founder & CTO at Inferless, shares insights on accelerating LLM inference in the cloud using Alluxio, tackling key bottlenecks like slow model weight loading from S3 and lengthy container startup time. Inferless uses Alluxio as a three-tier cache system that dramatically cuts model load time by 10x.

In this talk, Jingwen Ouyang, Senior Product Manager at Alluxio, will share how Alluxio make it easy to share and manage data from any storage to any compute engine in any environment with high performance and low cost for your model training, model inference, and model distribution workload.

Storing data as Parquet files on cloud object storage, such as AWS S3, has become prevalent not only for large-scale data lakes but also as lightweight feature stores for training and inference, or as document stores for Retrieval-Augmented Generation (RAG). However, querying petabyte-to-exabyte-scale data lakes directly from S3 remains notoriously slow, with latencies typically ranging from hundreds of milliseconds to several seconds.
In this webinar, David Zhu, Software Engineering Manager at Alluxio, will present the results of a joint collaboration between Alluxio and a leading SaaS and data infrastructure enterprise that explored leveraging Alluxio as a high-performance caching and acceleration layer atop AWS S3 for ultra-fast querying of Parquet files at PB scale.
David will share:
- How Alluxio delivers sub-millisecond Time-to-First-Byte (TTFB) for Parquet queries, comparable to S3 Express One Zone, without requiring specialized hardware, data format changes, or data migration from your existing data lake.
- The architecture that enables Alluxio’s throughput to scale linearly with cluster size, achieving one million queries per second on a modest 50-node deployment, surpassing S3 Express single-account throughput by 50x without latency degradation.
- Specifics on how Alluxio offloads partial Parquet read operations and reduces overhead, enabling direct, ultra-low-latency point queries in hundreds of microseconds and achieving a 1,000x performance gain over traditional S3 querying methods.
Speaker: David Zhu
David Zhu is a Software Engineer Manager at Alluxio. At Alluxio, David focuses on metadata management and end-to-end performance benchmarking and optimizations. Prior to that, David completed his Ph.D. from UC Berkeley, with a focus on distributed data management systems and operating systems for the data center. David also holds a Bachelor of Software Engineering from the University of Waterloo.

Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in annual revenue. Coupang's AI platform team builds and manages a large-scale AI platform in AWS for machine learning engineers to train models that enhance and customize product search results and product recommendations for its 100+ million customers.
As the search and recommendation models evolve, optimizing the underlying infrastructure for AI/ML workloads is essential for the e-commerce business. Coupang's platform team actively sought to improve their model training pipeline to boost machine learning engineers' productivity, publish models to production faster, and reduce operational costs.
Coupang focused on addressing several key areas:
- Shortening data preparation and model training time
- Improving GPU utilization in training clusters in different regions
- Reducing S3 API and egress costs incurred from copying large training datasets across regions
- Simplifying the operational complexity of storage system management
In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging distributed caching to power search and recommendation model training infrastructure.
Hyun will discuss:
- How Coupang builds a world-class large-scale AI platform for machine learning engineers to deliver better search and recommendation models
- How adding distributed caching to their multi-region AI infrastructure improves GPU utilization, accelerates end-to-end training time, and significantly reduces cross-region data transfer costs.
- How to simplify platform operations and to easily deploy the same architecture to new GPU clusters.
About the Speaker
Hyun Jung Baek is a Staff Backend Engineer at Coupang.
This video is originally published on TechArena.
At NVIDIA GTC 2025, Bin Fan from Alluxio and Scott Shadley from Solidigm tackled the growing need for decoupled storage and compute in AI infrastructure. They explained how Alluxio's caching layer enables fast, scalable and reliable infrastructure, accelerating AI training and inferencing seamlessly across regions and platforms.