AI/ML Infra Meetup at Uber Seattle: Tackling Scalability Challenges of AI Platforms

March 10, 2025

Hope Wang

Co-hosted by Alluxio and the Uber AI team on March 6, 2025, at Uber's Seattle office and via Zoom, the AI/ML Infra Meetup is a community event for developers focused on building AI, ML, and data infrastructure at scale. Speakers from Uber, Snap, and Alluxio delivered talks, sharing insights and real-world examples about LLM training, fine-tuning, deployment, designing scalable architectures, GPU optimization, and building recommendations systems.

Here are the key highlights from each presentation.

1. Deployment, Serving, & Discovery of LLMs at Uber Scale

Presented by: Sean Po, Staff SWE & Tse-Chi Wang, Senior SWE @ Uber

>> Full recording and slides

This talk provided a deep dive into how Uber manages its Generative AI Gateway, which powers all generative AI applications across the company.

Key takeaways:

Uber processes 5M+ AI requests daily through 40+ services built on their Generative AI Gateway
Their gateway provides a unified interface, enabling access to models from various vendors (OpenAI, Google's Gemini, open-source models like Llama, and in-house fine-tuned models)
The centralized approach provides authentication, authorization, logging, metrics, cost tracking, and guardrails
Uber has developed a sophisticated auto-discovery system from Michelangelo’s Control Plane for dynamically detecting deployments created outside the gateway

Particularly interesting was Uber's Assistant Builder, a no-code solution for creating RAG agents. With 540+ weekly active users internally, it enables teams to quickly build AI assistants that can access Uber's internal knowledge bases and data sources. The assistants use a React agent architecture and can be connected to Slack for easier team access.

Refer to the blogs for details: Navigating the LLM Landscape: Uber’s Innovation with GenAI Gateway and Scaling AI/ML Infrastructure at Uber.

2. Optimizing ML Data Access with Alluxio: Preprocessing, Pretraining, & Inference at Scale

Presented by: Bin Fan, VP of Technology @ Alluxio

>> Full recording and slides

Bin Fan delivered an insightful talk on data access challenges in ML applications, with particular emphasis on how Alluxio's distributed caching solution helps bridge the gap between storage and compute.

Key takeaways:

Open-source models like DeepSeek are disrupting the landscape by providing high-quality, free alternatives to proprietary models
Efficient resource utilization is crucial for model training, as demonstrated by DeepSeek's ability to train high-quality models with relatively modest resources
KV cache management is becoming a critical optimization for transformer models, especially as context windows expand
Alluxio positions itself as a distributed caching layer between storage and compute, particularly valuable for AI workloads with large datasets

Bin provided an interesting comparison between DeepSeek's 3FS (a parallel file system purpose-built for ML workloads) and Alluxio, explaining how they serve different but complementary roles in the ML infrastructure ecosystem.

If you're interested in learning more about 3FS, join the virtual tech talk, "Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distributed Storage," on Tuesday, April 1, at 11 am PT.

3. How Uber Optimizes LLM Training and Fine-tuning

Presented by: Chongxiao Cao, Senior SWE @ Uber

>> Full recording and slides

Chongxiao Cao from Uber's Michelangelo platform team shared valuable insights into Uber's approach to optimizing LLM training and fine-tuning workflows.

Key takeaways:

Uber has developed a robust MLOps stack integrating popular open-source tools like Hugging Face Transformers, Ray, DeepSpeed, and Flash Attention
Their architecture uses a Ray-based distributed system with DeepSpeed for model parallelism to efficiently train and fine-tune large models
The system supports multiple storage backends, including GCS, HDFS, and Uber's proprietary blob storage
Uber is increasingly incorporating LLMs into their recommendation systems by adding semantic understanding to traditional recommendation architectures

Particularly interesting was their agile development environment using Jupiter notebooks on Ray clusters, allowing engineers to modify code on the fly without rebuilding Docker images - a significant productivity boost for ML engineers.

Reference: Open Source and In-House: How Uber Optimizes LLM Training.

4. Building Production Platform for Large-Scale Recommendation Applications

Presented by: Xu Ning, Director of Engineering, AI Platform @ Snap

>> Full recording and slides

Xu Ning delivered the final talk, providing a comprehensive overview of the unique challenges in building and scaling recommendation systems compared to LLM applications.

Key takeaways:

Recommendation models often exceed LLMs in both data consumption (petabytes vs. terabytes) and model size (up to 100TB), presenting unique scaling challenges
Unlike LLMs, recommendation systems require extremely frequent updates to incorporate new content and user behaviors
The architecture typically involves multiple filtering stages, from retrieval to heavy ranking, to efficiently handle billions of potential items
Online training is crucial for recommendation systems to quickly learn from new user behaviors and content

Xu used examples from Snap's approach to handling the massive embedding tables that dominate recommendation model size, using parameter server architectures to distribute the workload. He also highlighted the high-fanout nature of recommendation inference, where thousands of items must be scored for each request.

Read more on how Snap powers its recommendation applications: Introducing Bento, Snap's ML Platform.

Final Thoughts

This meetup provided a fascinating glimpse into how leading tech companies are tackling AI infrastructure challenges at scale. A few common themes emerged across all presentations:

The scale is staggering - Whether it's Uber's 5 million daily AI requests or Snap's petabyte-scale training data, these systems operate at truly massive scale
Open source LLMs are disrupting - DeepSeek emerged as a disruptive force in the GenAI landscape, expanding accessibility to much broader audiences
Infrastructure is evolving rapidly - The systems described are constantly being refined to handle new models, larger data volumes, and more complex workloads

As AI continues to transform industries, the infrastructure supporting these systems will only grow more sophisticated. This meetup provided valuable insights into the current state of the art and the direction it's heading.

Stay tuned for future events where we'll continue to explore the cutting-edge AI and machine learning infrastructure!

‍

Share this post

Blog

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

New Features in Alluxio Enterprise AI 3.5

With the new year comes new features in Alluxio Enterprise AI! Just weeks into 2025 and we are already bringing you exciting new features to better manage, scale, and secure your AI data with Alluxio. From advanced cache management and improved write performance to our Python SDK and S3 API enhancements, our latest release of Alluxio Enterprise AI delivers more power and performance to your AI workloads. Without further ado, let’s dig into the details.

‍

Alluxio Enterprise for Data Analytics Scales to New Heights

We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo