Orchestrating Data for the Cloud World with Alluxio 2.0

July 11, 2019

Haoyuan Li

Today, I’m thrilled to announce the GA of Alluxio 2.0, Alluxio’s biggest release to date (see our Release Notes & Release Blog) with over 900 PRs. Thank you to the 1000+ open source developers, our amazing team, and users, customers, partners, who together made this possible! 2.0 is a major step towards realizing our vision of building an open source implementation of a Data Orchestration system for analytics and machine learning in the cloud.

In this blog, I will share the motivation behind starting this project back at UC Berkeley AMPLab, and how the system has evolved as the broader ecosystem transitioned into the cloud world.

Back in 2013, I was a Ph.D. student at UC Berkeley AMPLab advised by Professor Ion Stoica and Scott Shenker. By then, Hadoop was dominating big data ecosystem and becoming the de facto industry solution; whereas in AMPLab researchers started to build the Berkeley Data Analytics Stack (or BDAS), which had successfully spawned widely popular open-source projects like Apache Spark and Apache Mesos. With an intrinsic interest in data and distributed systems, I co-created the Alluxio open source project (formerly named Tachyon) as the data layer in BDAS.

Initially, Tachyon was commonly used together with Apache Spark to save and share off-heap RDDs (Resilient Distributed Dataset) in memory. Soon it became clear that this research project had great potential as a standardized data access layer across multiple storage systems. Regardless of Hadoop or BDAS, running analytics in data warehouse was the assumption and focus at that time; the concept of cloud computing was more a buzzword than a reality for most enterprises.

The data ecosystem has evolved dramatically since I founded the company in 2015. Today, most organizations are either already in the cloud (whether single, hybrid or multi-cloud) or in the process of adopting a cloud strategy. Unlike serving colocated analytics jobs in traditional data warehouses, data service in the cloud becomes more distant (e.g., transferred from S3), siloed (e.g., spread across multiple different regions or storage services), and often with large variance in performance.

Designed to provide data abstraction to decouple compute and storage, Alluxio is ideally positioned in the cloud world as an orchestration platform for data (like Kubernetes is the orchestration platform for containers). It enables data engineers to run analytics and AI/ML workloads on clouds of their choice with magnitudes higher performance and interacts with data on-demand without having to worry about where the data resides or the performance implication.

Today, Alluxio is deployed and trusted by industry leading companies such as China Unicom, Development Bank of Singapore, Tencent, and many more. Some of the largest deployments have more than 1,000 nodes in a single Alluxio cluster, powering critical infrastructures globally. At the same time, our community has grown to 1000+ contributors, and our software can handle billions of files and manage petabyte scale data.

I am more excited today than ever. Alluxio 2.0 is a big step towards realizing the vision of being the data orchestration layer enabling new technology stacks and serving organizations to unlock the power of data for all. Welcome to download the software and try it out!

Enjoy hacking, creating, and cheers to the future!

Share this post

Blog

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

AI/ML Infra Meetup at Uber Seattle: Tackling Scalability Challenges of AI Platforms

Insights from from Uber, Snap, and Alluxio on LLM training, fine-tuning, deployment, designing scalable architectures, GPU optimization, and building recommendations systems.

New Features in Alluxio Enterprise AI 3.5

With the new year comes new features in Alluxio Enterprise AI! Just weeks into 2025 and we are already bringing you exciting new features to better manage, scale, and secure your AI data with Alluxio. From advanced cache management and improved write performance to our Python SDK and S3 API enhancements, our latest release of Alluxio Enterprise AI delivers more power and performance to your AI workloads. Without further ado, let’s dig into the details.

‍

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo