Bay Area Meetup: Alluxio 2.0 Deep Dive and Near Real-time Analytics with Spark
July 23, 2019
By 
Calvin Jia

We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.

In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:

– New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
– Improved Alluxio POSIX API to support legacy and machine-learning workloads;
– A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
– A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, async-persist, cross mount move/copy and distributed loading;
– Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.

Alluxio 2.0 Preview Release Deep Dive

We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.

In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:

– New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
– Improved Alluxio POSIX API to support legacy and machine-learning workloads;
– A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
– A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, async-persist, cross mount move/copy and distributed loading;
– Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.

Video:

Presentation slides:

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio from Alluxio, Inc.

Real-time Data Processing for Sales Attribution Analysis with Alluxio, Spark and Hive at VIPShop

Vipshop is a leading eCommerce company in China with over 15 million active daily users. Our ETL jobs primarily run against data on HDFS, which can no longer meet the increasing swiftness and stability demand for certain real-time jobs. In this talk, I will explain how we’ve replaced HDFS with Memory+ HDD managed by Alluxio to speed up data accesses for all our Sales Attribution applications running on Spark and Hive, this system has been in production for more than 2 years. As more old fashion ETL SQLs are being converted into real-time jobs, leveraging Alluxio for caching has become one of the widely considered performance tuning solution. I will share our criteria when selecting use cases that can effectively get a boost by switching to Alluxio.

Our future work includes using Alluxio as an abstraction layer for the \tmp\ directory in our main Hadoop clusters, and we are also considering Alluxio to cache the hot data in our 600+ node Presto clusters.

Bio:
Wanchun Wang is the Chief Architect and has been with VIPShop for over 5 years and his interests focus on processing large amounts of data such as building streaming pipelines, optimizing ETL applications, and designing in-house ML & DL platforms. He is currently managing big data teams that are responsible for batch, real-time, and data warehouse systems.

Video:

Acknowledgment:
Our event partner AICamp (http://www.xnextcon.com) is a global online platform for engineers, data scientists to learn and practice AI, ML, DL, Data Science, with 80000+ developers, and 40+ cities local study groups around the world.

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer