Resources

Blog

Blog

Everything you want to know about how to decouple SQL engines from Hive Data Warehouse

Are you using SQL engines, such as Presto, to query existing Hive data warehouse and experiencing challenges including overloaded Hive Metastore with slow and unpredictable access, unoptimized data formats and layouts such as too many small files, or lack of influence over the existing Hive system and other Hive applications?

On Demand Videos

On Demand Videos

Optimizing Query Performance by Decoupling Presto and Hive Data Warehouse

ALLUXIO COMMUNITY OFFICE HOUR

Blog

Blog

Serving Structured Data in Alluxio Concept

This article introduces Structured Data Management available in the latest Alluxio 2.2.0 release, a new effort to provide further benefits to SQL and structured data workloads using Alluxio.

Blog

Blog

Serving Structured Data in Alluxio Example

This article goes through a simple example to illustrate how Structured Data Management available in the latest Alluxio 2.2.0 release to help SQL and structured data workloads.

Blog

Blog

Whats new in Alluxio 2.2

With this release comes the General Availability (GA) of Alluxio Structured Data Services (SDS), the subsystem of Alluxio responsible for managing and transforming structured data, such as databases, tables, and partitions.

On Demand Videos

On Demand Videos

Bursting Apache Spark Workloads to the Cloud on Remote Data

ALLUXIO COMMUNITY OFFICE HOUR

On Demand Videos

On Demand Videos

Testing Distributed System at Scale for the Cost of a Large Pizza on AWS

ALLUXIO COMMUNITY OFFICE HOUR

On Demand Videos

On Demand Videos

Running Presto with Alluxio on Amazon EMR

ALLUXIO COMMUNITY OFFICE HOUR

‍

White Paper

White Paper

Accelerating analytics & AI in Kubernetes with Alluxio Open Source Data Orchestration

Presentation

Presentation

CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes Using Alluxio

In the on-prem days, one key performance optimization for Apache Hadoop or Apache Spark workloads is to run tasks on nodes with local HDFS data. However, while adoption of the Cloud & Kubernetes makes scaling compute workloads exceptionally easy, HDFS is often not an option. Effectively accessing data from cloud-native storage services like AWS S3 or even on-premises HDFS becomes harder as data locality is lost.

Originated from UC Berkeley AMPLab, the open source project Alluxio approaches this problem in a new way by helping to move data closer to compute workloads efficiently and on-demand, and unify data across multiple or remote clouds, and many more. This webinar will describe the concept and internal mechanism using the stack of Spark+Alluxio in Kubernetes to enhance data locality even when the storage service is outside or remote.

Particularly, we will go over:

Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
The roadmap of Alluxio to further improve running analytics jobs like Spark and Presto, including the on-going closer integration with Presto

On Demand Videos

On Demand Videos

Tech Talk: Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

On Demand Videos

On Demand Videos

Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, NFS, DC OS, & Alluxio

ALLUXIO ONLINE MEETUP

On Demand Videos

On Demand Videos

Community Office Hour: Hands-on with Alluxio Structured Data Management

ALLUXIO COMMUNITY OFFICE HOUR

On Demand Videos

On Demand Videos

Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

ALLUXIO COMMUNITY OFFICE HOUR

Presentation

Presentation

Enabling Ultra-fast Presto in the Cloud with Alluxio

PRESTO SUMMIT NYC

This talk describes a stack of open-source projects to serve high-concurrent and low-latency SQL queries using Presto with Alluxio on big data in the cloud. Deploying Alluxio as a data orchestration layer to access cloud storage object storage (e.g., AWS S3), this architecture greatly enhances the data locality of Presto with distributed and cross-query caching, thus avoids reading the same data repeatedly from the cloud storage.

In addition, since the Alluxio v2.1 release, Alluxio provides structured data management to deliver additional performance beyond caching raw bytes of input files or objects, but also manage and transform structured data. For example, Alluxio can convert data in raw formats (such as CSV) into a more compact and performant file format (such as Parquet) to accelerate Presto queries by 10x for certain workloads with much less CPU used.

This talk will cover an overview of Alluxio’s core concepts, architecture, data flow, as well as the use cases from internet companies like Walmart, JD.com, Ryte that run this stack of Presto and Alluxio at the scale in production.

On Demand Videos

On Demand Videos

Tech Talk: Integrating Google Cloud Dataproc with Alluxio for faster performance in the cloud

On Demand Videos

On Demand Videos

Tech Talk: The Path to Migrating off MapR

Presentation

Presentation

Accelerating workloads and bursting data with Google Dataproc & Alluxio

BIG DATA APPLICATION MEETUP @ GOOGLE

Google Cloud Dataproc is a popular managed on-demand service to run Spark, Presto and many other compute workloads. Alluxio, an open source data orchestration technology, helps speed up Dataproc workloads by providing a distributed caching layer within the Dataproc Cluster. In addition, Alluxio enables “Zero-copy” bursting allowing users to run compute workloads even on data that’s remote on-prem or another cloud. In this session, Dipti from Alluxio and Roderick from Google Cloud will share an overview of Alluxio and Google Dataproc and the benefits the two together bring. It will include a demo of initializing a Dataproc cluster with Alluxio to run workloads on remote data.

On Demand Videos

On Demand Videos

Community Office Hour: Improving Memory Utilization of Spark Jobs Using Alluxio

ALLUXIO COMMUNITY OFFICE HOUR

‍

Presentation

Presentation

Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)

Presto Meetup Hosted @ UBER

This talk describes a stack of open-source projects to serve high-concurrent and low-latency SQL queries using Presto with Alluxio on big data in the cloud. Deploying Alluxio as a data orchestration layer to access cloud storage object storage (e.g., AWS S3), this architecture greatly enhances the data locality of Presto with distributed and cross-query caching, thus avoids reading same data repeatedly from the cloud storage.

In addition, in the latest v2.1 release, Alluxio provides structured data management to deliver additional performance beyond caching raw bytes of input files or objects, but also manage and transform structured data. For example, Alluxio can convert data in raw formats (such as CSV) into a more compact and performant file format (such as Parquet) to accelerate Presto queries by 10x for certain workloads with much less CPU used.

This talk will cover an overview of Alluxio’s core concepts, architecture, data flow, as well as the use cases from internet companies like Walmart and JD.com that run this stack of Presto and Alluxio at the scale in production.

Blog

Blog

Kubernetes Alluxio and the Disaggregated Analytics Stack

TL;DR: First the news - Alluxio support for K8s Helm charts now available! K8s is a certified environment for Alluxio. Now the take away- Alluxio brings back data locality for the disaggregated analytics stack in K8s. How? Read on.

Presentation

Presentation

The Practice of Presto & Alluxio in E-Commerce Big Data Platform

JD.com is China’s largest online retailer. It uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. One example of their computing framework, JDPresto, has gained a 10x performance improvement on average by deploying Alluxio.

On Demand Videos

On Demand Videos

Tech Talk: How the Development Bank of Singapore solves on-prem compute capacity challenges with cloud bursting

Blog

Blog

Data Orchestration Summit Recap and Highlights

We are delighted by the success of the inaugural Data Orchestration Summit on Nov. 7, 2019! Organized by Alluxio, this one-day event was sold out with nearly 400 attendees! Data engineers, cloud engineers, data scientists joined the talks of 24 industry leaders from all over the globe to share their experiences building cloud-native data and AI platforms. All session recordings and slides are now available.

‍

Alluxio Enterprise AI

Alluxio Enterprise Data

Resource Hub

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer