Resource Hub
.png)



DATA ORCHESTRATION SUMMIT 2019
This hands-on training run by the creators of Presto and Alluxio will cover how to get started with Presto and Alluxio. Attendees will get hands-on experience launching the EC2 instance, exploring the Alluxio filesystem and cluster status, and running queries with Presto on Alluxio where you’ll experience the performance benefits of using Alluxio in your analytics stack.
Presto is a widely popular sql query engine, and it is great for interactive sql analytics. However, when the data is remote or in object stores, performance becomes a challenge. Alluxio can improve Presto’s query performance by using Alluxio as a distributed cache layer co-located with Presto. Presto with Alluxio brings together two open source technologies to give you better performance and multi-cloud capabilities for interactive analytic workloads. Presto’s open source distributed SQL query engine coupled with Alluxio enables true separation of storage and compute for data locality and provides memory speed response time and aggregate data from any file or object store.



ODSC WEST 2019
Cloud storage brings great flexibility in management and cost-efficiency to data scientists, but also introduces new challenges related to data accessibility and data locality for machine learning applications. For instance, when the input data is stored in a remote cloud storage like AWS S3 or Azure blob storage, direct data access is often slow and expensive; but manually moving data to the training clusters can be time-consuming, complicated and often require data engineering or ETL pipelines.
This session is designed for data scientists or data engineers who work with remote and possibly multiple data sources in hybrid or multi-cloud environments. We will guide the audience to use Alluxio to greatly simplify the data preparation in these environments, covering the following topics:
- -How to setup and create POSIX endpoint for Alluxio service to unify the file system data access to S3, HDFS and Azure blob storage
- How to run Apache Spark to read input from and write output to remote storage with Alluxio as the distributed data caching layer
- How to run TensorFlow to train models backed by accessing remote input data like access local file system.



The big data stack has evolved over the past few years with an explosion of data frameworks, starting with MapReduce and expanding to Apache Spark and Presto. The approach to managing and storing data has evolved as well, starting from using primarily Hadoop distributed file system (HDFS) to newer, cheaper, and easier technologies like object stores. But the design of most object stores inhibits real-time big data and AI workloads running directly on them.
Vitaliy Baklikov and Dipti Borkar explore a different architecture for analytic workloads, particularly those deployed in cloud environment. Alluxio, an open-source virtual distributed file system, provides a unified data access layer for hybrid and multicloud deployments. Alluxio enables distributed compute engines like Spark or Presto or machine learning frameworks like TensorFlow to transparently access different persistent storage systems (including HDFS, S3, Azure, etc.) while actively leveraging in-memory cache to accelerate data access.
Vitaliy and Dipti dive into how DBS Bank built a modern big data analytics stack, leveraging an object store as persistent storage even for data-intensive workloads, and how it uses Alluxio to orchestrate data locality and data access for Spark workloads. In addition, deploying Alluxio to access data solves many challenges that cloud deployments bring with separated compute and storage.



At Ryte, we analyze unstructured, semi-structured and structured data for more than one million users worldwide. The whole Ryte-Platform is built with a scalable architecture to support our heavy load and make it possible for our customers to drill-down from a high-level overview into the last byte of their websites.
In this presentation, I will show why & how we solve some challenging technical issues, improve the speed, and reduce costs of our AWS EMR Hadoop & Presto -Backend with Alluxio to an awesome level!
Topics:
- What is Ryte: Platform to optimize your Online-Marketing
- Requirements for the Ryte-Platform
- Why we use Presto on AWS EMR with S3
- When problems pop-up
- How we solve them with Alluxio in a perfect way
.jpeg)

.jpeg)
For today’s blog post I interviewed Bin Fan, Founding Engineer and VP of Open Source at Alluxio. Bin is the PMC maintainer of the Alluxio open source project. Prior to Alluxio, he worked for Google on the next-generation storage infrastructure.



Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Data storage is migrating from the colocated model (e.g., HDFS) to a more cost-effective, scalable but often fully disaggregated and remote data lake model (e.g. S3). This has created a strong need for data orchestration in the cloud like what K8s does for container-based workloads, so that data can be presented in the right layout at right location for data applications on the cloud. Originally developed from UC Berkeley AMPLab project “Tachyon”, Alluxio (www.alluxio.io) implements the world’s first open-source data orchestration system in the cloud: an unified access layer for data-driven applications in bigdata and ML, enabling Spark, Presto or TensorFlow to transparently access different external storage systems while actively leveraging in-memory cache to accelerate data access. In this talk, we will present: trends and challenges in the data ecosystem in cloud era; Data engineering in the cloud with data orchestration; Use cases of using tech stacks (Presto or Tensorflow) with Alluxio on S3