This is a guest blog by Ashwin Sinha with an original blog source.
This blog introduces Wormhole— open source Dockerized solution for deploying Presto & Alluxio clusters for blazing fast analytics on file system (we use S3, GCS, OSS). When it comes to analytics, generally people are hands-on in writing SQL queries and love to analyse data which resides in a warehouse (e.g. MySQL database). But as data grows, these stores start failing and hence arises a need for getting the faster results in same or less time frame. This can be solved by distributed computing and Presto is designed for that. When attached with Alluxio, it works even more faster. That’s what Wormhole is all about.
Here is the high level architecture diagram of solution:
Let us explain each component in the order which they should be setup in wormhole:
- Consul — Consul is a service networking solution to connect and secure services across any runtime platform and public or private cloud. It helps in identification of a container by its containerID to other containers. Setup instructions here.
- Docker — Docker is a set of platform-as-a-service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. In our setup, we have put every service as a Docker container. Setup instructions here.
- Alluxio Master — Alluxio masters should be deployed in High Availability(HA). These are responsible for making query execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions here.
- Alluxio Worker — Alluxio workers are the actual cache storage of Alluxio. When data is queried from Alluxio filesystem(FS), it fetches data from underlying FS(maybe S3, GCS or any configured FS) and stores in RAM in LRU fashion. Next time it doesn’t need to go till FS and returns data blocks from its own cache. Setup instructions here.
- Hive metastore — Hive metastore is the collection of metadata of all the hive tables. In our setup, we create metastore on MySQL for storing Alluxio located data. Setup instructions here.
- Presto Coordinator — Presto coordinator should be deployed in High Availability(HA). These are responsible for making jiquery execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions here.
- Presto Worker — Presto workers are responsible for crunching the data from underlying Alluxio. These pull the data, perform operations on it and send the individual results to coordinator. Setup instructions here.
Apart from above components, we require a Zookeeper quorum setup which is required for making Alluxio master and Presto coordinator highly available(HA). For complete documentation on setup, please refer here.
Time for some action
Now since we have setup Presto on top of Alluxio, but how to make it available for everyone to use? So the answer can be some other tools like Metabase, which provide connectivity to Presto. Just we needed to add the appropriate configurations and it works like a charm for all sorts of analysis.
Presto and Alluxio also provides UI to track the current state of things and that helps a lot.
Next Plans
Next focus will be mostly on making the solution self-serve through a user interface and make it self scalable (possibility of deployment on Kubernetes).
ABOUT THE AUTHOR
An engineer at heart and Data Engineer by profession, who loves to learn and play with distributed systems. Open-source enthusiast and loves to contribute back to community. Can refer to the work at github and medium. Also a traveller who loves to explore distant places and enjoy long bike rides.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.