This blog is authored by Madan Kumar and Alex Ma originally posted on medium.
As the data ecosystem becomes massively complex and more and more disaggregated, data analysts and end users have trouble adapting and working with hybrid environments. The proliferation of compute applications along with storage mediums leads to a hybrid model that we are just not accustomed to.
With this disaggregated system data engineers now come across a multitude of problems that they must overcome in order to get meaningful insights.
- Enabling connections between the various computes and storage becomes increasingly complex.
- Often seeing low performance due to lack of data locality for compute, which is a new challenge that we did not have to face previously in collocated environments(storage & compute together).
- Ultimately having to deal with high costs, mainly due to creating multiple copies of data as and when they need it closer to compute which ultimately results in storage not being optimized and becoming increasingly saturated.
In this new 2.0 ecosystem data engineers need to find a way to be able to leverage and work with hybrid environments. While also being able to maintain minimal code changes for their applications and being able to leverage all storage systems available to their fullest.
Today I see data engineers when attempting to work in these hybrid environments have no easy and transparent way to deal with these issues. Many times we tend to make multiple copies across environments in the hopes of trying to achieve locality. While also not being able to use more efficient computes due to API incompatibility. We tend to inexplicably end up overloading storage and not fully leverage other cheaper solutions.
Handling these modern workloads requires a solution that solves a few different problems but most of all one that can serve as a virtualization layer between compute and storage. Similar to how we have orchestration frameworks for technologies like containers, there needs to be an orchestration framework for data. One such open source system is Alluxio (formerly Tachyon Nexus), Alluxio provides capabilities that allow it to function as a modern data orchestration solution.
Alluxio provides a few particular features that a data orchestration framework needs to be successful in Hybrid environments.
A framework that allows engineers to have unified access to data regardless of the storage system it may reside on. This becomes increasingly necessary when also using newer computes that may not natively integrate to a particular storage. Which allows you to not have to worry about using a common interface. Alluxio’s API translation allows users to continue bringing new technologies into their ecosystem while also ensuring a durable consistent way of ensuring that they can be connected. Alluxio’s tiering capability also helps solve the slow data access problem while letting you leverage lower cost storage.
While working in hybrid environments can be challenging, it is something that we must come to grips with in today’s rapidly involving data ecosystem. Modern data orchestration frameworks today while not solving the entire problem have come a long way in making the adaption to hybrid that much easier!
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.