We are in the early stages of the data revolution. Organizations are racing to build data-driven cultures and innovate on data-driven applications. These applications impact many facets of our lives from the way we get to work to how we are medically diagnosed. However, the value of the data is far from being fully utilized and the speed of innovation can be dramatically improved. We believe the critical missing piece is a data orchestration layer.
The current pace of innovation is hindered by the necessity of reinventing the wheel in order for applications to efficiently access data. When an engineer or scientist wants to write an application to solve a problem, he or she needs to spend significant effort on getting the application to access the data efficiently and effectively, rather than focusing on the algorithms and the application’s logic.
This manifests in many scenarios: for example, when a developer wants to move an application from an on-premise environment to a cloud environment, or when a data scientist who wrote Apache Spark applications would like to code a Tensorflow application, etc. In fact, whenever there is a change to the application framework, storage system, or deployment environment (cloud vs on-premise), the developer needs to reinvent the wheel again for data access. The trends of scaling compute & storage independently, rise of the object stores, increased popularity of the hybrid & multi clouds all further exacerbate the challenges associated with data access.
Many are attempting to resolve the challenges associated with data access by either creating a new storage system, a new computation framework, or a new stack. However, history has shown that every five to ten years, there will be another wave of new storage systems and computation frameworks, which does not fundamentally resolve data access challenges. Take storage as an example, each new storage system becomes yet another data silo in the data environment. Same goes for the approach of creating a new application or a new stack.
At Alluxio, we believe that in order to fundamentally solve the data access challenges, the world needs a new layer - a data orchestration platform - between computation frameworks and storage systems. A data orchestration platform abstracts data access across storage systems, virtualizes all the data, and presents the data via standardized APIs with global namespace to data-driven applications. In the meantime, it should have caching functionality to enable fast access to warm data. In summary, a data orchestration platform provides data-driven applications Data Accessibility, Data Locality, and Data Elasticity.
To draw an analogy, data orchestration is to data what container orchestration is to containers. Container orchestration is a category of technologies that enable containers to run in any environment agnostic to the hardware that is running the application and ensure that applications are running as intended. Similarly, data orchestration is a category of technologies that enables applications to be compute agnostic, storage agnostic and cloud agnostic.
Now, with a data orchestration platform in place, an application developer can work under the assumption that the data will be readily accessible regardless of where the data resides or the characteristics of the storage and focus on writing the application.
Besides empowering application developers, a data orchestration platform also brings tremendous value to infrastructure engineers. It derisks vendor lock-in by providing organizations with flexibility at the infrastructure level. Transitioning to different storage systems (including cloud storage), adopting another application framework, or even having a hybrid or multi cloud environment, are all possible without incurring a large development cost. I will expand on the need and the impact of the data orchestration from those perspectives in future blogs.
In summary, data orchestration is the missing piece in the data world. Alluxio is an implementation of a data orchestration platform, and we invite everyone to join us and innovate for the future!
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.