We are extremely excited to announce the release of Alluxio 2.3.0!
Alluxio 2.3.0 focuses on streamlining the user experience in hybrid cloud deployments where Alluxio is deployed with compute in the cloud to access data on-prem. Features such as environment validation tools and concurrent metadata synchronization greatly improve Alluxio’s functionality. Integrations with AWS EMR, Google Dataproc, K8s, and AWS Glue make Alluxio easy to use in a variety of cloud environments. In this article, we will share some of the highlights of the release. For more, please visit our release notes page.
Downloads can be found here. Join thousands of members in our Slack channel to ask any questions and provide your feedback! Thank you to everyone who contributed to this release!
Significant Adoption of Hybrid Cloud
The trend of moving to cloud is undeniably shaping the industry. Data analytics and machine learning workloads are no exception, but we have seen many Alluxio users prefer the hybrid cloud approach, as opposed to a lift and shift. Alluxio’s ability to enable zero copy bursting of compute to cloud has proved invaluable in enabling organizations to begin leveraging the cloud.
Alluxio 2.3 addresses several key usability challenges and further improves the system’s effectiveness in hybrid deployments.
One Command Deployment on AWS EMR and Google Dataproc
Try out our example hybrid cloud deployment on AWS EMR or Google Dataproc.
Deploying Alluxio for the first time should be easy, and being able to repeatably create custom deployments with Alluxio in the stack is key for deployments in the cloud. Cloud resources are often elastic or ephemeral, as opposed to the long term maintenance model commonly used in on-premise deployments.
Alluxio artifacts have been published for integration with terraform scripts. Experienced users can use the assets provided (see one of the above tutorials for details) as a basis for building their own terraform deployments. Note this is currently only available in the Enterprise Edition.
Environment Validation Tools
After deployment, the hurdle of connecting on-cloud Alluxio to remote data is the biggest challenge for new Alluxio users. We’ve created a guided experience to help users during this first step after deployment.
The Alluxio Enterprise Edition has a remote connectivity page in the UI which troubleshoots and validates the entire mounting process.
Both the Community and Enterprise Editions have three new validation tools to help users troubleshoot issues in their deployments. These tools are all part of the command line bin/alluxio
runHdfsMountTests
checks configuration for mounting the target HDFS path to Alluxio.
runUfsIOTest
measures the read/write IO throughput from Alluxio cluster to the target HDFS.
runHmsTests
validates the given configuration is sufficient to run Hive Metastore operations.
Concurrent Metadata Synchronization
For long running and production hybrid cloud deployments, users found it critical for the files and directories virtualized in Alluxio to be synchronized with the on-premise data in near real time. This previously was not feasible for namespaces with a large number of files.
In Alluxio 2.3 the new concurrent metadata synchronization algorithm provides an order of magnitude or more performance improvement, especially for large namespaces with concurrent operations.
Alluxio Structured Data Services
Alluxio is most commonly used in OLAP big data workloads with frameworks like Presto and SparkSQL. Alluxio Structured Data Services (SDS) is the subsystem in Alluxio that enables integration with those frameworks at the structured data level, as opposed to raw files and directories. Read more about SDS here.
Alluxio 2.3 further improves the range of compatibility for SDS, especially in cloud environments.
Glue UDB Support
The Alluxio Catalog Service now supports connecting to AWS Glue for the metadata service. This enables Alluxio Structured Data Services for table metadata stored in AWS Glue, in addition to the existing support for the Hive Metastore.
ORC File Support
ORC is now a supported input type (in addition to CSV and Parquet) for transformations with the Alluxio Catalog Service.
More Info
You can find more information in the 2.3.0 official release notes.
Want to hear from the core developers? Join us for a webinar on the 2.3 release!
Have questions? Come join the Community Slack Channel.
Zac, Calvin, Bin, Adit, and Alluxio Product Team
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.