Alluxio 1.4.0 has been released with a large number of new features and improvements. This blog highlights some stand out aspects of the Alluxio 1.4.0 open source release.
- Improved Alluxio Under Storage API
- Native File System REST Interface
- Packet Streaming
Improved Alluxio Under Storage API
Alluxio is a system which bridges the gap between computation and data storage. The initial version of the Under Storage API mirrored the Alluxio File System API and was tailored to storage systems providing access using an HDFS-like API. Object stores, both public and private, have increasingly become the storage backend of choice for various use cases, and as a result, the Under Storage API needed to evolve in order to best serve both object and file system data storages.
Object stores have a flat namespace with only a top level directory-like entity (a bucket). However it is possible to create pseudo-directories, which enables the illusion of directories in the under store. Since the object store API does not distinguish between file objects and directory objects, a file system API is extremely inefficient for an object store.
For example, a UFS API delete on a path does not know if a file or directory is to be deleted and must issue a remote query to fetch that metadata. Each metadata operation on the object store is expensive because of the latency involved in communicating with a remote storage system.
In Alluxio 1.4.0, the UFS API has been updated to better deal with such a scenario making use of the fact that the additional metadata required to make the call efficient is already known by Alluxio in most cases.
Changing the Under Storage API to be more object storage friendly has two main benefits:
- Optimized object store connectors: The improved UFS API suitable for object stores in addition to file systems shows significant performance benefits.
- Streamlined object store integrations: A new abstraction for object stores means that it is even easier to integrate an object store to Alluxio. Instead of worrying about patterns common to other object stores, implementing a thin wrapper over the Java client for the REST interface talking to that particular object store is now sufficient.
Optimized object store connectors
Alluxio 1.4.0 has seen major improvements in small file and small read performance with object stores. Evolving the UFS API has enabled improved metadata performance with object stores. Here are a few experiments which demonstrate the benefits.
Create and Delete Performance for Empty Files
CreateDeleteS3A5x improvement compared to v1.33x improvement compared to v1.3Ceph10x improvement compared to v1.315x improvement compared to v1.3
Creation of zero-byte calls in under storage is one of the operations which has been significantly impacted by changes in 1.4.0. Write performance is improved by 5x for S3A and 10x for Ceph. Note that the performance difference is greater for Ceph, as with new abstraction called ObjectUnderFileSystem, optimizations in 1.3.0 specific to S3A are now applicable to all other object stores. The other operation which was impacted majorly is ‘delete’ with 3x and 15x improvements for S3A and Ceph respectively
Back-Of-The-Envelope Calculations
A typical data-center has a 10Gbps (~ 1GBps for the sake of calculation) network link between Alluxio and the remote storage cluster with a 1 ms RTT. The I/O time to transfer a 1MB chunk on this link would be 1MB / 1GBps = 1 ms. Reducing the number of metadata round-trips for a create operation from 10 to 1 (10x as for Ceph), reduces our total execution time from 11ms (10 ms + 1 ms) to 2ms (1 ms + 1ms) which is more than a 5x performance improvement for writing a 1MB file.
By the same logic, the improvement for a 10MB chunk is approximately 2x. This illustrates how optimizing metadata performance significantly improves small file and small read performance. The benefits become even more pronounced in situations where the remote storage cluster is farther away.
Streamlined object store integrations
Adding a new object store integration now involves implementation of a much smaller subset of methods and resembles functionality natively supported by a REST interface client. Only half of the methods to be implemented previously are still required. In terms of source lines of code, a new object store can now be integrated in less than 400 LOC which is more than a 2x reduction as well.
Native File System REST Interface
The newly introduced REST interface provides parity with Alluxio’s native Java API and its purpose is to facilitate interactions with Alluxio from non-Java environments.
The REST API is available through a new Alluxio process called Alluxio proxy, which proxies the communication between the REST API and Alluxio servers using an internal Alluxio Java client.
The Alluxio proxy can be started:
- locally through the ./bin/alluxio-start.sh local command, which starts a local Alluxio cluster
- co-located with every Alluxio master and Alluxio worker process started through the ./bin/alluxio-start.sh all command
- explicitly through the ./bin/alluxio-start.sh proxy command, which start the proxy process locally
API
The REST API consists of two type of endpoints:
- path: http://<host>:<port>/api/v1/paths/<path>/<operation>/
- stream: http://<host>:<port>/api/v1/streams/<id>/<operation>/
The host parameter can be any machine which is running an Alluxio Proxy.
The path endpoints perform the given operation over a path (e.g. list-status, create-file, or delete). Any additional arguments are passed to the endpoint as a JSON object.
Some of the path endpoints, create-file and open-file in particular, create a stream and return an integer handle to id. This handle can be used to invoke the stream endpoints to perform the given operation (e.g. read, write, or close).
Examples
This section illustrates some of the REST API functionality through the use of curl commands that communicate with a local Alluxio proxy.
Create a directory
The following command creates the /hello/world directory; the recursive=true parameter is used to create missing parents recursively:
curl -v -H "Content-Type: application/json" -X POST -d '{"recursive":"true"}' http://localhost:39999/api/v1/paths//hello/world/create-directory
List a directory
The following command lists the contents of the /hello directory:
curl -v -X POST http://localhost:39999/api/v1/paths//hello/list-status
Delete a directory
The following command deletes the contents of the /hello directory; the recursive=true parameter is used to delete the directory recursively:
curl -v -H "Content-Type: application/json" -X POST -d '{"recursive":"true"}' http://localhost:39999/api/v1/paths//hello/delete
Upload a file
The following commands create the /hello-world.txt file, write its contents, and close it:
curl -v -X POST http://localhost:39999/api/v1/paths//hello-world.txt/create-file
1 // Proxy creates an upload "stream" and returns its ID
curl -v -H "Content-Type: application/octet-stream" -X POST -d 'Hello World!' http://localhost:39999/api/v1/streams/1/write
// Writes 'Hello World!' to the file
curl -v -X POST http://localhost:39999/api/v1/streams/1/close
// Closes the stream
Download a file
The following commands open the /hello-world.txt file, read its contents, and close it:
curl -v -X POST http://localhost:39999/api/v1/paths//hello-world.txt/open-file
2 // Proxy creates a download "stream" and returns its ID
curl -v -X POST http://localhost:39999/api/v1/streams/2/read
Hello World!
curl -v -X POST http://localhost:39999/api/v1/streams/2/close
// Closes the stream
Performance
For optimal performance, we recommend collocating an Alluxio proxy with Alluxio server processes. This will enable non-Java applications to access data stored in Alluxio at memory-speed, while minimizing the overhead of the extra hop between Alluxio proxy and Alluxio servers.
Packet Streaming
Alluxio 1.4.0 introduces a new network transfer protocol designed to fully utilize the available network bandwidth between Alluxio components. We achieve this by reducing the amount of buffering used during network transfers and relying on a continuous streaming protocol as opposed to a request-response protocol for data transfer.
Benefits
- Up to 2x I/O performance improvement within a standard network, with better results in high latency-throughput product environments
- Handles small reads and random reads optimally without configuration tuning
Protocol Details
By using this approach, we ensure that the network pipe is continuously saturated because we do not need to send periodic requests for additional data. If we did so, the network pipe would be empty for the read request during the round trip time of the request. This leads to significant I/O performance improvement, especially in cases where the round trip time is fairly long and the throughput available is large.
In addition, the unit of data transfer has been reduced to a packet (64KB by default). With the streaming protocol, a smaller packet does not influence workloads with large sequential I/O because there is a constant number of setup/teardown messages. However, the small packet size is also favorable for small reads, since the total amount of data read is much closer to what the reader is requesting. Therefore, packet streaming can satisfy clients of both workload types without requiring different configurations.
Packet streaming is currently still in an experimental stage, and we will be actively improving this feature in coming releases to further improve Alluxio’s performance and ease of use.
And Many More!
This blog only highlighted a few of the new features and improvements in Alluxio 1.4.0. For a more comprehensive list, check out the release notes.
You can easily get started with Alluxio open source or community edition today by following the quick start guide.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.