Introduction

When running write-intensive ETL jobs in the cloud using Spark, MapReduce or Hive, directly writing the output to AWS S3 often yields slow, unstable performance or complicated error handling due to throughput throttling, object storage semantics among many other reasons. Some users choose to first write the output to an HDFS cluster and then upload the staging data to AWS S3 using tools like s3-dist-cp. Though optimizing the performance or eliminating corner cases for object storage, the second approach adds both cost and complexity in maintaining a staging HDFS, as well as increases the delay for the data to be available for downstream applications.

Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud. Alluxio v2.0 introduced Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.

What is Replicated Async Write

Assume that a Spark job is writing a large data set to AWS S3. To ensure that the output files are quickly written and keep highly available before persisted to S3, pass the write type ASYNC_THROUGH and a target replication level to Spark (see description in docs),

alluxio.user.file.writetype.default=ASYNC_THROUGH alluxio.user.file.replication.durable=<N>

and set the output to the Alluxio file system client (with URI scheme: “alluxio://”)

scala> myData.saveAsTextFile("alluxio://<MasterHost>:19998/Output")

This Spark job will first synchronously write a new file to Alluxio file system with N copies before returning. In the background, the Alluxio file system will persist a copy of the new data to the Alluxio under storage like S3.

By default, the property alluxio.user.file.replication.durable is set to 1 and writing a file to Alluxio using ASYNC_THROUGH completes at memory speed if there is a colocated Alluxio worker. But if any worker crashes or restarts before the data is persisted, the data can be lost. Increasing alluxio.user.file.replication.durable to some value greater than 1 ensures the data is replicated at N different workers in Alluxio to survive N-1 worker failures before being persisted, at the cost of temporarily storing N times more data in Alluxio.

Note that, before the file is eventually persisted, in case any of these N workers fail, Alluxio will take care of the failure by re-replicating the under-replicated blocks and ensuring sufficient copies of the file; also free command will not work before the file is persisted. Once the file is persisted to the under storage, the in-Alluxio copies of this file can become evictable if another configuration allows this.

Enable Replicated Async Write in Alluxio

Configure the Alluxio property alluxio.user.file.replication.durable to a reasonable number to provide fault tolerance as well as set alluxio.user.file.writetype.default to ASYNC_THROUGH on all client applications. Note that these properties are client-side only, meaning they don’t require restarts of any Alluxio processes to use them!

For example, put the following lines into conf/alluxio-site.properties to write with three initial copies before data is persited:

alluxio.user.file.writetype.default=ASYNC_THROUGH alluxio.user.file.replication.durable=3

Alluxio command-line is also supported to use replicated async write. For example, the copyFromLocal command can copy data from the local file system to Alluxio using replicated async write:

$ ./bin/alluxio fs -Dalluxio.user.file.writetype.default=ASYNC_THROUGH \ -Dalluxio.user.file.replication.durable=3 copyFromLocal <localPath><alluxioPath>

Comparison to Other Alluxio Write Types

Replicated async write is one write type that aims to balance the requirements in write speed and data persistence, but also at the transient cost of more storage capacity. Alluxio also provides a couple of additional write types realizing different trade-offs in the design space.

For an application which writes its data through Alluxio, unless the user is writing with THROUGH or CACHE_THROUGH, the data will not exist in the under storage when writing. The problem is that those types of writes are slower than their MUST_CACHE and ASYNC_THROUGH counterparts. When a file is written using MUST_CACHE or ASYNC_THROUGH, if the blocks of the file only exist on a single worker and that worker fails or crashes, then the written data will be unrecoverable. When Alluxio provides synchronous replicas of data at the time it is written, then it combines the performance of using MUST_CACHE while still providing fault-tolerance.

Share this post

Blog

Make Multi-GPU Cloud AI a Reality

If you’re building large-scale AI, you’re already multi-cloud by choice (to avoid lock-in) or by necessity (to access scarce GPU capacity). Teams frequently chase capacity bursts, “we need 1,000 GPUs for eight weeks,” across whichever regions or providers can deliver. What slows you down isn’t GPUs, it’s data. Simply accessing the data needed to train, deploy, and serve AI models at the speed and scale required – wherever AI workloads and GPUs are deployed – is in fact not simple at all. In this article, learn how Alluxio brings Simplicity, Speed, and Scale to Multi-GPU Cloud deployments.

Accelerate your Cloud Object Storage for AI Workloads

Turn your existing S3 storage into an AI-ready storage layer with sub-ms latency and terabytes per second throughout per Alluxio cluster with linear scalability — no data migration required.

Alluxio's Strong Q2: Sub-Millisecond AI Latency, 50%+ Customer Growth, and Industry-Leading MLPerf Results

Alluxio's strong Q2 featured Enterprise AI 3.7 launch with sub-millisecond latency (45× faster than S3 Standard), 50%+ customer growth including Salesforce and Geely, and MLPerf Storage v2.0 results showing 99%+ GPU utilization, positioning the company as a leader in maximizing AI infrastructure ROI.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo