In the previous blog, we introduced Uber’s Presto use cases and how we collaborated to implement Alluxio local cache to overcome different challenges in accelerating Presto queries. The second part discusses the improvements to the local cache metadata.
File Level Metadata for Local Cache
Motivation
First, we want to prevent stale caching. The underlying data files might be changed by the third-party frameworks. Note that this situation might be rare in Hive tables but very common in Hudi tables.
Second, the daily reads of unduplicated data from HDFS can be large, but we don't have enough cache space for caching all the data. Therefore, we can introduce scoped quota management by setting a quota for each table.
Third, metadata should be recoverable after server restart. We have stored metadata in the local cache in memory instead of disk, which makes it impossible to recover metadata when the server is down and restarted.
High-Level Approach
Therefore, we propose the File Level Metadata, which holds and keeps the last modified time and the scope of each data file we cached. The file-level metadata store should be persistent on disk so the data will not disappear after restarting.
With the introduction of file-level metadata, there will be multiple versions of the data. A new timestamp is generated when the data is updated, corresponding to a new version. A new folder storing the new page is created corresponding to this new timestamp. At the same time, we will try to remove the old timestamp.
Cache Data and Metadata Structure
As shown above, we have two folders corresponding to two timestamps: timestamp1 and timestamp2. Usually, when the system is running, there will not be two timestamps simultaneously because we will delete the old timestamp1 and keep only timestamp2. However, in the case of a busy server or high concurrency, we may not be able to remove the timestamp on time, in which case we may have two timestamps at the same time. In addition, we maintain a metadata file that holds the file information in protobuf format and the latest timestamp. This ensures that Alluxio's local cache only reads data from the latest timestamp. When the server restarts, the timestamp information is read from the metadata file so that the quota and last modified time can be managed correctly.
Metadata Awareness
Cache Context
Since Alluxio is a generic caching solution, it still needs the compute engine, like Presto, to pass the metadata to Alluxio. Therefore, on the Presto site, we use the HiveFileContext. For each data file from the Hive table or Hudi table, Presto creates a HiveFileContext. Alluxio makes use of this information when opening a Presto file.
When calling openFile, Alluxio creates a new instance of PrestoCacheContext, which holds the HiveFileContext and has the scope (four levels: database, schema, table, partition), quota, cache identifier (i.e., the md5 value of the file path), and other information. We will pass this cache context to the local file system. Alluxio can thus manage metadata and collect metrics.
Per Query Metrics Aggregation on Presto Side
In addition to passing data from Presto to Alluxio, we can also call back to Presto. When performing query operations, we will know some internal metrics, such as how many bytes of data read hit the cache and how many bytes of data were read from external HDFS storage.
As shown below, we pass the HiveFileContext containing the PrestoCacheContext to the local cache file system (LocalCacheFileSystem), after which the local cache file system calls back (IncremetCounter) to the CacheContext. Then, this callback chain will continue to the HiveFileContext, and then to RuntimeStats.
In Presto, RuntimeStats is used to collect metrics information when executing queries so that we can perform aggregation operations. After that, we can see the information about the local cache file system in Presto's UI or the JSON file. We can make Alluxio and Presto work closely together with the above process. On the Presto side, we have better statistics; on the Alluxio side, we have a clearer picture of the metadata.
Future Work
Performance Tuning
Because the callback process described above makes the CacheContext's lifecycle grow considerably, we have encountered some problems with rising GC latency, which we are working to address.
Adopt Semantic Cache (SC)
We will implement Semantic Cache (SC) based on the file-level metadata we propose. For example, we can save the data structures in Parquet or ORC files, such as footer, index, etc.
More Efficient Deserialization
To achieve more efficient deserialization, we will use flatbuf instead of the protobuf. Although protobuf is used in the ORC factory to store metadata, we found that the ORC’s metadata brings more than 20-30% of the total CPU usage in Alluxio’s collaboration with Facebook. Therefore, we are planning to replace the existing protobuf with a flatbuf to store cache and metadata, which is expected to improve the performance of deserialization significantly.
To summarize, together with the previous blog, this two-part blog series shares how we created a new caching layer of the hot data needed for our Presto fleet based on recent open-source collaboration between Presto and Alluxio communities at Uber. This architecturally simple and clean approach can significantly reduce HDFS latency with managed SSD and consistent hashing-based soft affinity scheduling. Join 9000+ members in our community slack channel to learn more.
About the Authors
Chen Liang is a senior software engineer at Uber's interactive analytics team, focusing on Presto. Before joining Uber, Chen was a staff software engineer at LinkedIn's Big Data platform. Chen is also a committer and PMC member of Apache Hadoop. Chen holds two master's degrees from Duke University and Brown University
Dr. Beinan Wang is a software engineer from Alluxio and is the committer of PrestoDB. Before Alluxio, he was the Tech Lead of the Presto team in Twitter, and he built large-scale distributed SQL systems for Twitter's data platform. He has twelve-year experience working on performance optimization, distributed caching, and volume data processing. He received his Ph.D. in computer engineering from Syracuse University on the symbolic model checking and runtime verification of distributed systems.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.