Originally published on trino.io: https://trino.io/blog/2023/07/21/trino-fest-2023-alluxio-recap.html
By 2025, there will be 100 zetabytes stored in the cloud. That’s 100,000,000,000,000,000,000,000 bytes - a huge, eye-popping number. But only about 10% of that data is actually used on a regular basis. At Uber, for example, only 1% of their disk space is used for 50% of the data they access on any given day. With so much data but such a small percentage being used, it raises the question: how can we identify frequently-used data and make it more accessible, efficient, and lower-cost to access?
Once we have identified that “hot data,” the answer is data caching. By caching that data in storage, you can reap a ton of benefits: performance gains, lower costs, less network congestion, and reduced throttling on the storage layer. Data caching sounds great, but why are we talking about it at a Trino event? Because data caching with Alluxio is coming to Trino!
https://www.youtube.com/watch?v=oK1A5U1WzFc
Recap
So what are the key features of data caching? The first and foremost is that the frequently-accessed data gets stored on local SSDs. In the case of Trino, this means that the Trino worker nodes will store data to reduce latency and decrease the number of loads from object storage. Even if the worker restarts, it also still has that data stored. Caching will work on all the data lake connectors, so whether you’re using Iceberg, Hive, Hudi, or Delta Lake, it’ll be speeding your queries up. The best part is that once it’s in Trino, all you need to do is enable it, set three configuration properties, and let the performance improvement speak for itself. There’s no other change to how queries run or execute, so there’s no headache or migration needed.
Hope then gives deeper technical detail on exactly how data caching works. She highlights a few existing examples of how large-scale companies, Uber and Shopee, have utilized data caching to reap massive performance gains. Then the talk is passed off to Beinan, who gives further technical detail, exploring cache invalidation, how to maximize cache hit rate, cluster elasticity, cache storage efficiency, and data consistency. He also explores ongoing work on semantic caching, native/off-heap caching, and distributed caching, all of which have interesting upsides and benefits.
Give the full talk a listen if you’re interested, as both Hope and Beinan go into a lot of great, technical detail that you won’t want to miss out on. And don’t forget to keep an eye on Trino release notes to see when it’s live!
Share this session
If you thought this talk was interesting, consider sharing this on Twitter, Reddit, LinkedIn, HackerNews or anywhere on the web.
To learn more about Alluxio, join 11k+ members in the Alluxio community slack channel to ask any questions and provide your feedback: https://alluxio.io/slack.
To learn more about the Trino community, go to https://trino.io/community.html.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.