On-Demand Videos
Scaling experimentation in digital marketplaces is crucial for driving growth and enhancing user experiences. However, varied methodologies and a lack of experiment governance can hinder the impact of experimentation leading to inconsistent decision-making, inefficiencies, and missed opportunities for innovation.
At Poshmark, we developed a homegrown experimentation platform, Lightspeed, that allowed us to make reliable and confident reads on product changes, which led to a 10x growth in experiment velocity and positive business outcomes along the way.
This session will provide a deep dive into the best practices and lessons learned from successful implementations of large-scale experiments. We will explore the importance of experimentation, overcome scalability challenges, and gain insights into the frameworks and technologies that enable effective testing.
In the rapidly evolving world of e-commerce, visual search has become a game-changing technology. Poshmark, a leading fashion resale marketplace, has developed Posh Lens – an advanced visual search engine that revolutionizes how shoppers discover and purchase items.
Under the hood of Posh Lens lies Milvus, a vector database enabling efficient product search and recommendation across our vast catalog of over 150 million items. However, with such an extensive and growing dataset, maintaining high-performance search capabilities while scaling AI infrastructure presents significant challenges.
In this talk, Mahesh Pasupuleti shares:
- The architecture and strategies to scale Milvus effectively within the Posh Lens infrastructure
- Key considerations include optimizing vector indexing, managing data partitioning, and ensuring query efficiency amidst large-scale data growth
- Distributed computing principles and advanced indexing techniques to handle the complexity of Poshmark’s diverse product catalog
As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data loading and GPU utilization, often leading to costly investments in high-performance computing (HPC) storage. However, this approach can result in overspending without addressing the core issues of data bottlenecks and infrastructure complexity.
A better approach is adding a data caching layer between compute and storage, like Alluxio, which offers a cost-effective alternative through its innovative data caching strategy. In this webinar, Jingwen will explore how Alluxio's caching solutions optimize AI workloads for performance, user experience and cost-effectiveness.
What you will learn:
- The I/O bottlenecks that slow down data loading in model training
- How Alluxio's data caching strategy optimizes I/O performance for training and GPU utilization, and significantly reduces cloud API costs
- The architecture and key capabilities of Alluxio
- Using Rapid Alluxio Deployer to install Alluxio and run benchmarks in AWS in just 30 minutes
Video: Presentation Slides: How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio from Alluxio, Inc.
Video: Presentation Slides: Building a scalable analytics environment to support diverse workloads from Alluxio, Inc.
Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.
In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.
This talk will focus on:
- The Hadoop ecosystem at Datasapiens
- Drastic increase of S3 API costs during performance tests with Presto
- S3 API costs tests with TPC-DS
- Implications to the cloud data lake architecture
Electronic Arts (EA) is a leading company in the gaming industry, providing over a thousand games to serve billions of users worldwide. The EA Data & AI Department builds hundreds of platforms to manage petabytes of data generated by games and users every day. These platforms consist of a wide range of data analytics, from real-time data ingestion to ETL pipelines. Formatted data produced by our department is widely adopted by executives, producers, product managers, game engineers, and designers for marketing and monetization, game design, customer engagement, player retention, and end-user experience.
Near real-time information for EA’s online services is critical for making business decisions, such as campaigns and troubleshooting. These services include, but are not limited to, real-time data visualization, dashboarding, and conversational analytics. Highly time-sensitive applications such as BI software, dashboards and AI tools heavily rely on these services. To support these use cases, we studied an innovative platform with Presto as the computing engine and Alluxio as a data orchestration layer between Presto and S3 storage. We evaluated this platform with real industrial examples of data visualization, dashboarding, and a conversational chatbot. Our preliminary results show that Presto with Alluxio outperforms S3 significantly in all cases, with a 6x performance gain when handling a large number of small files.
We are extremely excited to announce the release of Alluxio 2.4.0!
Alluxio 2.4.0 focuses on features critical to large scale, production deployments in Cloud and Hybrid Cloud environments. Features such as highly scalable metadata journaling, aggregate cluster metrics monitoring, and automated detection of JVM pauses further improve Alluxio’s suitability for demanding workloads. Devops tools are also key for triaging issues when they occur. In Alluxio 2.4 we further improve the cluster wide log collection framework. Finally, Alluxio is continually expanding its state of the art integrations with frameworks and storage systems. Alluxio 2.4 introduces and improves integrations with Kubernetes, Azure Data Lake Storage, and Apache Ozone. Alluxio 2.4 is also the first Alluxio release that has support for Java 11.
In this Office Hour, we will go over:
- Expanded metadata service
- Cloud native deployment
- Simplified DevOps and system monitoring
- Support for Java 11
In this talk, we describe the architecture to migrate analytics workloads incrementally to any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) directly on on-prem data without copying the data to cloud storage.
In this Office Hour:
- We will go over an architecture for running elastic compute clusters in the cloud using on-prem HDFS.
- Have a casual online video chat with Alluxio Open Source core maintainers to address any Alluxio related questions from our community members
Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery – A platform to query files in cloud storages with SQL.
This talk will focus on:
- How Alluxio fits StorageQuery’s tech stack;
- Advantages of using Alluxio as a cache layer and its unified filesystem
- Development of new under file system for Backblaze B2 and fine-grained code documentation;
- ShannonDB remote storage mode.
Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome.
In this Office Hour, we will go over:
- Glue Under Database integration
- Under Filesystem mount wizard
- Tiered Storage Enhancements
- Concurrent Metadata Sync
- Delegated Journal Backups
The hybrid cloud model, where cloud resources run Spark or Presto jobs against data stored on-premises, is an appealing solution to reduce resource contention in on-premise environments while also saving in overall costs. One key flaw in a hybrid model is the overhead associated with transferring data between the two environments. Data and metadata locality within the compute application must be achieved in order to maintain the similar performance of analytics jobs as if the entire workload was run on-premises.
In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.
We will cover:
- Typical challenges of moving data to the cloud and expanding compute capacity.
- Details about “zero-copy” hybrid cloud solution for burst computing
- A demo of running Presto analytic queries using remote on-prem HDFS data with Alluxio deployed in AWS EMR
As the amount of data analyzed and stored continues to grow exponentially, fixed on-premises infrastructure like Apache Hadoop data lakes becomes costly. Add to that the need to support newer and popular frameworks on an already busy data lake, it is not uncommon to see Hadoop-based data lakes running at beyond 100% utilization and hybrid processing split between physical and cloud infrastructure. As a result, companies are looking to leverage the flexibility and cost savings of the cloud.
Join us for this tech talk where we will show you how Alluxio can help burst your private computing environment to Google Cloud, minimizing costs and I/O overhead. Alluxio coupled with Google’s open source data and analytics processing engine, Dataproc, enables zero-copy burst for faster query performance in the cloud so you can take advantage of resources that are not local to your data, without the need for managing the copying or syncing of that data.
We’ll also show a demo on how to get up and running with Alluxio and Dataproc, including how to:
- Setup your hybrid environment between your private datacenter and Google Cloud Platform
- Burst a Spark based machine learning algorithm to Dataproc while accessing on-prem data
- Scale analytic workloads directly on data on-prem without copying and synchronizing the data into the cloud
Today’s conventional wisdom states that network latency across the two ends of a hybrid cloud prevents you from running analytic workloads in the cloud with the data on-prem. As a result, most companies copy their data into a cloud environment and maintain that duplicate data. All of this means that it is challenging to make both on-prem HDFS data accessible with the desired application performance.
In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.
In this Office Hour, we will go over:
- A strategy to embrace the hybrid cloud, including an architecture for running ephemeral compute clusters using on-prem HDFS.
- An example of running on-demand Presto, Spark, and Hive with Alluxio in the public cloud.
- An analysis of experiments with TPC-DS to demonstrate the benefits of the given architecture.
Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed.
This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently.
In this Office Hour, we will go over how to:
- Metadata storage challenges
- How to combine different open source technologies as building blocks
- The design, implementation, and optimization of Alluxio metadata service