Reducing large S3 API costs using Alluxio at Datasapiens

December 13, 2020

Juraj Pohanka

Koen Michiels

Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.

In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.

This talk will focus on:

The Hadoop ecosystem at Datasapiens
Drastic increase of S3 API costs during performance tests with Presto
S3 API costs tests with TPC-DS
Implications to the cloud data lake architecture

This talk will focus on:

The Hadoop ecosystem at Datasapiens
Drastic increase of S3 API costs during performance tests with Presto
S3 API costs tests with TPC-DS
Implications to the cloud data lake architecture

‍

Video:

Presentation Slides:

Reducing large S3 API costs using Alluxio at Datasapiens from Alluxio, Inc.

‍

Reducing large S3 API costs using Alluxio at Datasapiens from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces: Architecture, Implementation & Lessons Learned

Scaling experimentation in digital marketplaces is crucial for driving growth and enhancing user experiences. However, varied methodologies and a lack of experiment governance can hinder the impact of experimentation leading to inconsistent decision-making, inefficiencies, and missed opportunities for innovation.

At Poshmark, we developed a homegrown experimentation platform, Lightspeed, that allowed us to make reliable and confident reads on product changes, which led to a 10x growth in experiment velocity and positive business outcomes along the way.

This session will provide a deep dive into the best practices and lessons learned from successful implementations of large-scale experiments. We will explore the importance of experimentation, overcome scalability challenges, and gain insights into the frameworks and technologies that enable effective testing.

September 24, 2024

AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: Architectural Strategies for Millions of Products

In the rapidly evolving world of e-commerce, visual search has become a game-changing technology. Poshmark, a leading fashion resale marketplace, has developed Posh Lens – an advanced visual search engine that revolutionizes how shoppers discover and purchase items.

Under the hood of Posh Lens lies Milvus, a vector database enabling efficient product search and recommendation across our vast catalog of over 150 million items. However, with such an extensive and growing dataset, maintaining high-performance search capabilities while scaling AI infrastructure presents significant challenges.

In this talk, Mahesh Pasupuleti shares:

The architecture and strategies to scale Milvus effectively within the Posh Lens infrastructure
Key considerations include optimizing vector indexing, managing data partitioning, and ensuring query efficiency amidst large-scale data growth
Distributed computing principles and advanced indexing techniques to handle the complexity of Poshmark’s diverse product catalog

September 24, 2024

Optimize, Don’t Overspend: Data Caching Strategy for AI Workloads

As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data loading and GPU utilization, often leading to costly investments in high-performance computing (HPC) storage. However, this approach can result in overspending without addressing the core issues of data bottlenecks and infrastructure complexity.

A better approach is adding a data caching layer between compute and storage, like Alluxio, which offers a cost-effective alternative through its innovative data caching strategy. In this webinar, Jingwen will explore how Alluxio's caching solutions optimize AI workloads for performance, user experience and cost-effectiveness.

What you will learn:

The I/O bottlenecks that slow down data loading in model training
How Alluxio's data caching strategy optimizes I/O performance for training and GPU utilization, and significantly reduces cloud API costs
The architecture and key capabilities of Alluxio
Using Rapid Alluxio Deployer to install Alluxio and run benchmarks in AWS in just 30 minutes

September 10, 2024

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer