Streaming systems form the backbone of the modern data pipeline as the stream processing capabilities provide insights on events as they arrive. But what if we want to go further than this and execute analytical queries on this real-time data? That’s where Apache Pinot comes in.
OLAP databases used for analytical workloads traditionally executed queries on yesterday’s data with query latency in the 10s of seconds. The emergence of real-time analytics has changed all this and the expectation is that we should now be able to run thousands of queries per second on fresh data with query latencies typically seen on OLTP databases.
Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from streaming sources like Kafka, as well as from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage), and provides a layer of indexing techniques that can be used to maximize the performance of queries.
Come to this talk to learn how you can add real-time analytics capability to your data pipeline.
ALLUXIO DAY XV 2022
September 15, 2022
Streaming systems form the backbone of the modern data pipeline as the stream processing capabilities provide insights on events as they arrive. But what if we want to go further than this and execute analytical queries on this real-time data? That’s where Apache Pinot comes in.
OLAP databases used for analytical workloads traditionally executed queries on yesterday’s data with query latency in the 10s of seconds. The emergence of real-time analytics has changed all this and the expectation is that we should now be able to run thousands of queries per second on fresh data with query latencies typically seen on OLTP databases.
Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from streaming sources like Kafka, as well as from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage), and provides a layer of indexing techniques that can be used to maximize the performance of queries.
Come to this talk to learn how you can add real-time analytics capability to your data pipeline.
Video:
Presentation Slides:
Complete the form below to access the full overview:
Videos
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.