Overview
Monitoring metrics is highly important to operate distributed systems in production. Alluxio collects metrics using the Codahale Metrics Library on I/O throughput, RPC throughput, and resource usage. Alluxio metrics are shown in its webUI, but are also available through a REST endpoint or exportable to several third-party sinks in a time-series manner (see docs).
Grafana, a comprehensive metrics visualization software, ties into this process by pulling the metrics that systems like Alluxio collect through a sink and visualizes them in a more helpful fashion. This guide will cover how to set up Grafana and Graphite, a supported sink for Alluxio that will put metrics in a time-series database, along with exploring some of the possibilities that the combination offers.
Installation
There are many ways to go about installing both Graphite and Grafana: from source, on local, Docker, and several others. Alluxio, Graphite, and Grafana all have documentation on all the various ways to install. I opted to install using Docker in this guide to save time on managing dependencies. There is documentation available on how to install Alluxio with Docker along with Graphite and Grafana. All containers, excluding Grafana, were made to run on the same network by adding --net=alluxio_nw
to the docker run
command. This lets the containers easily communicate with each other. All three services expose a WebUI which also serves to ensure that they are running smoothly and correctly.
Configuration
Grafana does not collect any metrics directly but instead uses Graphite as a middle-man. Alluxio pushes its metrics to Graphite, which puts them in a time-series database, and Grafana then pulls those metrics from it.
Since Alluxio doesn’t communicate with Grafana the only configuration change required is between Alluxio and Graphite. Alluxio already supports Graphite as a third-party sink, so in Alluxio’s metrics.properties
file located in /alluxio/conf
the following needs to be added:
alluxio.metrics.sink.graphite.class=alluxio.metrics.sink.GraphiteSink
alluxio.metrics.sink.graphite.host=graphite
alluxio.metrics.sink.graphite.port=2003
alluxio.metrics.sink.graphite.period=10
Whether using Docker or not the above five lines are needed in metrics.properties, and along with enabling the sink they tell Alluxio where and how often to push its metrics. The edited metrics.properties
file will also have to be added to all Alluxio workers too. To make sure that Graphite is correctly configured with Alluxio you can go to Graphite’s webUI to check if Alluxio’s metrics are present.
It will take around five to ten minutes for all of Alluxio’s metrics to appear in Graphite, but after it is configured with Alluxio it must now be added as a datasource for Grafana. Like Alluxio, Grafana already has built-in support for Graphite so configuring is easy. Simply follow the instructions for adding Graphite as a datasource. The webUI URL needs to be entered under `HTTP URL` along with setting the `HTTP Access` to `Browser` then set the Graphite version. Afterwards Grafana will start pulling metrics from Graphite, and you can start displaying Alluxio’s metrics in Grafana!
Querying Metrics
To display Alluxio’s various metrics in Grafana, queries need to be created in dashboard panels. Panels are where all metrics will be displayed from, and where queries are entered to retrieve metrics. Querying with Graphite is very similar to navigating a file system. Users also have a wide array of functions available to use where metrics can be combined, filtered, and manipulated in various ways. This gives users control over what they want displayed.
Multiple queries can be added in a single panel, then functions can be used to manipulate metrics by aggregating, finding the difference, or dividing queries. Query visibility can also be enabled or disabled if you want to show only one metric in a panel. This allows multiple queries to be used in functions. The `*` is also a helpful icon when creating queries as it displays all available metrics, and is very useful when aggregating worker metrics. Grafana does provide documentation for using Graphite which is a useful starting ground, but is slightly out of date although still helpful.
There are also options for how queries are displayed including graphs, tables, gauges, and single pieces of data. In many of these visualizations, you can add ranges, units, thresholds, edit the legend, axis scale and many other options, some of which are unique to each visualization. When combining all the different types of queries with the visualizations each graph or single stat can be unique.
Exploring Grafana
When creating a dashboard, a panel’s size can be changed to fit more metrics on the screen or to make the dashboard more visually appealing. A dashboard can also have rows which allow metrics to be organized by types. These rows are collapsable and help to have a less crowded screen. By combining the queries with all the ways panels can be edited, metrics can be visualized in many different ways.
There were some challenges when I first started using Grafana, and with all of the available options, starting seemed difficult. Plus with all the different functions available to use it was hard to find one for the right purpose. On top of that, the thresholds for graphs and singlestats can be awkward to use, and it should be noted that thresholds only supports using constant values limiting your capabilities.
Many of the editing options also provide little description of what they do nonetheless, after a short period of time I became comfortable with the software, and I was able to create a dashboard containing many of alluxio’s metrics. The dashboard uses a small portion of the available features but can act as a starting ground for a dashboard if need be. This template containing Alluxio’s metrics can be found in /alluxio/integration/grafana/alluxio-grafana-template.json.
Hopefully, you can now confidently create your own Grafana dashboards to display any metrics you need!
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.