A Docker Captain's Blog

Docker | Kubernetes | Cloud

Docker, Prometheus & Pushgateway for NVIDIA GPU Metrics & Monitoring

In my last blog post, I talked about how to get started with NVIDIA docker & interaction with NVIDIA GPU system. I demonstrated NVIDIA Deep Learning GPU Training System, a.k.a DIGITS by running it inside Docker container. ICYMI – DIGITS is essentially a webapp for training deep learning models and is used to rapidly train the highly accurate deep neural network (DNNs) for image classification, segmentation and object detection tasks.The currently supported frameworks are: Caffe, Torch, and Tensorflow. It simplifies common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best performing model from the results browser for deployment.

In a typical HPC environment where you run 100 and 100s of NVIDIA GPU equipped cluster of nodes, it becomes important to monitor those systems to gain insight of the performance metrics, memory usage, temperature and utilization. . Tools like Ganglia & Nagios etc. are very popular due to their scalable & distributed monitoring architecture for high-performance computing systems such as clusters and Grids. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. But with the advent of container technology, there is a need of modern monitoring tools and solutions which works well with Docker & Microservices.

It’s all modern world of Prometheus Stack…

Prometheus is 100% open-source service monitoring system and time series database written in Go.It is a full monitoring and trending system that includes built-in and active scraping, storing, querying, graphing, and alerting based on time series data. It has knowledge about what the world should look like (which endpoints should exist, what time series patterns mean trouble, etc.), and actively tries to find faults.

How is it different from Nagios?

Though both serves a purpose of monitoring, Prometheus wins this debate with the below major points –

Nagios is host-based. Each host can have one or more services, which has one check.There is no notion of labels or a query language. But Prometheus comes with its robust query language called “PromQL”. Prometheus provides a functional expression language that lets the user select and aggregate time series data in real time. The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems via the HTTP API.

Nagios is suitable for basic monitoring of small and/or static systems where blackbox probing is sufficient. But if you want to do whitebox monitoring, or have a dynamic or cloud based environment then Prometheus is a good choice.

Nagios is primarily just about alerting based on the exit codes of scripts. These are called “checks”. There is silencing of individual alerts, however no grouping, routing or deduplication.

Let’s talk about Prometheus Pushgateway..

Occasionally you will need to monitor components which cannot be scraped. They might live behind a firewall, or they might be too short-lived to expose data reliably via the pull model. The Prometheus Pushgateway allows you to push time series from these components to an intermediary job which Prometheus can scrape. Combined with Prometheus’s simple text-based exposition format, this makes it easy to instrument even shell scripts without a client library.

The Prometheus Pushgateway allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. It is important to understand that the Pushgateway is explicitly not an aggregator or distributed counter but rather a metrics cache. It does not have statsd-like semantics. The metrics pushed are exactly the same as you would present for scraping in a permanently running program.For machine-level metrics, the textfile collector of the Node exporter is usually more appropriate. The Pushgateway is intended for service-level metrics. It is not an event store.

Under this blog post, I will showcase how NVIDIA Docker, Prometheus & Pushgateway come together to push NVIDIA GPU metrics to Prometheus Stack.

Infrastructure Setup:

Docker Version: 17.06

OS: Ubuntu 16.04 LTS

Environment : Managed Server Instance with GPU

GPU: GeForce GTX 1080 Graphics card

Cloning the GITHUB Repository

Run the below command to clone the below repository to your Ubuntu 16.04 system equipped with GPU card:

NVIDIA provides a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML (NVIDIA Management Library). These bindings are under BSD license and allow simplified access to GPU metrics like temperature, memory usage, and utilization.

Next, under the same directory, you will find a python script called “test.py”.