SignalFx is the only real-time cloud monitoring platform for infrastructure, microservices, and applications. The platform collects metrics and traces across every component in your cloud environment, replacing traditional point tools with a single integrated solution that works across the stack.

Monitoring of Docker environments is challenging. Why? Because each container typically runs a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage. Traditional monitoring solutions take metrics from each server and applications they run. These servers, and the applications running on them, are typically very static with very long uptimes.

Docker deployments are different: A set of containers may run many applications, all sharing the resources of one or more underlying hosts. It's not uncommon for Docker servers to run thousands of short-term containers (e.g., for batch jobs) while a set of permanent services runs in parallel. Traditional monitoring tools not used to such dynamic environments are not suited for such deployments. On the other hand, some modern monitoring solutions (e.g. SPM from Sematext) were built with such dynamic systems in mind and even have out of the box reporting for for docker monitoring. Moreover, container resource sharing calls for stricter enforcement of resource usage limits, an additional issue you must watch carefully. To make appropriate adjustments for resource quotas you need good visibility into any limits containers have reached or errors they have caused. We recommend using alerts according to defined limits; this way you can adjust limits or resource usage even before errors start happening.

Watch Resources of your Docker Hosts

Host CPU

Understanding the CPU utilization of hosts and containers helps one optimize the resource usage of Docker hosts. The container CPU usage can be throttled in order to avoid a single busy container slowing down other containers by taking away all available CPU resources. Throttling the CPU time is a good way to ensure a minimum of processing power for essential services — it's like the good old nice levels in Unix/Linux.

When the resource usage is optimized, a high-CPU utilization might actually be expected and even desired, and alerts might make sense only for when CPU utilization drops (service outages) or increases for a longer period over some max limit (e.g. 85%).

An overutilized Docker host is a sign of trouble.An underutilized host is a sign of wasting money.

Host Memory

The total memory used in each Docker host is important to know for the current operations and for capacity planning. Dynamic cluster managers like Docker Swarm use the total memory available on the host and the requested memory for containers to decide on which host a new container should ideally be launched. Deployments might fail if a cluster manager is unable to find a host with sufficient resources for the container. That's why it is important to know the host memory usage and the memory limits of containers. Adjusting the capacity of new cluster nodes according to the footprint of Docker applications could help optimize the resource usage.

Host Disk Space

Docker images and containers consume additional disk space. For example, an application image might include a Linux operating system and might have a size of 150-700 MB depending on the size of the base image and installed tools in the container. Persistent Docker volumes consume disk space on the host as well. In our experience watching the disk space and using cleanup tools is essential for continuous operations of Docker hosts.

Good kids clean up their rooms.Good Docker ops clean up their disks by removing unused containers & images.

Disk space usage on Docker hosts

Because disk space is critical, it makes sense to define alerts for disk space utilization to serve as early warnings and provide enough time to clean up disks or add additional volumes. For example, SPM automatically sets alert rules for disk space usage for you, so you don't have to remember to do it.

A good practice is to run tasks to clean up the disk by removing unused containers and images frequently.

Total Number of Running Containers

The current and historical number of containers is an interesting metric for many reasons. For example, it is very handy during deployments and updates to check that everything is running like before.

When cluster managers like Docker Swarm, Mesos, Kubernetes, and CoreOS/Fleet automatically schedule containers to run on different hosts using different scheduling policies, the number of containers running on each host can help one verify the activated scheduling policies. A stacked bar chart displaying the number of containers on each host and the total number of containers provides a quick visualization of how the cluster manager distributed the containers across the available hosts.

This metric can have different "patterns" depending on the use case. For example, batch jobs running in containers vs. long-running services commonly result in different container count patterns. A batch job typically starts a container on demand, or starts it periodically, and the container with that job terminates after a relatively short time. In such a scenario one might see a big variation in the number of containers running resulting in a "spiky" container count metric. On the other hand, long-running services, such as web servers or databases, typically run until they get re-deployed during software updates. Although scaling mechanisms might increase or decrease the number of containers depending on load, traffic, and other factors, the container count metric will typically be relatively steady because, in such cases, containers are often added and removed more gradually. Because of that, there is no general pattern we could use for a default Docker alert rule on the number of running containers.

Nevertheless, alerts based on anomaly detection, which detect sudden changes in the number of the containers (or for specific hosts) in a short time period, can be very handy for most of the use cases. The simple threshold-based alerts make sense only when the maximum or minimum number of running containers is known, and in dynamic environments that scale up and down based on external factors, this is often not the case.

Container Metrics

Container metrics are basically the same metrics available for every Linux process, but include limits set via cgroups by Docker, such as limits for CPU or memory usage. Please note that sophisticated monitoring solutions like SPM for Docker are able to aggregate Container Metrics on different levels like Docker Hosts/Cluster Nodes, Image Name, or ID and Container Name, or ID. Having the ability to do that makes it easy to track resources usage by hosts, application types (image names) or specific containers. In the following examples, we might use aggregations on various levels.

Use modern Docker monitoring solutions to slice and dice by host, node, image, or container. You'll need that.

Container CPU – Throttled CPU Time

One of the most basic bits of information is information about how much CPU is being consumed by all containers, images, or by specific containers. A great advantage of using Docker is the capability to limit CPU utilization by containers. Of course, you can't tune and optimize something if you don't measure it, so monitoring such limits is the prerequisite. Observing the total time that a container's CPU usage was throttled provides the information one needs to adjust the setting for CPU shares in Docker. Please note that CPU time is throttled only when the host CPU usage is maxed out. As long as the host has spare CPU cycles available for Docker, it will not throttle containers' CPU usage. Therefore, the throttled CPU is typically zero, and a spike of this metric is a typically a good indication of one or more containers needing more CPU power than the host can provide.

Container CPU usage and throttled CPU time

The following screenshot shows containers with 5% CPU quota using the command "docker run –cpu-quota=5000 nginx." We see clearly how the throttled CPU grows until it reaches around 5%, enforced by the Docker engine.

Container CPU usage and throttled CPU time with CPU quota of 5%

Container Memory — Fail Counters

It is a good practice to set memory limits for containers. Doing that helps avoid a memory-hungry container taking all available memory and starving all other containers on the same server. Runtime constraints on resources can be defined in the Docker run command. For example, "-m 300M" sets the memory limit for the container to 300 MB. Docker exposes a metric called container memory fail counters. This counter is increased each time memory allocation fails — that is, each time the pre-set memory limit is hit. Thus, spikes in this metric indicate one or more containers needing more memory than was allocated. If the process in the container terminates because of this error, we might also see out of memory events from Docker.

A spike in memory fail counters is a critical event, and putting alerts on the memory fail counter is very helpful to detect poor settings for the memory limits or to discover containers that try to consume more memory than expected.

Container Memory Usage

Different applications have different memory footprints. Knowing the memory footprint of the application containers is important for having a stable environment. Container memory limits ensure that applications perform well, without using too much memory, which could affect other containers on the same host. The best practice is to tune memory setting in a few iterations:

Monitor memory usage of the application container.

Set memory limits according to the observations.

Continue monitoring of memory, memory fail counters, and Out-Of-Memory events. If OOM events happen, the container memory limits may need to be increased, or debugging is required to find the reason for the high memory consumptions.

Container memory usage

Container Swap

Like the memory of any other process, a container's memory could be swapped to disk. For applications like Elasticsearch or Solr, one often finds instructions to deactivate swap on the Linux host — but if you run such applications on Docker it might be sufficient just to set "–memory-swap=-1" in the Docker run command!

Don't like to see your container swapping? Use –memory-swap=-1 in the Docker run command and be done with it!

Container swap, memory pages, and swap rate

Container Disk I/O

In Docker, multiple applications use the same resources concurrently. Thus, watching the disk I/O helps one define limits for specific applications and give higher throughput to critical applications like data stores or web servers, while throttling disk I/O for batch operations. For example, the command docker run -it –device-write-bps /dev/sda:1mb mybatchjob would limit the container disk writes to a maximum of 1 MB/s.

Container I/O throughput

To limit a Docker container from eating all your disk IO use e.g. –device-write-bps /dev/sda:1mb.

Container Network Metrics

Networking for containers can be very challenging. By default, all containers share a network, or containers might be linked together to share a separated network on the same host. However, when it comes to networking between containers running on different hosts, an overlay network is required, or containers could share the host network. Having many options for network configurations means there are many possible causes of network errors.

Moreover, errors or dropped packets aren't the only important things to watch out for. Today, most of the applications are deeply dependent on network communication. Throughput of virtual networks could be a bottleneck, especially for containers like load balancers. In addition, the network traffic might be a good indicator how much applications are used by clients, and sometimes you might see high spikes, which could indicate denial of service attacks, load tests, or a failure in client apps. So watch the network traffic — it is a useful metric in many cases.

Network traffic and transmission rates

Summary

There you have it — the top Docker metrics to watch. Staying focused on these top metrics and corresponding analysis will help you stay on the road while driving towards successful Docker deployments on many platforms such as Docker Swarm, Docker Cloud, Docker Datacenter, or any other platform supporting Docker containers.

SignalFx is built on a massively scalable streaming architecture that applies advanced predictive analytics for real-time problem detection. With its NoSample™ distributed tracing capabilities, SignalFx reliably monitors all transactions across microservices, accurately identifying all anomalies. And through data-science-powered directed troubleshooting SignalFx guides the operator to find the root cause of issues in seconds.