More Fun with NGINX Plus Health Checks and Docker Containers

At nginx.conf 2017, I gave a presentation on how to use NGINX Plus health checks with Docker containers. You can access the presentation as a YouTube video or a blog post, which includes the Powerpoint slides and a transcription of my talk. In this post, I’ll describe an improved version of the basic approach, then present working configuration code you can use to implement it yourself.

Introduction

When running containers in a microservices environment, your service instances are susceptible to becoming overloaded due to resource limitations, such as memory or CPU. There are a number of strategies for addressing this issue; this blog post discusses one that relies on NGINX Plus active health checks.

We’ll focus on methods for three use cases:

Request‑count–based. Use this method when requests to a service are so heavyweight that a service instance can only handle one request at a time.

CPU‑based. Use this method when CPU utilization is the main limiting factor and you want to set a CPU‑usage threshold above which the service doesn’t accept new requests.

Memory‑usage–based. Use this method when memory utilization is the main limiting factor and you want to set a memory‑usage threshold, above which the service doesn’t accept new requests.

All three methods work in the same fundamental way. NGINX Plus calls a program that implements an active health check based on one of the methods above and returns a server status of either unhealthy or healthy. NGINX Plus removes an unhealthy server from the load‑balancing rotation, and keeps a healthy server in the rotation (or adds it back if it was previously unhealthy).

Health Check Approaches

Let’s get into the details of each method. Code for the examples is available in the NGINX repo on GitHub.

For all of the examples, we’re using NGINX Plus as the load balancer and NGINX Unit as the application server, with two examples written in PHP and one written in Python. Everything runs in Docker containers.

Request-Count–Based

For this method, the application creates a semaphore file, /tmp/busy, when it receives a request, then removes the file when it’s finished processing the request. The health check determines whether the file exists on a given service instance. If it does, the instance is considered unhealthy and NGINX Plus stop sending requests to it. If the file doesn’t exist, the instance is considered healthy and NGINX Plus sends requests to it.

The example uses a single Python program, testcnt.py, to implement both the application and the health check; which function to execute is determined by the request URI.

The shortest interval between health checks is one second, so it can that long for NGINX Plus to see that a service instance is unhealthy (busy). During that time, NGINX Plus might send another request to the service instance. To handle this case, the application returns status code 503 if it is already processing a request when another request arrives. If this happens, NGINX Plus tries another instance.

CPU-Based

You can use the Docker API to get CPU‑usage metrics for a container, but they are relative to the Docker host. In other works, if the Docker API reports that the CPU usage for a container is 25%, that means 25% of the Docker host’s CPU.

For this example, we set a threshold of 70% for the application, and assign each container in the application an equal share of the threshold percentage. For example, if there is one container it can use 70% of the Docker host’s CPU. If there are two containers, each can use 35% of the Docker host’s CPU. We use the NGINX Plus API to get the number of containers for the application.

There are two PHP programs: testcpu.php generates CPU load and hcheck.php does the health check.

To get statistics for a container, the health‑check program makes the following call to the Docker API on the Docker host:

Calculating CPU usage requires two calls to the API, one second apart in the example. CPU usage is calculated by comparing the cpu_stats.cpu_usage.total_usage fields in the two calls.

Memory-Usage–Based

As in the CPU‑based example, this example uses the Docker API to retrieve memory‑usage metrics. Each container is limited to 128 megabytes of memory and the memory‑usage metrics are relative to this limit.

There are two PHP programs: testmem.php uses memory and hcheck.php does the health check. If memory usage is above 70%, the health check returns a status of unhealthy.

The health check makes the same Docker API call as for the CPU‑usage method, but uses different fields: the percentage of memory used is memory_stats.usage divided by memory_stats.stats.hierarchical_memory_limit.

Configuring NGINX

No changes to the main NGINX configuration file (/etc/nginx/nginx.conf) are required. However, if you want to see detailed messages about health checks in the error log, set the severity level to info, as in this example:

error_log /var/log/nginx/error.log info;

The NGINX Plus configuration for the sample applications follows. As you read it, and especially if you use or adapt it, keep these points in mind:

Consul is used for DNS service discovery. Both Consul and NGINX Plus support DNS SRV records, which means that NGINX Plus can get the port numbers as well as the IP addresses of the containers. This is necessary because Docker port mapping is used.

The first server block, listening on port 80, enables sending requests directly to the program that does health checks. This is required so we can see what an unhealthy health check looks like. We can’t see that type of response if we send a request to a health‑check program via a virtual server that has health checks configured, because NGINX Plus doesn’t forward requests to unhealthy servers.

For the sake of easy understanding, the configuration is minimal – it doesn’t include all the directives a best‑practices configuration has.

The health‑check intervals are all short, so the system responds quickly during a demo. In a production environment, the one‑second interval for the count‑based health check is probably still suitable, since you want NGINX Plus to stop sending requests as soon as possible after the service becomes busy. For the CPU and memory health checks, a longer interval might be set in production.

This configuration and the CPU health‑check program uses the built‑in live activity monitoring dashboard that uses version 2 of the NGINX Plus API, introduced in NGINX Plus R14.

The configuration and programs are examples of possible ways to use active health checks on applications in Docker containers. They have not been tested in production or at scale.

Have a Cookie? :)

Our site uses cookies to provide functionality and performance as well as for social media and advertising purposes. Social media and advertising cookies of third parties are used to offer you social media functionalities and personalized ads for NGINX content and offers. To get more information about these cookies and how we process personal data, check our Privacy Policy. Do you accept the use of cookies and the processing of personal data involved?

Your Cookie Settings

Site functionality and performance

These cookies are required for NGINX site functionality and are therefore always enabled. These include cookies that allow you to be remembered as you explore the NGINX site, help make the shopping cart and checkout process possible as well as assist in security issues and conforming to regulations. To use the NGINX website, you have to consent to these cookies and the processing of personal data according to the NGINX website terms of use and privacy policy.

Social media and advertising

Social media cookies offer the possibility to connect you to your social networks and share content from our website through social media. Advertising cookies (of third parties) collect information to help better tailor NGINX advertising to your interests, both within and beyond NGINX websites. De-selecting these cookies may result in seeing advertising that is not as relevant to you or you not being able to link effectively with Facebook, Twitter, or other social networks and/or not allowing you to share content on social media.