Effective ELB Health Checks Part 2: How it Works

Previously, we discussed the five things to consider when setting up health checks for Classic AWS Elastic Load Balancer (ELB). But every application is a unique snowflake and the same holds true for health checks. In order to be effective, you’ll need to understand a little more about how health checks actually work (feel free to tcpdump this to see it in action).

If a new instance is added to your ELB, the ELB will immediately send a health check. Your instance could be marked healthy or unhealthy at this point, based on the result of your initial health check. From here, your health check interval takes over along with your health thresholds. Your application engineering teams will need to pick thresholds and intervals that allow your application time to become healthy again, but also eliminate false positives and adjust sensitivity to latency increases due to standard practices (DB backups, etc.).

After you’ve settled on a strategy, you’ll need to make your health check page. This process is more complicated than often assumed. Many simply make an empty page that returns an HTTP 200 or accepts a TCP connection. Sure, it’s a simple health check page, but that doesn’t actually tell you what’s going on in your application. That kind of page is fine if you want to ensure your web server is working or if connection is established, but it won’t actually tell you if your application is functioning properly.

There are three effective strategies to test your application:

Test your application’s status externally to the process and test your dependencies in a “canary” style test suite. Only upon success would you return a success.

Build your test page into your application directly.

Flex your application functionality from outside of the application’s primary service process with a separate health monitoring micro service.

Diving into the health check page construction itself is something that your development and operations teams should carefully evaluate. There should be a practical and sufficiently simple set of items to test and strategies for testing them.

Example application:

So you could write something incredibly simple like this (pardon the pseudo code mix):

{% highlight ruby %}

def HealthController < ApplicationController

def index

response_code = 200

@broken = “”

if !run_tests

response_code = 417

@broken = “I’m a teapot.”

end

render :status => response_code, :locals => { :broken => @broken }

end

private

def run_tests

[call test methods here]

end

end

{% endhighlight %}

There are ways to surface what failed and how it failed, but this illustrates the point about checking critical dependencies functionally wherever possible. With a proper health page in place, each time the ELB hits the health check page, you’re going to have a quick and lightweight functional test of your dependencies and a basic application functional test. If something goes wrong, you’ll know it and your ELB will shift away. If all nodes have issues or an upstream dependency has issues, ELB will automatically fail open with Route53, and you’ll be okay.

For more details on health checks, a critical production system running under the “Classic” version of ELB, check out our previous blog post in this series as well as the original article that inspired this discussion:

About Scott Vidmar

As a Principal Engineer at Datapipe, Scott is a savvy technologist who is creative in his approach to developing inventive, proactive solutions to operational challenges. He uses real world experiences to ensure applications at a massive scale can operate with agility and in a fault tolerant manner. Scott writes from an engineering and customer service perspective to provide insight into common and not so common IT challenges.