Encapsulated Services with Consul and Confd

In a landscape where software infrastructure is getting more and more dynamic, dynamic communication of services with other services as well as with their environments is increasingly important.

Using REST to communicate between services is basically a no-brainer, but locating a service or replacing config files with a REST API isn’t as obvious of a decision. Of course wiring a domain name to an IP with the GoDaddy GUI, or maintaining config files on dozens of servers doesn’t seem right either.

Emerging solutions such as Consul and Confd help solve this problem. Consul takes an approach that provides the flexibility of a REST API and the portability and ubiquitousness of DNS (a custom DNS server). Confd leverages the centralized key/value store maintained by Consul to expose a static config (that updates dynamically.)

The quick start guides for both are pretty solid, so I’m not going to do a setup tutorial. Instead I’ll give you a tour of a working cluster and walk you through how to do some common things. To follow along, here is all you need to get the 6 node example environment running in Vagrant. It will even be using the same IPs so you can click through on links.

Getting Started

Look through the Vagrantfile to get an idea of what we’re setting up, but essentially we are building:

Some things to try out

It’s pretty straight forward, so just poke around. You can view all registered services and the nodes instances of your service are running on, or go to the “Nodes” tab and see a list of nodes with the services running on each. There are also health check status and key/value tabs that we’ll talk about later.

CLI Tool

To try out the cli tool, you need to get onto one of the cluster nodes. From there you can can find out more about whats going on.

DNS

One of the most powerful features of consul, is its custom DNS server. If you want to keep your application decoupled from consul, then you probably don’t want to leverage the REST API. That’s what DNS is for.

Just remember that its running on a custom port and not wired against your resolv.conf by default, so you will need to manually resolve the address. The demo app/service referenced in this example contains a simple load balancer prototype for working with consul’s SRV records.

Note that we need the SRV records and not just the A records because the SRV entries also have the port our service is running on

Give Confd something to work with

You may have noticed that our service endpoint (http://172.20.20.13:8045/foo) is reporting that Foo = "<no value>". This is because although we wired up confd to sync a key from consul to our local config (in the Vagrantfile,) we never actually added a key to consul.

Confd listens for changes to designated keys in consul and runs them through a template to produce your config file.

Our Demo Webapp

Included in our cluster example is a demo webapp and service named “demo.” Rather then have a separate service app and ui app, I just baked all functions into one application for simplicity. Here are the available urls:

http://172.20.20.13:8045/foo A REST resource that exposes information about the “my-svc” service (its host name and a value from a config maintained by confd.)

http://172.20.20.15/demo The web ui: this will use consul to discover “my-svc”, make a request to my-svc/foo, and then emit what it has found to the screen.

If you want to dig around in the demo app, or add to it, here’s the src

Fail a Health Check

Lets see how consul protects us when one of our services starts misbehaving. The demo app exposes a status endpoint that will emit “OK” when the service is healthy, and something else when it is not. Rather then actually testing health, this endpoint is configured to look for the existance of the file /tmp/fail-healthcheck and if it exists, start emitting “FAIL” instead of “OK”.

Before we induce the failure, run through this punch list to see what a healthy cluster looks like. Since we will be failing the instance of “my-svc” running on “svc1”, keep an eye on that.

Run back through the punch list to see how everything changed. We’ve got “critical’s” in our status ui, “FAIL” in our endpoint, the demo app is only using “svc2” now, and we can see why by inspecting DNS.

You can go ahead and rm /tmp/fail-healthcheck and everything will be right in the world again.

Why’d we exit 2?

If you paid attention to the consul config for “svc1,” you might have noticed that our health check is exiting with a 2 instead of the more traditional 1. This is another pretty cool feature of consul.

curl localhost:8045/status | grep OK || exit 2

Consul uses nagios style checks for its health checks. This meens that exiting with a 0 means everything is passing. Exiting with a 1 is just a warning: things will turn red, but the service will not be taken out of load balance. Exit with anything greater than 1 will result in a “critical” state and the service will not be returned in a DNS lookup etc.

Why is this cool? Lots of monitoring solutions are using this style of check these days, so you can use the same checks for all of them.

Wrapping up

Thanks for walking through my demo! If you just read through it but didn’t run your own cluster, you really should!

One last thing to do with your cluster

Technically speaking, although we needed the -bootstrap flag for the consul1 node when we started the cluster, we don’t need it anymore (and if we ever needed to restart it, we would need to get rid of it.) So to be a little more production like, you can do the following to kill that consul agent and then restart it, re-joining via consul2: