CoreOS Blog

Blog Menu

Zero Downtime Frontend Deploys with Vulcand on CoreOS

Update: Vulcand has been updated since this post was written. Jump to https://github.com/vulcand/vulcand for more info. You can follow the concepts of this post, but the commands are no longer up to date.

Running a distributed system across many machines requires a sophisticated load balancing mechanism that can reconfigure itself quickly and reliably.

The team over at Mailgun has built vulcand, an etcd-backed load balancer, to serve traffic to different parts of their systems. vulcand has many awesome features such as an HTTP API, a command line utility and the ability for complex routing rules. We're only going look at a few simple examples in this post, but you can check out the readme for the complete details.

Today we're going to deploy Vulcan on a CoreOS cluster and use it to facilitate two strategies for zero-downtime deployments. These examples were intentionally kept simple for easy comprehension and it's up to you to decide what strategies you want to use for high availablity, port management, etc.

The Set Up

This post is going to cover two common front-end deployment strategies: a rolling-upgrade and a rapid switch to a new software version. We're going to assume you're familar with CoreOS, docker and fleet, and also have a 3 node CoreOS cluster to run the examples on. This post was tested on CoreOS 317.0.0 with etcd 0.2 and fleet 0.3.2.

Our systemd units are going to use containers that are located on the public docker index as coreos/example and have been tagged as 1.0.0 and 2.0.0. Mailgun has set up a trusted build for vulcand as mailgun/vulcand.

Each unit contains a web server that serves a very simple webpage. To easily tell the difference between versions, 1.0.0 has a red background and 2.0.0 has a blue background.

Let's get started.

Clone the Example Units Repository

The unit-examples repository contains all of the example units used in our blog posts. Clone it to save yourself from having to copy/paste everything:

If you're planning on submitting units to the cluster remotely, your local copy of fleetctlmust match the version of fleet running on the CoreOS machines. You can find the versions with fleetctl -version and fleet -version. Browse the tagged releases on GitHub for both Linux and Mac versions of fleetctl.

Start Vulcan

First, start up an instance of vulcand and configure a DNS record to point to it. If you're using Vagrant, you may need to forward ports on your laptop to Vagrant.

You'll need to modify the unit files in this example with the domain/subdomain you're using in order for vulcan's routing to work properly. The easiest way to do this is with sed after you've cloned the units repository:

vulcan.service

We're going to use the "trusted build" of vulcand on the public index. The trusted build is a container that is built directly from the vulcand repository every time code is commited to master. This ensures that the code you're running comes directly from Mailgun and is always up to date.

You can also clone the vulcand repository on your CoreOS machine and build the Dockerfile manually if you'd prefer. Here's the unit we're going to use:

The -etcd flag tells vulcand that our etcd cluster can be reached over the docker0 bridge since we're running vulcand in a container. On Vagrant machines, this IP can be different and you may need to edit the unit. Let's start it:

To test it out, load up the location in a browser. You should see error: "Bad Gateway" since we haven't set up anything to receive traffic. If you don't see anything, check to see if the units have started successfully:

If your terminal supports split panes, it's useful to run watch -n 10 fleetctl list-units in a new pane so you can keep an eye on what's happeneing. Feel free to read on while the container downloads.

How Vulcand Works

vulcand has two main concepts, locations and upstreams, that are connected together to serve traffic. A location is a combination of a hostname and a path that can be matched with a regular expression. The location is matched up with an upstream, which is a group of endpoints that are qualified to serve a subset of traffic. All of the examples for manuipulating vulcand in this post will be shown using etcdctl, but an HTTP API and command line tool are also provided.

Vulcand request flow

Set Up the Location

Before we can start routing traffic, we need to set up the location. In the provided units, this is done in the registration sidekick in order to create the location if no units are already running. Here's a breakdown of the commands contained in our registration unit. You don't have to run any of these, they are just for illustration:

This will create the hostname example.com and the path, home, to be matched by the regular expression /.*, which should match all paths:

etcdctl set "/vulcand/hosts/example.com/locations/home/path" '/.*'

Next, we need to register our container as a member of the upstream example. We're using the unit name as the unique identifier:

Scenario 1: Rolling Frontend Update

Our first deployment scenario is very common: a rolling upgrade. This strategy is common if the changes that you're making are hidden through feature flags or aren't going to harm a user's experience if they get routed to mixed versions during a session.

Single vulcand upstream

Start Version 1.0.0

Before we start our containers, let's take a look and see what they're doing. We're running our simple web server in example-v1.0.0-*.service and a sidekick registration service in mixed-register-v1.0.0-*.service:

example-v1.0.0-A.service

In the ExecStartPre we're doing a docker pull in order to prevent the unit from reporting active until the container download is complete. If we didn't do this, the registration unit would start before the download is complete. TimeoutStartSec=0 disables systemd's built-in time out when starting a unit, since our docker container is quite large and takes a few minutes to download.

Our unit conflicts with the other 1.0.0 example units in order for them to be spread across machines in the cluster.

mixed-register-v1.0.0-A.service

The registration unit executes the etcdctl commands covered earlier. Including EnvironmentFile=/etc/environment allows us to reference $COREOS_PUBLIC_IPV4 and $COREOS_PRIVATE_IPV4 in our unit. This file is populated on platforms in which CoreOS can determine the network environment (cloud providers, vagrant, etc). RemainAfterExit=yes (docs) will allow this unit to be active even after ExecStart commands run. If it didn't our ExecStop commands would run immediately afterwards.

The registration unit will create the location and the upstream and vulcan will automatically start sending traffic to it.

When you load the vulcand container in a browser, you should see a red background and the text 1.0.0. As before, you might need to wait for the container to download.

Version 1 is running

Start Version 2.0.0

Suppose that we've just implemented a great new feature (a blue background!) and we're ready to deploy it. First, build and push the updated image to the docker index. Since we're using the public index this has already been done for you, but here are the commands used:

Load up the browser a few more times and you should only see the blue background. Our deployment from 1.0.0 → 2.0.0 is now complete. To clean up before our next example, destroy the remaining units and remove all of the state from etcd:

Scenario 2: Rapid Switch to New Version

Another common deployment scenario is the need to rapidly switch from one version of your frontend to another. For this scenario we're going to use the same primary units, but different registration sidekicks for 2.0.0. Instead of registering to the same endpoint, we're going to create one with a different name, example2. Since we need to switch upstreams after our new containers have been deployed, that line has been omitted from our sidekicks.

Next Steps

If you're looking to put a system like this into production, it should be possible to achieve high availability with multiple vulcand containers behind a cloud load balancer, round-robin DNS or another common practice. More complex port management is another issue to tackle if you're running many containers on your cluster.

More Information

To read the complete vulcand documentation, head over to the GitHub page. Be aware that the current status is "Moving fast, breaking things. Will be usable soon, though.". More information about advanced unit files and example fleet deployments can be found in the CoreOS docs.

If you have questions, concerns or improvements to this this vulcand workflow, let us know on the CoreOS user mailing list.