Rolling updates with Docker Swarm

Rolling updates with Docker Swarm

Say that you have a set of services up & running in your Swarm cluster. Sooner rather than later there will come a time where you want to upgrade the version of your services. This most likely means that you will want to deploy a new set of containers with the upgraded version of your software.

A common approach in the industry is to set your website in Maintenance mode. This implies several things:

You configure your load balancer so that it stops sending requests to your backend services

You then serve a static Site under maintenance page from the load balancer itself

Later on you deploy the new version of your software to your servers, ensure that it’s properly working by the means of smoke tests or something similar

Then you commission the servers back again into the load balancer, remove the Site under maintenance page and have your service back up

If you do this manually, you must follow these steps by the book, otherwise your scheduled downtime would be way longer.

If you do this in an automated way by the means of some sort of orchestration tool or even with bash scripts, then you reduce the downtime dramatically, yet you still have downtime.

In the past, this strategy made sense because creating new servers with the new version of your software was rather a slow process. Not today.

A better strategy would be to spin up new servers, provision them with the new version of your software, ensure that the application is running again by the means of some sort of smoke tests and once confirmed that the tests were passing, commission the new servers into your load balancers, decommission the old servers and probably also destroy them and in this way avoid having downtime at all.

Today, we can achieve zero-downtime deployments in just a matter of seconds thanks to the fact that containers are cheap to both produce and move around. Let’s see how we can do this using Docker Swarm.

Getting Started

For the sake of simplicity we’re going to use the containersol/hello-world container image, which serves a web application that renders the hostname of the machine where it’s running. Since it runs in a container, it shows the host name of the container itself.

Notice how the REPLICAS field says 0/1. This is because at that point Swarm was still busy getting a hold of the desired container image in order to replace the old one with the new one.

Let’s take a deeper look to see what is going on behind the scenes. Roughly, this is what Swarm will do for you:

Pull the image specified in the docker service update command

Remove the current (now old) container from its internal load balancer

Send a SIGTERM signal to the container and give it a grace period of 10 seconds to exit gracefully. If after 10 seconds the service hasn’t yet given up, Docker will terminate the process with a SIGKILL signal.
Both the signal and the grace period can be tweaked with the --stop-signal and the --stop-grace-period flags when creating/updating the service

Start the new container

Add the new container to its internal load balancer

As you can see, this default behaviour features downtime since between steps 2 and 4 there could be a big delay, specially since production-ready applications nowadays tend to take more than just some milliseconds to come up.

If we inspect the service, we can see that the Update order field is set to stop-first:

Update Order

It’s a pity that at the time of writing this post, Swarm’s default behaviour is to kill your container and then bring one up (stop-first). By now one would think that starting a new container and ensuring its healthy and readiness first would be the way to go…
The good news is that this behaviour can be tweaked in Swarm so that it does exactly that (start-first). To go about it, use the --update-order flag either at the time of creating your service or when updating it.

Let’s update our service, this time making sure that the update-order is set to start-first

1

2

$docker service update-d--update-order start-first backend

backend

The next time that the container image is updated, Swarm will first bring the new container up and only commission it once it’s ready.

To verify this behaviour, open a new terminal window and watch for the containers being run:

1

2

3

4

5

6

7

8

9

$watch-n0.5docker ps

# or the poor man's watch if you can't afford it:

$whiletrue;

do

docker ps

sleep0.5

clear

done

and on the other window go ahead and release a new version of the container image:

1

2

$docker service update-d--image containersol/hello-world:latest

backend

You can see how the new container is being created and then the old one is being thrown away.

Update Parallelism

Another important feature for rolling updates in Swarm are the --update-parallelism and the --rollback-parallelism flags. These flags will tell Swarm how many tasks it will update in parallel. Tweak these flags according to your own needs but most certainly you want this number to be lower than the total amount of tasks/replicas that you’re running for a specific service.

1 by default

0 will update/rollback all at once

A word about rollbacks

It is very possible that the new container image doesn’t come up during the upgrade process, either because of system failures or because of a faulty image. In this case, Swarm will make the best effort to rollback to the previous version that you were running. For these scenarios you also need to think if you want to go for a start-first or for a stop-first strategy. There are also some networking issues with Swarm that are still being figured out.

Compose/Stack file gotcha

We have done this exercise from the command line for the sake of simplicity. However, more sophisticated, production-ready setups feature the definition of your services in Docker Compose or Stack files. We were sad to find out that the update-order feature, even though merged into upstream, hasn’t yet made it to the latest Docker version. This means that you are for now stranded with the stop-first strategy until (hopefully) the next release of Docker.

Passing the image hash by the means of an env var is a common practice, yes.

Regarding rollback, you need to make sure that you have properly defined your health checks for your application. As Swarm rolls out the new version of your application it’ll check its health status and rollback automatically for you in case that things start to go wrong.

The best strategy for rolling back migrations is to do a forward rollback. That is, create a new version of your software that migrates back to the previous state that you had.

we have a service in global mode with service constraints for particular node and set healthcheck instruction (curl –fail -s http://127.0.0.1:8000/ || exit 1) in stack file. Now if we set the order to start-first then after the service update new container will spin-up and will try to access open port 8000 to become healthy. So is there any port conflict happened between new and old containers because until the new container successfully pass the healthcheck it would not become ready and retries.
Is it possible at the same time both the container can access the same port on same host. We are trying to achieve zero downtime during service update.