Redis Server With Swarm Rescheduling On-Node-Failure

This portion of the post will focus on deploying a Redis container and testing out the current state of the experimental rescheduling feature.

Note: rescheduling is very experimental and has bugs. We will walk through an example and review known bugs as we go through.

If you want your container to be rescheduled when a Swarm host fails, then you need to deploy that container with certain flags. One way to do this is with the following flags: --restart=always -e reschedule:on-node-failure or with a label such as -l 'com.docker.swarm.reschedule-policy=["on-node-failure"]'. The example below will use the environment variable method.

First, let’s deploy a Redis container with the rescheduling flags and a volume managed by Flocker.

Next, SSH into the Docker host where the Redis container is running and take a look at the contents of the appendonly.aof file we instructed Redis to use for persistence. The file should be located on the Flocker volume mount-point for the container and contain no data.

$ cat /flocker/9a0d5942-882c-4545-8314-4693a93fde19/appendonly.aof
# there should be NO data here yet :)

Next, let’s connect to the Redis server and add some key/values. After, look at the contents of the appendonly.aof file again to show that Redis is storing the data correctly.

Testing Failover

Now we want to test the fail-over scenario making sure our Flocker volume moves the data stored in Redis to the new Docker host where Swarm reschedules the container.

To do this let’s point Docker at our Swarm manager and monitor the events using the docker events command.

To initiate the test, run shutdown -h now on your Docker host that is running the Redis container to simulate a node failure. You should see events (below) that correlate to the node and container dying.

What the events tell us is that the container and its resources (network, volume) need to be removed, disconnected or unmounted because the host is failing. The events you see below are:

Then, some bit of time after the Docker host dies, you should eventually see an event for the container the same container being rescheduled (created again). This is where there is still some work to be done, as of 1.1.3 and our testing we noticed that Swarm has an issue running Start on the container after it has been Created on the new Docker host.

You should see the Create event logged while watching docker events and this actually does initiate the re-creation of the container and the movement of the Flocker volume it was using.

We found that you may need to manually Start the container on the new host after it was rescheduled.

This is the event we see when the container was rescheduled and created on a new Docker host automatically. Notice the IP address changed to a different IP from the last message; this is because the container is rescheduled on a new Docker host.

The data is still there! Given the current state of rescheduling, it’s not recommended to rely on it.

During our tests, we did come across users that said the container did start. We also came across users that said rescheduling didn’t work at all, or they wound up with two identical containers if the Docker host came back.

Either way, there are certainly kinks to work out and it's part of the community's job to help test, report and fix these issues so they can work reliably. We will update this post along the way to make sure to show you how rescheduling works in the future!