Developer Blog

Rolling deployments with Ansible and Cloud Load Balancers

Recently a fellow Racker wrote a great post about zero downtime deployments.
I strongly believe in the principals he described. In his example, he used Node.js
to build a deployment script to accomplish the goal, and I want to explore how
this could be done with Ansible and the Rackspace Cloud.

Theory review

Lets review the theory. Assume you have 4 app servers behind a cloud load balancer. A zero downtime rolling deploy would mean taking half of the app servers out of the load balancer rotation, updating app code on them, validating app code on them, and re-inserting them back into the load balancer before moving on to repeat the steps with the second half of the app servers.

Mark two nodes as draining

When your app is done with current requests, mark those two as disabled

Deploy your app to those two

Verify your app as appropraite

Mark those two as enabled

Then, simply repeat this process for the remaining two.

Ansible example

Lets write an Ansible playbook to accomplish the steps we outlined above. This example assumes a Cloud LoadBalancer named app-lb1 configured with four active web servers.

First we're going to use the rax_clb module to discover data about the load balancer. We do this by making a call to ensure that state=present. This will create the load balancer if it doesn't exist, but thanks to idempotence it will return data about the load balancer if it already exists.

With those pre-flight checks out of the way, we can now create a group of hosts from the nodes. This uses the add_host module. We will set the name to the node ID then use the ansible_ssh_host variable to store the IP address to access them. We'll also set ansible_ssh_user to root, and set lb_id to the ID of the load balancer, which will use in a later play.

Now that we've captured the load balancer data, done our pre-flight checks, and added our nodes we can get on with our deploy. This will be a new play using the group we just created, web:

-name:Deploy the codehosts:webgather_facts:noconnection:sshany_errors_fatal:yesserial:2

A couple things to call out in this play header. any_errors_fatal is a setting that will cause the entire playbook to abort with failure if there are any tasks that fail in this play. We want this instead of the default of only having total failure if all the hosts in a play have failures. serial: 2 is a setting that alters how Ansible runs hosts through the tasks of the play. In this case, Ansible will take two hosts through each task of the play to completion, then move on to the remaining two. This is how we accomplish the rolling part of the update, roll 2, then roll the remaining 2.

Speaking of tasks, lets add them in! Remember we need to set nodes to draining, then wait for the node to finish it's tasks, then disabled, then update the code, and finally re-enable them. We will use the rax_clb_nodes module to do the status manipulation.

tasks:-name:drain the noderax_clb_nodes:load_balancer_id={{ lb_id }}node_id={{ inventory_hostname }}condition=draining wait=truecredentials=/path/to/credentialsdelegate_to:localhost# place holder task to wait for connections to clear-name:wait for app connections to closecommand:echo "true"register:connectionsuntil:connections.stdout == "true"retries:20delay:10-name:disable the noderax_clb_nodes:load_balancer_id={{ lb_id }}node_id={{ inventory_hostname }}condition=disabled wait=truecredentials=/path/to/credentialsdelegate_to:localhost# place holder task to deploy the code, could be many tasks here# including tasks to validate the deploy-name:deploy the codecommand:echo "successful deploy"-name:re-enable the noderax_clb_nodes:load_balancer_id={{ lb_id }}node_id={{ inventory_hostname }}condition=enabled wait=truecredentials=/path/to/credentialsdelegate_to:localhost

The use of delegate_to in the rax_clb_nodes tasks directed Ansible to run the task locally on localhost. This module is designed to interact with the Cloud Load Balancers API, which we want to have happen locally, but with the variable context of the node we are dealing with.

There are a couple of "magic tasks" here. wait for app connections to close is doing a simple echo, but here you'll want to make use of a task that is appropriate to your application to ensure you're ready to move it into the disabled state. I've made use of an until loop here to demonstrate one way that Ansible can wait for a condition to be met before continuing on. The second magic function, deploy the code is also a place holder, an exercise left up to the application deployer.

Now that we have our playbook we can run it!

$ ansible-playbook app-deploy.yaml

Final thoughts

I'll repeat the same concerns that Ken did regarding your applications:

Does your app have an integrated versioning/cache-busting approach?

Are your app-servers stateless or do they require sticky sessions?

How do you deploy your app to your app servers?

Do you have the ability to roll-back effortlessly?

Can your app have different versions served to clients concurrently?

I hope this example of how to accomplish a basic rolling update with Ansible
can start the move to a zero downtime deployment capability. This capability is
a great enabler for your team, ops, devs, product managers, etc.. will all
benefit from the ability to deploy code at any time without interruption.