Developer Blog

Zero downtime deployments with pkgcloud and Cloud Load Balancers

I've been spending a lot of time working on more practical examples with pkgcloud, and one of the ones that I think will appeal broadly is the ability to deploy your code as part of a zero-downtime deployment strategy.

What are Zero-Downtime deployments?

Ask a dozen dev-ops people, and you'll probably get 10 different answers. The most likely consistent attributes to them is that you can deploy with negligable impact to your traffic. No dropped connections, no errors during downtime, No 502/500/etc errors from your Load Balancer.

My opinion is that the specific mechanism you use to achieve this is less important than the motivation for why you want this. I assert that having the ability to ship any change, at any time, and the confidence that it won't impact your users will empower your developers to do what they do best; write awesome code.

In this example, Zero-downtime deployment means having a set of application servers that are behind a Load Balancer that are rotated in and out during your deployment, allowing you to deploy your code transparently to your users:

Buyer Beware

There are a number of concerns to be aware of before adopting any zero-downtime rolling deployment approach:

Does your app have an integrated versioning/cache-busting approach?

Are your app-servers stateless or do they require sticky sessions?

How do you deploy your app to your app servers?

Do you have the ability to roll-back effortlessly?

Can your app have different versions served to clients concurrently?

Theory

Let's assume you have 4 app servers behind a load balancer, named web-01, web-02, web-03, and web-04. We want to deploy to half of them, and the deploy to the other half, using the load balancer to manage the process.

When your app is done with current requests, mark web-01 and web-02 as disabled

Deploy your app to web-01 and web-02

Verify your app as appropraite

Mark web-01 and web-02 as enabled

Then, simply repeat this process for web-03 and web-04.

Example

This example uses Rackspace's Cloud Load Balancers and pkgcloud to manipulate your nodes. We'll start at the point where you've already pre-validated the code you want to deploy using whatever unit/integration or build tests your app may have.

We're going to use three npm packages for this example, pkgcloud, async, and underscore. Async is being used for control-flow. Underscore will purely be for convenience methods.

Once we have our client, the first thing we'll need is a health check. This will allow us to always start from a known condition, which is a critical step in having confidence in your deployment process. A go/no-go if you will.

functionensureStatus(loadBalancerId,callback){client.getLoadBalancer(loadBalancerId,function(err,lb){if(err){callback(err);return;}// We don't want to to do anything if we're not in a known state to begin withif(lb.status!=='ACTIVE'){callback(newError('Load Balancer status not active'));return;}// check the status of each nodes. We want all of our nodes enabled// before we begin rotating nodes in and outif(_.any(lb.nodes,function(node){returnnode.condition!=='ENABLED';}){callback(newError('All nodes must be condition:ENABLED to deploy');return;}// If you want to any app specific validation, you could call out to that here// If we meet a minimum validation, lets callback with no errorscallback();});}

Next, we'll need a function that allows us to update a node to a specific condition. We'll use this function for both rotating in, as well as rotating out.

functionupdateNodeCondition(lb,node,newCondition){// we return a function here so we can use this inside of async.seriesreturnfunction(next){node.condition=newCondition;console.log('Updating Node ['+node.address+'] to condition ['+node.condition+']');// first lets update the node conditionclient.updateNode(lb,node,function(err){if(err){next(err);return;}// second, lets wait for the load balancer to tell us the change// has been completed, and we're back at active statuslb.setWait({status:'ACTIVE'},2500,next);});}}

Given this helper function, rotating a node out is straight-forward.

functionrotateNodeOut(nodeAddress,lb,callback){// first, lets find the address in our list of nodesvarnode=lb.nodes.filter(function(node){returnnode.address===nodeAddress;})[0];if(!node){callback(newError('Unable to find requested node'));return;}if(node.condition!=='ENABLED'){callback(newError('Node must be condition:ENABLED before rotating'));return;}async.series([updateNodeCondition(lb,node,'DRAINING'),// this stops new incoming connectionswaitForAppConnectionsToClose,// This would be your function to identify if connections are completeupdateNodeCondition(lb,node,'DISABLED')// move to disabled condition],callback);}

There is one magic function in this section. It's the waitForAppConnectionsToClose. Basically, you'll want to do whatever is appropriate for your application to ensure you're ready for moving to disabled state. If you don't possess (or need) anything like that here, you could simply move straight to disabled status and remove this step.

Now, lets put it all together, and rotate out the first half of our app servers. We're going to use 10.0.10.1 through 10.0.10.4 for our four app servers.

functionrotateOutAndDeploy(ips,lb,callback{async.series([function(next){// rotate out each ip in our arrayasync.forEach(ips,function(address,cb){rotateNodeOut(address,lb,cb);},next)},function(cb){// we're going to assume you have a callback based function to// deploy your code here. This could be an ssh command, or anything// where you actually push the code and restart your servicesdeploy(ips,cb);}],callback);}

As before, we're going to use another magic function to handle the actual deployment; in this case it's deploy. There are so many ways you could do this, we're not going to address that in this example. The point is, you'll need some way of actually deploying your code to your nodes, whether it's scp, rsync, git, etc.

Finally, we need to be able to rotate back in.

functionrotateIn(ips,lb,callback{async.forEach(ips,function(address,next){// we don't need a multi-step process here, so we can// just invoke the returned function with cb directlyupdateNodeStatus(address,lb,'ACTIVE')(next);},callback);}

Now, lets run a full deployment!

Voila

functiondeploy(loadBalancerId,poolA,poolB,callback){varlb;async.series([function(next){ensureStatus(loadBalancerId,function(err,loadBalancer){if(err){next(err);callback;}lb=loadBalancer;next();},function(next){rotateOutAndDeploy(poolA,lb,next);},function(next){rotateIn(poolA,lb,next);},function(next){rotateOutAndDeploy(poolB,lb,next);},function(next){rotateIn(poolB,lb,next);},// lets make sure we return the load balancer to a known statefunction(next){ensureStatus(loadBalancerId,next);}// this would be something you do post deploymentfunction(next){verifyDeployment(next);}],function(err){if(err){callback(err);return;}console.log('W00t! Successful Deployment');callback();});}

That's it! While it may seem like a number of steps, each of them are rather focused. It's the combination of these steps that will make your deployments a breeze.

Final Thoughts

I hope this example sets you on the path towards Zero-downtime deployments. When you have confidence that you can ship code at any time of day without a significant impact to your existing users, it can be a great enabler.