How to Avoid Auto Deployments Behaving Badly

Environment management in general comes with its own set of complications, but what I am interested in knowing is how other Systems/Operations departments are handling auto deployments. Several of the points in the articles above are moot for us and have been for some time. Access to production systems is heavily limited to our Systems Administrators and many of our products are deployed with scripts triggered by team city builds. Code or configuration files are not deployed manually to our end to end test environment, and when they are, generally an auto-deployment script needs to be touched to alter the outcome rather than a human touching a file. Don’t get me wrong, there are still blunders that occur from one environment to another “oh I forgot that we needed to add that file so I didn’t add it to the production release… oops”. However, my main concern is what to do when an auto deployment script behaves badly.

First, let us discuss what auto deployment is to our teams. When code is ready to release to a test or production environment, an auto deployment script will be executed to place the code on the correct server(s). In the case where code is going to production, a Systems Administrator is assigned a ticket to take a few preparation steps, then run the script to deploy the code. Depending on the environment and product, one or all of the following things could happen during an auto-deployment run:

Disable/enable monitors

Disable/enable nodes in a cluster

Run database scripts

Svn checkout or switch a directory or file

Copy files from one location to another

Start/stop services

Start/stop web containers

The first time I ran an auto-deployment release, I felt naked. I was so used to knowing every piece of what was actually required for a product to become live that pressing a button felt completely foreign. I worried that something would be missed, and I just wouldn’t know. So what should you do in a case like this? How do you mitigate the risk of the system being compromised? You’re running a script that you know little about other than you just typed several passwords into a config file on a secure server and it’s using your permissions to execute. However, because you have done this manually before, you know what the behavior and outcome of the script should be, so verify it.

The first few times we were vigilant. Check to make sure the node is out, review all the database scripts prior to release, check log files during and after the release, etc. Over time though, we started to become more comfortable with the release runs, and this is when the problems started to arise.

Auto-deployment scripts in recent months have been known to produce any one of the following symptoms:

Didn’t run the database scripts, or ran the incorrect database scripts

All these problems are compound, they are the responsibility of both the development and systems teams to realize and mitigate. The real root of the solution is to treat the auto-deployment script as exactly what it is: code. What do you do with code? You test it, you unit test it, and you make sure that it throws exceptions and failures when it doesn’t do what it is supposed to do. Most of the above problems could easily be avoided if the script was intelligent enough to know when it has made a mistake.

So, this is my list of remedies for saving an operations team from feeling naked when learning to run auto deployment scripts.

Get involved, have the script demoed to you and understand what it does before you run it. Ensure that every time it changes, you receive a new demo.

Don’t be naive, yes, maybe one day this script will run without someone watching over it, but before that happens, sys admins and developers should work out the kinks.

Review database scripts with a developer before the script is executed then check to ensure that one of the intended operations occurred.

Any time a problem is encountered post-release that is found to be the fault of the script, have the script altered to create a FAILURE state if it ever happens again. Additionally, review the script for other possible failures and have those adjusted too.

Ask developers to check log files too, you will not always catch an error.

Consider having the script pause at critical stages (i.e. between nodes, etc) so that things can be inspected before they go too far.

Improve your monitoring systems. Create monitoring situations for the errors that get caused by the releases.

One day, after you have executed a script 10 times and it has not made a mistake, you will trust it. The important thing to remember is that the scripts will continue to evolve, so stay involved and help test every change. After all, as Systems Administrators or Operations staff, it is your job to ensure that the system is operational and take responsibility or alert the appropriate team when it is not.

Share this:

Like this:

Related

Thanks for the mention. One technique that I want to expand on in a future blog or talk is treating deployment scripts exactly like code, and versioning them in the same repo as the code that they deploy. When I did that on a big project a couple of years ago (and made each CI build deploy to VM’s), the failure rate dropped like a stone. I also made sure that if the (bash) scripts weren’t valid, the CI build failed in about a second 🙂

Thanks Julian, I don’t actually know how our dev team manages the scripts, but considering most of the other repository design they likely already do this. I think one of the main contributors is the sheer amount of scripts and products we have which adds to the complexity and disparity of each script.

I would add one point. (Good) Developers spend a lot of time refactoring their code and taking great care to keep things simple. I’ve seen plenty of deployment scripts where this was neglected.

I think that having a simple, straightforward deployment script is potentially more important than having simple, straightforward code. You *really* want to be able to read and understand a deployment script when something goes wrong during a 3 AM deployment.

Kevin I agree, some of the scripts we have seen are pretty simple to read (probably ones you worked on while you were here). My main concern and an area I think I want to focus is ensuring our monitoring systems alert us when something went wrong, hopefully early.