Eliminating Failed Deployments – Part 2 – Automate Your Obsession

A few years ago I was working in The City of London. The company I worked for had a very good development process – continuous integration, unit testing and several test environments before production (the sort of thing described in Eliminating Failed Deployments – Part 1 – Replication! Automation! Complication?). Environment-specific values were automatically inserted into configuration files and deployments were made by staff who weren’t developers.

With all that, you’d expect that deployments went perfectly, but they didn’t. We still had problems that weren’t always enough to warrant rolling back the deployment, but WERE enough to cause delays and the occasional frantic phone calls and debugging sessions.

One particular deployment faltered because the deployment didn’t update some permissions to match the other changes it had made. After you experience problems like this a few times, it’s easy to see how obsession can build up.

Luckily one of my colleagues was Julian Simpson, our Build and Release Manager. He’d encountered deployment problems more than I had as all of the deployments went past his team. He suggested a set of automated tests to catch the most frequent problems we experienced, and expanding them as we found new problems.

Julian’s test system was written in Ruby, and ran when it was triggered by a change to the code in the version control system. Initially it reduced the number of failed deployments we had by about 75%. Over the following weeks we added more tests to cover the problems we encountered during other deployments; eventually we managed around a 99% success rate.

That’s 99% of deployments going straight into production. No changing configurations, no permissions problems, no database problems, nothing. Just deploying the changes, running through the sanity tests and that’s it. Admittedly we also ran the deployments through several environments (see Eliminating Failed Deployments – Part 1 – Replication! Automation! Complication?), testing in each, but combining the two to achieve a 99% success rate seems pretty good.

Since then I’ve implemented similar test frameworks everywhere I’ve worked and achieved similar results. Like many things the language you write the tests in isn’t important – Ruby and Python scripts, and C# in NUnit have all proved effective. The testing is what matters.

Ideally the tests should be run automatically when you create or update your deployment “packages” in your continuous integration system, usually from your build machine. You may want to run them on demand either from the command line or a UI, and that’s also useful if you haven’t automated them.

If your deployment packages contain multiple files, you need to decide on a structure for them; it’s easier if you separate out database scripts from binaries, and have separate folders for each database and installer under there. If you have separate documentation, put that in a separate folder. A directory structure like the following would do the job:\package \binaries \server_installers \client_installers \scripts \vm_configurations \vm1_configuration \deployment \rollback \sql \db1 \deployment \rollback \db2 \deployment \rollback \documentation

Organized like this, the directory structure is easier to understand and simpler for different types of test to scan through. As always this is just a guide; a different layout may work better for you.

Your tests need to scan through all of the relevant files, and carry out the basic checks you would expect testers to carry out. Some I’ve used in previous jobs include:

Binaries:

Ensure all of the binary files have an appropriate version number; “1.0.0.0” is NOT an appropriate version number.

Check the configuration files (“.ini” and/or “.config”) to ensure:

There are no references to non-production servers, IP addresses or URLs

There are no user names or passwords anywhere in them; you should be using integrated security for databases, and a safer authorization system for web services

Installers:

Check the installers to ensure:

There are no references to the following resources in other environments:

databases

file locations

file servers

web service URLs

These references could be in ini or config files, in MSI, web deploy or any other installation packages

MSIs:

The installer’s version number is correct – usually it should match the version of the binary files inside it

All of the files present in the installer conform to your requirements in the Binaries section. This is quite advanced because you’ll need to write code to run through the database in the MSI file

WebDeploy Packages:

If you’re using MS WebDeploy packages, remember that all of this information is in the ZIP file, so just open the archive file, find the files within the directory structure, and carry out the same checks.

Databases:

Check the SQL scripts to ensure:

The SQL syntax is correct (for SQL Server consider the Microsoft.SqlServer.Management.SqlParser.Parser namespace to do the checking for SQL Server scripts. You can use it to load the scripts, parse them, and check them for errors)

There are no references to non-production databases or servers

Every new stored procedure, function or table has appropriate permissions applied

Any new user logins are assigned to ROLEs

No user logins appear in permissions GRANT statements – only ROLEs should be used

Check that rollback scripts exist, and that:

Every CREATE for a new DB object in the deployment scripts has a matching DELETE or DROP

Every ALTER for an existing DB object in the deployment scripts has a matching ALTER to revert it.

Documentation:

Yes, we all run agile / scrum / lean / whatever development processes. No we never need documentation. In reality though, there’s always someone who feels that no deployment is complete without a document of some sort. Add a check to make sure it’s present so that your deployment doesn’t fail because of an argument over process.

An Example:

The simplest implementation of the tests follows the following procedure:

# Retrieve all of the config files in the specified directory file_names = [join(self.directory_path, f) for f in listdir(self.directory_path) if isfile(join(self.directory_path, f)) and f.endswith('.config')]

failure_details = []

# Get the contents of each of the files, and check whether they contain references # to development ("-DEV") environments. for file_name in file_names: with open(file_name, mode='r') as file_object: for line_number, line_contents in enumerate(file_object): if '-dev' in line_contents.lower(): failure_details.append('File: {}, line {}'.format(file_name, line_number))

This example uses the Python unittest module, which forms a complete Python unit testing framework. This example is provided as an illustration, which is just a roundabout way of saying I’m not a Python expert, so there’s almost certainly some code that could be improved in it.

The output in this case includes a list of the files containing the incorrect suffix, along with the lines the text is on.

Something Else To Worry About (Before You Automate It Away):

When I first began implementing tests like this the performance was awful because there’s so much disk activity to read in the files to be tested. As mentioned above, in the simplest implementation each test is responsible for finding, reading and testing the files it needs.

Once you have more than a few tests it’s much faster to structure them to load the files once and cache them in memory for testing. Alternatively you might decide to load each file and run all of the appropriate tests against it, before moving to the next file; it’s up to you, it’s your system.

Summary:

This list should form a reasonable basis for your pre-deployment tests. Obviously it’s not exhaustive but once you’ve implemented the basic system it should be easy to add extra tests that apply to your own systems. If you experience other deployment problems, add tests to cover them so that they don’t happen again.

Once you’ve got enough tests, your deployments will go much more smoothly and successfully. They won't be the things that wake you up during the night any more. That noise you just heard might, but not your deployments.

Summary

I'm Jason Ross, a software architect and full-stack developer based in Calgary, Alberta, Canada. This is my site for my professional activities and articles.

I design and build software systems. The sort of systems that don't crash when they receive incorrect data, that don't need to be rebooted constantly and that don't run so slowly that the users think they've crashed. Systems that take data from any source, process it as quickly, efficiently and reliably as possible and output the results in whatever form is required.

I use Continuous Integration and Deployment to automate every part of the process from building to deployment and integration testing.

I have an engineering background but I now design and write software mostly in C#, C++ and Python on Windows, although I've also developed on Linux with a few other languages.