…and technology is not always the answer.

(n) Steps to Configuration Management Paradise…

Over the past five years I’ve come to experience the delights of Puppet, CFEngine and Chef across a wide range of deployments ranging from a couple of web servers to host two or three hundred sites, to thousands of servers underpinning an OpenStack-based cloud solution.

I’d like to share a couple of thoughts on what I’ve learned, how to avoid making the same mistakes that I’ve made, and how to ensure that the next time you reach for your modules or cookbooks, you do it in a structured and sensible manner.

1) Starting from scratch is *always* an option

There are many of us who have inherited various test and build systems with massive variation between the documentation and reality. We have also probably created a few along the way without even realising it!

These systems provide us with a huge amount of Technical Debt and can often ensure that whilst there is no root cause for an outage, there are a myriad of small issues that were patched and those patches were then patched to solve something else, before the third layer of patches was applied but the error wasn’t apparent until the 10th level of patching at which point it was too late for a quick fix.

If you find yourself in this situation then once the panic has died down and the incident is over, I thoroughly recommend taking a step back and making sure that your design for your configuration management (and you did design it, didn’t you?) meets the requirements that you are now dealing with.

Many people will throw their hands up in horror at the idea of throwing away the broken things and rewriting them from scratch, however in my experience – in the same way that is should be quicker to rebuild a broken system than repair it – it is often quicker to rewrite something than refactor it.

Rewriting instead of refactoring often also provides an opportunity to re-assess the original design, which almost always ends up with a better design, better fault detection and better code.

2) Watch your dependencies…

If you’re anything like me, you have a “baseline” set of modules or cookbooks. This are inevitably wrapped into a Chef-Role or a Puppet-Class and this is the first thing that gets applied to your servers once they are brought online.

I recently had an issue where I wanted to rebuild my Chef Server. This in itself wasn’t an issue, however I then needed to upload the various cookbooks and data bags that made up my “baseline” Role, and that’s where the fun started.

For various reasons, I try to keep my cookbooks in “function-specific” git repositories which are then included as submodules in a “master” repo. For example, I have a git repo that holds all the cookbooks I need for monitoring/metrics such as Icinga and Graphite, another for all the “basics” such as NTP, SSH and ResolvConf, another for “webservers” (Nginx, Apache, PHP, Rails etc.). This is fine, however I found that then I went to upload the cookbooks to the chef-server, I hit the following issue:

Upload all the “basic” cookbooks

Run the “baseline” role on a client

It fails because “baseline” requires NRPE, which is in the monitoring git repo

Upload the NRPE cookbook from the monitoring git repo

It fails because it requires the SSL cookbook which is in the “security” repo

Upload the SSL Cookbook from security on its own

Upload NRPE

Run chef on the client and watch it pass.

This occurred because my design was originally different to how I ended up deploying servers. I took a step back, looked at the options that I had, and redesigned my cookbook layouts and git repositories so that any “clients” (NRPE, NSCA, Munin-Node etc.) were in their own cookbooks as part of the “basic” repo, as well as their dependencies (SSL support on a system is pretty basic – why had I put this somewhere else to begin with?!) and now I can upload and build a server to the “baseline” standard without worrying about dependencies.