Or how to avoid frustration configuring, debugging, and rescuing servers and third-party services

We all know we ought to document our software. But in order to act on this advice, we first need to figure out where to draw the boundary that delineates oursoftware from not our software. As it turns out, it’s rather tricky to draw a clean line between the two. Indeed, as I’ll show below, most of us radically underestimate how far the tentacles of our software reach. And by extension, our software documentation falls far, far short.

Let’s continue thinking about this conundrum in the context of a simple Ruby web app running on a VPS. Recalling the various projects I’ve worked on or had GitHub access to, I’d say most programmers document:

How to set up a development machine.

The commands to run/test/deploy the software.

The names of the Ruby gems (library dependencies) used within the software, along with these gems’ version numbers. This last addition is given as a free side effect of the Gemfile, the file which pulls library dependencies into the application in the first place.

Most programmers stop here—what’s the problem then? The problem is that they are defining software as “code we write and save in text editors”. But this definition is short-sighted, a form of professional narcissism even. A better, more modern definition of software is “any application of computers to solve a problem”. And, as you can imagine, solving problems with computers doesn’t necessarily require the typing of a single line of code. Nor does it require code editors. Heck, it doesn’t even require code. Software in this sense can be built by cleverly connecting components in mega GUIs (Ableton Live anyone?) or by plugging together strings of web application APIs. The best hacker I know can’t even program.

Viewing software under this new definition, this suggests we should focus on documenting the full set of steps necessary to exactly replicate the current problem-solving apparatus. If we’ve done this correctly, it means that someone following our documentation would be able to rebuild our production server and all its dependent third-party collaborating services and tests should a disaster strike and knock out the current setup.

Returning to our web application example, this means we should be documenting the following:

How we provisioned our server with the needed programming languages—installers/environments/etc.

For each third-party software partner (like Amazon S3/Google Analytics), we need to note every important configuration option we filled out through their web GUIs. We want to keep track of how we organise constellations of user accounts for security. And we also want to keep track of service-specific settings (e.g. bucket-level security rules, logging info, etc.).

The full and current configuration files for our HTTP-to-outside-world server (e.g. Nginx) and our Ruby-to-Nginx server (e.g. Puma or Unicorn). The documentation should reveal the motivation behind why certain settings were chosen, particularly if those choices were surprising.

Command-by-command steps to install Postgres or Nginx on our server, at least to the extent that their installation processes are mired with unexpected difficulties—and they nearly always are for those of us who aren’t sys-ops pros…

Exact details on how to run, restore from, or reinstate our backup solution.

The keys and values of all the environmental variables expected to exist on our server. Obviously, while taking care to protect the more sensitive items.

A full description of our DNS settings, including A records and MX records.

What type of HTTPS certificate we need, where the current one was purchased, the details required to generate it, and instructions on how and where to install it.

The source code of every shell script running on the server, along with the exact settings of the cronjob/equivalent running these scripts.

The startup scripts that ensure all processes are initiated on a server reboot.

Notes about the special request emails we had to send to our third-party partners (e.g. I had to email my search-as-a-service provider to switch on its stop words feature for my account, since there was no trace of this feature in the company’s web application’s GUI).

Gathering this information isn’t particularly difficult—how long does it take to copy and paste the results of a DNS lookup? How hard is it to prune the contents of the `$ history` command in a shell after you’ve finished setting up a server? For practically no extra cost in time, you get a step-by-step playback of the steps you followed.

Do not fall into the trap of leaving undocumented any configurations that happen only “once” in your product’s lifecycle. Insane is the system administrator who views themself as a commando in enemy territory: They SSH-teleport into the VPS, conjure up their sidekick Vim to manually configure each relevant file (cronjobs, “nginx.conf”, “unicorn.config”, etc. ), tinker for hours until the server finally runs, then speedboat off into the sunset, e-cigar in hand. This kind of machismo is stupid. This system administrator won’t remember how they configured the server six months later. They have no sharable documents to ease collaboration and encourage instructive comment from others. Unlike programmers proper, they have no source control and therefore no possibility of rolling back to happier days. And they have failed to build their own toolbox through the creation of reusable guides or code samples for speeding up related projects—meaning that their time to completion on future projects will be inflated compared to their alter ego who regularly squirrels away reusable morsels. And, worst of all, when the server goes down eventually, they’ll need buckets more time to resuscitate it, potentially costing the business dearly. Case in point to show that these things actually happen: My previous VPS provider was hacked by a disgruntled ex-employee, resulting in my web application server being completely deleted! AKA now you know a guy who had that happen to him once.

I’ve argued that, when considering documentation, a broader definition of software is needed than initially assumed. Must this documentation consist of laborious READMEs? Not at all: Documentation is not tied to any particular form. Indeed, the most convenient documentation to keep up to date is that which occurs as a side effect of functionality in your software.

Most trivially, code comments reveal quite a great deal, as do automated tests.

As mentioned already, highly specific version number information for dependencies can be automatically generated and checked into version control (e.g. Ruby’s Gemfile.lock from the Bundler world). These leave a convenient, executable record.

If you have accurate, up-to-date scripts to provision servers (Chef, etc.), these will contain the same information that laborious command-by-command descriptions of the equivalent would. Relatedly, server configuration files (e.g. Cronjobs, nginx.conf, unicorn.config) that are checked in to your code repository and automatically reuploaded on deploy would guarantee that your records in source control always match the state of your server’s config files.

If you enclose your software in a project-level OS environment and explicitly load all your environmental variables from an accompanying file, this leaves you with a foolproof description of what variables ought to be set for the software to run.

With all this configuration documented, just imagine how many days you’ll save when lightning strikes and your server crashes. Because sooner or later the storm will come, and the test of your talent as an engineer is whether or not you have mitigated against these risks.