How do you document your ops infrastructure?

Editor’s note: For a detailed look at how we systematically unearth productivity black-holes in our Ops team, join the webinar at the end of this page. Note, this is a new version of an article originally published on 03/15/2014.

As your team infrastructure grows, one of the most important things is how it’s documented. Anyone new joining the team, existing members working on new areas, or even the on-call team needs to know how things work.

The first line of documentation is essentially config management, and for this we use Puppet. This defines things like packages, config files, server roles, etc. However, it only defines the “state.” In addition to this, documentation needs to cover things like emergency response, how to deal with alerts, failover procedures, processes, checklists and vendor information.

What do we want from our ops documentation?

I recently started a project at Server Density to revamp all our docs. We’ve had some problems which could have been avoided or resolved faster if our docs were better. As our infrastructure continues to grow, this is important to address properly, and then keep well maintained.

Historically, we used Confluence as a wiki but we gradually transitioned to using GitHub with markdown formatted files alongside code. However there are some problems:

Search. Searching in GitHub is more designed to search code, and requires some filters for the organisation and repository. We’d need to split the docs to a separate repo to avoid the code alongside them also being searched. In Confluence search was never accurate and also quite slow.

Editing. The biggest challenge for any documentation is keeping it up to date. Being able to quickly edit the docs is important and there’s some overhead with a wiki format or having to commit code—it’s minor, but is an extra step. Formatting is also inflexible.

Collaboration. Being able to work on a a doc simultaneously or discuss changes / comment on areas of a doc is much better in GitHub than on Confluence but is still focused around individual commits, or pull requests combining specific changes. This works well for a specific body of work but not for ongoing discussions.

Speed. GitHub has a good performance but Confluence is really slow at everything. We used their hosted version rather than the on-premise install.

In summary, we want a system that has minimal barriers to creating / editing docs, can be searched quickly and accurately, is easy to collaborate on and ideally it should also be available offline and/or downloadable.

How do other people document their infrastructure?

I asked on Twitter to see what other people were doing, having looked online and not found much about what other companies are doing (other than a brief mention of Confluence by Etsy).

What do people use for documenting ops e.g. failover procedures, checklists, known issues, etc? GitHub? Confluence?

You can click through to see the range o replies—they included things like Mediawiki, GitHub Wiki, OneNote, HackPad (was since acquired by DropBox), Confluence, and some more complex tools with offline sync. Also noted was how GitHub do this, using Markdown files which are synced offline too.

What did we pick for Ops documentation?

Having already tried confluence and Markdown files in GitHub, I decided to try Google Docs. The whole team already had access to it through the web, offline and via mobile; documents can be created and edited very quickly, in-line and collaborated on by multiple team members; it has a built in drawing tool so we can create system diagrams; it’s very fast to load; and crucially, search is incredibly fast and accurate. Indeed, it is Google search after all! You can also download documents in multiple formats to store offline if you prefer.

Are you doing something different or have a good way to address the documentation problem—please do comment!

Also, make sure you join the Running Better Ops Teams webinar (see below). It reveals the ins and outs of how to systematically unearth engineer-time black holes, eliminate knowledge silos, and save time for things that matter: Improving your product and growing your business.

Free eBook: 4 Steps to Successful DevOps

This eBook will show you how we i) hacked our on-call rotation to increase code resilience, ii) broke our infrastructure, on purpose, to debug quicker and increase uptime, and iii) borrowed practices from the healthcare and aviation industry, to reduce complexity, stress and fatigue. And speaking of stress and fatigue, we’ve devoted an entire chapter on how we placed humans at the centre of Ops, in order to increase their productivity and boost the uptime of the systems they manage. What are you waiting for, download your free copy now.