Checklist Driven Development

Programming – it’s probably the most intellectually demanding pursuit on Earth. Whilst solving a bug, you have to simultaneously hold a mental image of the entire codebase’s architecture in your mind, whilst considering the syntactical idiosyncrasies of your programming language of choice and making sure that your new code is written in a clear manner so that other engineers can understand it without needing to come and bother you about it months later. It’s no wonder that programmers have such high IQs!

But despite being super smart, we still make a lot of mistakes. On average, a software developer will write around 121 released lines of code per day. Unfortunately, every 1,000 lines of released code contain 15-50 bugs. This means we’ll usually generate a new production defect roughly every day.

Sometimes these bugs can be rather embarrassing. Take the time I was working on a WordPress site for a not-for-profit. They were short on cash, so I found them a really cheap webhost: $2/month. Amazing bargain! Or so it seemed until the hosting company went broke a few months later. Doh! Normally it wouldn’t be a huge deal because you take backups of course. Except I didn’t. I was so busy chasing down a great bargain for my client that I forgot the basic step of setting up a backup plugin. It wasn’t that I didn’t know how to set up backups; it just wasn’t a habitual step.

It seems like I’m not the only one who makes embarrassing mistakes like this. A colleague was doing some online banking with a major Australian bank and was alarmed to discover that there was a debugger statement left in the code. Let’s just say he wasn’t greatly impressed by their security measures!

Even NASA engineers make mistakes. In 1999, the Mars Climate Orbiter vanished without a trace whilst attempting a landing. Investigators discovered the cause: the code to coordinate the thruster engines was based on pound-seconds when it should’ve been using the SI unit of newton seconds. As a result, the orbiter was thrusting 4.4 times more than it should have and wound up careening into the red planet. The cost to taxpayers? $327 million.

Sometimes bugs have a human cost. The Therac 25 radiation therapy machine was designed to help extend the lives of cancer patients. Instead, it wound up killing 3 patients and giving 75 others secondary cancer due to a race condition that led to patients receiving up to 1000 times the safe dose of radiation.

As more and more aspects of everyday life are touched by code, we programmers have a tremendous responsibility to ensure that we don’t leave bugs in our code.

How can we do it, though? We’re human – we have off days, we have things happening in our personal life, we sometimes get asked to work crazy hours. How can we get things right consistently?

Dr. Atul Gawande, the author of the book “The Checklist Manifesto” has an answer: checklists. He presents evidence that regardless of their raw intelligence and experience, all professionals are liable to make “errors of ineptitude”. From pilots to construction engineers to venture capitalists to doctors, everyone screws up regularly. On the bright side, every industry that Gawande examined benefited from checklists.

To give a historical picture, checklists first found use in the aviation industry. In 1935 US pilots were faced with a dilemma. They needed advanced aircraft to help win World War 2 but these aircraft were so advanced that they were almost impossible to fly. The Boeing B17 Flying Fortress had so many knobs and dials that the US airforce’s most experienced test pilot crashed it the day it was being demoed to air force commanders. If he couldn’t fly this thing, who could? The aircraft was almost mothballed after the disastrous demo. Fortunately for the allies, Boeing came up with a way to make flying the B17 safer and it was…you guessed it: checklists!

Pilots now have hundreds of checklists at their disposal, ranging from the pre-flight checklist to other contingency checklists that come into play when unusual circumstances occur. Remember the Hudson River hero? The pilot who skillfully glided his aircraft down to a perfect landing on the Hudson after losing all of his engines to birdstrike? He was using a checklist to help restart the engines and manage the glide.

Checklists in medicine

Other professions have started to cotton on. Gawande and his colleagues at the World Health Organisation (WHO) devised a checklist to help reduce the death toll from surgery. Every year 237 million people go under the knife and of those 1 million die and 7 million are left disabled. The staggering number of lives lost compares to diseases like malaria and tuberculosis. Doctors are obviously not foolish or poorly trained. To practise as a doctor in most countries requires 7 plus years of careful training and supervision. Yet they still make mistakes, sometimes silly ones.

Gawande and his colleagues came up with a 19 point checklist to try to stem the tide. They rolled out the checklist to 8,000 operations across 8 hospitals. The results were staggering: a 36% reduction in surgical complications. When you consider that most healthcare interventions are lucky to result in a 5% improvement, 36% is an incredible result.

Something to note is that despite this result and many other similar studies, medicos are often still resistant to using checklists. As Gawande notes: “It somehow feels beneath us to use a checklist, an embarrassment. It runs counter to deeply held beliefs about how the truly great among us – those we aspire to be – handle situations of high stakes and complexity”

Checklists in programming

Gawande ends his book with a call to arms: any profession could benefit from checklists. What about software engineering? Are we unique? We’re smart and well trained but we still make silly mistakes and experience the embarrassment and hurt that these mistakes cause. Seems like we could give checklists a crack.

Learnosity and Safety Culture, two Australian SaaS firms, both employ checklists in their development process

Case study: Learnosity

I joined Learnosity almost a year ago and was thrilled to hear the Checklist Manifesto mentioned in my first week as well as seeing several checklists in operation. We have quite a few checklists in use:

the new starter checklist to make sure newbies get the right schwag and software on their first day

the release checklist to make sure our deployments are as de-risked as possible

our pre-code review checklist that developers tick off before moving a ticket to Code Review

our functional review checklist that QA uses to make sure a new feature or bug fix is rock solid

The last two checklists were brought in ten months ago and have already been quite helpful. The rigour and discipline from these checklists have helped catch bugs before they go out to production and serve as a reminder of our engineering best practices (e.g. “Have you increased unit test coverage?”).

Case study: Safety Culture

Safety Culture is another Australian company that lives and breathes checklists. The company’s product is an app to manage checklists, primarily for the construction industry where there are strict compliance standards. New starters are given a copy of “The Checklist Manifesto” to inculcate the checklist ethos in them from day dot. Being so checklist centric, it would be hypocritical of them to not use checklists as part of their development process and indeed they do.

They use a pull request checklist with 11 steps. This checklist is designed to help code reviewers pick up issues whilst reviewing another engineer’s code. You’ll note that the checklist includes steps to pick up silly mistakes (e.g. “Are any new features hidden behind a feature flag or other restriction?”) as well as steps to help enhance collaboration and solve tough problems (e.g. “Has it had a secondary review from someone familiar with the service or technology if needed?”).

Principles for effective checklists

By this point, I’m hoping you’re convinced that checklists can be effective and are motivated to give them a shot in your engineering team. Before you do, I’ll offer a few suggestions for how to make your checklists even more effective.

Principle #1: expect push back

At Learnosity, Safety Culture and almost every organisation Gawande surveyed in his book, checklists are far from being universally welcomed. The only situations where checklist usage approaches 100% are where it is mandated by legislation. As such, I’d encourage you to exercise patience and use all of the change management tools you have up your sleeve. Consult your team, get their buy-in and hopefully checklists will start to become a thing at your organisation.

Principle #2: avoid checklist fatigue

As a checklist fanatic, I’ve had to be reined in from creating crazily complicated checklists. Boeing has a rule of thumb: no more than 5 items on the checklist.

To reach this mark, tap into your lazy self and automate as many checks as you can. Right now our CI system only runs tests when we deploy code to staging but we’re working on reworking that so that any time new code is checked into a branch, unit, integration and UI tests will automatically be run and the results reported via Slack. Once that’s done, we’ll be able to remove that item from our checklist.

Principle #3: raise the bar

Although we want to use checklists to avoid silly mistakes, if that’s all they have on there, people will start to ignore them. “When’s the last time I left a debugger in my code? I don’t need to use this stupid checklist!”

Checklists work best when they enshrine best practices. Learnosity’s check: “Have you added tests?” makes it clear that we expect developers to write automated tests to prevent regressions. This is far from a trivial step. Some items – e.g. “Have you removed debuggers?” – are 5-second checks but writing tests take time. Having this as a checklist item that pops up every time you move a ticket to code review makes this expectation tangible. Rather than just having an imploration to write tests in a document that no-one reads, a checklist serves as a constant reminder.

Principle #4: foster collaboration

Although many serious mistakes are silly ones (e.g. forgetting to install a backup plugin or failing to convert from pound-seconds to newton seconds), some are exceedingly complex. Take concurrency related bugs like the Therac 25 race condition. Concurrent code is notoriously prone to bugs and also rather difficult to debug.

In situations like this, teamwork makes the dream work. Safety Culture’s got something clever going with their pull request checklist: “Has this code been reviewed by a domain expert?” makes sure that complicated logic gets extra attention. Gawande spends a lot of time discussing how checklists don’t just stop errors of ineptitude but can also solve complex problems by fostering collaboration.

Your turn now

If you’ve read this far, you’re probably enthused about the idea of using checklists and have got a few ideas on how to start using them. So why not have a go? It doesn’t take long to come up with a checklist and share it with your team. You can use a tool like www.process.st or SafetyCulture.io, create a custom JIRA like we did at Learnosity or even go low tech and just print off a checklist for your engineers to put next to their machines.