…and technology is not always the answer.

When github goes down…

A huge number of organisations rely on GitHub every day to safeguard their code and ensure that the code is available to all of their CI tools 24×7.

As I write this, GitHub is currently offline and I’m sure there are many Systems Administrators and Operations Engineers working hard to fix it.

This isn’t a post about how redundant GitHub should be – those guys know what they are doing and I trust them to fix it – it’s what you do when you rely heavily on GitHub for your company to run and how you can start to track external services via standard monitoring tools that matter now.

The first thing to do during a GitHub outage (or any major outage of a third party) is to resist the urge to take to social media and start slating them about how you could do better. I’ve failed to resist this in the past and it gets you into all sorts of trouble. It’s hard, but it is possible.

The second thing to do is work out what it means for your organisation. A few questions that are worth asking at this point are:

Can you still trade?

Can you still test your code?

Can you deploy if there is a critical bug found in your application?

Will this affect your customers ability to trade/test/deploy?

How many pizzas will be needed if this starts to continue into the early hours of the morning?

If you can still trade, then you’re in a good place.

If you can test using a local copy of your repo, then that’s also a good thing.

If you can deploy your code to production without needing to commit, test and merge to master, then you’ve quite possibly got other issues with your “DevOps Strategy”, however, in this case, it seems that you have inadvertently created a workaround for GitHub outages, so I guess congratulations are in order… [0]

The third thing to do is analyse why you didn’t spot this earlier.

Maybe you did, maybe your monitoring solution kicked in and alerted you to the fact that you wouldn’t be able to push to live. Maybe you were woken up at the same time as the GitHub engineers so you could fix things at your end of the pipeline, however, I’m going to guess again and say that this probably wasn’t the case.

One of the best ways to ensure that your business isn’t affected by an outage at a third-party supplier such as GitHub is to hook into their status page and link that up to your monitoring solution.

Thankfully for us, GitHub have a really easy to use status API that returns the current status as a JSON response. This allows us to easily interpret the output of the API and act in out monitoring solution accordingly.

I’ve used Nagios and Icinga for years, these days I’m slowly migrating to Dataloop who offer a great service [1]. One of the things I love about Dataloop is that they support the standard Icinga/Nagios check syntax based on exit codes.

This means that within minutes of spotting the issue with GitHub this morning, I was able to write the following and upload it to Dataloop: