I'm a Fellow at the Adam Smith Institute in London, a writer here and there on this and that and strangely, one of the global experts on the metal scandium, one of the rare earths. An odd thing to be but someone does have to be such and in this flavour of our universe I am. I have written for The Times, Daily Telegraph, Express, Independent, City AM, Wall Street Journal, Philadelphia Inquirer and online for the ASI, IEA, Social Affairs Unit, Spectator, The Guardian, The Register and Techcentralstation. I've also ghosted pieces for several UK politicians in many of the UK papers, including the Daily Sport.

Analyzing Friday's Google Outage

As you may or may not know near all GoogleGoogle services went down on Friday afternoon. No one outside the company is quite sure what happened as yet: amusing speculations have been that the Googleplex finally gained consciousness or that someone made the mistake of typing Google into Google. One thing we did find out is that according to one real time analysis company internet traffic dropped 40% during those few minutes. Meaning that we could therefore, realistically, say that Google is 40% of internet traffic.

As Google themselves have said (via email) what happened was:

To clarify, that 11 minutes was came from the posting times of each of the updates on the Dashboard, which is different than the actual incident time. The dashboard clearly states “Between 15:51 and 15:52 PDT, 50% to 70% of requests to Google received errors; service was mostly restored one minute later, and entirely restored after 4 minutes.”

But this doesn’t tell us all that much. We know it went down, we know it came back up again and impressively quickly too. That’s certainly one lesson from this, Google’s recovery programs are excellent. An interesting comparison to the several day’s outage for certain of MicrosoftMicrosoft‘s Outlook services. Or the famed week over here in the UK when a major bank’s systems went kablooie as a result of a software upgrade.

But one question I find really interesting is, well, should we actually expect Google’s systems to run perfectly all the time anyway? How much downtime should we accept as just being part and parcel of the way the world works?

I’m always hesitant to use anonymous sources but you’ll just have to accept that this one does really know the subject under discussion. Well, you don’t have to but perhaps you will accept my assurance:

“Five Nines” (99.999%) is the Holy Grail of marketeers, but in practice it seems to be unachievable for a complex system. You have only 5 minutes per year of downtime allowed, which normally equates to one incident every 3-4 years at max. Either your system is extremely simple, or it’s massively expensive to run. Normally the cost of that extra 45 minutes of uptime a year is prohibitive – easily double that of four nines in many cases, sometimes much more – and most reasonable people settle for four nines or, in practice, less than that.

As you attempt to get failures entirely ironed out of the system the costs start to rise near exponentially. Thus there’s always going to be a trade off between system reliability and cost.

There’s a similarity here with what happens in my day job of weird metals. As soon as you start to demand metals at higher purities then the costs soar. For example, standard aluminium is 99.7% pure and costs around $3 a kg (say, $1.40 a lb) at present. 99.9% Al costs $5 a lb. 99.99% perhaps $15 and so on and so on: I’ve had to talk people back from demanding very high purity metals and oxides for certain processes before now. Explaining that the vast costs of moving from a 99.995% to a 99.999% purity simply aren’t going to be worth it for anything short of a space mission style environment.

It doesn’t surprise me that engineering of complex systems works in a similar fashion. Good enough at a reasonable price almost always beats the attempt to be as perfect as possible at any price.

And I will admit that I found this thought amusing:

I’m not surprised Google drops off the planet for 5 minutes – I’m surprised it doesn’t happen more often, and I’m astonished they get it back online in 5 minutes. I also feel sorry for people setting up their Internet connection at home in that outage window, when they tried connecting to www.google.com to verify their connection and it failed. “I can’t reach Google – my Internet must be bust, it certainly can’t be Google that’s unavailable…”

Post Your Comment

Post Your Reply

Forbes writers have the ability to call out member comments they find particularly interesting. Called-out comments are highlighted across the Forbes network. You'll be notified if your comment is called out.