The mighty chain of IT: Where five 9 uptime falls apart

It used to be said that if you want something done right that you have to do it yourself. True words, but unfortunately that only works if you are ready to maintain the entire scope of build, deploy, monitor and support all by yourself.

Earlier this week, GoDaddy.com suffered from an outage which highlighted some significant worries for many people. Whether you were one of the millions of sites hosted by GoDaddy, or one of the millions of customers who use GoDaddy DNS services, you were the unintended victim of a brutal situation.

Regardless of the fact that it was an unexpected attack by a member of the famed Anonymous hacker group, the end result was the same for all of those customers (me included); we realized that the five 9s uptime promise is ultimately on a best effort basis.

Today, prominent programming author and blogger @JeffHicks was on the recovery from a hacked Delicio.us account resulting in a Twitter blast of spam posts under his profile. While this doesn’t affect the uptime of any of Jeff’s services and sites, it speaks to the importance of the chain of IT.

Weakest Link

“A chain is only as strong as its weakest link”

We’ve all heard the phrase, quoted the phrase and seen the true result of it as well. After years of BCP/DR design and implementation I’ve had more than enough exposure to the SPOF (Single Point of Failure) concept, and the idea that interdependent systems reveal vulnerabilities that have to be understood.

If you were running your application infrastructure and counted on the GoDaddy 99.999% uptime “guarantee” you have now become the SPOF to your customer. It wasn’t your fault, nor could you have thought you needed to plan around it really. How much more that a five 9 uptime guarantee could you ask for.

LCD – Lowest Common Denominator

I wrote a series about BCP/DR geared towards the “101” crowd who may not have had exposure to a fully featured BCP program. In one of those posts I talked about how the Lowest Common Denominator is what you use to define the recoverability and reliability of your service.

As we evaluate our business and application infrastructure we have to understand every component that is involved to fully realize where we have exposure to failure or vulnerability.

Known knowns

Donald Rumsfeld had a great statement about what we know. This is what he said:

It’s a powerful statement and I’ve used it many times in presentations. I’ve been asked by management teams over and over again (usually immediately after a system failure): “How do we plan for unplanned outages?”.

It is ironic that we are trying to plan for something unplanned. The simplicity of the statement almost gives it an innocence. But there is truth to what it asks.

Test Driven Infrastructure

In a previous post about Test Driven Infrastructure I promoted the use of TDD (Test Driven Development) methodologies for building infrastructure and application systems. It’s an important part of how we get to the four 9 or five 9 design. We cannot just throw down a “guarantee” or a “promise” of uptime if we do not fully understand what it means.

The ideal case for any system is that we can design, build, test for failure and then and only then do we really see the potential uptime. If you’ve been involved in BCP, you also understand that there are levels of failure that we plan for. Somethings are beyond the ability to plan for or are so cost prohibitive that we can’t implement the “perfect design”.

So what do I tell my customer?

We can only speak to historical uptime. Have you heard the statement “past performance is no guarantee of future returns”? We also generally don’t expose our entire end-to-end system design to every customer in every case because it would be challenging, and nearly impossible as systems change over time.

As a provider of services (whatever those may be) you will be committed to some SLA (Service Level Agreement) and as a part of that agreement you will have metrics defined to say where we pass or fail. Another key part of that will be a definition of what we do when we miss an SLA. Do we define our SLA over a week, month, year? It’s a great and important question.

What now?

I don’t want to sound like a negative Nelly, but I do want to raise the awareness of designers, programmers, admins, architects and management all over that we need to do our best to be aware of vulnerabilities, exposures and this may move into privacy and security which are ultimately part of the overall picture.

Not too long ago, Dropbox suffered an exposure because of the simplest possible thing: employee password hack. Regardless of their globally distributed systems and highly available systems, a single password opened the door to a potentially fatal breach.

So to refer to Donald Rumsfeld, we have “unknown unknowns” that cannot be accounted for, we also have many “known unknowns” that we can get closer to understanding and preparing for.

Break out the Visio diagrams and take a deeper look into where you may have some exposure. And as you do that, you may realize that it is just part of the design and is unavoidable, but it is better to know than to find out the hard way.