What does 100% uptime mean, and how does it pertain to SLAs?

I was on Spiceworks today and ran into this conversation about 100% uptime.I had a few thoughts but am interested in what others had to say, as well. Share them below, or on Spiceworks!

Most SLAs will claim 100% uptime (which most of you know is unattainable) with the provisions that “an outage doesn’t count if it is under 10 minutes” or caused by certain factors, or a host of other excuses.

Uptime, in the above context then, has two components: reliability and availability. Availability refers to the amount of time the server is working, and the reliability refers to the number of times the server fails.

To put it in a simpler context, imagine we are in a boat. Availability refers to the percentage of time in a given time period that we are out of the water, and reliability refers to the number of times we get wet in that same time period.

There are three typical solutions for business critical applications: clusters, fault-tolerant servers, or the cloud.

Microsoft clusters, which only work for cluster-aware applications, work as a team of servers. When one server fails, the next server takes over the application, however, whatever transaction was happening at the time of the fault is lost.

Fault-tolerant servers work in tandem: two servers are doing all the work all of the time, at the same time. If one fails, the application is still running and the users never know a fault has occurred. (Incidentally, with our Stratus servers, when a fault occurs or is about to occur, the server will call home to our service center for pro-active maintenance.)

This can be hard to imagine, so here is an analogy. For clusters, imagine a dance team enters a competition. They start the music and a dancer starts her number, but falls and breaks her ankle. A new dancer takes her place, the music is restarted, and the dancing continues.

For fault-tolerant servers, imagine the Rockettes. If one Rockette falls offstage, kicking and dancing is still happening.

On to the “cloud” option. Clouds, like Rackspace, Amazon Cloud, or even many parts of the Google brand , sound like a great plan. But clouds, despite their name, do not run on rainbows and unicorn dust. Their data and applications live on a physical server which is vulnerable to faults.

Just as an aside, a private cloud is another great option: hosting your own cloud on a high availability solution like a fault tolerant server or a cluster.

Rreading the fine print in SLAs is crucial. SLAs should be meaningful, and incur damages onto the company if they are broken. To give some perspective, if our ftserver customers incur ANY downtime at all for any reason, no matter how small, we pay $50,000. Again I say, responsible, customer-oriented companies have wiggle-proof SLAs.