Friday, August 12, 2011

The analysts at Saugatuck Technology recently wrote a note on "Cloud IT Failures Emphasize Need for Expectation Management". One comment caught my attention:

"Recall that the availability of a group of components is the product of all of the individual component availabilities. For example, the overall availability of 5 components, each with 99 percent availability, is: 0.99 X 0.99 X 0.99 X 0.99 X 0.99 = 95 percent."

I understand their math - but it strikes me odd that they would use this thinking when discussing cloud computing. In cloud environments, the components are often available as virtualized n+1 highly available pairs. If one is down, the other is taking over. In a non-cloud world, this architecture is typically only reserved for the most critical components (e.g., load balancers or other single-point-of-failures). It's also common to create a complete replica of the environment in a disaster recovery area (e.g., AWS availability zones). In theory, this leads to very high up-time.

Let me put this another way... I currently have 2 cars in my driveway. Let's say each of them has 99% up-time. If one car doesn't start, I'll try the other car. If neither car starts, I'll most likely walk over to my neighbors house and ask to borrow one of their two cars (my DR plan). You can picture the math... in the 1% chance that car A fails, theirs a 99% chance that car B will succeed, and so on. However, experience in both cars and in computing tells us that this math doesn't work either. For instance, if car A didn't start because it was 20 degrees below zero outside, there's a good chance that car B won't work start - and for that matter, my neighbors cars won't start either. Structural or natural problems tend to infect the mass.

I wish I could show you the new math for calculating availability in cloud systems - but it's beyond my pay grade. What I know is that the old math isn't accurate. Anyone have suggestions on a more modern approach?

As I dug through the descriptions, I found myself with more questions than answers. When you say Membase or MongoDB are available as part of the PaaS, what does this really mean? For example:

They're pre-installed in clustered or replicated manner?

They're monitored out of the box?

Will it auto-scale based on the monitoring data and predefined thresholds? (both up and down?)

They have a data backup / restore facility as part of the as-a-service offering?

The backup / restore are as-a-service?

The backup / restore use a job scheduling system that's available as-a-service?

The backup / restore use an object storage system that has cross data center replication?

Ok, you get the idea. Let me be clear - I'm not suggesting that OpenShift does or doesn't do these things. Arguments can be made that it in some cases, it doesn't need to do them. My point is that several new "PaaS offerings" are coming to market and they smell like the same-ole-sh!t. If nothing else, the product marketing teams will need to do a better job of explaining what they currently have. Old architects need details.

It's no secret that I'm a fan of Amazon's approach of releasing their full API's (AWS Query, WSDL, Java & Ruby API's, etc.) along with some great documentation. They've built a layered architecture whereby the upper layers (PaaS) leverage lower layers (Automation & IaaS) to do things like monitoring, deployment & configuration of both the platforms and the infrastructure elements (block storage, virtual compute, etc.) The bar has been set for what makes something PaaS - and going forward, products will be measure based on this basis. It's ok if your offering doesn't do all they sophisticated things you find in AWS - but it's better to be up front about it. Old architects will understand.