Why transparency matters in the cloud

A number of people have asked if the advice that Gartner is giving to clients about the cloud, or about Amazon, has changed as a result of Amazon’s outage. The answer is no, it hasn’t.

In a nutshell:

1. Every cloud IaaS provider should be evaluated individually. They’re all different, even if they seem to be superficially based off the same technology. The best provider for you will be dependent upon your use case and requirements. You absolutely can run mission-critical applications in the cloud — you just need to choose the right provider, right solution, and architect your application accordingly.

2. Just like infrastructure in your own data center, cloud IaaS requires management, governance, and a business continuity / disaster recovery plan. Know your risks, and figure out what you’re going to do to mitigate them.

3. If you’re using a SaaS vendor, you need to vet their underlying infrastructure (regardless of whether it’s their own data center, colo, hosting, or cloud).

The irony of the cloud is that you’re theoretically just buying something as a service without worrying about the underlying implementation details — but most savvy cloud computing buyers actually peer at the underlying implementation in grotesquely more detail than, say, most managed hosting customers ever look at the details of how their environment implemented by the provider. The reason for this is that buyers lack adequate trust that the providers will actually offer the availability, performance, and security that they claim they will.

Without transparency, buyers cannot adequately assess their risks. Amazon provides some metrics about what certain services are engineered to (S3 durability, for instance), but there are no details for most of them, and where there are metrics, they are usually for narrow aspects of the service. Moreover, very few of their services actually carry SLAs, and those SLAs are narrow and specific (as everyone discovered recently in this last outage, since it was EBS and RDS that were down and neither have SLAs, with EC2 technically unaffected, so nobody’s going to be able to claim SLA credits).

Without objectively understanding their risks, buyers cannot determine what the most cost-effective path is. Your typical risk calculation multiplies the probability of downtime by the cost of downtime. If the cost to mitigate the risk is lower than this figure, then you’re probably well-advised to go do that thing; if not, then, at least in terms of cold hard numbers, it’s not worth doing (or you’re better off thinking about a different approach that alters the probability of downtime, the cost of downtime, or the mitigation strategy).

Note that this kind of risk calculation can go out the window if the real risk is not well understood. Complex systems — and all global-class computing infrastructures are enormously complex under the covers — have nondeterministic failure modes. This is a fancy way of saying, basically, that these systems can fail in ways that are entirely unpredictable. They are engineered to be resilient to ordinary failure, and that’s the engineering risk that a provider can theoretically tell you about. It’s the weird one-offs that nobody can predict, and are the things that are likely to result in lengthy outages of unknown, unknowable length.

It’s clear from reading Amazon customer reactions, as well as talking to clients (Amazon customers and otherwise) over the last few days, that customers came to Amazon with very different sets of expectations. Some were deep in rose-colored-glasses land, believing that Amazon was sufficiently resilient that they didn’t have to really invest in resiliency themselves (and for some of them, a risk calculation may have made it perfectly sane for them to run just as they were). Others didn’t trust the resiliency, and used Amazon for non-mission-critical workloads, or, if they viewed continuous availability as critical, ran multi-region infrastructures. But what all of these customers have in common is the simple fact that they don’t really know how much resiliency they should be investing in, because Amazon doesn’t reveal enough details about its infrastructure for them to be able to accurately judge their risk.

Transparency does not necessarily mean having to reveal every detail of underlying implementation (although plenty of buyers might like that). It may merely mean releasing enough details that people can make calculations. I don’t have to know the details of the parts in a disk drive to be able to accept a mean time between failure (MTBF) or annualized failure rate (AFR) from the manufacturer, for instance. Transparency does not necessarily require the revelation of trade secrets, although without trust, transparency probably includes the involvement of external auditors.

Much as I appreciate your diss in the sencod paragraph of your reply, the arithmetic for calculating the cost is quite simple, but depends on having knowledge that cloud customers are selected for not having.Most anyone who has been writing PHP for more than two years can get up and running on a VPS in only a few minutes longer than it would take with GAE, but also be able to migrate to many other VPS providers without substantially more pain. The point of cloud computing isn’t having your trivial site up in 2-3 minutes, is it? I thought it was all about that scalability.On FB, my point is exactly that they have the same infrastructure internally, because there’s nothing magical about the cloud, it’s just infrastructure. But they’re not renting space on Google or Amazon, because it’s not cost effective at the high end. Which leads back to one of my core points, which is that the cloud is neither technologically interesting nor cost effective over VPSes or having physical machines.I would say you’ve hit the nail on the head about the last 30 years of technology. I think if there were zero innovation apart from hardware continuing to become cheap and widely used, I would have trouble saying life is worse than it is now. Definitely the last 20 years. We could be using Standard ML, the only language with fully specific semantics, but instead we switched from C to C++ and Java, bringing with them different kinds of unbearably inhuman complexity. We could be using Plan 9, the successor to Unix with truly integrated networking and distributed computation a true platform to build cloud-like services on but instead we’re using the Unix clone written expressly to the lowest common denominator. We could be using message passing instead of threading, we could be using a truer relational calculus rather than SQL, and so forth. It’s as though we chose wrong every time. We live in a world without a reliable networked filesystem, and instead we have so many logging frameworks in Java we have to have a meta logging framework to abstract over them. Now that I think of it, I would miss ZFS, HTTP, and Haskell, but I can’t think of too many other successes we’ve had in the last 30 years.Like most pessimists, I would say I’m merely a realist. The cloud never was your savior, so there’s no sense pining for the days when you thought it was. It never was anything more than marketing. If you want to do something good for technology, find a way to delete two lines of code for every line you write from now on. Avoid busy work but accept that some code just has to be written by a human and not hidden behind an abstraction layer.Thanks for listening to my crazy rant. I’ll return to irrelevance now.