Passionate about filling the gap between technology and business. Explaining complexity in plain English is my thing. Thriving on #cloud, #bigdata and #microservices applications. Based in Europe. Photographer wannabe. Heavy metal guitarist.

Meta

Tag Archives: HA

During my recent experience at Structure:Europe I have engaged in a discussion regarding whether the “workload” should “care” about the cloud or not. It is a great topic to debate on and I decided to write a few more thoughts here below. But let me recall the conversation first.

The trigger was a sentence by @ditlev, CEO of OnApp, that we also heard during his session with @tonylucas during the second day of Structure, and that I commented on Twitter with:

.@richtwarner@Ditlev workloads care about the #cloud because they need to control and orchestrate the infrastructure in order to scale

Among others, @khushil, engineer at Mail Online, did not agree with me as we read his tweet replying: “wrong way around, the cloud needs to understand the workload to scale to support. other way misses the point“.

Ditlev obviously picked this up and explained further that “workloads should be agnostic, your infrastructure (incl cloud)/platform should adapt” and that he would “like to see abstraction layers between workloads and infrastructure“.

What’s the source of the workload?

One may be confused and agree in principle with everyone, as every tweet seems to be reasonable, but I think first we should make some assumptions upfront, starting from the definition of workload:

The amount of work performed by an entity in a given period of time, or the average amount of work handled by an entity at a particular instant of time. The amount of work handled by an entity gives an estimate of the efficiency and performance of that entity. In computer science, this term refers to computer systems’ ability to handle and process work.

In computer science, we indeed have several abstraction layers and that happens also with cloud computing (more insights in one of my previous post here). In such scenario, however, there are also many “entities” performing some work at those different layers. So which one is the entity whose workload we were talking about? Given what our companies do, I bet we were referring to cloud infrastructures handling and processing work that is generated by applications running on top of them.

Clarified the context, I shall explain why applications should instead care about the cloud without expecting any magic to happen down there.

The new era of IT infrastructures

We are witnessing a tremendous change in the core functioning of IT infrastructures. Up to the advent of cloud computing, the general approach taken by IT professionals was to manually provision a specific footprint made of servers, CPUs, memory, storage and network devices. That footprint was probably over-sized in order to accommodate predictable workload growth over time. Applications were designed to abstract from infrastructures, they were simply demanding more CPU cycles or IOPS whenever they wanted to, regardless of the actual availability. The result of this approach has been a tremendous increase in the total cost of ownership of IT departments. Hardware was required to be reliable, fast and able to accommodate peaks in workload without any performance loss. This all came at a price.

Two main drivers came to disrupt and trigger a drastic change. The first one is mobile computing, i.e. the demand of Internet services that suddenly became ubiquitous, leading the generation of unpredictable workload demand from anywhere in the world, at any time of the day. The second driver is the growing availability of a large quantity of data, user and machine generated Big Data, that require to be stored and analyzed.

To accommodate the above scenarios, IT infrastructures had to become completely software-driven, highly elastic and extremely scalable. With cloud infrastructures, today it is in fact possible to provision an infrastructure footprint using a few mouse clicks or a couple of functions within a few lines of code. The size of the infrastructure can be adapted to the required workload in a specific moment, no more need to over-provision, as resources can grow and shrink using few simple automatic operations.

And with the availability of software-consumable infrastructures, also applications are changing their approach, becoming much more infrastructure-aware. In fact, in case of resource shortage, applications can request for more by using API calls, growing the infrastructure footprint as required. At the same way, they’re now able to handle faults, making expensive highly available infrastructures completely worthless! (I blogged about this before here).

Think application!

All of this to explain that no, the cloud (infrastructure) does not have to understand the workload and does not have to automatically adapt to it. Even if that could be theoretically possible, the infrastructure lacks of the right metrics to recognise a real need for more resources. Instead, it is the application itself that actively adapts its own infrastructure, because only the application understands how the user experience is going, which the only metric that should be taken into consideration when measuring application performance.

Are you currently auto-scaling your infrastructure based on CPU utilisation? When it happens, are you sure that corresponds to a real improvement of the user experience? Or simply to a higher bill of your cloud provider?

And what if your VM goes down? Will you blame your cloud provider and then blog about its ridiculous SLA penalty fees, that never corresponds to your real loss of business? Isn’t it more effective to make sure that you understand the VM is down and that you (your application, I mean) take the necessary step to failover elsewhere?

With a clever application, infrastructure can be seen just as a toolbox. And you need to know how to use those tools in order to build highly available, auto-scaling applications. Don’t expect your screwdriver to build on your behalf the room for your newborn baby.

Or more simply, as @AmberCoster of AppDynamics said to me in another Twitter conversation: “Think application, not just infrastructure!”.

Titles are usually provocative and I won’t judge its veracity, however there is no such thing as the “Cloud Uptime” because, despite the cloud is considered as a whole, you can imagine that it is made of thousands of components and not all of them go down at once. Therefore, the outage within a cloud service tends to be bigger the more these components are interdependent. I’m going to explain this more in details.

Cloud Outages

The article says “Cloud Outages” are eventually inevitable because doing better than 99.99% availability would cost too much and companies like Netflix (which suffered its cloud provider outage right on Christmas Eve) would still continue using the cloud just because eventually “it does a great job of providing ready-to-use features”. In other words, it says that using the cloud requires a compromise that companies with multi-million businesses are ready to take: losing money from time to time in exchange of the flexibility of the cloud. My dear, I refuse to believe that.

First off, cloud providers do things differently and we can’t generalize. Let’s narrow down to AWS as this is the cloud provider the article mainly refers to. AWS is primarily an IaaS provider with some service components operating at the PaaS layer, such as the ELB (Elastic Load Balancer). In this context, there is no such thing as a “Cloud Outage” but there is the outage of a component of the cloud that your application relies on and that your application has not been instructed to handle in case of failure.

When working at the PaaS layer your freedom is limited. On one hand, you don’t have to worry about how things work underneath because the provider does everything for you but, on the other hand, you also have to rely on it when it comes to availability and SLA. Netflix relied on ELB and their application had no other way to handle its failure than waiting for AWS to fix the problem.

So how should Netflix prevent such things from now on? As others have also said, they should just build their own load balancing service by operating at the IaaS layer. In this case, they would have the freedom and the responsibility to set up multiple LBs in different availability zones or even different data centers, making their application more resilient in case of any infrastructure outage.

The responsibility of a PaaS provider

Later, the article goes through a list of PaaS provider duties in case of an outage. When I read it the second time I figured out that the term PaaS was misused as the author was instead referring to a generic provider offering any kind of services through the cloud.

However, this gives me the chance to say that a real PaaS provider should never ever suffer from any underlying infrastructure outage. The PaaS software should be the very best example of highly available resilient application, architected to exploit most of the isolation/redundancy mechanisms made available by the underlying IaaS. In the end, a PaaS provider employs mostly DevOps who master cloud automation tools and best-practices and who do know how to make an application resilient.

Moreover, a PaaS cloud is not about elasticity or scalability, as the article says, but those two come from the underlying IaaS: it’s the infrastructure that scales, it’s the infrastructure that grows and shrinks fast. Whereas PaaS is all about about automation: automated deployment, auto-scaling, automated failover and recovery on infrastructure failures.

What cloud uptime is about

In conclusion, more than 99.99% is actually possible and there are examples of that. Joyent is one that managed to deliver 99.9999% of uptime in the last 2 years. So how to build more reliable clouds? Simply by architecting an infrastructure with the least possible number of interdependent components. A cloud infrastructure made of distributed and replicated micro-components is capable of delivering scalability and reliability while limiting the impact of an outage, preserving the overall SLA.

Two things to keep in mind for the best uptime of your application in the cloud:

Choose an IaaS provider with an architecture designed to limit the impact of outages. If this sounds too theoretical, then think about EBS (AWS Elastic Block Store) which is a centralized macro-component highly dependent on the network.

Choose to have the freedom to build your own resilient app at the IaaS layer and, if you decide to go PaaS, pick a provider with an refund policy in case of outage that is significative enough for your business.

And in the end, Netflix will keep using the cloud because they learnt from this experience and they know that mastering cloud best-practices can save them from the next (indeed inevitable) infrastructure outage.