Categories

Enterprise Cloud Adoption – two areas for consideration

There are many postings on the risks and challenges for Cloud migration

One recent one, posted by Glen Robinson (@GlenPRobinson) had an excellent summary of the key risks and challenges to a broader migration to the Cloud. In his paper he sites Security, Resilience, Reputation and Regulatory as key, followed by Financial, Licensing and Talent from a Commercial perspective

I have drawn two of these out with more detail

Firstly, Security; clearly a huge topic in its own right and often will be the first and most significant reason any organisation does not deploy into the Public Cloud. To aid this resolution, the concept of a Shared Responsibility model was created. Whilst AWS even use this term directly, this is not in any way exclusive to AWS. In a bimodal context (see previous blogs for a broader view of this…), there are two distinct approaches

Public Cloud Provider (eg AWS) are responsible for the security of the cloud

The customer is responsible for the security and compliance in the cloud

Or in other words, there should be no grey areas between the two

Using the AWS Shared Responsibility Model as an excellent reference point; “Of the cloud means”, The Cloud provider secures the following areas which are of course linked directly to what Services they provide. So for AWS, this can be broken into two core elements:

Compute, Storage, Network , Database

Regions, Availability Zones, Edge locations

What this means is that there will be a range of Services offered for each of these Components and within that a very full suite of security measures and controls. The other key point is that these are Provider services so you do only get what is provided.

The second dimension;”In the Cloud”: means, securing the following areas, which by their very nature and context are owned by the customer and therefore it is wholly the responsibility of the customer to secure it:

The most important take-away here is whichever Cloud Provider an Organisation contracts with, it is critical to understand the Shared Responsibility Model and thereafter ensure that each element within the system has had the right treatment applied to it

The second area I will reference is “Resilience”. This has been highlighted more by some recent very high profile outages from both AWS and Azure. This could be coined as “removing single points of failure”

Firstly, and logically when designing an application it is imperative that the requirements for uptime are understood and endorsed from a business perspective. Defining a requirement for “high availability” means it can withstand failure of individual or multiple components . The two most common terms and measures used in the Industry are Recovery Time Objective (RTO) and Recovery Point Objective (RPO), the first being process restoration time and the second being length of data loss. Getting an agreed business focused target for this is imperative to allow the solution to be designed correctly. Furthermore (logically) is to ensure the Business doesn’t simply say “it needs to be 100% fault tolerant” without the right context which is business process coupled with the financial impact.

Using AWS as an example, each of the key building blocks has functionality that provides levels of resiliency and redundancy – the aligns to the Shared Responsibility Model above and therefore understanding for example the deployment model for an Amazon Virtual Private Cloud (VPC) and Elastic Balancing is key. Of course seeing a quoted level of availability and reality can be interesting…(note the quoted S3 uptime is 99.99999999%)…

Therefore, there are a number of mechanisms on how to remove points of failure. Note these are not unique to a public cloud provider and have been the bread and butter of on premise architectures for a long time (forever in fact…)

The following are the key areas to focus on:

Introduce redundancy – there are two major types, standby and active – standby is where a process is performed on failover, whereas active is automatically distributing workload. Typically standby is significantly easier to design and cheaper to deploy so there is always a cost/benefit trade off to be done

Failure detection – Automation is very much the hot/key topic here as this allows not only the detection but the reaction activities to take place. This is recognition that failires will happen and therefore the more you know about them, can trend them (even predict) the better prepared you will be. An interesting more extreme perspective of this is the Netflix model which is not only detect failure but create failures as well to ensure applications are deployed with the right level of resiliency (Chaos Monkeys). I love the Netflix model as it introduces a culture that is accepting infrastructure failure rather than one that is surprised by it

Data Storage – at the core of every application will be the data and therefore techniques such as data replication that will introduce redundant copies of the data automatically create less points of failure. Typically there are two type of replication – synchronous and asynchronous . The key difference between the two is whether the application has to wait for the data to be written to all places (synchronous), or continues. Clearly there are significant factors in this choice because of potential latency issues. As always, when there is seemingly only a choice of two, a third option; “quorum based” which is a hybrid of the first two has been developed, which is especially useful for large scale distributed databases. Of course this is no substitution for actual data backup and this should be part of the overall disaster recovery plan

Multi-Data Centre resilience – traditionally the hardest decision is when to perform a fail over – especially when there is a short disruption and the length of the disruption is not known. In AWS because there are separate Regions with separate Availability Zones, data can be replicated across data centres synchronously, so failure can be automated and transparent to end users. There is clearly quite a significant cost to this so it comes down to the very first point of understanding the business imperative for resilience

Hybrid – a much discussed subject but is a clear option for removing single points of failure. Hybrid in this context means that applications are not necessarily deployed in both the public and private infrastructures but there is an option to deploy across them in the event of failure. Clearly this would then bring in a lot of additional parameters, some very significant as ensuring that the workloads can actually be operated in both environments which are kept in sync continuously. This would appear on the face of it to be a very costly option as all the economies that are to be gained from moving to a public cloud provider could be irradiated. But..it is definitely an option to be considered..