Designing for Resilience…It’s Not About the Platform

Forget about not being able to access critical business applications like SalesForce, Office365, GoToMeeting, ADP or GitHub, can you imagine a world (or even 30-minutes) without access to your favorite social media sites like Twitter, Facebook and Snapchat…some should consider a break but that’s a subject for another day. The recent outage of a portion of the Amazon AWS hyper-scale cloud environment has thrust application availability to the forefront of public discussion once again.

What IT professionals need to remember is that it doesn’t matter what platform your application is running on, you, as the steward of the application, are responsible for designing an architecture that ensures the viability and resilience of your applications and data.

1. Provider Redundancy

While all of the major public, hyper-scale cloud platforms have done a great job ensuring redundancy in their cloud platforms, there is always the chance that a software bug, human error or cyber-attack could dramatically affect the performance and/or availability of a large portion of the hyper-scale cloud provider’s environment.

Developing a redundant, provider-agnostic architecture also provides an organization with the flexibility to move any given application to another cloud provider to meet future business and/or IT objectives.

A number of providers have sprung up over the last 12 – 18 months to address the need for platform flexibility. Some of these solutions, such as the ZERODOWN software developed by ZeroNines, allow for the quick shifting of application workloads from one hyper-scale cloud provider to another without the loss of functionality or data, thus making an outage or severe performance issue with one of the hyper-scale cloud providers a non-event.

2. Geographic Redundancy

Whether an organization decides to develop a multi-provider architecture or not, it is important to have geographic redundancy when it comes to where you host the data and applications. Geographic redundancy is important to guard against regional risks such as:

Political turmoil (Ukraine, Middle East, North Korea, etc.)

Large natural disasters (think Hurricane Katrina in 2005)

Large power outages (think the large power outage that affected the entire Northeast part of the United States in 2003 or the rolling blackouts that have occurred in Southern California)

Large-scale telecommunication outages

To adequately guard against geographic uncertainties, a geographically redundant architectural design should provide a minimum of 250 miles separation between data centers. This amount of separation will typically guard against natural disasters, power-grid failures and network failures.

Of course, the distance alone will not guard against political risks, depending on the stability of the country hosting the applications and data, a further separation may be required to protect against any political unrest.

Along with native solutions provided by the hyper-scale cloud providers, organizations can look at solutions such as Zerto’s Virtual Replication in order to replicate applications and data from one geographic region to another.

3. Data Center Redundancy

Data center redundancy takes two forms…

The first is hosting your applications and data within multiple data centers, even if they are both within the same general geographic area (less than 250-miles apart). This physical data center redundancy guards against localized weather events, terrorist attacks, local power grid and network failures, building failures (fire, flood, etc.) and the preverbal “backhoe through the cables”.

In addition to physical data center redundancy, organizations must ensure that there is redundancy built into the individual data centers themselves. This is important because the most successful disaster plan is the one that will not need to be executed in a real-life situation. By ensuring that your data center is highly redundant within the walls of the data center, you dramatically decrease the chances that you will actually have to declare a disaster and move workload requests to another location. Intra-data center redundancy means eliminated single points of failure within the data center for such items as power delivery, distribution & backup power, internal and external network connectivity, fire suppression systems and cooling systems.

Along with native solutions provided by the hyper-scale cloud providers, organizations can look at solutions like Zerto’s Virtual Replication that easily replicate applications and data within a data center, as well as between data centers. If a virtualization solution such as VMware’s ESXi or Microsoft’s HyperV solution is used, organizations can take advantage of virtualization-platform specific toolsets to seamlessly migrate workloads within or between data centers.

4. Application Redundancy

To further protect applications from a failure, it is highly recommended that an organization look at ways to make the applications themselves highly redundant. An application can fail for a number of reasons, including hardware failure; operating system failure; application bugs; cyber-attack; and unplanned spikes in utilization.

An application architecture that leverages micro-services and horizontal scaling, along with a well-defined software development and deployment methodology will help guard against failures at the application level.

Tools such as application and web load balancers allow you to route work requests to the available resources. In the case of an individual application server failing (whether due to a hardware, software or network failure), the workloads can be seamlessly routed to the remaining application servers. The load balancer architecture will also protect the performance and availability of the application In the case of unplanned spikes in utilization.

5. Data Redundancy

Without the data, an application is of no use. For this reason, organizations should look for ways to replicate their data. This replication not only protects against hardware and software failures but also human error.

To protect against hardware and software errors, organizations should look to data clustering solutions such as the SIOS DataKeeper technology to replicate their data from one storage device to another.

To protect against intentional and unintentional human error, organizations should look at making backups of their data on a regular basis and keeping old copies of the data around for an agreed upon and documented period of time. Unlike clustering, where the data is copied from the primary location to the secondary location, backups are done at a scheduled time making it easy to “roll-back” an inadvertent update to the data.

Conclusion

Whether an organization hosts their applications and data in a public, hyper-scale cloud platform (Amazon AWS, Google Cloud Platform, Microsoft Azure, etc.) or a private environment (physical servers, Microsoft HyperV, VMware ESXi, etc.), it is the responsibility of the IT department to ensure the data center, infrastructure, applications, and data have all been designed to meet the specific resiliency requirements of the business for the specific application and data. Leveraging a cross-platform consulting and managed services company, such as HOSTING, to design, build, monitor and manage can dramatically increase the resiliency of an organization’s applications and data.

About the Author

Michael McCracken is currently the Vice President of Advanced Solutions at HOSTING. With over 25 years of IT industry experience, Michael has extensive knowledge on infrastructure and application transformation, including solutions in the areas of security & privacy, data center design, business continuity & resiliency, information lifecycle management, storage & server consolidation/virtualization, infrastructure high-availability, LAN/WAN/wireless networks and mobile computing.