If you start mapping out the major components of an application stack, you’ll probably arrive at this list (bottom-to-top):

Network links and devices;

Network services;

Servers and storage;

Virtualization platforms and operating systems;

Databases, message queues…;

Applications.

Each one of these components can fail due to hardware failure, software error, or human mistake (operator error).

Next, identify the likelihood of individual failures. Hardware failures (apart from link failures) are less common than software failures or operator errors these days, and in most cases infrastructure failures tend to be less common than application problems.

Which parts of the whole stack are currently resilient to failures and which ones represent a single point of failure?

Which parts could be made more resilient?

How will your organization handle the remaining SPOFs?

What is the downtime caused by a failure of a non-redundant component?

How often can you expect to see those failures?

Getting answers to those questions (good luck ;) might make it easier to persuade the CIO that you company doesn’t need a L2 DCI for disaster recovery (which might happen every 10 years) when the non-redundant applications need a restart every month or remain unpatched for years because nobody wants to touch them… and if everything else fails, you can still quote Gartner.

By Ivan Pepelnjak

2 comments:

Identifying the weakest link is like playing a game of whack a mole, when you think you have all your ducks in order another appliance, piece of software or other dependency is unknowingly introduced creating a possible SPOF.. that's why I find it's important to regularly review these questions with the right group of people (Management, Network, Servers, Storage teams, etc) I also agree that there are far more cheaper/affordable ways than jumping into the L2 DCI bandwagon. Many applications these days have their own replication/syncing technology built-in.. for example in the Microsoft world SQL Server mirroring has been around since v2005, Exchange 2007 has started it's own mailbox replication as well.. all of these have only improved. If possible, I would try to at least virtualize as many workloads as possible to make it at least possible to replicate legacy workloads at a VM level using software like Veeam or other form of replication. I understand that there will always be some super lagacy equipment which would require a forklift upgrade - I can think if a legacy IVR system using Dialogic cards (can't virtualize) tied to a bunch of analog telco lines..

I think another measure should be added: the effect of the removal of the weakest link, on the overall network.Or at a further (more complicated) level: the effect of its removal on other services (ie the dependencies on that system or service.)

The author

Ivan Pepelnjak (CCIE#1354 Emeritus), Independent Network Architect at ipSpace.net, has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced internetworking technologies since 1990.