Choosing Resilience Over Reliability

In our Southwest’s Summer Screwup article, we wrote about how a router failure at Southwest Airlines resulted in at least $10 million in lost revenue and millions more in associated costs throughout the travel industry. While doing research for that article, we came across a presentation by Richard Cook, MD, who studied complex systems in hospital operating rooms, a place where an equipment failure could be fatal.

In his presentation, Dr. Cook raised this essential question: Are service delivery interruptions abnormal events that we should strive to completely eliminate or should we “expect the unexpected” and continually adjust and adapt? To help answer this, he pointed out that in complex systems, which most certainly includes IT networks, there is a significant discrepancy between “What is Imagined” and “What is Found”.

Imagined System (Ideal) vs. Found Systems (Reality)

Imagined Systems are what is drawn up in network diagrams — a vision of an ideal working system, codified in neatly connected lines between network equipment. The assumption is that if things work the way they are supposed to, there will be only minor, if any, incidents. Incidents indicate an imperfect design or flawed implementation – correcting the design or implementation error is the way to prevent future incidents.

Found Systems are dynamic, adaptive, and change constantly. There is a strong reliance on perpetual monitoring and proactive adaptation designed to address issues both anticipated and not. Performance and analytical reviews are done during operations with no lag time — maintenance is a constantly running process, not a scheduled event. The goal here is not to perfect the system (since that could never be achieved), but instead to be continuously ready for, and adapting to, unexpected events.

Stuff Happens: Build for Resilience More Than Reliability

According to Dr. Cook, the essential quandary we face is that while we design systems for high reliability based on our best knowledge at that time, we cannot know what will happen in the “real world”, where things change constantly and unexpected events occur. We cannot know what errors the combination of components that make up the system will produce, what operator mistakes will occur, or how a change in one area will affect another part.

We have a choice: Do we strive to “perfect” the system and rely on backups and failovers to protect against the unknown? Or do we accept that the system is under constant pressure and therefore constantly monitor and adapt to changes in real-time?

Imagined Systems are designed for reliability, with backups and failover systems in place; the problem is that redundant systems are usually inadequately tested and not fully maintained — when a problem happens, they may not withstand the stress. Southwest Airlines’ reliance on a redundant infrastructure is a perfect example.

Found Systems are designed for resilience, with the ability to recover quickly from failures, to recognize and respond to abnormal situations, and be able to quickly adapt to change. This is made possible by aligning both design and operations with a modernized ITSM approach: being proactive and crafting best practices that are centered on constant improvement, adaptation, and maintenance. In other words, don’t wait for an incident to happen. Instead, assume incidents will happen at any time and focus on proactively adapting processes to handle change. Focus on efficient incident management powered by Continual Service Improvement.

The first approach waits for incidents to occur, then applies reactive, fire-fighting techniques to resolve the issue. This is followed by a root cause analysis that seeks to determine what caused the incidents so that a fix can be applied.

The second approach functions in a state of constant monitoring and proactive adjustments to ever-changing situations. It looks at root cause analysis as only mildly useful because the combination of circumstances that produced the incidents was likely temporal; fixing it addresses something that occurred in the past but does not prevent incidents caused by a different combination. And since complex systems are always interacting in multiple ways, there is no way to prepare for every possible arrangement of that system during its operation.

What is Crow Canyon’s Approach?

Obviously, from this discussion, we at Crow Canyon favor the resilient approach. As appealing as the ideal, imagined system concept is, IT service delivery takes place in the real world, where unexpected and unanticipated incidents occur all too often.

As solution developers, we at Crow Canyon seek to provide software that enables a proactive, resilient approach. Our SharePoint-integrated IT Help Desk System is closely aligned with ITIL Incident Management best practices. It is designed to help your teams identify issues and restore normal operations as quickly as possible. It provides your teams with the communications tools they need to effectively engage with your employees or customers for precise and timely recovery from incidents.

All of our solutions include robust reporting & analytics capabilities, a key best practice requirement in ITIL’s Problem Management processes. Our reporting & analytics functionality tracks and records all data-points as they pass through configurable workflows. Comprehensive filters enable you to view data that is relevant to your specific business needs, with dashboard results being visually conveyed via charts and graphs. This approach provides invaluable feedback when implementing ITIL-based Continual Service Improvement initiatives.

In addition, our Asset Management System provides a comprehensive view of exactly what is going on with your assets, a feature that is critical when applying ITIL Change Management best practices to your organization. If an incident affects your organization’s service capabilities, then attention may shift to making relevant changes to your infrastructure. In order to do this effectively, you need immediate access to specific real-time asset information (e.g., who is currently using an asset?, was is its usage history?, when was it last updated or serviced?, what is its license / serial number?, etc.).

Incident Management, Problem Management, and Change Management are all key components of our ITSM design philosophy and ITIL best practice-aware approach. And if you are a small to medium-sized business, and think ITSM/ITIL is overkill, see our ITIL-Lite WhitePaper that addresses this concern.

Want to learn more about how our solutions can transform SharePoint and Office 365 into real world business application platforms? Give us a call at 1-925-478-3110 or contact us by e-mail at sales@crowcanyon.com