Have We Learned Anything from Famous Downtime Fiascos?

And how technology is aiming to prevent the next one.

Every company knows that planning for a disaster is an integral part of a business continuity strategy.

After all, any business that relies heavily on IT systems appreciates that even minor downtime can quickly add up to millions of dollars in lost production.

How common is downtime? According to a survey by Zetta, 54% of IT professionals have experienced an outage lasting eight or more hours. The most common reasons for downtime among those surveyed were power outages and hardware failure, circumstances seemingly out of the company's control.

But are they?

It's worth looking at the effects of just a few recent examples of downtime fiascos:

Silicon Valley Sent Back to the Dark Ages

On April 21, 2017, a massive power outage paralyzed San Francisco for much of the day. The outage, not the first this year, was particularly exasperating to a city in the heart of Silicon Valley, arguably one of the country's most important technology hubs.

Just like that, the city was sent back in time, as elevators stopped moving, traffic lights went out, and businesses were forced to do credit card payments with old-fashioned paper imprints.

The cause was initially reported as being the result of a fire at a substation. The real culprit, as an investigation revealed, was the not the fire itself.

PG&E spokesperson Barry Anderson said the outage started at their Larkin Substation. "We had an equipment failure, a catastrophic failure of one of our circuit breakers. When the circuit breaker failed, it created a fire around the breaker." Within a few hours, that little circuit breaker was responsible for hundreds of millions in lost revenue for the city's businesses.

This wasn't the first time the city had suffered a major power outage as a result of old equipment, either. An aging infrastructure in the city points to more power outages in the future.

Network Outage Grounds Delta Flights

In August, 2016 Delta Airlines suffered a major outage that led to the cancellation of more than 1,800 flights and huge losses in revenue.

The problems started early on a Monday morning when a power control module at the company's command center malfunctioned, leading to a power surge to the transformer and a complete loss of power.

While crews worked to restore power and stabilize the system quickly, the situation was worsened by the fact that several systems and equipment did not switch over to backup power during the outage.

The affected systems left Delta unable to process check-ins, conduct typical boarding operations, and dispatch aircraft during the downtime. The delays and cancellation of flights, combined with regulations regarding pilot and crew rest periods, led to a domino effect that resulted in more than 1,800 flights being cancelled, spanning more than three days.

What Have We Learned?

Despite the headlines these incidents (and many others) generated, it's clear a lot of companies haven't been paying attention.

According to Gartner, the average large corporation still experiences 87 hours of downtime per year, costing companies more than $300,000 per hour. That's a shame because more often than not, event forensics tend to reveal that the faults that lead to downtime are, in fact, both predictable and preventable.

The Delta airlines fiasco could have been prevented with proper planning and predictive analysis. If the malfunctioned part that caused the surge had been identified for potential failure and replaced, the fiasco could have been avoided entirely.

Technology And Potential Treatments

To be fair, that sort of prognostic ability wasn't available just a few years ago; a best guess usually had to suffice.

There are already a number of companies working on it. NY-based Aquant.io is using AI and machine learning to forecast and enhance service by ensuring maintenance people have the parts and tools they require, before the parts fail. The system is also able to identify potential failures.

Exacter Technology, which focuses on electric utilities, employ a process that recognizes electrical signatures emitted from overhead lines which suggest cracked and contaminated insulators, the root cause of most electrical pole fires. Immediate repairs can be made and the data merged with existing Geographic Information Systems (GIS) and outage data to help predict recurring failures caused by degraded equipment.

If we're to prevent these sorts of costly downtime incidents in the future, being proactive and replacing parts before they fail seems like the best strategy. As all industries become increasingly IT-based and dependent on IoT technology, that sort of prescience doesn't just seem like a good idea: it will be vital.

If we've learned anything from this recent spate of downtime fiascos, when an industry shuts down, the ripples of consequence extend very far.