What Went Wrong At Delta?

Just in case you don’t remember, the U.S. carrier, Delta airlines, suffered a major melt down when their headquarters experienced a power outage. This outage crippled their IT systems and the result of this was that they could not fly their planes. Planes were stuck on the ground, flights were cancelled, and Delta was even unable to get the word out to flyers that their flights had been cancelled. This mess up was based in Delta’s IT department because of the importance of information technology and at the end of the day that means that it was the person with the CIO job’s responsibility.

What Went Wrong?

The bad things for Delta started to happen at about 2:30am. when there was an electric problem in Delta’s Atlanta headquarters. A critical power control module at the airline’s technology command center malfunctioned, causing a surge to the transformer and a loss of power. The good news is that power was stabilized and restored quickly. However, after the malfunction some critical systems and network equipment didn’t switch over to backups. Other systems did.

The systems that failed to switch over suffered from “instability” affecting the performance of a customer service system used to process check-ins, conduct boarding, and dispatch aircraft. You wouldn’t think that a little power outage would have a big impact, but in this case you’d be wrong. Because airlines are so tightly scheduled, the delays and cancellations have had a major effect.

The reason is because flight crews and planes are not where they should be. Additionally, those flight crews can only be on duty for a limited time before rest periods are required by law, with crews working in three or four day rotations. Multiplied across tens of thousands of pilots and flight attendants and thousands of scheduled flights, rebuilding rotations is a time-consuming process.

What Should Delta Do Now?

So what’s the big deal you say? An outage can happen to anyone and it appears as though this time Delta’s number was just up? That’s not good enough in this case. Delta cancelled around 1,000 flights on the day of the outage and about 775 flights on the next day as the airline worked to establish normal operations. The company said it expected to cancel about 90 flights on the following day (2 days out), and return to normal operations later. Clearly this outage had a big impact on their ability to deliver their product to their customer.

In Delta’s defense, they have said that over the past three years the company had invested hundreds of millions of dollars in technology infrastructure upgrades and systems, including backup systems to prevent what happened from occurring. However, clearly they didn’t make all of the right investments. What Delta has discovered is that they did have a redundant backup power source in place. Unfortunately some of their core systems and key systems did not kick-over to the back-up power source when they lost power and, as a consequence of that, it caused their entire system effectively to crash and they had to reboot and start the operation up from scratch.

What this means is that the person in the Delta CIO position has decided to keep their IT operations in house and has not yet moved them to the cloud. Delta’s data center in Atlanta took a hit and because of its importance to the company and the fact that they don’t have a mirror site running somewhere else, they were effectively out of business. If the company had moved to the cloud and virtualized all of their applications, then an event at their Atlanta facility would have been a non-issue.

What All Of This Means For You

Delta Airlines runs a complex operation. They schedule hundreds of flights every day with a collection of planes and crews. In order to keep all of these different parts working correctly, they have an extensive set of applications that the company relies on. However, when a power outage hit the company’s Atlanta headquarters all of these applications got taken offline and the company had to effectively shut down.

This of course brings up the question, what happened and why couldn’t the CIO prevent it from happening? The first question is easier to answer. The company did have systems in place that when primary power was lost to key system, they would automatically switch over to other sources of power. However, when the power failure occurred only some systems switched and some did not. This resulted in widespread outages. What the CIO should have done was to move Delta’s key systems into the cloud where they could be duplicated and fully backed up independent of a single physical location.

What Delta clearly had not been doing was running simulations of what could happen if they had a power outage. These simulations are challenging to perform and they do carry some risk that you’ll cause production delays if things go wrong, but they are critical. Delta could have discovered their power switching problems earlier if they had been doing this. Hopefully going forward they’ll do a better job of testing their systems.

P.S.: Free subscriptions to The Accidental Successful CIO Newsletter are now available. Learn what you need to know to do the job. Subscribe now: Click Here!

What We’ll Be Talking About Next Time

Who works in your IT department? Is it a bunch of fired up people who arrive at work each and every day ready to take on the world? Do they believe in the importance of information technology and are they straining to find new and better ways to do just about everything? Or when you walk out of the room, does everyone kick back, fire up their laptops, and sign into Facebook in order to chat with their friends? As the person with the CIO job, you want everyone in the IT department to be spending their time working to move the company forward. You can’t afford to have any dead wood in the IT department.