Testing Disaster Recovery

How good are your Disaster Recover plans (DR plans)? Of 10 companies that don’t have DR plans, only 1 will be in business for more than 10 years. Is it because the other 9 companies experience a major event for which they are unprepared? No, most likely it is because those 9 other companies have very immature processes and controls, of which the lack of disaster preparedness is just a symptom.

That being said, having a plan and executing the plan are two different things. I worked at a Fortune 500 pharmaceutical company that did annual testing of the mainframe DR plan by sending all of the support engineers to the hot site (located 1,000 miles away) to bring up an identical model mainframe at an IBM facility. They then loaded all of our applications and tools to see if they could replicate our production environment within a three day window. Every year they found deficiencies in the plan. Every year they discovered changes implemented over the previous year that affected the DR plan’s execution. One year, the operators in the main production site saw the mainframe’s CPU, memory utilization, and disk access all drop to zero. The DR testers didn’t just simulate bringing up the backup mainframe; they accidently switched all the communication lines to the IBM facility. It was the ultimate proof that the DR plan was effective (even if it did scare the crap out of the production mainframe staff).

I use to tell a story when I was training ITIL that I am pretty sure is completely apocryphal, but it was such a good story that I told it in almost every session.

The story goes that a major car rental agency would secretly schedule a DR test on a random day of the year. On that day, all the arriving employees would be met at the door by the DR testers and be given a randomly selected colored ball. The color of the ball determined what department you worked in for that day. If it was a blue ball, you worked in accounting; if it was a red ball, you went to sales; if it was a green ball, you worked in IT; if it was a black ball, you went back home for the day (because you died in this test).

The test was to determine if the DR plan written by each business unit was of sufficient quality that someone with little-to-no knowledge of the department’s operations could maintain the business at a minimum level to sustain the company for one day (I swear that was always the day I called to make my car reservation). For this test to work, every employee had to know how to find any departments’ DR plan, and then had to be able to access the necessary information and tools for the other business unit’s functions. I might have an issue with the folks in the mailroom getting access to my sensitive HR data. Maybe the super sensitive data is considered non-critical when the company issues the test disaster alert, and that data is simply unavailable. Some companies even have a “break glass” switch that, when activated, pulls all security down and allows anyone unhindered access to all data. Obviously, this is a case of last resort where the importance of keeping data private is trumped by ensuring the company’s future viability.

ITIL says that IT Service Continuity (DR for IT) is an on-going, iterative process that is never complete. Even the best DR plan will have deficiencies. The biggest deficiency is the fact that in a real emergency, the human mind shuts down and reverts to a more primitive state. You must write the plans based on the assumption that the people executing them will not be in top mental state, and might not be trained in the area they are forced to support. If you have the opportunity to test your plans, try and shake things up a little by having your CIO work the Service Desk, have a Service Desk person work in the NOC, have a Network Technician bring up all the servers, etc. And to really make it a good simulation, make them all wear headphones that are continuously playing the sound of crying babies at 80 decibels.

Does your organization have Disaster Recovery plans?

If so, have they ever been tested?

If you have tested the plans, how realistic were the tests?

Could your company operate at a minimum level if all your employees stayed home and you went to BestBuy and recruited random individuals to fill their place?