Wednesday, January 10, 2007

How to Create a Disaster Recovery Plan

Learn the basics of creating a plan that will have you prepared to recover your data and keep the business running after an IT-disabling disaster.by Glen Kunene, Senior Editor

What would you do if a storm flooded your data center? Or how would you respond if a power outage blacked out your servers? How would you recover your data and keep the business running after an unforeseen disaster? When disasters strike unprepared companies the consequences range from prolonged system downtime and the resulting revenue loss to the companies going out of business completely, yet many IT shops are not prepared to deal with such scenarios.

The key to surviving such an event is a business continuity strategy, a set of policies and procedures for reacting to and recovering from an IT-disabling disaster, and the main component of a business continuity strategy is a disaster recovery plan (DRP). In this article, DevX and Cole Emerson, President of Cole Emerson & Associates, Inc., a business-continuity consulting firm, and chairman of the board of DRI International, administrators of a global certification program for business continuity/disaster recovery planners, walk through the basics of creating an effective DRP.

Step 1: Risk AnalysisThe first step in drafting a disaster recovery plan is conducting a thorough risk analysis of your computer systems. List all the possible risks that threaten system uptime and evaluate how imminent they are in your particular IT shop. Anything that can cause a system outage is a threat, from relatively common manmade threats like virus attacks and accidental data deletions to more rare natural threats like floods and fires. Determine which of your threats are the most likely to occur and prioritize them using a simple system: rank each threat in two important categories, probability and impact. In each category, rate the risks as low, medium, or high.

For example, a small Internet company (less than 50 employees) located in California could rate an earthquake threat as medium probability and high impact, while the threat of utility failure due to a power outage could rate high probability and high impact. So in this company's risk analysis, a power outage would be a higher risk than an earthquake and would therefore be a higher priority in the disaster recovery plan.

Step 2: Establish the BudgetOnce you've figured out your risks, ask 'what can we do to suppress them, and how much will it cost?' Can I detect a threat before it hits? How do I reduce the potential of it occurring? How do I minimize its impact to the business? For example, our small California Internet company could employ an emergency power supply to mitigate its power outage threat and have all its data backed up daily on RAID tapes, which are stored at a remote site in case of an earthquake. The more preventative measures you establish upfront the better. Emerson says, "dollars spent in prevention are worth more than dollars spent in recovery."

The results of Step 1 should be a comprehensive list of possible threats, each with its corresponding solution and cost. It is imperative that IT presents all of these threats to the business operations units, so they can make an informed decision regarding the size of the disaster recovery budget (i.e., which risks the company can afford to tolerate and which it must pay to mitigate). Emerson believes IT "falls down" in its failure to communicate the real risks for system downtime to the business operations units of their companies. He says, "It's okay for operations to say no; it's not okay for IT not to let them know the risks."

A good place to begin is by presenting the cost of downtime to the business. How long can your business afford to be without its computer systems should one of your threats occur?

Ultimately, the business operations unit decides which threats the business can tolerate. According to Emerson, when developing a DRP, IT departments are "shooting in the dark without those business indications." Both IT and the business units must agree on which data and applications are most critical to the business and need to be recovered most quickly in a disaster. The management of our small Internet company, for example, may decide they can supply the budget only for the emergency generators and the company will have to assume the risk of an earthquake.

Disaster recovery budgets vary from company to company but they typically run between 2 and 8 percent of the overall IT budget. Companies for which system availability is crucial usually are on the higher end of the scale, while companies that can function without it are on the lower end. However, these percentages may be too small. For a large IT shop 15 percent is a best practice rule of thumb according to Emerson.

Step 3: Develop the PlanThe feedback from the business units will begin to shape your DRP procedures. If, for example, they determine that the company must be up within 48 hours of an incident to stay viable, then you can calculate the amount of time it would take to execute the recovery plan and have the business back up in that timeframe. Emerson suggests that you have the recovery systems tested, configured, and retested 24 hours prior to launching them. He says the set up takes anywhere from 40 hours to days to complete.

The recovery procedure should be written in a detailed plan or "script." Establish a Recovery Team from among the IT staff and assign specific recovery duties to each member. The manner in which your team conducts its recovery probably will be no different than its regular production procedures: the chain of command likely won't change and neither will the aspects of the network for which each member is responsible.

Define how to deal with the loss of various aspects of the network (databases, servers, bridges/routers, communications links, etc.) and specify who arranges for repairs or reconstruction and how the data recovery process occurs. The script will also outline priorities for the recovery: What needs to be recovered first? What is the communication procedure for the initial respondents? To complement the script, create a checklist or test procedure to verify that everything is back to normal once repairs and data recovery have taken place.

Step 4: Test, Test, TestOnce your DRP is set, test it frequently. Eventually you'll need to perform a component-level restoration of your largest databases to get a realistic assessment of your recovery procedure, but a periodic walk-through of the procedure with the Recovery Team will assure that everyone knows their roles. Test the systems you're going to use in recovery regularly to validate that all the pieces work. Always record your test results and update the DRP to address any shortcomings.

As your business environment changes, so should your DRP. Reexamine the plan every year on a high level: Do you still need every part of the plan? Do you need to add to it? Will the budget need to be adjusted to accommodate changes to the plan? As applications, hardware, and software are added to your network, they must be brought into the plan. New employees must be trained on recovery procedures. New threats to business seem to pop up every week and a sound DRP takes all of them into account.