Preparing for the Worst

When a data center is shiny and new it can be easy to forget that things can go wrong. Servers can crash, people make mistakes and utility power is lost. Most companies – 95 percent – experienced an unplanned data center outage during the previous two years, a 2010 Ponemon Institute study titled “National Survey on Data Center Outages” found.

To avoid these issues, every new and existing data center must have a formalized, tested disaster recovery plan, said Rachel Dines, senior analyst at Forrester Research. But even organizations that plan for the worst aren’t necessarily prepared, she added. “While the vast majority of organizations have some sort of continuity plan, what varies is how robust it is, how often they test and how closely it matches business or mission requirements,” she said. “That’s where the story gets not so good.”

Many disaster recovery plans fall apart because information technology workers overlook or skip specific steps in the process, she said, adding that the first step has little to do with technology choice or implementation said.
“The very first thing you need to understand is the business requirements,” Dines said. “What does the department or agency need to continue? What systems are critical to the mission?”

IT organizations need to assess which missions are essential functions and correlate those needs with the IT systems that support them. Called a business impact assessment, this process requires meeting with end users and is usually handled by an outside vendor since it can be time-consuming and difficult.
“Most private organizations don’t even have the resources to dedicate to do this properly,” Dines said. “That’s why this step is often glossed over or skipped entirely.”

When choosing a vendor for an assessment, agency officials should look for companies that have a proven methodology and have documented experience working in a similar industry. The process will be quicker and go more smoothly if the firm you hire doesn’t have to start from scratch.

Once you’ve uncovered what applications and data are mission critical, it’s time to look at that pool from a recovery perspective. Ultimately, Dines said, you’ll need to define recovery time and recovery point objectives – how long you can go without those applications and data and how much data you can lose without impacting your agency’s mission. In this step, end users should have limited input; few will admit there is data they can live without. In the government sector, this step may include looking into nontechnical replacements for data or applications.

“Since much of government just recently came off of paper, there should be options available to them,” Dines said.

A key part of this process is figuring out the cost of downtime, said Robert Bready, research director, IT infrastructure at Aberdeen.

“Until you know how much your downtime costs you, you can’t figure out the technology you’ll need to implement for recovery,” he said. “If your downtime costs $100 an hour, backup to tape or off site might be sufficient. But if you’re an organization like [the Federal Emergency Management Agency] an hour of downtime might mean you’re not just losing money, you’re losing lives, so you’ll have to formulate a way to get back online more quickly.”

One of the final steps is architecting and implementing the technology and services to support the organization’s backup, archival and recovery plans. To this end, experts suggest paying close attention to "grooming" your data, or making sure you keep just enough copies of the data you need for protection.
“There is simply way too much data being generated to do things the old way any longer, taking an incremental backup during the week and full copies every weekend,” said Steve Duplessie, senior analyst at research firm Enterprise Strategy Group. “It’s crazy. Who needs 187 copies of the same, non-changing data?”

Going forward, every agency will have to conduct regular testing, Dines said. Currently only 47 percent of enterprises test their disaster and recovery plans at least once a year and nearly one in five (18 percent) never test at all, according to “Disaster Recovery Exercises Fall Short Of The Finish Line,” a recent Forrester Research study.

“For government agencies, having constituents trust them is a main goal. Confidence is a big driver of planning and preparation,” Dines said. “Organizations are striving for significantly higher levels of availability [than] in the past – both planned and unplanned downtime because customers and constituents are demanding it.”