IT disaster recovery: Sysadmins vs. natural disasters

Businesses need to keep going even when faced with torrential flooding or earthquakes. Sysadmins who lived through Katrina, Sandy, and other disasters share real-world advice for anyone responsible for IT during an emergency.

In terms of natural disasters, 2017 has been one heck of a year. Hurricanes Harvey, Irma, and Maria brought destruction to Houston, Puerto Rico, Florida, and the Caribbean. On top of that, wildfires burned out homes and businesses in the West.

It’d be easy to respond with yet another finger-wagging article about preparing for disasters—and surely it’s all good advice—but that doesn’t help a network administrator cope with the soggy mess. Most of those well-meant suggestions also assume that the powers that be are cheerfully willing to invest money in implementing them.

We’re a little more interested in the real world. Instead, let's put that bad news to some good use.

Case in point: One result of a natural disaster is that the boss may suddenly be willing to find budget for disaster recovery planning. As a New York area sysadmin puts it, "The greatest benefit I found from Hurricane Sandy is our client's interest in investing back into IT, so hopefully you will welcome bigger budgets as well."

Don’t expect that willingness to last long, though. Any sysadmin who’d like to suggest infrastructure improvements is urged to make hay while the sun shines. As another Sandy-survivor IT specialist ruefully remarks, "Initial interest in IT spending lasted the calendar year for us. By the following year, any plans that hadn't already been put in the works got put on the back burner due to 'budgetary constraints,' and then completely forgotten about by around 6 months later."

"Adequate” is a key word. As a sysadmin on Reddit wrote in 2016, "Our disaster plan is a disaster. All our data is backed up to a storage area network [SAN] about 30 miles from here. We have no hardware to get it back online or have even our core servers up and running within a few days. We're a $4 billion a year company that won't spend a few $100K for proper equipment. Or even some servers at a data center. Our executive team said, 'Meh what are the odds of anything happening' when the hardware proposal was brought up."

Another on the same thread put it more succinctly: "Currently my DR plan is to cry in a dark damp corner and hope nobody cared about anything that was lost."

Get the report from 451 Research - Datacenter Modernization: Trends and Challenges

If you’re crying, let’s hope you aren’t crying alone. Any disaster plan, even one devised by the IT department, has to ascertain that you can communicate with humans, as sysadmin Jim Thompson learned during Katrina: "Make sure you have a plan to communicate with people. During a serious regional disaster, you will not be able to call anyone with a phone in the affected area code."

Make a wish list

The first step is recognizing the problem. "Many companies are not actually interested in disaster recovery, or they address it reluctantly,” says Joshua Brusse, a chief architect at Micro Focus. "Viewing disaster recovery as an aspect of business continuity is a different perspective. All companies deal with business continuity, so disaster recovery should be considered as part of that.”

Ensuring that there’s an adequate disaster recovery and business continuity plan in place requires the IT department to document its needs. That’s true even if—or particularly when—you don’t get your way. As one sysadmin remarks, "I like to have a 'thought dump' location where any and all plans/ideas/improvements can be just dumped in with no limitations or restrictions. [This] is especially helpful for when you propose a change, it gets shot down, and six months later that situation you warned about came up.” Now you have everything prepared and can start the discussion: "As we discussed back in April…"

So, what can you do when your executive team responds to the business continuity plan with "Meh what are the odds of anything happening?" Shockingly poor judgement as that is, one sysadmin suggests it's also completely normal behavior for the executive layer. In situations this dire, experienced sysadmins say document the events. Be clear that you told the executives what needed to be done and that they refused to do so. "The general idea is to have a paper trail long enough for them to hang themselves," the sysadmin adds.

If that doesn’t work, the experience of bringing back a flooded data center will serve you well in a new job search.

Protect the physical infrastructure

"Our office is an old decrepit building,” reported one sysadmin after Harvey hammered Houston. "We went into the building blind and the infrastructure in place was terrible. We literally just finished the last of the drops we needed in that building and now it's all under water."

Nonetheless, if you want the data center to keep running—or to get back up and working after a storm—you need to ensure the facility can stand up to not only the kind of disasters expected in your area but the unexpected ones as well. One reason Sandy was devastating is that the New York area wasn’t prepped for that sort of weather system. A sysadmin in San Francisco knows why it’s important to ensure the company’s servers are in a building that can withstand a magnitude 7 earthquake. A business in St. Louis knows how to respond to tornadoes. But you should prepare for every eventuality: a tornado in California, an earthquake in Missouri, or a zombie apocalypse (which also gives you justification for a chainsaw in the IT budget).

In Houston's case, most data centers stayed up and running because they were built to withstand storms and floods. Data Foundry’s chief technology officer, Edward Henigin, says of one of its data centers, "Houston 2 is a purpose-built facility designed to withstand Category 5 hurricane wind speeds. This site has not lost utility power, and we have not had to transition to our backup generators."

That's the good news. The bad news is, as superstorm Sandy showed in 2012, if your data center isn't ready to handle flooding, you're in for a world of trouble. Customers of one failed data center, Datagram, included high-profile sites Gawker, Gizmodo, and Buzzfeed.

Of course, sometimes there's nothing you can do. As one San Juan, Puerto Rico, sysadmin sadly wrote when Irma came through, "Generator took a dump. Server room running on batteries but no [air conditioning]. Bye bye servers.” The sysadmin couldn’t fail over to disaster recovery because the MPLS (Multiprotocol Label Switching) line was also down: "Fun day."

To sum up, IT professionals need to know their area, know their risks, and place their servers in data centers that can handle the local conditions.

An argument for the cloud

The best way to avoid an IT data center failure when a storm rolls through is to make sure the backup data center is elsewhere. That requires sensible decision-making in locating them. Your backup data center should not be in a region that can be affected by the same natural disaster; place your resources in more than one availability zone. Think backup and primary along the same fault line in an earthquake or vulnerable to flooding from linked water sources.

Some sysadmins use the cloud for redundancy. For example, Microsoft Azure storage is always replicated to ensure durability and high availability. Depending on the options you choose, Azure replication copies your data, either within the same data center or to a second data center. Most public clouds offer similar automatic backup services to help ensure data stays safe no matter what happens to your local data center—unless your cloud provider is in the same storm path.

Expensive? Yes. As expensive as being down for a day or two? No.

Don't trust the public cloud? Consider a colocation (colo) service. With colo, you still own your hardware and run your own applications, but the hardware can be miles away from trouble. For instance, during Harvey, one company "virtually" moved all its resources from Houston to its colo in Austin, Texas. But those local data centers and colocation sites need to be ready to handle disasters; it’s one of the criteria you should use in choosing them. For example, a Seattle sysadmin looking for colocation space considered, "It was all about their earthquake and drought protection (overbuilt foundations and water trucks to feed the chillers)."

A more serious plan driven by IT staff in the wake of 2016's Delta and Southwest outages was for a managed service provider to deploy uninterruptible power supplies to its clients: "On the critical pieces, we use a combination of SNMP signalling and PowerChute Network Shutdown (PCNS) clients to shut things down in the event of a power failure. Bringing things back up, well... that depends on the client. Some are automatic, and some require manual intervention."

Another approach is to support the data center with utility power from two substations. For example, the Seattle Westin Building data center has multiple 13.4-kilovolt utility feeds, diverse power substations, and multiple 480-volt three-phase transformer vaults.

Serious power failure prevention systems are not "one size fits all" units. Sysadmins should requisition a custom-designed diesel generator for the data center. Besides being tuned for your specific needs, generators must be capable of jumping to full speed in moments and accept full-power loads without impacting the load performance.

As most data center professionals know, if you have time—say, a hurricane is a day away—make sure your generator is working, fully fueled up, and is ready to kick on when the power lines get cut. Of course, you should have been testing your generator every month anyway. You have been doing that? Right? Right!

Testing your confidence in backups

Ordinary users almost never make backups, and fewer still check to make sure their backups are actually any good. Sysadmins know better.

These days, tape can handle 10 terabytes per tape; there are experiments underway that take tape up to 200 TB. Technologies such as the Linear Tape File System enable you to read tape data as if it were just another network drive.

Yet for many, tape is the option of absolute last resort. That’s fine, because backup should have plenty of options. In this case, says one sysadmin, "we would have to fail with: [Windows] server level VSS [Volume Shadow Storage] snapshots, SAN level volume snapshots, and SAN level offsite archived snapshot copies. But if, hypothetically, something happened that nuked our VM, the SAN, and the backup SAN, we could still get the tapes back and recover the data.”

When trouble is coming your way, use replication tools such as Veeam, which create a virtual machine replica of your servers. If there's a failure, the replicas are automatically spun up. No fuss, no muss, as one sysadmin says in the popular sysadmin post, "I love you Veeam."

Network? What network?

Of course, no cloud, no colo, and no remote data center helps you if staff can't reach their services. You don’t need a natural disaster to justify redundant Internet connections. All it takes is a backhoe cable cut or severed fiber lines to give you a bad day at work.

Smart sysadmins know their corporate Internet connections must be business-class connections with a service-level agreement (SLA) that includes a "time to repair” clause. Better still is to get a dedicated Internet access (DIA) circuit. Technically, they're no different than any other Internet connection. The difference is that a DIA is not a "best effort" connection. Instead, you get a specified amount of bandwidth that is dedicated for your use and comes with a SLA. They're not cheap, but as the saying goes, "Fast. Reliable. Cheap. Pick any two." When it's your business on the line and a storm is coming your way, "reliable” has to be one of your two picks.

When the storm skies clear

You can't prepare for all disasters, but you can plan for many of them. With a well-thought-out and tested disaster recovery and business continuity plan that is followed to the letter, your company can stay afloat while your rivals are drowning.

Sysadmins vs. disasters: Lessons for leaders

How many times must your IT staff say this: Don't just make backups. Test backups.

No power? No company. Make certain your servers’ emergency power is sufficient for your needs and work.

If your company survives a natural disaster—or dodges one—wise sysadmins know that this is the time to ask management for the disaster recovery budget they’ve been postponing. Because next time, you might not be so lucky.

Steven J. Vaughan-Nichols, a.k.a. sjvn, has been writing about technology and the business of technology since CP/M-80 was the cutting-edge PC operating system, 300bps was a fast Internet connection, WordStar was the state-of-the-art word processor, and we liked it. His work has been published in everything from highly technical publications (IEEE Computer, ACM NetWorker, Byte) and business publications (eWeek, InformationWeek, ZDNet) to popular technology magazines (Computer Shopper, PC Magazine, PC World) and the mainstream press (Washington Post, San Francisco Chronicle, Businessweek).

Steven J. Vaughan-Nichols, a.k.a. sjvn, has been writing about technology and the business of technology since CP/M-80 was the cutting-edge PC operating system, 300bps was a fast Internet connection, WordStar was the state-of-the-art word processor, and we liked it. His work has been published in everything from highly technical publications (IEEE Computer, ACM NetWorker, Byte) and business publications (eWeek, InformationWeek, ZDNet) to popular technology magazines (Computer Shopper, PC Magazine, PC World) and the mainstream press (Washington Post, San Francisco Chronicle, Businessweek).