Our Cloud Disaster Recovery Story

I put my money where my mouth is with cloud DR, and it not only benefited my organization but also earned us a prestigious award.

As a student and practitioner of enterprise cloud computing, I've written a lot over the past five years about what I’ve learned. So I'm very pleased that my organization has won a prestigious award for one of our cloud computing efforts, competing against organizations four times our size. Here’s our story and a bit of analysis behind this "five-year overnight success."

What exactly did we win? The Amazon "City In A Cloud" competition, for midsized cities, including a pretty hunk of Lucite and a service credit of $50,000. Sweet.

Mind you, there aren't a huge number of cities even experimenting with cloud computing. Nonetheless, our city, Asheville, N.C., with a daytime population of 120,000, was up against a field of innovators that included Tel Aviv, Israel (population 414,000), Almere, Netherlands (196,000), and Santa Clarita, Calif. (209,000).

Lest you say this award was all about vendor shenanigans, third-party judges included: Scott Case of Startup America; St. Paul, Minn., Mayor Christopher Coleman, president of the National League of Cities; Bob Sofman, co-executive director of Code for America; and luminaries from The Aspen Institute, White House Office of Social Innovation, and Civic Participation, and other organizations.

What did we win for? We used automation software to do real-time syncing of production systems to cloud storage, which meant paying for the software and the storage, but no compute -- until we needed it. It meant being able to fail over when needed with a high level of confidence, knowing that the disaster recovery system is exactly the same as the production system, or at least within a few hours of that state.

Why is that important? Well, let’s start with a stat from the InformationWeek survey underpinning my recent Cloud Disaster Recovery Tech Digest. Among the 430 business technology pros who responded to the survey, all of whom are involved with their organizations’ backup systems, just 23% said they're extremely confident they could get the business up and running again in a reasonable time frame after a disaster that takes out the main datacenter.

If you’re part of the 77% who aren’t extremely confident, you're not alone. We weren't confident.

A happy problemAfter 20-plus years supporting public safety, I'm prone to thinking about catastrophes. So when I arrived at my present organization, I asked: How do we handle disaster recovery? The answer: We have a DR center. Hooray! Then I discovered that it was two blocks away from the main site. Oy, vey!

I've always thought that DR is best handled by external providers, but in this case, with facilities available to us and providers not able to provide the level of service we expected at a price we could afford, we decided to go with an internal solution.

So we planned to build a regional DR center as a capital project, an add-on to a planned fire station. We had to be patient: Construction wouldn’t happen for a couple of years. Then the project was canceled in 2011. Enormous problem, or fortuitous opportunity?

Having dabbled in cloud computing since 2009, we started thinking: Maybe we don't need another data center. I had joined the organization around the time of Hurricane Katrina and vividly remembered the problems with regional datacenters during that disaster. So even moving our DR center 12 miles away, as we had planned, might not be good enough. Moving the "virtual DR center" several states away, to a cloud datacenter nowhere near us, made a lot of sense.

At the time, it was daunting to move virtual machines into any public cloud, especially with the level of automation we were looking for. So we kept experimenting and investigating.

Startup risks and rewardsThen I got pitched by a startup vendor, CloudVelox, about automated cloud disaster recovery. (I’m startup-friendly, though I do delete at least 49 of every 50 vendor pitches I receive due to lack of relevance or understanding of our business goals.) I read CloudVelox's pitch. I was interested, but we were talking about production systems important enough to merit the type of investment that DR demands, so I wasn't about to approach this willy-nilly.

The question was: How to approach the risk, and how to get permission from those who own the systems?

In terms of approach, we took it slow. We took the "small jump, medium jump, high jump" approach. In this case, we deployed one low-risk server using the startup vendor's methodology. Then we moved to one mid-risk server. Then a mid-risk n-tier application. Armageddon didn't ensue.

In terms of permission, our IT organization has earned credibility with other business units in our city. We offer a high level of uptime. If we screw up, we admit it and communicate about it. Although we must enforce policy, we aren't the No Police. And we recognize that we aren’t the owners of systems; we're the custodians.

All of that cred added up in this case to approval to gradually move production systems into a new type of disaster recovery: automated synchronization and deployment into a public cloud provider, in this case Amazon Web Services.

Our application staff was all in favor of a DR system that would be automated and available on an ad hoc, easy-to-test basis. Our infrastructure staff was understandably a little freaked out: Production systems in the cloud? Security nightmare!

We put the app staff in charge of the project -- they had skin in the game because the old failover methods were labor-intensive and hard to test. And because our infrastructure folks had legit, specific concerns beyond their initial emotional reactions, we designed the deployment to address those concerns. We also hired an auditor to put the system through its paces before going into production.

When we went into production, we were amazed that not only could we fail over in less than an hour (compare that result to those weekend-long DR exercises where IT runs around like Keystone Cops trying to figure out patch levels and whether apps are working OK). The performance of the systems was pretty awesome, even though we were running in the West availability zone. (We're in North Carolina.)

Lessons learnedIt wasn't, and still isn't, rainbows and sunshine. We learned several things worth sharing.

DNS. Your enterprise DNS is probably more messed up than you think. It boils down to the enterprise propensity to define an "internal" versus an "external" DNS zone. No surprise: When you expect your app to be available via AWS, and you want to plan for your headquarters being gone due to an earthquake or other disaster, you might want to use globally distributed DNS, even for internal apps. You don't want to worry about manually configuring clients to use global, rather than internal, DNS entries when you're worried about your headquarters falling into a gigantic crack in the Earth's crust.

Licensing. It can be a bear when you move systems. To wit: Many proprietary systems rely on a license key based on a host ID, and when the host ID changes, the system won't work, or it will work in substandard "evaluation mode." A quick call to our vendor revealed exactly what the procedure is for emergency/disaster recovery. It's a good thing to know before the disaster.

Bandwith. Synchronizing your databases to the cloud periodically isn't for the faint of heart. Our DR vendor synchronizes at the block level in Windows Server, which appears to create a lot of traffic. We're fortunate that we have more broadband providers in town than the usual duopoly, so bandwidth is both plentiful and relatively inexpensive. In fact, we just about doubled our average utilization. So the question becomes: Is it more cost effective to double your bandwidth, or is it more cost effective to farm out to a DR provider or build a private DR center? In our case, upping our bandwidth is the far better alternative, but I know that’s not true everywhere.

Ultimately, we probably will still move our current alternative datacenter to another location to back up things like VoIP and public safety radio. But I can tell you this: That new datacenter will cost far less, and it will be far smaller, than the one we initially planned to build. And we won't waste money buying duplicate gear either. Another important outcome is that, because of the cost reduction (about a tenth of the cost for capital, according to our infrastructure manager), we have moved to also protect systems that are "important, but not urgent," systems that were too expensive to protect in the past.

In its ninth year, Interop New York (Sept. 29 to Oct. 3) is the premier event for the Northeast IT market. Strongly represented vertical industries include financial services, government, and education. Join more than 5,000 attendees to learn about IT leadership, cloud, collaboration, infrastructure, mobility, risk management and security, and SDN, as well as explore 125 exhibitors' offerings. Register with Discount Code MPIWK to save $200 off Total Access & Conference Passes.

Jonathan Feldman is Chief Information Officer for the City of Asheville, North Carolina, where his business background and work as an InformationWeek columnist have helped him to innovate in government through better practices in business technology, process, and human ... View Full Bio

Most of the vendors rely on really hefty security which can cause malfunctions in itself, especially when securing various labels in a big data set. What kind of innovation in the security is undertaken has to be passed through for evaluation by Cloud Disaster Management department.

Congratulations, Jonathan! And thanks for sharing your story and insight. It's important to demonstrate that moving to cloud isn't all "rainbows and sunshine," as you say, but the benefits should outweigh the challenges of the transition.

Congratulations, Jonathan. In addition to the technical insight, it's helpful to get a look into how you put the team together, addressing the concerns and interests of app and infrastructure groups. These are personal and political decisions, not just technical ones.

We've been in this discussion at our organization before, it's an interesting model that warrants serious consideration. Conversation always leads too...If it works so well for DR, why do you need a primary data center at all? In my locale affordable bandwidth is our primarly impediment, but sounds like this isn't an issue for you. What hurdles do you face that prevent you from considering this approach as your primary data center? Thanks.

Excellent testimony from Jonathan on DR. The real gain in the long run may be in the willingness to back up many additional systems, thanks to the ease of DR operations via the cloud and overall reduced cost.

I've edited our Backup Technologies surveys for years, and that confidence number is stubbornly low. I used to have a lot of sympathy, before virtualization became SOP. I had some sympathy before cloud became SOP. Now? Not so much.

InformationWeek's IT Perception Survey seeks to quantify how IT thinks it's doing versus how the business really views IT's performance in delivering services - and, more important, powering innovation. Our results suggest IT leaders should worry less about whether they're getting enough resources and more about the relationships they have with business unit peers.

They say perception is reality. If so, many in-house IT departments have reason to worry. InformationWeek's IT Perception Survey seeks to quantify how IT thinks it's doing versus how the business views IT's performance in delivering services - and, more important, powering innovation. The news isn't great.