Category Archives: Disaster Recovery

When I picked up Johnson’s book, I knew nothing about Cholera. Sure I’d heard the name, but I didn’t know what a plague it was, during the 19th century.

Johnson’s book is at once a thriller, of the deadly progress of the disease. But in that story, we learn of the squalor inhabitants of victorian england endured, before public works & sanitation. We learn of architecture & city planning, statistics & how epidemiology was born. The evidence is weaved together with map making, the study of pandemics, information design, environmentalism & modern crisis management.

“It is a great testimony to the connectedness of life on earth that the fates of the largest and the tiniest life should be so closely dependent on each other. In a city like Victorian London, unchallenged by military threats and bursting with new forms of capital and energy, microbes were the primary force reigning in the city’s otherwise runaway growth, precisely because London had offered Vibrio cholerae (not to mention countless other species of bacterium) precisely what it had offered stock-brokers and coffee-house proprietors and sewer-hunters: a whole new way of making a living.”

1. Scientific pollination

John Snow was the investigator who solved the riddle. He didn’t believe that putrid smells carried disease, the miasma theory prevailing of the day.

“Part of what made Snow’s map groundbreaking was the fact that it wedded state-of-the-art information design to a scientifically valid theory of cholera’s transmission. “

3. A Generalist saves the day

The interesting thing about John Snow was how much of a generalist he really was. Because of this he was able to see thing across disciplines that others of the time were not able to see.

“Snow himself was a kind of one-man coffeehouse: one of the primary reasons he was able to cut through the fog of miasma was his multi-disciplinary approach, as a practicing physician, mapmaker, inventor, chemist, demographer & medical detective”

4. Enabling the growth of modern cities

The discovery of the cause of Cholera prompted the city of London to a huge public works project, to build sewers that would flush waste water out to sea. Truly, it was this discovery and it’s solution that has enabled much of the population densities we see in the 21st century. Without modern sanitation, 20 million plus cities would certainly not be possible.

5. Bird Flu & modern crisis management

In recent years we have seen public health officials around the world push for Thai poultry workers to get their flu shots. Why? Because although avian flu only spreads from animal to humans now, were it to encounter a person who had our run-of-the-mill flu, it could quickly learn to move human to human.

All these disciplines of science, epidemiology and so forth direct decisions we make today. And much of that is informed by what Snow pioneered with the study of cholera.

I recent went on a flight with a pilot friend & two others. With only 3 passengers, you get up front and center to all the action. Besides getting to chat directly with the pilot, you also see all of the many checks & balances in place. They call it the warrior aviation checklist. It inspired this post on disaster recovery in the datacenter.

1. Have a real plan B

Looking over the document you can see there are procedures for engine failure in flight! That’s right those small planes can have an engine failure and they glide. So for starters the technology is designed for failure. But the technology design is not the end of the story.

When you have procedures and processes in place, and training to deal with failure, you’re expecting and planning for it.

[quote]
Expect & plan for failure. Use technologies designed to fail gracefully and build software to do so. Then put procedures in place to detect & alert you so you can put those processes into place quickly.
[/quote]

2. Use checklists

Checklists are an important part of good process. We use them for code deploys. We can use them for disaster recovery too. But how to create them?

Firedrills! That’s right, run through the entire process of disaster recovery with your team & document. As you do so, you’ll be creating the master checklist for real disaster recovery. Keep it in a safe place, or better multiple places so you’ll have it when you really need it.

The news this past week has brought endless images of devastation. All metropolitan region, the damage is apparent.

More than once in conversation I’ve commented “That’s similar to what I do.” The response is often one of confusion. So I go on to clarify. Web operations is every bit about disaster recovery and crisis management in the datacenter. If you saw Con Edison down in the trenches you might not know how that power gets to your building, or what all those pipes down there do, but you know when it’s out! You know when something is out of order.

That’s why datacenter operations can learn so much about crisis management from the handling of Hurricane Sandy.

1. Run Fire Drills

Nothing can substitute for real world testing. Run your application through it’s paces, pull the plugs, pull the power. You need to know what’s going to go wrong before it happens. Put your application on life support, and see how it handles. Failover to backup servers, restore the entire application stack and components from backups.

2. Let the Pros Handle Cleanup

This week Fred Wilson blogged about a small data room his family managed, for their personal photos, videos, music and so forth. He ruminated on what would have happened to that home datacenter, were he living there today when Sandy struck.

It’s a story many of us can related to, and points to obvious advantages of moving to the cloud. Handing things over to the pros means basic best practices will be followed. EBS storage, for example is redundant, so a single harddrive failure won’t take you out. What’s more S3 offers geographically distributed redundant copies of your data.

[quote]
Web Operations teams do what Con Edison does, but for the interwebs. We drill down into the bowels of our digital city, find the wires that are crossed, and repair them. Crisis management rules the day. I can admire how quickly they’ve brought NYC back up and running after the wrath of storm Sandy.
[/quote]

3. Have a few different backup plans

Watching New Yorkers find alternate means of transportation into the city has been nothing short of inspirational. Trains not running? A bus services takes it’s place. L trains not crossing the river? A huge stream of bikes takes to the williamsburg bridge to get workers to where they need to go.

Deploying on Amazon can be a great cloud option, but consider using multiple cloud providers to give you even more redundancy. Don’t put all your eggs in one basket.

4. Keep Open Lines of Communication

While recovery continued apace, city dwellers below 34th street looked to text messages, and old school radios to get news and updates. When would power be restored? Does my building use gas or steam to heat? Why are certain streets coming back online, while others remain dark?

During an emergency like this one, it becomes obvious how important lines of communication are. So to in datacenter crisis management, key people from business units, operations teams, and dev all must coordinate. Orchestrating that is and art all by itself. A great CTO knows how to do this.

Having just spent the last 24 hours in lower manhattan, while Hurricane Sandy rolled through, it’s offered some first hand lessons on disaster recovery. Watching the city and state officials, Con Edison, first responders and hospitals deal with the disaster brings some salient insights.

1. What are your essentials?

Planning for disaster isn’t easy. Thinking about essentials is a good first question. For a real-life disaster scenario it might mean food, water, heat and power. What about backup power? Are your foods non-parishable? Do you have hands free flashlight or lamp? Have you thought about communication & coordination with your loved ones? Do you have an alternate cellular provider if your main one goes out?

With business continuity, coordinating between business units, operations teams, and datacenter admins is crucial. Running through essential services, understanding out they interoperate, who needs to be involved in what decisions and so far is key.

2. What can you turn off?

While power is being restored, or some redundant services are offline, how can you still provide limited or degraded service? In the case of Sandy, can we move people to unaffected areas? Can we reroute power to population centers? Can we provide cellular service even while regular power is out?

[quote]Hurricane Sandy has brought devastation to the East Coast. But strong coordinated efforts between NYC, State & Federal agencies has reduced the impact dramatically. We can learn a lot about disaster recovery in web operations from their model.
[/quote]

Also very important, architect your application to have a browse only mode. This allows you to service customers off of multiple webservers in various zones or regions, using lots of read-replicas or read-only MySQL slave databases. It’s easy to build lots of read-only copies of your data while there are no changes or transactions taking place.

3. Did we test the plan?

A disaster is never predictable, but watching the emergency services for the city was illustrative of some very good response. They outlined mandatory evacuation zones, where flooding was expected to be worst.

In a datacenter, fire drills can make a big difference. Running through them gives you a sense of the time it takes to restore service, what type of hurdles you’ll face, and a checklist to summarize things. In real life, expect things to take longer than you planned.

Probably the hardest part of testing is to devise scenarios. What happens if this server dies? What happens if this service fails? Be conservative with your estimates, to provide more time as things tend to unravel in an actual disaster.

4. Redundancy

In a disaster, redundancy is everything. Since you don’t know what the future will hold, better to be prepared. Have more water than you think you’ll need. Have additional power sources, bathrooms, or a plan B for shelter if you become flooded.

With Amazon’s recent outage, quite a number of internet firms failed. In our view AirBNB, FourSquare and Reddit Didn’t Have to Fail. Spreading your virtual components and services across zones and regions would help, but further across multiple cloud providers not just Amazon Web Services, but Joyent, Rackspace or other third party providers would give you further insurance against a failure in one single provider.

Redundancy also means providing multiple paths through system. From load balancers, to webservers and database servers, object caches and search servers, do you have any single points of failure? Single network path? Single place where some piece of data resides?

5. Remember the big picture

While chaos is swirling, and everyone is screaming, it’s important that everyone keep sight of the big picture. Having a central authority projecting a sense of calm and togetherness doesn’t hurt. It’s also important that multiple departments, agencies, or parts of the organization continue to coordinate towards a common goal. This coordinated effort could be seen clearly during Sandy, while Federal, State and City authorities worked together.

In the datacenter, it’s easy obsess over details and lose site of the big picture. Technical solutions and decisions need to be aligned with ultimate business needs. This also goes for business units. If a decision is unilaterally made that publish cannot be offline for even five minutes, such a tight constraint might cause errors and lead to larger outages.

Coordinate together, and everyone keep sight of the big picture – keeping the business running.

We’ve all seen cloud computing discussed ad nauseam on blogs, on Twitter, Quora, Stack Exchange, your mom’s Facebook page… you get the idea. The tech bloggers and performance experts often pipe in with their graphs and statistics showing clearly that dollar-for-dollar, cloud hosted virtual servers can’t compete with physical servers in performance, so why is everyone pushing them? It’s just foolhardy, they say.

On the other end, management and their bean counters would simply roll their eyes saying this is why the tech guys aren’t running the business.

There are a lot of components that make up modern internet websites, and a lot of places to get stuck in the mud. Website performance starts with the browser, what caching it is doing, their bandwidth to your server, what the webserver is doing (caching or not and how), if the webserver has sufficient memory, and then what the application code is doing and lastly how it is interacting with the backend database. Continue reading →

Some of the high profile companies affected by Amazon’s April 2011 outage could have recovered had they kept a backup of their entire site outside of the cloud. With any hosting provider, managed traditional data center or cloud provider, alternate backups are always a good idea. A MySQL logical backup and/or incremental backup can be copied regularly offsite or to an alternate cloud provider. That’s real insurance! Continue reading →