San Francisco is DOWN: The Fragility of Web 2.0 Ecosystem – Common Sense Must Not Have Made the Feature List

I was just leaving the office for a client dinner last night when I noticed I
couldn’t get to my TypePad blog, but I chalked it up to a
"normal" Internet experience.

When I fired up Firefox this morning (too much wine last night to care) I was surprised to say the least.

I am just awestruck by the fact that yesterday’s PG&E power outage in San Francisco took down some of the most popular social networking and blogging sites on the planet. Typepad (and associated services,) Craigslist, Technorati, NetFlix etc…all DOWN. (see bottom of post for a most interesting potential cause.)

I’m sure there were some very puzzled, distraught and disconnected people yesterday. No blogging, no secondlife, no on-line video rentals. Oh, the humanity!

I am, however, very happy for all of the people who were able to commiserate with one another as they apparently share the same gene that renders them ill-prepared for what is one of the most common outage causalities on the planet: power outages.

Here’s what the TypePad status update said this morning:

Update: commenting is again available on TypePad blogs; thank you for your patience. We are continuing to monitor the service closely.

TypePad blogs experienced some downtime this afternoon due to a
power outage in San Francisco, and we wanted to provide you with the
basic information we have so far:

The outage began around 1:50 pm Pacific Daylight Time

TypePad blogs and the TypePad application were affected, as well as LiveJournal, Vox and other Six Apart-hosted services

No data has been lost from blogs. We have restored access to blogs as well as access to the TypePad application. There
may be some remaining issues for readers leaving comments on blogs; we
are aware of this and are working as quickly as possible to resolve the
issue. (See update above.)

TypePad members with appropriate opt-in settings should have
received an email from us this afternoon about the outage. We will
send another email to members when the service has been fully restored.

We will also be posting more details about today’s outage to Everything TypePad.

We are truly sorry for the frustration and inconvenience that
you’ve experienced, and will provide as much additional information as
possible as soon as we have it. We also appreciate the commiseration
from the teams at many of the other sites that were affected, such as
Craigslist, Technorati, Yelp, hi5 and several others.

I don’t understand how the folks responsible for service delivery of these sites, given the availability and affordability of technology and hosting capability on-demand, don’t have BCP/DR sites or load-balanced distributed data centers to absorb a hit like this. The management team of Sixapart has experience in companies that understand that the network and connectivity represent the lifeblood of their existence; what the hell happened here in that there’s no contingency for power outages?

Surely I’m missing something here.

Craigslist and Technorati are services I don’t pay for, so one might suggest taking the service disruption with a grain of SLA salt (or not, because it still doesn’t excuse not preparing for issues like this with contingencies) but TypePad is something I *pay* for. Even my little hosting company that houses my personal email and website has a clue. I’m glad I’m not a Netflix customer, either. At least I can walk down to Blockbuster…

Yes, I’m being harsh, but I there’s no excuse for this sort of thing in today’s Internet-based economy. It affects too many people and services but really does show the absolute fragility of our Internet-tethered society.

Common sense obviously didn’t make the feature list on the latest production roll. Somebody other than me ought to be pissed off about this. Maybe when Data Center 3.0 is ready to roll, we won’t have to worry about this any longer 😉

/Hoff

Interestingly, one of the other stories of affected sites relayed the woes of 365 Main, a colocation company, whose generators failed to start when the outage occurred. I met the the CEO of 365 Main when he presented at the InterOp data center summit on the topic of flywheel UPS systems which are designed to absorb the gap between failure detection and GenStart. This didn’t seem to work as planned, either.

You can read all about this interesting story here. This was problematic because the company had just issued a press release about a customer’s 2-year uninterrupted service the same day 😉

Valleywag reported that the cause of the failure @ 365 Main was due to a drunk employee who went berserk!This seemed a little odd when I read it, but check out how the reporter from Valleywag is now eating some very nasty Crow … his source was completely bogus!

Related

I am not a customer of any of the sites affected but I do share the level of frustration about the complete lack of fail over and disaster recovery processes.
However – I do take issue over the Netflix comment on a number of different levels. You may be able to walk down to your blockbuster but will it have what you have in stock? And NetFlix isn't a communication service – it mails DVDs around which already have built in delay so some downtime is not really an issue. My DVDs arrived today with or without the main website up. And lastly the Netflix downtime wasn't caused by the issue in San Francisco. They were updating their pricing system (incidentally to lower prices) and "something went wrong". That in of itself is a problem but not related to your thesis.

Sorry, but Netflix uses the Internet as the main portal for order processing. If you tried to connect to them yesterday, you couldn't.
As to the issue of what's in stock @ BB…I maintain that when you combine BB's brick and mortar and the Internet-based service, you have a backup plan. What happens when Netflix goes bye bye?
You watch re-runs.
Netflix *is* a communication service…just like FedEx's business competency isn't about package delivery…it's about saving time.
If, indeed, the Netflix outage was due to their software update and was coincidentally at the same time of the failure, I will retract the statement. At this time, all of the wire services and on-line blogs still include Netflix in the outage group caused by the power loss.
Thanks for the ping/heads-up. I'll monitor the wire to see if the story changes.
/Hoff

A couple buddies of mine work for a large colo provider out of 365 Main and said it was an all-hands-on-deck cluster fsck. Linked off of your articles is an updated press release from 365main (with the RedEnvelope two-year uptime press release removed) that talks about how the generators didn't kick in: http://www.365main.com/status_update.html
I too had the same thoughts as you did, especially as I was using Craigslist at the time to search for a new apartment here in the Boston area. I completely agree, for such a large "part of the Internet" hosted without some sort of DR/replication site elsewhere seems almost nonsensical. Hopefully there is a minimum of pink slips for those involved, but I am sure now DR plans are being devised as we speak.

You're serious? Really, or is it just a joke?
DR/BCP costs a lot of money. Having realtime database replication, hot standby servers, or warm servers ready to be booted up at a moments notice, etc…. they all cost a lot of money.
Most large organizations that actually do SaaS for large fees don't even have hot-spare datacenters and realtime database replication. Its pretty darned expensive.
You'll be hard pressed to find anyone other than stock exchanges and banks that have realtime failover and/or RTO of less than 4 hours. Most companies don't have anything even close to that for their DR/BCP plans. It just isn't worth spending the money on given the frequency on the event and the loss associated.
Do you know something about the costs of these things I don't?

You're catching me at a weak moment, Andy…which is good for you. I'd ordinarily take this comment and call it out as a separate post, but I just had lunch and I'm in a good mood.
My only response to your question regarding whether I was serious or not is as follows:
1) "DR/BCP costs a lot of money."
Sure, it can. So does being completely off the grid for 5 hours. Again, my focus was on the service companies who make a living by feeding off the web, like Six Apart. If they have no BCP/DR Plan to deal with anything other than power failure, that's idiotic.
2) "Most large organizations that actually do SaaS for large fees don't even have hot-spare datacenters and realtime database replication. Its pretty darned expensive."
Please send me proof of this statement. Having worked for, catered to and consulted with many of the leading SaaS companies, I can tell you that your statement is false. Cost is relative. If you assess risk appropriately, then you can justify the cost appropriately, too.
…combine that with a ubiquity of GSLB solutions, cheap hardware, open source software, etc., and your statement just doesn't hold water.
3) "You'll be hard pressed to find anyone other than stock exchanges and banks that have realtime failover and/or RTO of less than 4 hours"
Well, I guess I'm good a being hard pressed then because I easily run out of fingers and toes without breathing hard counting the numbers of organizations that I am privy to that are not banks/exchanges that do. GSLB and active/active load balancing to distributed sites used in conjunction with services like Akamai mean you can do this…and reasonably so from a cost perspective.
4) "Most companies don't have anything even close to that for their DR/BCP plans. It just isn't worth spending the money on given the frequency on the event and the loss associated."
I have no response to that statement…really. I'm not talking about mom and pop shops here…generalizing is diluting your argument.
5) "Do you know something about the costs of these things I don't?"
On some fronts, it seems so.
/Hoff

Not trying to start a flamefest 🙂
A few points.
1. If these folks didn't have a DR plan, that is a bad thing.
2. We don't know how many people were in the process of activating a backup facility.
3. Having realtime database replication is expensive for any decently sized database. There are the pipe costs, the replication overhead on the database, etc.
4. A lot of people use places like Sungard for their DR with capacity on demand rather than a hotsite for failover. Especially people without guaranteed uptimes, contractually guaranteed SLAs, etc.
So, I'm not really surprised that NetFlix and Craigslist didn't have this handled. I'm a little surprised that a few other people weren't more resilient. But again, its a question of how much revenue they lost by being down vs. the cost of the immediate failover. At least the craigslist guys put up pages to indicate they were down and they had the capability to do that.

@Andy:
It seems that you're stuck on hotsite backup (active/standby) versus the ability to do GSLB across an active/active site configuration. Even one scaled down with degraded service.
Not arguing that is doesn't cost money to do any of this, but the reality is that for companies that live and die by availability, it's hard for me to swallow that for companies like Netflix and SixApart, they can afford to be down…and not just from the perspective of the revenue hit…how about the impact on reputation.
There are so many architectural and technology options these days for replication (and I've done a TON of them, including database) and load balancing that I find it very short-sighted to suggest that making generalizations about how or why folks don't "do" BCP/DR doesn't make any sense to me.
You work in financial services. You know how regulatory requirements look at risk-based criticality to help determine what needs to be backed, how long critical assets can (or can't) be out of service, and what the impact is.
Nobody said that sites like six apart or netflix need to be 100% up in terms of functionality, but even near-time redirection with a snapshotted service would be better than being down for hours.
Let's agree to disagree and move on.
/Hoff