Saturday, February 04, 2006

Lowered uptime expectations?

Recently, a lot of popular websites seem to be having long planned and unplanned downtimes.

Google's Blogger has been taking many planned and unplanned outages over the last few weeks. Yahoo My Web just announced they're going down "for a few hours." Salesforce.com just had multiple outages, including one that lasted almost a day. Gap decided to close Gap.com, OldNavy.com, and BananaRepublic.com for over two weeks. Bloglines took an outage instead of trying to switch to a new data center without any interruption of service. Technorati did something similar a year ago, but took a longer weekend outage for the move. And there are many, many other examples.

Back in the late 1990s at Amazon, I remember we used to think of any downtime as unacceptable. Code was always written to do a smooth migration, old and new boxes interoperating to keep operations seamless. If downtime was taken, it was very early in the morning and short, minimizing the impact to a spattering of insomniacs and international users.

Lately, this view seems downright quaint. Sites are taken down casually for long periods of time. The Gap.com example seems particularly egregious.

Perhaps this is part of a general decline in quality of service. When you outsource customer service to someone who cares even less than you do, when you treat customers with neglect that borders on hostility, perhaps taking downtime is just part of the package.

It is true that customers seem to have grown to accept these outages as the norm. But maybe we should demand more from web companies.

10 comments:

I disagree. I don't think it's about outsourcing customer service, or not caring about customer experience.

I think it's about cost and competitiveness. Companies want to be nimble and frugal.

For planned downtime, this means supporting architectural changes as often as developers think necessary; and the less downtime you plan on, the more money it will cost you.

For unplanned downtime, this essentially works out to getting by on the bare minimum necessary to support your customers...so when something fails, it takes you out.

At Webshots, we planned a 24-hour downtime in mid-January and notified our users well ahead of time. We were only down for 12, but I still wished I could have fought for less downtime, and having it done in the middle of the night. That would have cost more money, though. (Our last planned downtime was 2 years ago, when we moved data centers. Also a 12 hour downtime, but most of that we were still up in a read-only mode. We also spent more money.)

I also think it's the cool thing to do. Every competitor you have that takes their site down for some period of time makes it that much more tolerable--if not expected--that you do the same. After all, this is "Web 2.0." Everybody's doing it, man. Get with the game, man. You a hipster or a loser?

How long do you think it takes before one planned downtime event becomes, internally, justification for another? Answer: not long.

Downtimes are really not acceptable, because except in very rare circumstances, it has absolutely no reasoning behind it.

It's not like brick and mortar, when you close up shop to remodel. Yes, it may be more money sometimes, but with products like Xen where the entire server can be segmented into individual components, and each component can be transferred seamlessly in milliseconds from one physical server to another on the other side of the world, downtime is really the sign of a sloppy IT plan.

Imagine if a hospital planned a 2 week computer downtime. Unacceptable. If you are moving to a new datacenter, the only switch over downtime should be the database, and that can be done in real time by a competent IT team (not small sites, but corporate sites) that even has experience in the rather "mundane" practice of load balancing, and then stepping over a Gb line with Xen... it's really not brain surgery. All it requires is bandwidth, not weeks of uptime at the old datacenter. Google, arguably the largest private operator of data centers, doesn't have downtime. Why? They actually continually operate like like this. Two things would survive a nuclear war... cockroaches and Google datacenters. (I just wish Blogger shared their parents technological advantages.)

Then again, I think the world was better off without Gap and Oldnavy for a few weeks. Haha!

The Web site of the company that holds my 401(k) (John Hancock) is mostly accessible 24 hours a day -- except you can't calculate your rate of return during nights or weekends. This seemingly intentional behavior has long mystified me. It's not as if the code that calculates this information needs a vacation!

It just boils down to sloppy IT. IT folks, hold your horses, don't flame me just yet. Hear me through:

IT sloppiness depends on many factors such as:1. Lack of competent IT staffs2. Lack of committment in IT

In a sense 1 is also dependent on 2, but lets say a company commits to IT, but the staffs hired are just not quite there. Then you have the scenarioes mentioned. More and more planned and unplanned downtime.

But this can also be due to a general drop in Best Practices in the IT industry. I wish I could quote figures like 8 out of 10 etc, but I can't and won't make up figures. But many companies do not have proper SDLC in place. Their Live server is their test server, other than their notebooks! For that matter, some may wonder out loud, "Why do we call it 'Live' server or going 'Live'?" Their IT plans have all but left out the staging server.

But why is there a lack of competent IT folks for hire or the general drop in Best Practices? This leads us to point 2. Lack of Committment in IT.

Most companies think of IT as an afterthought. And they should, shouldn't they? What with commoditization of hardware, software ... and even developers (think outsourcing). The upside with commoditization is that standardization must be in place beforehand, and hence it means lesser unknowns. The lesser unknowns part is the upside. This means lesser costs. Which is another upside. But over time, this can lead to using IT as the budget belt; whenever there is budget cut, IT gets the squeeze. End result? When IT plans for long term strategy gets cut, out goes the staging servers. When Marketing and Sales folks determine the SDLC, you can only imagine what will transpire. You end up not being able to plan for much. Firefighting sets in and downtimes becomes part and parcel, planned or unplanned.

I'm probably over-simplifying things, since office politics, social factors, drastic changes in technology over the last 5~10 years etc has all but change many things. But I think that about sums it up.

In the end, does it mean IT folks are doomed to deteriorate into plumbers, janitor level? Not necessarily, as can be seen in the proficiency in companies like Google and the likes. :D