Monday, December 1, 2008

All This Hardware And No Uptime

SharePoint is pretty heavy. I often think of it as an 800 pound gorilla who stopped exercising and let itself slide. To handle all the services that run within a farm and provide decent response time to users, a reasonable amount of hardware usually gets provisioned to pick up the slack.

I'm talking about real iron here. Large farms featuring clustered SQL Servers, redundant application servers, and a series of web front ends balanced with either a network load balancer or a Microsoft NLB cluster.

One might look at all this gear and think that as a result, the farm is almost guaranteed to enjoy some pretty high availability right? Well I guess that depends on what you call high availability.

A Desire for High Availability

\

Total downtime (HH:MM:SS)

Availability

per day

per month

per year

99.999%

00:00:00.4

00:00:26

00:05:15

99.99%

00:00:08

00:04:22

00:52:35

99.9%

00:01:26

00:43:49

08:45:56

99%

00:14:23

07:18:17

87:39:29

For clients that are running SharePoint internally, uptime is probably important but not a huge priority. Clients that use SharePoint for internet facing applications are another matter. These users are far more likely to use SharePoint to buttress e-commerce offerings or brand efforts. These businesses usually want high availability and may even ask for four to five nines of uptime (99.99%-99.999% uptime). Five nines (sometimes called the holy grail of uptime), equates to being down no more than 5 minutes 15 seconds a year. It's a bold proposition and requires a great deal of planning and forethought. Still, if it's doable, all this hardware should set you in the right direction right?

Heavy Patches, And Lots of Them

The biggest hurdle I've had with providing high availability to clients with SharePoint has come from the patching procedures issued from Microsoft. Normally when updating applications/machines it's possible to update one machine at a time, using your load balancer to shelter this machine from production. Once the machine has has been updated you can bring it back into production and start updating one of it's siblings. With SharePoint this process gets a little more complicated. Here are a couple of the reasons:

There's no uninstall/rollback for most SharePoint updates (your best bet for uninstall is a machine level backup).

The recommended install procedure dictates that you stop all IIS instances for Web Front Ends. This makes it difficult to continue to provide service or at the very least hold up a stall/maintenance page.

There's usually at least one machine in the farm that rejects the upgrade and needs to be troubleshot individually. For me these have often resulted in removing the machine from the farm, upgrading it, and then adding it back to the farm. This usually adds to server downtime, especially if the server was serving a key role (ie: SSP host or the machine that hosts the Central Administration web site).

Assuming you manage to make it through all the above without a lot of downtime, how many times a year do you think you might be able to do it and still maintain a reasonable downtime SLA? Before you answer that, consider all the updates that have come down the pipe for WSS since it's RTM (it's SharePoint 2007 remember). This is also just a list of updates for WSS, there's a whole other table for MOSS (although most of the dates and versions coincide).

Don't get me wrong, updates are good. In fact, I like it when Microsoft fixes things, especially when the clients who have purchased MOSS have already paid potentially millions in licensing fees. I just wished these updates which happen many times a year AND provide critical fixes to expected functionality had better upgrade strategies.

Do SharePoint updates and the way in which SharePoint farms are upgraded make high availability a pipe dream? Does all that hardware do nothing except help the farm scale out?

A Little Transparency

In fact all I'm really looking for from these updates is a little transparency. I'd be thrilled if I could get a little more detail as to what's going on underneath the hood and what to do when the error log starts filling up.

I've yet to see a really good troubleshooting strategy or even deployment strategy that gives you good odds of limiting downtime when it comes time to roll out these upgrades.

We have a ticket open with MS support to take up this issue. The wait for SharePoint related issues is still pretty long, but rest assured should I come up with one or find a good resource for these kinds of rollouts you'll know where you'll find it.

2 comments:

Denny Lane
said...

Tyler -

Very interesting observations.

I am with Stratus Technologies and we focus on providing continuous availability solutions (beyond HA). Our fleet of x86 systems running Windows, RHEL, and VMware have an average 99.9999% uptime for both the hardware and off the shelf OS.

Traditionally these systems have been deployed for your most mission critical apps, but with prices starting under $15K more people are using them to replace the complexity (and lower app availability) of clusters. SharePoint is one of those apps people are doing this with.

We also have addressed part of the patching dilemma you discussed with a feature called "Active Upgrade". For planned outages you can remove the system from "lock step", patch one side while the other side continues to transact business, test to determine if you like the results of the patch, then "glue the system back together" or roll it back. Much faster and safer than doing it via cluster failover/failback.

Denny, I'm curious how this feature "Active Upgrade" would take machines out of lock step. If the upgrade is changing the schema of SharePoint content databases and config databases how do you go about continuing to offer services?

Wouldn't the new schema called with old web front ends yield errors?

Even if you temporarily swapped out the database server we'd still have a data sync issue when we swap back. And even worse it would be across different schemas (the newly upgraded one [with old data], and the old one that we kept to provide availability [with new data]).

About Me

Tyler Holmes is a Solutions Architect working in Portland, Oregon. He lives mostly in the MS tech stack and is currently treading the waters of Communication/Collaboration and Business Intelligence with off the shelf/open source technologies.