It seems that today is going to be one of those days where I get lost in forums and blogging – I can live with that :-)

One of the questions that came up on a forum today was about choosing an HA solution – based solely on the hardware that was running the database! Given that single piece of info, it’s impossible to come up with any kind of sensible answer. The other thing I see a lot is someone saying ‘just use a cluster’ – well, if you’re trying to protect against damage to the data, just using a cluster won’t do it because of the single-point-of-failure in a failover cluster – the shared disks.

So where do you start? The key to choosing an HA solution is to work out your requirements first and then choose a technology that allows you to meet as many of them as you can, within your available budget. Here are some of the questions I like to ask (not an exhaustive list):

What is the maximum application downtime SLA (service-level agreement)? In other words, if a disaster happens, how long can the application be off-line while failover occurs or the disaster is fixed?

What is the maximum acceptable data-loss SLA? If a disaster happens, how much can you afford to lose in terms of data or work? You might require up-to-the minute recovery for instance, or you might be able to cope with losing the last day’s worth of transactions.

What are you trying to protect? Site, server, instance, database, filegroup, partition, table, group of tables?

What is the transaction log generation rate of your workload? If it’s very high, that means you’re going to have problems with backup up the log and with getting transaction log over to your redundant server/site.

What recovery model are you running your database(s) in? If you’re in SIMPLE, then you can’t get point-in-time recovery and so you’re looking at losing all the work since your last full backup, and it also means you can’t use any of the HA technologies which rely on the transaction log.

What’s your current backup strategy? If the answer is ‘what backup strategy?’ then you’ve got bigger problems than just getting an HA solution in place…

Are you trying to achieve site-level redundancy? If so, do you have a second site? Where is it? Does it have the same protection as the main site (in terms of security, HVAC, power, etc)

What’s the network bandwidth and latency to the second site? If your transction log generation rate is MBs/second, but your second site is 2000 miles away through a 720KB/second link, you’re not going to be doing any kind of HA solution involving the second site that comes close to your downtime and data-loss requirements…

What’s the hardware at the second site?

Can you alter the application at all? If you can’t alter the application then you may have a hard time getting it to gracefully failover to a redundant server. You also won’t be able to use explicit redirection with database mirroring.

What’s the application eco-system? In other words, what all has to failover so the application can run properly.

All of these figure into the choice of HA solution. Work these out, prioritize them, and then evaluate HA technologies (or combinations of technologies) to see which requirements you can meet. Don’t just jump at failover clustering first!

Over the next few months I’ll be posting more on designing for high-availability – let me know if there’s anything in particular you want to see.

One Response to HA: Where do you start when choosing a high-availability solution?

We’re a pretty small company setting up a DR site a few states away from our main headquarters. We’re doing near-real-time replication of SQL and two Exchange boxes (more servers to come) using a product called DoubleTake. It seems to be working as advertised for us. But the thing that I really wanted to comment on was the Riverbed WAN accelerators. We are seeing improvements of about 6-7x of our WAN traffic (ours is a 3mb connection). The Riverbed boxes are expensive, but it sure does make the bandwidth go a lot further.