I have 6 instances of sql server that I need to be highly available by being hosted on two sites (one as principal, one as DR). Transparent automatic failover is required. Most are fairly boring and don't do a lot, two are heavily used for writes.

The busiest databases process on average 400,000 tran/min, peaking at 900,000 and generate ~10 gig a day of new data each.

Both sites are the same, both have netapps 15k SAS iscsi sans.

Currently we have win server 2003, sql server standard 2005. Sync mirroring is too slow, and adds 2ms to each transaction.

I've managed to persuade the customer to pay for windows server 2008 and sql server enterprise 2008 with the view of possibly having a failover cluster (with one node in each cluster) between the two sites with netapps doing the cross site replication.

Pretty much agree with all of the answers here. There's no silver bullet for this stuff on a Windows platform. All options have downsides and you'll have to decide what is acceptable to you and what is not. BTW, are you specifically tied to using MSSQL? It's certainly not as robust with instant failover/clustering features, as say, MYSQL.
– TatasFeb 15 '11 at 18:49

3 Answers
3

If truly transparent automatic failover is required and you can't handle a 2ms overhead per transaction, your best option is probably bidirectional replication with a very good load balancer front end (like F5 Big-IPs). That way both SQL Servers are available at all times.

There's a lot of drawbacks associated with bidirectional replication - schema changes can be tricky, you have to rework your identity fields (different seeds, odd/even increments), and it's not self-healing. You need monitoring to know when replication goes down, and you need a 24/7 team to act fast when it does, because customers won't be happy if they don't see consistent data across the two nodes. Certain failures can also cause data to be on one node and not the other. For example, if replication stops working, and then 10 minutes later the whole server goes down, the second server won't see that 10 minutes of data until you get the primary back up and fix replication.

What if I say that some, say 5 min, data loss would be acceptable in a fail over? Provided that the failover was still transparent.
– BlootacFeb 15 '11 at 14:23

@Blootac - just to be clear, you're saying you can't tolerate 2ms of query slowdown, but you can tolerate 5 minutes of data loss? That seems a little odd, but in that case, log shipping is probably the easiest solution. Set up the client's connection strings to use the database mirroring failover partner option, and they'll automatically try to connect to the secondary server. You'll need to script something to do a RESTORE WITH RECOVERY on the secondary server if the primary has been down more than a couple of minutes.
– Brent OzarFeb 16 '11 at 13:15

You have no idea how odd my project is... its more important for the database to be available than it is for it to be totally intact. I didn't realise you could use the failover partner connection string attribute for databases not configured as mirrored...
– BlootacFeb 18 '11 at 9:06

OK, if it's more important to have availability than have it intact, I'd go the bidirectional replication route, but just know that there are a lot of ways it can break. Fortunately, it usually breaks "up" - both sides still accept reads & writes while it's broken, but the changes just aren't making it to the other node.
– Brent OzarFeb 18 '11 at 10:57

Nothing is 100% transparent. Failover clustering has downtime associated during the stop and start on the other node. If the app is cluster aware and/or has retry logic, not a problem. The instance name remains the same, so on that level clustering is transparent.

Replication and log shipping have different server names (source/destination), so you need to employ some sort of technology/alias/whatever to abstract the name change and/or be able to change the application configuration (assuming they didn't hardcode names). Database mirroring has a somewhat similar story, but if the app is coded to SNAC you may be able to use the automatic failover with Witness.

DBM/log shipping/repl also require you to sync objects not outside the database and make sure the standby has everything it needs to be running (including logins at the instance level).

So only failover clustering and DBM with a Witness and high safety will give you automatic failover. That doesn't mean transparent necessarily.

There is no one absolute right way to do this. It's based on your requirements (including overall SLAs, RTOs, and RPOs).

If you are having issues with mirroring, it could be I/O and/or network related. It may have nothing to do with SQL. So as you look at a new architecture, make sure you evaluate all levels of the solution.

Allan - why wouldn't replication combined with a geographic load balancer (like an F5 Big-IP) work? Clients connect to a DNS name that points to the Big-IP, and that device automatically routes them to whatever server is up. That's totally transparent to the clients.
– Brent OzarFeb 16 '11 at 13:17

I suggest that you look at using a file share witness in conjunction with the clustering in Windows Server 2008 (I can't remember off of the top of my head if you need R2 or not.) I haven't personally done this, but I spent some time looking at it last summer.

If one believes the marketing and Netapp has the proper software, it should provide site-to-site failover. Whether or not it can handle those transaction volumes is a worry, but the onus would be on the netapps and whatever connects them together.

Note that "transparent automatic failover" might not be achievable without support from the user applications themselves. If you are already doing mirroring, this might not be an issue.