Thursday, April 15, 2010

SQL Server Database Mirroring Woes

I'm a huge fan of SQL Server's database mirroring concept. We've been using it on our application (60GB DB over 220 tables, 10's to 100's of millions of rows) for almost 3 years on SQL 2005. Log shipping has its place (it's pivotal to our offsite disaster recovery plan), and clustering is great if you have a huge replicated SAN, but, at least on paper, DB mirroring is the lowest-cost and most approachable option. In reality however, it has some warts.

We started out with synchronous mirroring in a high safety + witness configuration. This is great, as we could easily take down the primary DB server for maintenance during "slow" periods with minimal effect on the running application (a few transactions might fail, which we recover from gracefully). As our database grew, though, we started seeing massive system slowdowns during peak usage periods. Investigation showed that the lag was coming from the commit overhead on the mirror, which might grow to 30s or more causing timeouts (high safety mode requires that the transaction be fully stored on the mirror server before returning control to the client). More investigation revealed that the disk write I/O on the mirror server's data volume was between 10x-500x the principal, which outstripped the disk array's ability to keep up. With a lot of angry customers and idled operators waiting around, we didn't have a lot of time to invest in root-cause analysis, so we switched over to asynchronous mirroring to keep the doors open (async mirroring doesn't hold up the client transaction waiting for the log to copy to the mirror). Luckily, Microsoft Startup Accelerator (now Bizspark) had hooked us up with SQL Enterprise licenses, so async mirroring was an option for us- it's not on SQL Standard! With async mirroring, a catastrophic loss of the primary server pretty much guarantees some data loss, so it's not ideal.Awhile back, we upgraded all our DB server storage to SSDs in a RAID10 config, resulting in a massive performance boost on our aging hardware. We figured this would allow us to go back to synchronous mirroring mode with no problems. While not as severe, we still experienced painful slowdowns during peak write periods, and had to switch back to async mirroring again. Even with async mirroring, the write volume to the mirror data disk was still consistently hundreds of times that of the primary. As we hadn't planned for these ridiculous mirror write volumes, we were starting to worry about our mirror server's SSDs burning out prematurely (SSDs have a limited write volume before the flash cells start to fail).

Flash forward to last month- we've purchased spanking new 12-core DB servers with the latest and greatest SSDs in RAID10, 64G of memory, and SQL 2008 on Windows Server 2008R2. We wanted to spend the time to get high safety synchronous mirroring in place again, so we wrote a little simulator app to see if SQL 2008 on our new servers had the same nasty I/O issues. It did. On average, the data write volume was constant, and 250-500x higher on the mirror (writing constant 3-7MB/s 24/7 is a quick death sentence for an SSD rated at 5GB/day for 5 years)!

Time to call in Microsoft. After explaining the situation, the first response was "as designed". Really? Our write volumes aren't all that high, so if this is true, I have a hard time believing that database mirroring is useful on a database of any size. In any case, had we gone live this way, our mirror machine's SSDs would've been shot within a matter of months. After an initial call of "BS!", I got a little more detail: apparently SQL Server not only ships the log data over in real-time, it also performs recovery on the DB for every transaction to minimize the failover time (which IS nice and snappy, usually <1s). Turns out, there is an undocumented trace flag that disables the per-transaction recovery process, at the cost of a higher failover delay. This sounded like exactly what I needed. So what is this magic trace flag?

DBCC TRACEON(3499, -1)

This should be run on both the primary and mirror DBs, since they switch roles during failover. It worked exactly as advertised for us. The mirror server's disk I/O was now in lock-step with the primary, and we could once again use full-safety mirroring with a witness. The failover times were definitely increased, but in our testing, they're still sub-10s, which is perfectly workable for us.

I've only found two references to this trace flag online- one in a presentation by an MS employee that says you should test extensively (which we are), the other in an unrelated KB article about upgrading DBs with fulltext indexes to 2008 from 2005. I've found a handful of people griping about this problem in forums over the years, with no responses. Hopefully this will take care of others' issues as well as it did ours. We were within inches of switching to a sub-second log shipping scenario to replace mirroring because of this issue, and now it's looking like we won't have to. Just wish it was a little better documented.

6 comments:

Just a quick note to say I was experiencing the exact same issue re I/O issues on the mirror server. I am mirroring 12 databases in our test environment and disk I/O on the mirror was significantly higher than was required. This was causing me no end of grief until I came across your article.....I implemented it in our test environment and everything is running very smoothly. Thank you so much for all your hard work and for sharing your findings. Now that the database layer is fully configured and working great I would really appreciate your advice on one last issue. You make reference in your article to recovering gracefully from failed transactions. I have found out during this project that the developers here historically only 'retry' if a deadlock occurs. Therefore, if the system is mid-transaction on failover we need to retry that transaction against the mirror (partner) server. I have looked far and wide for a paper of recommendations as to what error codes you should capture and then retry on but alas I have had no luck - is there a 'best practice' of what we should capture for. Thus far, I have asked them to code retries for;

It seems like everyone of these failures throws its own unique error code and we are catching and retrying on these codes. I am sure this is a novice 'school boy' error way of handling this and would really like some guidance as to how this should be handled?

Please, if you could spare the time any assistance would be greatly received.

Glad this helped someone- I was surprised at the number of times I found this problem referenced in forums with no resolution.

From a development perspective, it's difficult-to-nearly impossible to just reconstruct and retry a transaction "on the spot" if your system is at all complex, spans multiple layers, etc. Without a LOT of complex supporting infrastructure, you usually can't just go back and replay all the pieces just right as a direct response to a particular failure. I've tried doing what you describe in former lives, and it just ends up complicating the code for not a lot of benefit (and sometimes to the overall detriment of the system, since issues can be hidden). We "fail gracefully" in that we're using transactions for everything so that the DB stays consistent (you'd be amazed how many people screw that up), and that any work can be retried as a unit in the case of ANY failure. For that part, we're not looking for specific error codes, we just roll back the transaction and in the case of automated processes (background scheduled tasks, queued work, etc), auto-retry the work item later once or twice before flagging it as suspect and needing human intervention. User-initiated requests should be idempotent until their transaction commits, so if a user request fails, we don't try to hide it- just say "sorry, it broke", and ask them to try again. Large blocks of failures are logged and alerted on, so if we get alerted to a bunch of failures and we DIDN'T just do a failover, we know we've got a problem that needs attention.

Might not be what you wanted to hear, but it's served us well... Good luck!

Very much appreciate the quick response. The only other issue I am experiencing since turning on the trace flag (and I’m not saying here it is because of the trace flag – it’s more to see if you guys had any issue) is I intermittently get the following error (I say intermittently) sometimes it happens first time, sometimes it happens on the 5th time of failover!

When running this on our enterprise domain (without the trace flag switched on) it fails over seamlessly without error. However, now I am running it at our hosting company for 'real-life' testing I am receiving the following error when performing failover (manual and automatic).

Microsoft SQL Server Native Client 10.0 error '80004005'

Cannot open database "DatabaseName" requested by the login. The login failed.

I am using a standard domain user account for each client for the SQL Server logins, removing any issues with mapped SID's etc when failover occurs. So when running the application and performing failover 'sometimes' it works!?! - So I'll quite happily be running a job from the app into the database and failing the database backwards and forwards with no problem at all and then suddenly BAM...i get this error. I am totally lost as the logs aren't giving me any information, hence me requesting some Jedi assistance.

I’m not sure whether this is due to the failover times increasing and therefore the connection from the app isn’t performing correctly. This only happens when I fail to the mirror (Failover Partner server). When I fail back to the principal, I can refresh the app and all is well. I appreciate this is your blog and will not fire random questions re my mirroring issues, it is just I would like to discount this as a trace flag issue.

Thank you for your assistance on this. I really do appreciate your help.

It doesn't seem right that the trace flag would directly affect login behavior, since all it does is change whether or not recovery is done after a log write. I'd suspect that connection pool state corruption due to the longer failover time might be a likely culprit, but we only use the .NET SqlClient (not the native client), so I don't have any experience there. If you're doing a planned failover, you can always DBCC TRACEOFF the 3499 flag on the secondary right before the failover. That gets you a fast failover and you're only paying high disk I/O for the length of time it takes to do the switch.

Apologies for the delay in responding, but i wanted to be sure before i did. It now seems (fingers crossed) that we have resolved the issue. It transpires that it was a connection pooling. After working with the dev team, they have issued a fix in testing that seems to have resolved the issue...

I thank you very much for all your valuable input, time and for the excellent initial post.

I came across this issue in 2008 R2 Enterprise, with synchronous mirroring. Our issue is lots of timeouts when indexing (online). The mirror can't keep up with the live database, I've seen the Disk I/O on the mirrored server continue to be pegged over a minute after the indexing finished on the live server, trying to catch up. This causes blocking, then massive timeouts for the customers, even when indexing during our slowest time period. Small indexes under 1MB will still cause issues (avg size was 5-6 MB index), as even only taking minute to index causes issues. I tried going with more reorganize indexes, instead of rebuild, as these make less work for the transactions logs, but even that caused timeouts. The tables causing timeouts have 40 to 210 million rows.

I was very hopeful this Trace Flag would work, but it didn't. I can only assume this is synchronous mirroring "as designed" by Microsoft - which doesn't work well at all even in moderate loads. I may try async mirroring. I'll probably try partitioning some of these tables.