I don't work for a huge company and we don't have Tera-byte sized databases but we have multiple mission critical 120GB databases and an awesome NetOps group. Instead of buying everything brand spanking new, they bought some awesome hardware that had been refurbished. What that allowed us to do was to afford not just 2 but 3 identical systems. 2 of them are clustered onsite. We regularly test them by forcing a failover. The "outage" is usually something less than 5 seconds. The third system is in another state about 500 miles away. It's the DR system. I don't know what tools they're using for all of this but the fail over to the DR site is also measured in very few seconds and it's all automatic. They've made my job as a DBA a proverbial cake walk when it comes to HA and DR.

During the "failover", most users lose no work and most don't even know the failover occurred.

Living on the bleeding edge is expensive. Instead of buying the latest and greatest which also commands the most expense, they bought the latest and greatest of the refurb world. For what most people would have paid a little more for just 1 system, we have 3. And, I have to tell you, these systems aren't some beatup ol' relics. They might not beat "state of the art" but they do a damn fine job of keeping up.

--Jeff Moden"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T."--22 Aug 2013

I figure it's not a bad idea to chime in here. My current DBA (I'm playing little ol' developer at the moment) is beating his head on the walls creating a proper DR system. I don't respect restrictive DBA's often, but he has my respect, it makes sense. With that in mind...

It's been a political nightmare. The gear is not the issue. The setup is not the issue. The volume of wtf outside of our domain is the issue.

In some ways I want to go back to the 1990s... when you could piss on the sysadmins and they ran so you could get your DB setup properly. One LUN goes sideways now and you've got four parties screaming 'Not ME!' until you can nail one to the floor with a sledgehammer.

And then they cry.

I want to go back to when we OWNED our damned work. I want to go back to the time when three people stood up at a meeting and said "Whoops, that's mine, sorry.".

You ask about the decision of when to fail. I have an issue with that, but not with the question. My issue is "can we fail?!". Too often, the answer is "No. We die."

I realize this seems like a bit of a soapbox, but this is five companies deep where I've seen this. Seriously, own up.

I own my failures. They should too. DR is not a one system show. I can't DR anything that doesn't have proper backup from all parties. Currently DR is like pissing uphill and upwind for anywhere that doesn't have dedicated staff dedicated to their components. There's no way for things to end well.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

Evil Kraig F (6/7/2013)I own my failures. They should too. DR is not a one system show. I can't DR anything that doesn't have proper backup from all parties. Currently DR is like pissing uphill and upwind for anywhere that doesn't have dedicated staff dedicated to their components. There's no way for things to end well.

My current company thinks that the backup tapes that they send to Iron Mountain are enough. We have over 500 hosted companies that use RDP to access their data/apps.

I've been doing my best to create alternate backup solutions. But if our building were to be hit by a tornado, we are screwed.

Failover clusters are a good idea, but don't let your SAN be a single point of failure. Years ago, I worked at a place which actually had invested in a failover cluster for all of their production SQL Server instances. Then there was a SAN failure, and we were down for a week. That's how long it took them to get a replacement SAN from the vendor and restore 10 TB of data from backup tape. I'm not a hardware / server guy, so I don't know exactly what they did wrong to get themselves up for that disaster.