Recovery Design Part 5 - Wrap-up

Doug's Oracle Blog

It's been a very mini-series but I hope I've highlighted some of the challenges when designing systems that need to recover from failures quickly. Here are a few ideas and I'm sure others could add their own.

Careful planning is essential.

Ask tough questions and imagine the worst.

Always be aware of the requirements. The design is entirely dependant on them.

You need to consider every single point of failure and RAC still implies a single shared database that can fail (and probably will some day).

Test.

Keep things as simple as possible and minimise human intervention. Humans make mistakes.

Test.

Document. It reduces the number of mistakes humans might make.

Test.

Employ technically strong, responsible humans, and I swear that isn't an advert, just recognition that the more complex the configuration, the better your people need to be.

I mentioned in the previous part that I would reach some unsatisfactory conclusions. Some of those are above - because I'm not telling you what to do, just mentioning some considerations - and here's another.

There really is no one-size-fits-all solution and (slay me for saying this, purists) the chances are that you will not deliver what the business wants, or what you hoped you would, but a compromise resulting from the balance of risks, costs and expectations. Far better to turn around and admit that if things go very wrong it might mean four hours down-time than come up with a pretty design document with the buzz-words that your managers are looking for and then have to spend four hours panicking when disaster strikes after the system's live and five days explaining to managers why 'it didn't work'.

As for the satisfactory sources of information, here are a few.

I already mentioned Mogen's first class critique of RAC and I only recently came across another excellent RAC paper by James Morle that looks at RAC Connection Management. Take a look at James' paper and ask yourself if you understand what 'Transparent Application Failover' really means. It sounds perfect, but I guess it doesn't do what most of you would expect (unless you've worked with it already).

I've recently started reading the eBook version of Julian Dyke and Steve Shaw's Pro Oracle Database 10g RAC on Linux. Although I haven't got into the guts of it yet, I've read all of the appendices and the first few chapters and the tone is pretty much what I've used in this series, but much more detailed. It's very strong on high availability concepts so if you've never worked in that type of environment before, it's a nice way to get started.

Oh, one more thing. Vidya asked about disk-to-disk synchronisation and I thought I'd include a link for anyone who's interested in that kind of thing, although it's clearly vendor-specific

The series is finished for now, but I think there'll be future blogs about how the implementation of the new system goes, particularly any problems I run into.

Another way of winning the VHA war is to use good old "divide and conquer". This is precisely what was done at my previous job:

we received billions of "clicks" from various search engines. Each of these consisted of a target url, a source url, the search terms used and an optional cookie to track sales, pages travelled, clients, etc. Each of these was charged to us by the engines and each was in turn charged back to the clients owning the target sites.

As you can expect because of the vast volumes, the need to track all clicks was of paramount importance. This is not .4X9s HA, this is 0 loss, 100% availability, all integer numbers, period! No excuses. We could lose the odd "click" here and there, but definitely no total service outages, none whatsoever.

Needless to say, there is not only the issue of making it work. What about new versions? And testing? And maintenance? etcetc.

The solution turned out to be Apache servers, heaps of them, on load balancers. Some nifty add-on code to make Apache log more than it usually does, then offline batch loading of the logs into analytics databases so we could do all sorts of fancy footwork with search terms, effectiveness, global marketing campaigns, etcetc.

If any of the servers failed, we'd lose the current "click" being processed but nothing moe: all load woul be diverted to the next ones in the round robbin and bingo, away it ticks.

In 24 months, we lost 10 minutes of clicks. Event that was due to a misconfiguration of one of the load balancers, not our application code.

That's not bad, but at what cost point?
Well, all blade commodity servers, RHAS3, Apache and some in-house brain sweat.

"...come up with a pretty design document with the buzz-words that your managers are looking for and then have to spend four hours panicking when disaster strikes after the system's live and five days explaining to managers why 'it didn't work'"

Ahh ... yes ... RealLife(tm). A beautifull summary.

It fits nicely with my other favorite/pet-hate observation: The manager that tells you to choose (technically) wrong option for reasons of cost/time, telling you that in the event of a disaster he will take full responsibilty. Which he does, in a fashion, by ordering you to do overtime to fix the problem when the disaster does strike.

Otherwise, back to the main point, your Recovery Design Mini series: Yes, yes, true, done that, ok, yes, good way of clarifying that, yes, ok ... (and so on) I agree with it.

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.