Fighting off RAC

As mentioned, my current client is upgrading from 9i Standard Edition on Windows to 11g Enterprise Edition on Linux.

The timeframe for this upgrade is quite aggressive. While normal development continues at a blistering pace, automated regression testing began part-time in January for an upgrade originally in March but now re-arranged for April.

Also in January, I was told that it had been decided that we would be going live on a 2-node RAC cluster. Alarm bells rang, sirens went off, I was bewildered as to how such a decision could be made.

This was a decision that management had apparently made in conjunction with the DBAs, with RAC sold as a black box database clustering solution with no question marks over its suitability.

There’s no architecture department/function and in terms of Oracle and databases, it’s a role I have previously filled elsewhere. I immediately went into overdrive, briefing against the wisdom of going to RAC before it was too late.

For us, there are numerous issues that I see with RAC:

The first is the upgrade timeline. The proposal, in January, was that in March we would be changing hardware, changing OS, changing database version, changing database edition and now also moving to RAC. There are plenty of references out there that suggest RAC alone can take a few months to stabilise and to nail your proper configuration. For a business that was committed to a March (even April now) deadline, this seemed to be to be a big risk.

Secondly, I am far convinced about the suitability of RAC for our system. It’s one thing for senior management to upgrade the business criticality of a system and give the greenlight for some spending to upgrade the platform. It’s another thing to upgrade your Ford Focus to a Ferrari only to stay stuck in the same commuting traffic. That’s my analogy of our system – lots of bad SQL, lots of statements that cope with 1001 different inputs, far too many lookups in functions called from SQL. At the end of the day there’s no getting away from the fact that our system has something like an average of 8 concurrent requests. Historically, it has been maxed out or close to being maxed out on CPU because the SQL is bad.

More importantly, I remain very concerned about the suitability of RAC for the profile of our system. I think of RAC as suitable for high-end OLTP systems of for multiple systems with non-OLTP systems restricting themselves to one of the nodes (unless non-conflicting activity can be determined somehow and allocated to relevant nodes). I would probably categorise our single system as a hybrid system with a very small element of OLTP and a large element of reporting. Furthermore, a large proportion of that reporting revolves around the top X% of a single table. The RAC architecture is such that repeated requests for the same blocks will cause significant waits as the instances coordinate and pass the blocks back and forth.

The bottom line is that RAC is complex and there’s no getting away from the fact complex is bad. Complexity eventually comes back round and bites you. And complexity usually costs more.

For now, my briefing has been a partial success, although really the only point that had any effective leverage was that the the upgrade date was under threat. A 2-node RAC solution is currently on hold for the initial upgrade. The idea being that we upgrade to 11g on the new hardware and then we expand to a 2-node cluster once that initial upgrade phase is stable.

But, and here’s the stickler, the upgrade is going to be to a 1-node RAC “cluster”. I can understand that there is a once-in-a-blue-moon opportunity to go to the new hardware, but I don’t like the smell of a 1-node RAC cluster at all. For a start, there surely has to be some overhead to the RAC installation. But we’re not even testing that. But what’s more is the cost. There is a significant licensing overhead to RAC. It’s an overhead that surely means that going to a 2-node cluster is a formality regardless. I’m just glad that it’s not my name on the cheque. It’s one of those things where I just hope that someone down the line doesn’t say that we’re moving off Oracle because it’s so expensive.

If your application doesn’t scale well, then you will have a lot of fun with a 2 node rac. On a one node rac it will run fine, but on 2 nodes it will be much slower.
But i would start with a non-rac install, which you can quite easily convert to a rac-install later on.

Good luck! It’s always interesting to hear which documented requirement can only be satisfied by using RAC. A rational analysis should be able to determine if the cost of not having the requirement is equal to the cost of implementing and supporting RAC.

Thanks Dom. Unfortunately, I’ve lost the battle already – the only concession was that we won’t be upgrading onto RAC immediately. I only heard about this on the rumour mill when it was too late.

“If the Oracle sales people say it’s right, and the Oracle professional consultants say it’s right, it must be right, right?”

I think a lot of the problem is that management wanted to buy/build a Rolls-Royce platform without considering whether there were any elements of the Rolls-Royce that were actually unsuitable. And once the business have agreed to spend the money, what are you going to do? Turn around and say that you don’t need as much and that your thinking was flawed. I am not a massive fan of technical decisions made by management in a void of appropriate technical input (your article).

It’s also an attitude to cost that I thought might have changed already with the ongoing financial credit crunch crisis.

Oliver, could you expand on your comment a bit? I agree that RAC isn’t a good solution to poorly tuned applications, but I fail to see how doubling the hardware resources available to it could make it go slower.

In situations where there’s a lot of poorly tuned SQL, I like to use the database resource scheduler to shift those long running SQL statements to a lower priority. Plus, you can very quickly see those long running statements in v$session.resource_consumer_group. I’m going to blog about this in the near future. This feature works in both RAC and non-RAC environments.

Once you have your 2 node RAC cluster setup, you could use services to put your good SQL on one node and your poorly tuned apps on the other. As you tune your apps 1 by 1, you can move them to the “good” node. This should be no worse than your current situation and might provide some incentive for your developers to tune their apps.

Another option is to use services to put OLTP on one node and reporting on the other, since the nature of the workloads is very different.

I do agree that it will extend your implementation timeline as it’s lot hard to install than single-node.

1 last comment. Since you’re moving to 11g, look into tuning those SQL statements that call functions with function result cache. I’m sure you already know about this feature, but I can tell you from personal experience that it’s a very simple, high-yield tuning technique. The APEX team included it immediately in the 3.0.1 release for all of the calls to functions that return translation strings.

Obviously, taking functions out of SQL statements completely is the best way to tune them, but this can often take a lot of work, where as function result cache allows you to simply add a line to the function and leave the query unchanged until you have time to circle back and really tune it.

RAC is a way of spreading the processing load over more CPUs. Interestingly though, the processor technology industry is going multi core at a pace, meaning a lot of the advantages of RAC (although not those around redundancy) could be seen to be eroded by these new multi core CPUs (take the UltraSPARC T2 for example). Why spred the load over multiple machines when one processor now has 4+ cores and 64 threads of execution. With this trend in mind I think your caution around RAC is well founded.