RE: 64 node Oracle RAC Cluster (The reality of...)

Just because a CFS is supported doesn't mean it is the most reliable
service of an OS. If a given vintage of ASM or straight shared raw has fewer
"moving parts" (shall we say less code path?) than a given CFS, why expose
yourself to the increased chances of a SPOF? Actually that's a "Where on the
slippery slope between ease of admin and maximum availability do you choose
to be?" question. There is room for an honest argument (in lieu of
sufficient differential measurements to trust) whether outages due to CFS
failures would be more or less than outages due to the logical complexity of
having multiple copies of the same ORACLE_HOME. Presumably the answer will
vary with OS, CFS, quality of shared disk infrastructure, quality of
internode communications, number of nodes, and acumen of the persons
involved. (Okay, as to the persons it may have more to do with compulsive
attention to detail than it has to do with acumen.)

Now just let's suppose you have several nodes in a grid/rac. Now of
course first you're going to test the new release/patch on your isolated wee
little 2-3 node test "grid" and make sure it doesn't do bad things. Then you
take two or three of the headroom nodes (capacity above need at peak load
nodes) off line (ie. you stop their instances of your production database
nicely and politely). Now, being sure to have in place on these offline
nodes the files that your reboot routines check to prevent unwanted instance
restart on reboot, you apply the patch/upgrade to these nodes along with any
database changes to the wee little test database you have spanning your
production grid (of course the test database instances on the online nodes
of the grid are down and locked off in the same manner the production
instances are locked off on the nodes you are upgrading. So now you have a
few nodes of your test database on the production environment up and running
and you make sure it works, running the regression suite you've developed to
avoid severe load on the production shared disk farm but fully testing the
functionality that must work to avoid losing your company enough money to
land your butt in the unemployment line. (I hope you have no actuals on that
so you'll have to make an old fashioned guess. You have CJLAD (Compulsive
Job Loss Avoidance Disorder) or you really didn't want to stay there anyway
if the answer is less than a few thousand bucks.)If you're going to this
much trouble there are probably a few more digits involved, unless we're
betting on donuts. After your test "in situ" works out okay, then you
prepare all the nodes for the new ORACLE_HOME and schedule your clean
bounce, third plex split or other quick backup method, pause application of
standby logs, and you pull the trigger where any required database upgrade
component of the release/patch takes place. After your "Deer Hunter" moment
you relax, remove and throw away the cork from a nice bottle of Scotch, and
do what any honest fellow (person, for the PC, but "any honest fellow" is
from a rare poem that I know) does with a bottle of Scotch that can't be
corked. Black Adder in Pete's case, if I recall correctly; I'll take
Dalwhinnie; some may stray toward Cabo Wabo, but that's not even Scotch....)