Search

Listener Error from addNode.sh with Second Network

Recently I ran into an problem with 11.2.0.3 RAC. I observed this on a system patched to PSU6 and it looks like a bug to me. But the interesting part isn’t the problem – it’s an impressive and creative workaround that my colleague found over the weekend. I should add that this teammate doesn’t have much background with Oracle RAC though he does have lots of experience with other technologies. His email this weekend surprised me and also gave me a good laugh – hope you find it equally useful and enjoyable!

The problem originated with a requirement I was given when designing this particular cluster system: I was asked to run Data Guard traffic over the backup network instead of the public network. This sounds simple enough if you haven’t worked with RAC. But if you’ve worked with Oracle clusters you realize that nothing is simple anymore. (A big reason I often encourage people to wait on moving to RAC, especially if the main driver is high availability…)

In an Oracle cluster, networks aren’t just networks. Each component (listeners & ports, IPs, subnets) is a “resource” that must be defined and managed by the cluster management software and must not be tinkered with outside of the cluster management software. I’ve seen many occasions where sysadmins new to RAC were surprised when their server would suddenly reboot itself after they stopped and restarted the network. Welcome to the newly complicated world of clusters!

Data Guard, of course, needs to use SQLNet connections in both directions between the two mirrored databases: from primary to standby to ship changes and from standby to primary to retrieve any gaps of missing changes. SQLNet connections require listeners. On a RAC cluster with IP networking you must always connect to the listener by using a special virtual IP. And the official Oracle Documentation only seems to support having a single Virtual IP for each node on a single public network. They don’t give any clues about having listeners (and VIPs) on a second network.

However with a little searching I found notes 1063571.1 and 1349977.1 on Oracle’s support site which has instructions to add a second network. FYI, there are also a few blogs (like Linda Smith’s blog) which have published the general process outlined by this note. This is the process I followed to setup listeners on a second network for our cluster. But this is where things got interesting. After adding the second network I proceeded to add a new node to the cluster… FAIL! And the root cause seems to be that adding a node simply breaks if there’s a listener on a second network! More specifically, root.sh – which actually joins the new node into the cluster – fails.

The clusterware root.sh creates a log file under the cfgtoollogs/crsconfig directory in the grid home. Looking a little deeper, we can see that this was a fatal error and that the cluster setup process actually died:

You can see from the excerpts above that the Oracle CRS root.sh script has failed to create the VIP for the second network. Looks like a simple bug to me, should be pretty easy to reproduce too. This was where I left the case last Friday afternoon.

This weekend I received the following email from my coworker:

Jeremy –

I spent a little bit of time digging into what this Oracle VIP is and if there was a way to fool it. Since this is not a standard OS definable thing, I checked to see if the srvctl command already existed on the machine. Since it was there, I gave it a try and it wouldn’t work because you had removed the clusterware. So I tried this:

– Opened 2 windows
– In Window 1 ran the command: /u01/grid/oracle/product/11.2.0/grid_1/root.sh
– In window 2 I kept trying the command: /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl add vip -n collabn3 -A 192.168.221.170/255.255.254.0 -k 2
– Eventually the clusterware came up and the node was defined in the cluster and services were fine and the command finally succeeded.

…

I assume this means everything went fine. Check it out and let me know if it really is in the cluster.

And there you have it. I had a look monday morning and sure enough it looked like everything succeeded. Creative, brilliant solution from one of my coworkers! Of course we will still file a bug report with Oracle and work to get it resolved, but this seemed worth sharing in the meantime.