Multipath timeout issues with extended 11.2.0.2 – cluster setup

We were setting up a 2 node Oracle Grid Infrastructure (RAC) – extended – cluster on top of RHEL 5.5 according to the Oracle standard documentation, with of course a third NFS-node as voting node. Also using ASM to create “host-based”mirror blockdevices for the Oracle software.

We did choose this configuration in stead of a configuration with Dataguard because of our high demand of failover-time in case of a node- / SAN- disaster. Should be within 30 seconds. This post raises the question if we made the right decision….

The following analyses and testing by the way has been the effort of my collegae Chris Verhoef, a former RedHat-consultant:

With this setup we are facing the issue that if we loose a complete SAN, the IO’s to the ASM diskgroups will be blocked for approx 3 till 4 minutes. Oracle does not like this. After 70 seconds after a freeze, rdbms is starting to reboot (expected behaviour). To shorten this time we have done some testing with the following parameters:

checker timeout

no_path_retry

dev_loss_tmo

First test (expected to be an extreme one in regards to the no_path_retry)
– checker timeout from 60000 to 30000 (udev rule change within 50-udev.rules)
– no_path_retry from 12 to “failed”
– dev_loss_tmo unchanged (16 default from scsi_transport_fc)
This results in a IO block from ASM point of view for a litle bit more than 1 minute (1.05) while expected 30 form checker timeout.

The second test (some more reliable, some queueing)
– checker timeout from 30000ms to 15000ms (udev rule change within 50-udev.rules)
– no_path_retry from “failed” to 2
– dev_loss_tmo from 16 to 7
This results in a IO block from ASM point of view for a little bit more than 1 minute (1.15) while expected 15 form checker timeout with a additional 7×2, so in total approx 30 seconds.

We opened a service-request at Red Hat to lower the IO block time, but also one at Oracle, just to be sure.

After an extended investigation the first advice of Red Hat is to install the latest version of the next packages, in our case: