Issue

During this event, you had 6 resource type timeouts. They were:

cvm.vxconfigd, cfgnic, datadg, ocrdg, cvm_clus, and vxfsckd.

Whenever you have a system where a system resource is being used to it's limits, you will see multiple VCS resource timeouts of this type in a short period of time. Each VCS resource type has it's own agent, and each agent has a monitor, online, offline, and clean component. These are just unix processes that are doing interprocess communications, and are vying for system resources, just like any other unix process. If the system is working to it's limits, this will cause these timeouts that you saw.

The timeouts are not the issue, but when the timeouts cause the clean component to be called, this is essentially a forced offline by VCS. There are tunables for all VCS resource types, that can be tweaked to increase the amount of time that VCS will wait before calling clean

In your case, the clean was called on the CVMVolDg resource datadg, which controls all of your shared volumes and filesystems. Because clean was called, this caused a forced offline, and because all of the CFSMount resources depend on this resource, this is the root cause of your issue.

The best fix for this is to lighten the load on these cluster nodes so that you are not using so much CPU on these systems, as this will prevent the timeouts in the first place. If this can't be done, then I would recommend the following tuning for the CVMVolDg resource type.

FaultOnMonitorTimeouts - This tunable defines then number of consecutive monitor timeouts that can occur before clean is called.

Current value - 4
Recommended value - 8

MonitorTimeout - This tunable defines how long the monitor will wait before it declares the resource to be timed out.

Current value - 60
Recommended value - 120

ToleranceLimit - This tunable defines how many monitor cycles the agent will do before declaring the resource as faulted.