is there documentation what should and what should not happen when all local
disks fail on a cluster node?

We had a problem with a two node veritas 3.5 cluster running solaris 8.
Two SF6800 with FC shared storage and a local D240 (with four disks) each.
Due to mis-cabling the D240 on the first node failed from a power outage.

The log files don't really explain what happend and I only have got the logs
from the second node (of course).

Can someone explain what happend and how it could have been prevented?

Ok, that's more or less clear, but how did the first node got faulted? There
is no message other than "RUNNING TO FAILED". No failing of resources,
nothing. How did the other decided that quick that all groups were offline
on node one? Did it probe? Did it panic the system?

OK, so port "h" went south.
Now, port "h" is used for the "had" process (the main VCS program) to
communicate to gab. Looks like "had" died here (or the whole box died)

If you had a look at the node that failed, you might see that it did
(most of the time if a disk dies, it will "hang" the system. This is
because a lot of IO gets blocked in the kernel, and the kernel is trying
very hard to re-try and get some IO going - a lot of processes or
threads might be waiting on IO)

You must also remember that the main VCS process (had) runs in user
land. So anything running in the kernel at that stage, will get a lot
more time on the CPU(s). GAB also runs in the kernel, and will try to
communicate to "had". If it can not talk to "had", it will actually kill
the process (and hashadow will then restart it - eventually). All of
these messages will get logged in the /var/adm/messages file (on Solaris)

It is a shame you don't have any info regarding what happened on the
sr005 system. Even a crash file could be analysed by Veritas support to
tell you what happened on the other system.

The other sysetm will stay autodisabled until it gets autoenabled or
"had" restarts (and communicates the state of the system and the
resources back to the other remaining node).

When the hagrp -online command was executed on sr006, the sr005 system
was not back yet (maybe at the ok prompt ?). Also a pitty you don't have
the log files beyond this. The failure reason would have been stated
(most likely the fact that it was still autodisabled on the sr005 node)

Now, lastly, the human factor. Most of the time when someone messes up,
they try to cover their tracks (by modifying, deleting the
/var/adm/messages and the /var/VRTSvcs/log/engine_A.log files).

There are some other indicators to look for (reboot times, other logs
for other applications or agents in /var/VRTSvcs/log all stop at the
same time and resume later - after a reboot or "go" at the ok prompt).

Also search see if any core or crash files were generated and get them
analysed.

Sorry for the long post, but I hope that explains it a bit more.
Really suggest you get your hands on more log files or other evidence
from the sr005 system.