Saturday, January 25, 2014

Core dump issues sometimes can be notoriously difficult to troubleshoot. I've got a call this morning from one of my customers saying that after a power outage Grid Infrastructure is not able to fully come up on some nodes on their Exadata cluster. After further examining the situation it turned out that crsd.bin binary is simply core dumping upon start up.

Troubleshooting Grid Infrastructure startup issues when nothing is core dumping sometimes could be a chore so what could be more fun when it's not able to fully start due to a major daemon core dumping?

One of the useful things to do when a binary core dumps is to get a stack trace to see which function raised the exception (you can examine the core file the gdb, for example, in order to do that). Let's see what the stack trace holds for us:

We can see that the source of the exception is in the Acl::Acl which is then propagated through the standard libraries. Moreover, function SrvResource::initUserId appears in the stack trace as well, which makes you wonder whether there is some issue with some of the resource's Access Control List, in particular with it's user id setting.

Armed with that knowledge you can now sift through the Grind Infrastructure logs in a much more effective way because these logs are notoriously big and "chatty" (I think my worst nightmare is when the database alert log will become like GI alert log thereby making it much less useful). And there we have it:

Exception: ACL entry creation failed for: owner:ggate:rwx

Turned out the nodes which were core dumping were recently added to the cluster and the user ggate, which is the owner of the GoldenGate resource, simply did not exist on these nodes. Apparently that was enough to cause crsd.bin core dumps. Yikes!