Abort RegionServer Immediately on OOME

Details

Description

Currently, when the HRegionServer runs out of the memory, it will call master, which will cause more heap allocations and throw a second exception that it's run out of memory again. The easiest & safest way to avoid this OOME storm is to abort the RegionServer immediately when it hits the memory boundary. Part of the 89-fb to trunk port.

guys... this is so stupid... I lost the whole morning cause HBase's RegionServer was dying with no logs, no nothing... how Am I supposed to debug the issue if u do not even generate a core dump? or a log message? ... argh

Guido Serra aka Zeph
added a comment - 20/Feb/13 11:28 guys... this is so stupid... I lost the whole morning cause HBase's RegionServer was dying with no logs, no nothing... how Am I supposed to debug the issue if u do not even generate a core dump? or a log message? ... argh

It looks like the other locations just call Runtime.getRuntime().halt(1) directly. Maybe we should do the same? BTW : Patch originally done by Liyin & reviewed by Kannan, so I'm not 100% sure what their reasoning was.

Nicolas Spiegelberg
added a comment - 10/Nov/11 23:00 It looks like the other locations just call Runtime.getRuntime().halt(1) directly. Maybe we should do the same? BTW : Patch originally done by Liyin & reviewed by Kannan, so I'm not 100% sure what their reasoning was.

Outside of this patch do we need to fix our 'abort'? Seems odd that we have 'abort' and then this more radical 'abort', here its called 'forceAbort', where we call halt (Maybe we should have a 'halt' method? This patch would use it and I believe there is another halt call over in the distributed code?).

stack
added a comment - 10/Nov/11 22:52 +1 on patch (Fix 'itseft' on commit).
Outside of this patch do we need to fix our 'abort'? Seems odd that we have 'abort' and then this more radical 'abort', here its called 'forceAbort', where we call halt (Maybe we should have a 'halt' method? This patch would use it and I believe there is another halt call over in the distributed code?).

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

Hadoop QA
added a comment - 10/Nov/11 22:42 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12503296/HBASE-4769.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/228//console
This message is automatically generated.