After finishing the open of META, the RS goes to update location in ROOT and gets:

2010-10-27 22:47:29,208 ERROR [RS_OPEN_META-192.168.0.44,59735,1288244843993-0] executor.EventHandler(154): Caught throwable while processing event M_RS_OPEN_META
java.lang.NullPointerException: No server for -ROOT-
at org.apache.hadoop.hbase.catalog.MetaEditor.updateMetaLocation(MetaEditor.java:127)
at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1271)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:156)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

This doesn't actually kill the RS, it's just a caught exception up in the generic EventHandler. But we get left in a weird state. Eventually master does the right thing and times-out the OPENING:

1. Address the race condition when we get the connection to the root server (could exist for meta too). The blocking call thinks we have a location but then when we get the cached location and don't get one.

2. If we do get this NPE writing to root (or maybe meta too), we should not just throw the exception all the way up to the EventHandler and log it and continue. That just stops our META_OPEN in it's tracks. We complete the open but just not the edit. We don't roll-back in any way.

3. If the master is assigning stuff out and a region says, hey, I'm already hosting this region... something must be up. In this case, it would not have been good for the RS to tell the master that it was already hosting it because it was missing the root edit. So maybe if this happens, the master asks the RS to close the region in question? Dunno.

Probably more issues to think about around this

This seems to be extremely rare. I have been running this TestRollingRestart script constantly and this only happens when I do a concurrent kill of the server hosting ROOT and then server hosting META, and then only sometimes, it does work more times than not.

Activity

So, we do actually catch exceptions and roll back if we fail to update root/meta. Issue is that we throw an NPE here which we don't catch. Instead we should throw some form of an IOE (we actually throw the NPE it's not an actual NPE).

Jonathan Gray
added a comment - 28/Oct/10 18:44 So, we do actually catch exceptions and roll back if we fail to update root/meta. Issue is that we throw an NPE here which we don't catch. Instead we should throw some form of an IOE (we actually throw the NPE it's not an actual NPE).

This is an interesting JIRA because it showed some race conditions around availability and tracking of catalogs. Though I think this patch is sufficient for closing this JIRA, we do need to get better at this. I think the first step to making this stuff easier is to get rid of ROOT over in HBASE-3171.

Jonathan Gray
added a comment - 29/Oct/10 22:14 Trivial patch that changes exception to an IOException.
This is an interesting JIRA because it showed some race conditions around availability and tracking of catalogs. Though I think this patch is sufficient for closing this JIRA, we do need to get better at this. I think the first step to making this stuff easier is to get rid of ROOT over in HBASE-3171 .

This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

Lars Francke
added a comment - 20/Nov/15 12:43 This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).