rare race condition can take down a regionserver.

Details

Description

this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:

java.lang.IllegalAccessError: Call open first
at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
at java.lang.Thread.run(Thread.java:619)

doMetrics() calls getStorefilesIndexSize() which gets a view of the storefiles ConcurrentSkipListMap at some point in time. Working on this snapshot it calls each store file in turn asking for the index size.

In another thread, the compaction completion code finishes, first thing it does is:

remove store files from the storefiles list.

do some stuff

close the aforementioned store files, which causes the this.reader to become null.

Back in thread #1, we run into the this.reader == null, and we throw the exception.

So we need to do either of:

sync on this map, use a synced versin of the map

allow the ability to check this metrics without causing a RS abort when we hit an exception. Either catch it, or prevent it from happening.

ryan rawson
added a comment - 23/Jun/09 08:18 this is a race condition, here is how it happens:
doMetrics() calls getStorefilesIndexSize() which gets a view of the storefiles ConcurrentSkipListMap at some point in time. Working on this snapshot it calls each store file in turn asking for the index size.
In another thread, the compaction completion code finishes, first thing it does is:
remove store files from the storefiles list.
do some stuff
close the aforementioned store files, which causes the this.reader to become null.
Back in thread #1, we run into the this.reader == null, and we throw the exception.
So we need to do either of:
sync on this map, use a synced versin of the map
allow the ability to check this metrics without causing a RS abort when we hit an exception. Either catch it, or prevent it from happening.

At first I thought that use of ConcurrentSkipListSet the problem but thinking on it more, rather, we need to make code tolerate fact that a file has been moved or removed. Alternative is syncing around file operations till they complete which is too much to ask.

A good while a go, an issue in metrics got HRS stuck in an infinite loop.

stack
added a comment - 23/Jun/09 23:32 At first I thought that use of ConcurrentSkipListSet the problem but thinking on it more, rather, we need to make code tolerate fact that a file has been moved or removed. Alternative is syncing around file operations till they complete which is too much to ask.
A good while a go, an issue in metrics got HRS stuck in an infinite loop.
Let me try hack up a patch.

+ Wraps metrics in a try/catch that catches any exception, logs it and then moves on rather than let it out and kill HRS
+ Changed getReader so it doesn't do IllegalArgumentException with "must open first" but instead lets out the null Reader.
+ Then, changed whereever we get a Reader to check for null. If null log it so we can see extent of phenomeon but in general just keep going.

stack
added a comment - 24/Jun/09 05:00 Patch does following:
+ Wraps metrics in a try/catch that catches any exception, logs it and then moves on rather than let it out and kill HRS
+ Changed getReader so it doesn't do IllegalArgumentException with "must open first" but instead lets out the null Reader.
+ Then, changed whereever we get a Reader to check for null. If null log it so we can see extent of phenomeon but in general just keep going.

I ran into the same issue, killed the HRS that hosted the ROOT partition. The "Catalog Tables" tbale in the UI was empty afterwards and the running MR job was failing fast. I synced to trunk and applied this patch and restarted the cluster. With the above patch two pieces are failing to update but they were the removed bloomfilter variables and it looks like it was removed in trunk already so no harm done.

Will report if I run into the new log outputs or if anything else happens.

Lars George
added a comment - 25/Jun/09 09:49 I ran into the same issue, killed the HRS that hosted the ROOT partition. The "Catalog Tables" tbale in the UI was empty afterwards and the running MR job was failing fast. I synced to trunk and applied this patch and restarted the cluster. With the above patch two pieces are failing to update but they were the removed bloomfilter variables and it looks like it was removed in trunk already so no harm done.
Will report if I run into the new log outputs or if anything else happens.