When I added scan interruption I first tried using Thread.interrupt(), but that did not work so well because HDFS client code are the interrupts. So then I switched to the strategy of using an atomic boolean and checking it in certain places. HDFS eating interrupts was in an older version, maybe it does not anymore? We could possibly try thread Thread.interrupt() in addition to checking a atomic boolean.

Keith Turner
added a comment - 09/Apr/14 18:04 When I added scan interruption I first tried using Thread.interrupt(), but that did not work so well because HDFS client code are the interrupts. So then I switched to the strategy of using an atomic boolean and checking it in certain places. HDFS eating interrupts was in an older version, maybe it does not anymore? We could possibly try thread Thread.interrupt() in addition to checking a atomic boolean.

Keith Turner, yes, thanks for mentioning. I re-read what you had mentioned there, but I will look at the code. My comments above were mostly from the standpoint of "if reading/writing data via the hdfs api can prevent a scan from being interrupted", maybe there's something more we need to do. Not yet substantiated with what the implementation does.

Josh Elser
added a comment - 09/Apr/14 17:56 Keith Turner , yes, thanks for mentioning. I re-read what you had mentioned there, but I will look at the code. My comments above were mostly from the standpoint of "if reading/writing data via the hdfs api can prevent a scan from being interrupted", maybe there's something more we need to do. Not yet substantiated with what the implementation does.

Josh Elser I made some comments on ACCUMULO-2542 about scan interruption that may be helpful. Tablet.completeClose(...) calls ScanDataSource.interrupt() which sets the atomic boolean mentioned in ACCUMULO-2542 to true.

Keith Turner
added a comment - 09/Apr/14 17:53 Josh Elser I made some comments on ACCUMULO-2542 about scan interruption that may be helpful. Tablet.completeClose(...) calls ScanDataSource.interrupt() which sets the atomic boolean mentioned in ACCUMULO-2542 to true.

Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever.

Yeah, this is what I was getting at. I wonder if there is something we could design into SKVI or the interruption call to ensure interruption actually propagates to the scan actually sees it and takes action. Just a thought.

Josh Elser
added a comment - 09/Apr/14 17:28 I am writing a test for this theory.
Neat!
Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever.
Yeah, this is what I was getting at. I wonder if there is something we could design into SKVI or the interruption call to ensure interruption actually propagates to the scan actually sees it and takes action. Just a thought.

The unloader attempts to interrupt the scans. Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever. I am writing a test for this theory.

Eric Newton
added a comment - 09/Apr/14 17:03 The unloader attempts to interrupt the scans. Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever. I am writing a test for this theory.

the monitor could display the number of unload requests outstanding in the tserver

That would be cool. I could see the general premise being otherwise useful too.

Perhaps related, does tablet unload interrupt running scans? Or, does a scan have the ability to block unloads indefinitely? Perhaps the tserver should try for some amount of time to unload, if it still hasn't unloaded because a scan is running, forcefully abort it? That also begs the question in the case of custom iterators, can we make something that will gracefully abort a scan using such iterators or are we reliant on users implementing exception handling properly to avoid the "9 day query"?

Josh Elser
added a comment - 09/Apr/14 16:48 the monitor could display the number of unload requests outstanding in the tserver
That would be cool. I could see the general premise being otherwise useful too.
Perhaps related, does tablet unload interrupt running scans? Or, does a scan have the ability to block unloads indefinitely? Perhaps the tserver should try for some amount of time to unload, if it still hasn't unloaded because a scan is running, forcefully abort it? That also begs the question in the case of custom iterators, can we make something that will gracefully abort a scan using such iterators or are we reliant on users implementing exception handling properly to avoid the "9 day query"?

Eric Newton
added a comment - 07/Apr/14 20:53 Possible ways of detecting this problem in the future:
UnloadTabletHandler could issue a warning if a tablet does not unload
master could generate warnings about unload requests that are old
the monitor could display the number of unload requests outstanding in the tserver

* master failed to balance
* custom balancer refused to balance while migrations were in place
* tablet server was not unloading the tablet
* tablet server was otherwise serving tablets, providing status
* memory dump determined that there were 21K UnloadTabletHandler objects
* jstack showed UnloadTabletHandler in Tablet.completeClose, line 2674
* the last print of the debug "completeClose(safeState=true, completeClose=true) occured 9 days ago
* there was a query that had been for 9 days

* master failed to balance
* custom balancer refused to balance while migrations were in place
* tablet server was not unloading the tablet
* tablet server was otherwise serving tablets, providing status
* memory dump determined that there were 21K UnloadTabletHandler objects
* jstack showed UnloadTabletHandler in Tablet.completeClose, line 2674
* the last print of the debug "completeClose(safeState=true, completeClose=true) occured 9 days ago
* there was a query that had been running for 9 days