[replication] ReplicationSource can miss a log after RS comes out of GC

Details

Description

This is from Hudson build 1738, if a log is about to be rolled and the ZK connection is already closed then the replication code will fail at adding the new log in ZK but the log will still be rolled and it's possible that some edits will make it in.

Activity

This patch simply adds a check if the region server is going down before closing the file. In the replication case it fixes the issue, since if it fails it will set that flag to true.

The issue with throwing an exception on i.logRolled(newPath) is that since there's potentially many of them, throwing midway would mean that you have to implement a roll back. There's nothing at the moment that requires that level of complexity.

Jean-Daniel Cryans
added a comment - 08/Feb/11 19:24 This patch simply adds a check if the region server is going down before closing the file. In the replication case it fixes the issue, since if it fails it will set that flag to true.
The issue with throwing an exception on i.logRolled(newPath) is that since there's potentially many of them, throwing midway would mean that you have to implement a roll back. There's nothing at the moment that requires that level of complexity.

Jean-Daniel Cryans
added a comment - 14/Feb/11 23:11 I mistakenly thought that hlog.closed was set when the region server was on it's way out, but I'm wrong... it's just set when the HLog is closing. Need to find an alternative.

I'm thinking of a more radical way of solving this issue, considering that the problem is that the RS is able to roll a log even tho we already lost our session, I'm thinking that we should call fs.close() from inside HRS.abort() thus preventing any other call from reaching HDFS. The downside is that it's going to make a big BOOOM and every call to close regions will fail in the ugliest fashion.

Jean-Daniel Cryans
added a comment - 15/Mar/11 00:35 I'm thinking of a more radical way of solving this issue, considering that the problem is that the RS is able to roll a log even tho we already lost our session, I'm thinking that we should call fs.close() from inside HRS.abort() thus preventing any other call from reaching HDFS. The downside is that it's going to make a big BOOOM and every call to close regions will fail in the ugliest fashion.

To reiterate the problem, it's possible to not be able to add an HLog to replicate if the session is expired when log rolling. HLog currently doesn't get any feedback from the WALActionListeners, even if they fail at doing their job.

One way of fixing it would be to throw an exception and stop the log rolling, but it means that if there's many listeners that some may already have processed the adding of the log. We could also kill the region server plain and simple if it happens.

Jean-Daniel Cryans
added a comment - 25/Oct/11 23:19 To reiterate the problem, it's possible to not be able to add an HLog to replicate if the session is expired when log rolling. HLog currently doesn't get any feedback from the WALActionListeners, even if they fail at doing their job.
One way of fixing it would be to throw an exception and stop the log rolling, but it means that if there's many listeners that some may already have processed the adding of the log. We could also kill the region server plain and simple if it happens.
I'm in favor of the latter.

This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

Lars Francke
added a comment - 20/Nov/15 12:42 This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).