[replication] The replication-executor should make sure the file that it is replicating is closed before declaring success on that file

Details

Description

I have seen cases where the replication-executor would lose data to replicate since the file hasn't been closed yet. Upon closing, the new data becomes visible. Before that happens the ZK node shouldn't be deleted in ReplicationSourceManager.logPositionAndCleanOldLogs. Changes need to be made in ReplicationSource.processEndOfFile as well (currentPath related).

Activity

Attaching a patch for 0.92. The main idea is that at the beginning of the main loop in replication executor's run method, it is checked whether the file pointed to by getCurrentPath is presently in use by the WAL for writing. If so, all the methods that are invoked later on in the present iteration of the loop skips those operations that would remove the file from the ZK queue, or, consider a file has been completely replicated.

With this patch, I haven't observed failures in TestReplication.queueFailover (for the reason mentioned in the jira Description) for 100s of runs.

Devaraj Das
added a comment - 17/Sep/12 07:04 Attaching a patch for 0.92. The main idea is that at the beginning of the main loop in replication executor's run method, it is checked whether the file pointed to by getCurrentPath is presently in use by the WAL for writing. If so, all the methods that are invoked later on in the present iteration of the loop skips those operations that would remove the file from the ZK queue, or, consider a file has been completely replicated.
With this patch, I haven't observed failures in TestReplication.queueFailover (for the reason mentioned in the jira Description) for 100s of runs.

Ted Yu
added a comment - 17/Sep/12 14:59 @Devaraj:
Thanks for your effort.
I got the following at compilation time:
[ERROR] /home/hduser/92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:[317,11] readAllEntriesToReplicateOrNextFile( boolean ) in org.apache.hadoop.hbase.replication.regionserver.ReplicationSource cannot be applied to ()
Do you see similar error ?

Rather than change all new Replication invocations to take a null, why not override the Replication constructor? Your patch would be smaller.

Could there be issues with isFileInUse in multithreaded context? Should currentFilePath be an atomic reference so all threads see the changes when they happen? Do you think this an issue?

Do we have to pass in an HRegionServer instance into ReplicationSourceManager? Can it be one of the Interfaces Server or RegionServerServices? Or looking at why you need it, you want it because you want to get at HLog instance. Can we not pass this? Or better, an Interface that has isFileInUse on it?

Currently, you are passing an HRegionServer Instance to ReplicationSourceManager to which is added a public method that exposes the HRegionServer instance on which we invoke the getWAL method to call isFileInUse. We're adding a bit of tangle.

Otherwise, I love the fact that you are figuring bugs and fixes in replication just using the test. Painful I'd imagine. Great work.

stack
added a comment - 17/Sep/12 15:45 Rather than change all new Replication invocations to take a null, why not override the Replication constructor? Your patch would be smaller.
Could there be issues with isFileInUse in multithreaded context? Should currentFilePath be an atomic reference so all threads see the changes when they happen? Do you think this an issue?
Do we have to pass in an HRegionServer instance into ReplicationSourceManager? Can it be one of the Interfaces Server or RegionServerServices? Or looking at why you need it, you want it because you want to get at HLog instance. Can we not pass this? Or better, an Interface that has isFileInUse on it?
Currently, you are passing an HRegionServer Instance to ReplicationSourceManager to which is added a public method that exposes the HRegionServer instance on which we invoke the getWAL method to call isFileInUse. We're adding a bit of tangle.
Otherwise, I love the fact that you are figuring bugs and fixes in replication just using the test. Painful I'd imagine. Great work.

Rather than change all new Replication invocations to take a null, why not override the Replication constructor? Your patch would be smaller.

I had considered that but it didn't seem adding a new constructor is justified in the long run. There probably are no consumers of the constructor outside HBase, etc., and adding another constructor means new code to take care of, etc. So although it makes the patch bigger, I think it's okay..

Could there be issues with isFileInUse in multithreaded context? Should currentFilePath be an atomic reference so all threads see the changes when they happen? Do you think this an issue?

There shouldn't be any multithreading issues here. Each ReplicationExecutor thread has its own copy of everything (including currentFilePath), and the getters/setters are in the same thread context.

Do we have to pass in an HRegionServer instance into ReplicationSourceManager? Can it be one of the Interfaces Server or RegionServerServices? Or looking at why you need it, you want it because you want to get at HLog instance. Can we not pass this? Or better, an Interface that has isFileInUse on it?

Yes, I tried to pass the HLog instance to Replication's constructor call within HRegionServer. But the code is kind of tangled up. HRegionServer instantiates a Replication object (in setupWALAndReplication). HLog is instantiated in instantiateHLog, and the constructor of HLog invokes rollWriter. If the Replication object was not registered prior to rollWriter call, things don't work (which means the Replication object needs to be constructed first but the HLog instance is not available yet). I tried fixing it but then I ran into other issues...

But yeah, I like the interface idea. Will try to refactor the code in that respect.

Devaraj Das
added a comment - 17/Sep/12 17:08 Ted Yu Not sure why you got a compilation error. Will look..
stack Thanks for the detailed comments. Here are the responses.
Rather than change all new Replication invocations to take a null, why not override the Replication constructor? Your patch would be smaller.
I had considered that but it didn't seem adding a new constructor is justified in the long run. There probably are no consumers of the constructor outside HBase, etc., and adding another constructor means new code to take care of, etc. So although it makes the patch bigger, I think it's okay..
Could there be issues with isFileInUse in multithreaded context? Should currentFilePath be an atomic reference so all threads see the changes when they happen? Do you think this an issue?
There shouldn't be any multithreading issues here. Each ReplicationExecutor thread has its own copy of everything (including currentFilePath), and the getters/setters are in the same thread context.
Do we have to pass in an HRegionServer instance into ReplicationSourceManager? Can it be one of the Interfaces Server or RegionServerServices? Or looking at why you need it, you want it because you want to get at HLog instance. Can we not pass this? Or better, an Interface that has isFileInUse on it?
Yes, I tried to pass the HLog instance to Replication's constructor call within HRegionServer. But the code is kind of tangled up. HRegionServer instantiates a Replication object (in setupWALAndReplication). HLog is instantiated in instantiateHLog, and the constructor of HLog invokes rollWriter. If the Replication object was not registered prior to rollWriter call, things don't work (which means the Replication object needs to be constructed first but the HLog instance is not available yet). I tried fixing it but then I ran into other issues...
But yeah, I like the interface idea. Will try to refactor the code in that respect.

My understanding of this patch is that it reduces the race condition but it still leaves a small window eg you can take the "fileNotInUse" snapshot, get "false", and the moment after that the log could roll. If this is correct, I'm not sure it's worth the added complexity.

It seems to me this is a case where we'd need to lock HLog.cacheFlushLock for the time we read the log to be 100% sure log rolling doesn't happen. This has multiple side effects like delaying flushes and log rolls for a few ms while replication is reading the log. It would also require having a way to get to the WAL from ReplicationSource.

<blue skying>While I'm thinking about this, it just occurred to me that when we read a log that's not being written to then we don't need the open/close file dance since the new data is already available. Possible optimization here!</blue skying>

Anyways, one solution I can think of that doesn't involve leaking HRS into replication would be giving the log a "second chance". Basically if you get an EOF, flip the secondChance bit. If it's on then you don't get rid of that log yet. Reset the bit when you loop back to read, now if there was new data added you should get it else go to the next log.

Jean-Daniel Cryans
added a comment - 17/Sep/12 17:57 My understanding of this patch is that it reduces the race condition but it still leaves a small window eg you can take the "fileNotInUse" snapshot, get "false", and the moment after that the log could roll. If this is correct, I'm not sure it's worth the added complexity.
It seems to me this is a case where we'd need to lock HLog.cacheFlushLock for the time we read the log to be 100% sure log rolling doesn't happen. This has multiple side effects like delaying flushes and log rolls for a few ms while replication is reading the log. It would also require having a way to get to the WAL from ReplicationSource.
<blue skying>While I'm thinking about this, it just occurred to me that when we read a log that's not being written to then we don't need the open/close file dance since the new data is already available. Possible optimization here!</blue skying>
Anyways, one solution I can think of that doesn't involve leaking HRS into replication would be giving the log a "second chance". Basically if you get an EOF, flip the secondChance bit. If it's on then you don't get rid of that log yet. Reset the bit when you loop back to read, now if there was new data added you should get it else go to the next log.

My understanding of this patch is that it reduces the race condition but it still leaves a small window eg you can take the "fileNotInUse" snapshot, get "false", and the moment after that the log could roll. If this is correct, I'm not sure it's worth the added complexity.

I don't think there is ever that window. The replication executor thread picks up a path that the LogRoller puts in the replicator's queue BEFORE the log roll happens (and the HLog constructor puts the first path before the replication executor starts). The replication executor is always trailing, and so when the HLog guy says that a path is not in use (being written to), it seems to me a fact that it indeed is not being written to and any writes that ever happened was in the past. Also note that the currentPath is reset AFTER a log roll, which is kind of delayed..

It seems to me this is a case where we'd need to lock HLog.cacheFlushLock for the time we read the log to be 100% sure log rolling doesn't happen. This has multiple side effects like delaying flushes and log rolls for a few ms while replication is reading the log. It would also require having a way to get to the WAL from ReplicationSource.

Yeah, I tried my best to avoid taking that crucial lock!

Anyways, one solution I can think of that doesn't involve leaking HRS into replication would be giving the log a "second chance". Basically if you get an EOF, flip the secondChance bit. If it's on then you don't get rid of that log yet. Reset the bit when you loop back to read, now if there was new data added you should get it else go to the next log.

I considered some variant of this. However, I gave it up and took a more conservative approach - make sure that the replication-executor thread gets at least one pass at a CLOSED file. All other solutions seemed incomplete to me and prone to races...

Devaraj Das
added a comment - 17/Sep/12 20:39 Jean-Daniel Cryans Thanks for looking. Responses below.
My understanding of this patch is that it reduces the race condition but it still leaves a small window eg you can take the "fileNotInUse" snapshot, get "false", and the moment after that the log could roll. If this is correct, I'm not sure it's worth the added complexity.
I don't think there is ever that window. The replication executor thread picks up a path that the LogRoller puts in the replicator's queue BEFORE the log roll happens (and the HLog constructor puts the first path before the replication executor starts). The replication executor is always trailing, and so when the HLog guy says that a path is not in use (being written to), it seems to me a fact that it indeed is not being written to and any writes that ever happened was in the past. Also note that the currentPath is reset AFTER a log roll, which is kind of delayed..
It seems to me this is a case where we'd need to lock HLog.cacheFlushLock for the time we read the log to be 100% sure log rolling doesn't happen. This has multiple side effects like delaying flushes and log rolls for a few ms while replication is reading the log. It would also require having a way to get to the WAL from ReplicationSource.
Yeah, I tried my best to avoid taking that crucial lock!
Anyways, one solution I can think of that doesn't involve leaking HRS into replication would be giving the log a "second chance". Basically if you get an EOF, flip the secondChance bit. If it's on then you don't get rid of that log yet. Reset the bit when you loop back to read, now if there was new data added you should get it else go to the next log.
I considered some variant of this. However, I gave it up and took a more conservative approach - make sure that the replication-executor thread gets at least one pass at a CLOSED file. All other solutions seemed incomplete to me and prone to races...
stack forgot to answer one of your previous questions.
Should currentFilePath be an atomic reference so all threads see the changes when they happen?
I think volatile suffices for the use case here.

So in layman's terms, your patch short circuits all the checks to change the current path if we know for sure that the file we are replicating from is being written to. The side effect is that we won't quit the current file unless it has aged right?

The replication executor is always trailing, and so when the HLog guy says that a path is not in use (being written to), it seems to me a fact that it indeed is not being written to and any writes that ever happened was in the past.

FWIW that might not be totally true, at least in 0.94 HLog.postLogRoll is called before HLog.cleanupCurrentWriter which does issue a sync().

Jean-Daniel Cryans
added a comment - 17/Sep/12 21:20 I see, all that double-negation (eg !fileNotInUse) confused me
So in layman's terms, your patch short circuits all the checks to change the current path if we know for sure that the file we are replicating from is being written to. The side effect is that we won't quit the current file unless it has aged right?
The replication executor is always trailing, and so when the HLog guy says that a path is not in use (being written to), it seems to me a fact that it indeed is not being written to and any writes that ever happened was in the past.
FWIW that might not be totally true, at least in 0.94 HLog.postLogRoll is called before HLog.cleanupCurrentWriter which does issue a sync().

So in layman's terms, your patch short circuits all the checks to change the current path if we know for sure that the file we are replicating from is being written to. The side effect is that we won't quit the current file unless it has aged right?

Yes ..

FWIW that might not be totally true, at least in 0.94 HLog.postLogRoll is called before HLog.cleanupCurrentWriter which does issue a sync().

I don't get this, JD. Could you please clarify a bit more? Given the fact that the currentPath would be updated only after the call to cleanupCurrentWriter, I don't see a difference in the behavior between 0.92 and 0.94... (maybe I am missing something though).

Devaraj Das
added a comment - 17/Sep/12 22:44 I see, all that double-negation (eg !fileNotInUse) confused me
Sorry about that. I'll see if I can change it to single negation
So in layman's terms, your patch short circuits all the checks to change the current path if we know for sure that the file we are replicating from is being written to. The side effect is that we won't quit the current file unless it has aged right?
Yes ..
FWIW that might not be totally true, at least in 0.94 HLog.postLogRoll is called before HLog.cleanupCurrentWriter which does issue a sync().
I don't get this, JD. Could you please clarify a bit more? Given the fact that the currentPath would be updated only after the call to cleanupCurrentWriter, I don't see a difference in the behavior between 0.92 and 0.94... (maybe I am missing something though).

Devaraj Das
added a comment - 18/Sep/12 00:30 Otherwise, I love the fact that you are figuring bugs and fixes in replication just using the test. Painful I'd imagine. Great work.
Thanks, Stack. Yes, I have burnt some midnight oil on these issues. Fun though.

Talk about races! Here it seems like the splitter didn't complete within the expected time, and the replication didn't happen for some data.

Here are the relevant log snippets (look for "considering dumping" where the file got dropped before the splitter completed). But in this case, the issue can be addressed by increasing the number of retries (which is already configurable). The patch attached here doesn't attempt to solve this problem.

Devaraj Das
added a comment - 18/Sep/12 17:55 Ted Yu Hey thanks for taking the patch for a spin.
Talk about races! Here it seems like the splitter didn't complete within the expected time, and the replication didn't happen for some data.
Here are the relevant log snippets (look for "considering dumping" where the file got dropped before the splitter completed). But in this case, the issue can be addressed by increasing the number of retries (which is already configurable). The patch attached here doesn't attempt to solve this problem.
2012-09-17 18:13:03,665 WARN [ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] regionserver.ReplicationSource(555): 2-sea-lab-0,41831,1347930742751 Got:
java.io.IOException: File from recovered queue is nowhere to be found
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:537)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:304)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:41196/user/hduser/hbase/.oldlogs/sea-lab-0%2C41831%2C1347930742751.1347930771911
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
at org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1475)
at org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1470)
at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.&lt;init&gt;(SequenceFileLogReader.java:58)
at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:166)
at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:689)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:503)
... 1 more
2012-09-17 18:13:03,665 WARN [ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] regionserver.ReplicationSource(559): Waited too long for this file, considering dumping
2012-09-17 18:13:03,665 INFO [ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] regionserver.ReplicationSourceManager(365): Done with the recovered queue 2-sea-lab-0,41831,1347930742751
2012-09-17 18:13:04,305 DEBUG [main-EventThread] wal.HLogSplitter(657): Archived processed log hdfs://localhost:41196/user/hduser/hbase/.logs/sea-lab-0,41831,1347930742751-splitting/sea-lab-0%2C41831%2C1347930742751.1347930771911 to hdfs://localhost:41196/user/hduser/hbase/.oldlogs/sea-lab-0%2C41831%2C1347930742751.1347930771911
2012-09-17 18:13:04,306 INFO [main-EventThread] master.SplitLogManager(392): Done splitting /1/splitlog/hdfs%3A%2F%2Flocalhost%3A41196%2Fuser%2Fhduser%2Fhbase%2F.logs%2Fsea-lab-0%2C41831%2C1347930742751-splitting%2Fsea-lab-0%252C41831%252C1347930742751.1347930771911

stack I have already responded to Ted's comment. In summary, the problem is that the log-splitter couldn't complete its work soon enough, and hence the file wasn't moved to .oldlogs soon enough. The replicator did the maxRetries and gave up. So this is a different issue (and maybe solved by increasing the value of maxRetries in the config.)

Devaraj Das
added a comment - 21/Sep/12 18:01 stack I have already responded to Ted's comment. In summary, the problem is that the log-splitter couldn't complete its work soon enough, and hence the file wasn't moved to .oldlogs soon enough. The replicator did the maxRetries and gave up. So this is a different issue (and maybe solved by increasing the value of maxRetries in the config.)

Jean-Daniel Cryans
added a comment - 03/Oct/12 18:33 I really don't like that we have to pass down another instance of HRS (through RegionServerServices). The fact that we're now doing this:
- new Replication( this , this .fs, logdir, oldLogDir): null ;
+ new Replication( this , this .fs, logdir, oldLogDir, this ): null ;
is making me sad. Also it leaks all over the code. It seems to me that there should be another way to handle this just in ReplicationSource.
At the moment I'd be +1 for commit only to trunk and on commit this logging will need to cleaned up:
LOG.info( "File " + getCurrentPath() + " in use" );
Is ok with you Devaraj Das ?

Thanks, Jean-Daniel Cryans for looking at the patch. Actually, upon looking at the RegionServerServices interface closely, I see that it extends the Server interface. So the problem you pointed out could be addressed by making the affected constructors and methods (the ones that I changed to have the new RegionServerServices argument) to have only RegionServerServices instead of Server/Stoppable instances.

Devaraj Das
added a comment - 03/Oct/12 19:13 Thanks, Jean-Daniel Cryans for looking at the patch. Actually, upon looking at the RegionServerServices interface closely, I see that it extends the Server interface. So the problem you pointed out could be addressed by making the affected constructors and methods (the ones that I changed to have the new RegionServerServices argument) to have only RegionServerServices instead of Server/Stoppable instances.
Will submit a patch soon. Hope that will look better.

Can we not pass down RegionServerServices? Can we pass a narrow Interface instead?

I think we can (I can pull out the getWAL() method from the interface RegionServerServices into a new interface and have RegionServerServices extend that..). But in that case we will pass two instances of HRS still (as pointed out by JD earlier). But thinking about it, that probably makes downstream methods' abstractions cleaner (when compared with the approach of having them accept a fat interface).

Devaraj Das
added a comment - 03/Oct/12 20:30 Can we not pass down RegionServerServices? Can we pass a narrow Interface instead?
I think we can (I can pull out the getWAL() method from the interface RegionServerServices into a new interface and have RegionServerServices extend that..). But in that case we will pass two instances of HRS still (as pointed out by JD earlier). But thinking about it, that probably makes downstream methods' abstractions cleaner (when compared with the approach of having them accept a fat interface).

In the trunk case, I think something better can be done (and the interface changes can be avoided). Replication.postLogRoll could do the enqueue of the new path in the ReplicationSource's queue. The Replication.preLogRoll would do everything else (creating ZK entries, etc.) except the enqueuing of the path in the queue..

The postLogRoll is currently called before the writer is reset (to nextWriter) in FSHLog.rollWriter. I propose that it be called after the writer is reset. That in my opinion seems to be a more precise place for calling postLogRoll..

Devaraj Das
added a comment - 03/Oct/12 23:43 In the trunk case, I think something better can be done (and the interface changes can be avoided). Replication.postLogRoll could do the enqueue of the new path in the ReplicationSource's queue. The Replication.preLogRoll would do everything else (creating ZK entries, etc.) except the enqueuing of the path in the queue..
The postLogRoll is currently called before the writer is reset (to nextWriter ) in FSHLog.rollWriter. I propose that it be called after the writer is reset. That in my opinion seems to be a more precise place for calling postLogRoll..
Thoughts?

In case it is not clear what's the deal with delaying the enqueueing of the new WAL file, the problem described in this jira happens because the new WAL file is enqueued too early (before the last WAL file is closed).

Devaraj Das
added a comment - 04/Oct/12 00:06 In case it is not clear what's the deal with delaying the enqueueing of the new WAL file, the problem described in this jira happens because the new WAL file is enqueued too early (before the last WAL file is closed).

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

Attaching a more complete patch based on the new approach .. The changes done to ReplicationSource.java in the earlier patch to do with "fileInUse" is still needed here (I renamed the variable "fileInUse" to be more reflective of what it actually is used for though).

Devaraj Das
added a comment - 04/Oct/12 23:30 Attaching a more complete patch based on the new approach .. The changes done to ReplicationSource.java in the earlier patch to do with "fileInUse" is still needed here (I renamed the variable "fileInUse" to be more reflective of what it actually is used for though).

The last time I played around postLogRoll in HBASE-3515, I found that we must ensure that we have that log up in ZK before we start writing to it because it would be possible for writers to append and at the same time not be able to add the log in ZK because the session timed out.

Jean-Daniel Cryans
added a comment - 10/Oct/12 04:30 The last time I played around postLogRoll in HBASE-3515 , I found that we must ensure that we have that log up in ZK before we start writing to it because it would be possible for writers to append and at the same time not be able to add the log in ZK because the session timed out.
The current code blocks talking to ZK.

Hey Jean-Daniel Cryans, this patch doesn't change that behavior at all (new log is put up in ZK before the log is being written to, and blocks talking to ZK..). This patch only changes the postLogRoll placement and that deterministically ensures the previous log file is really closed before enqueuing the new log for replication. The code changes in the replicator thread (ReplicationSource.java) makes sure that the entire iteration of the loop "sees" a closed log file at least once (and hence takes care of the problem reported in the jira).

Devaraj Das
added a comment - 10/Oct/12 05:44 Hey Jean-Daniel Cryans , this patch doesn't change that behavior at all (new log is put up in ZK before the log is being written to, and blocks talking to ZK..). This patch only changes the postLogRoll placement and that deterministically ensures the previous log file is really closed before enqueuing the new log for replication. The code changes in the replicator thread (ReplicationSource.java) makes sure that the entire iteration of the loop "sees" a closed log file at least once (and hence takes care of the problem reported in the jira).

Ah I see, I didn't fully grok the new preRoll/postRoll dance in my first review. That's clever.

My one last concern before committing would be what happens when we are able to compute a new HLog name and put it up in ZK, but then fail to create the HLog and the RS dies. Will the recovered queue hang or will it abandon that HLog? FWIW there's another jira regarding that problem but this could be a new failure case.

Jean-Daniel Cryans
added a comment - 10/Oct/12 18:03 Ah I see, I didn't fully grok the new preRoll/postRoll dance in my first review. That's clever.
My one last concern before committing would be what happens when we are able to compute a new HLog name and put it up in ZK, but then fail to create the HLog and the RS dies. Will the recovered queue hang or will it abandon that HLog? FWIW there's another jira regarding that problem but this could be a new failure case.

Ah I see, I didn't fully grok the new preRoll/postRoll dance in my first review. That's clever.

Cool. Thanks for taking a pass at this.

Will the recovered queue hang or will it abandon that HLog? FWIW there's another jira regarding that problem but this could be a new failure case.

The change done to the placement of the postLogRoll call in the patch will not affect recovered queues. This will only affect files that the RS in question is creating himself. The changes in ReplicationSource.java will only take effect for non-recovered files (there is a check !this.queueRecovered before setting currentWALisBeingWrittenTo to true).. So I think we are covered (please let me know if I missed something or misunderstood your concern).

I'll submit a patch shortly with the nits pointed out by Ted Yu fixed.

Devaraj Das
added a comment - 10/Oct/12 21:11 Ah I see, I didn't fully grok the new preRoll/postRoll dance in my first review. That's clever.
Cool. Thanks for taking a pass at this.
Will the recovered queue hang or will it abandon that HLog? FWIW there's another jira regarding that problem but this could be a new failure case.
The change done to the placement of the postLogRoll call in the patch will not affect recovered queues. This will only affect files that the RS in question is creating himself. The changes in ReplicationSource.java will only take effect for non-recovered files (there is a check !this.queueRecovered before setting currentWALisBeingWrittenTo to true).. So I think we are covered (please let me know if I missed something or misunderstood your concern).
I'll submit a patch shortly with the nits pointed out by Ted Yu fixed.

You end up with a log tracked in ZK that doesn't exist. This RS's queue will be recovered by another RS that will eventually try to read from that non-existing file. My concern is how we're going to treat that file.

Jean-Daniel Cryans
added a comment - 10/Oct/12 21:31 please let me know if I missed something or misunderstood your concern
Consider this scenario. First this runs:
Path newPath = computeFilename();
Then with your patch we add this file in ZK during:
i.preLogRoll(oldPath, newPath);
Now let's say HDFS becomes unavailable or the RS fails and never gets to this line:
HLog.Writer nextWriter = this.createWriterInstance(fs, newPath, conf);
You end up with a log tracked in ZK that doesn't exist. This RS's queue will be recovered by another RS that will eventually try to read from that non-existing file. My concern is how we're going to treat that file.

Jean-Daniel Cryans, this sequence of events could happen currently too, isn't it? The lines of code that I moved are to do with postLogRoll which happens after the sequence that you are talking about. This problem exists with/without this patch.

You end up with a log tracked in ZK that doesn't exist. This RS's queue will be recovered by another RS that will eventually try to read from that non-existing file. My concern is how we're going to treat that file.

To answer your question, I think the RS that picks this queue up will dump the file after a couple of retries (since the file doesn't exist and will never show up in the recovered logs directory).

Devaraj Das
added a comment - 10/Oct/12 21:49 Jean-Daniel Cryans , this sequence of events could happen currently too, isn't it? The lines of code that I moved are to do with postLogRoll which happens after the sequence that you are talking about. This problem exists with/without this patch.
You end up with a log tracked in ZK that doesn't exist. This RS's queue will be recovered by another RS that will eventually try to read from that non-existing file. My concern is how we're going to treat that file.
To answer your question, I think the RS that picks this queue up will dump the file after a couple of retries (since the file doesn't exist and will never show up in the recovered logs directory).

Jean-Daniel Cryans
added a comment - 10/Oct/12 23:00 The lines of code that I moved are to do with postLogRoll which happens after the sequence that you are talking about. This problem exists with/without this patch.
I disagree. Right now we add the log in ZK under postLogRoll() and createWriterInstance will run before that so the file should exist at least.
I think the RS that picks this queue up will dump the file after a couple of retries
Yeah the fact that it's the last file and that the multiplier would go to the max and that it's a recovered queue should take care of that.

I disagree. Right now we add the log in ZK under postLogRoll() and createWriterInstance will run before that so the file should exist at least.

Ah! and Ooops! I forgot about the fact that I changed the code to have preLogRoll not be ignored in the replication handler. Sorry, all the time I was thinking about the change in the placement of the call to postLogRoll.. So yes, it could happen that the logfile is up in ZK before the file exists but it appears (as we just discussed in the previous comments) that the issue would take care of itself (the RS that picks this file would dump it after some retries)...

Devaraj Das
added a comment - 10/Oct/12 23:20 I disagree. Right now we add the log in ZK under postLogRoll() and createWriterInstance will run before that so the file should exist at least.
Ah! and Ooops! I forgot about the fact that I changed the code to have preLogRoll not be ignored in the replication handler. Sorry, all the time I was thinking about the change in the placement of the call to postLogRoll.. So yes, it could happen that the logfile is up in ZK before the file exists but it appears (as we just discussed in the previous comments) that the issue would take care of itself (the RS that picks this file would dump it after some retries)...