Description

In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable.

The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket.

When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode.

This also explains why Sam doesn't see it on his 0.20 append branch – there are no block tokens there, so the RPC connection is getting reused properly.

Todd Lipcon
added a comment - 19/May/11 23:08 aha! I think I understand what's going on here!
The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket.
When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode.
This also explains why Sam doesn't see it on his 0.20 append branch – there are no block tokens there, so the RPC connection is getting reused properly.
I'll file another JIRA about this issue.

Actually, it looks like the leak in 7146 "pushed this over the edge". But, even with that patch, if I "lsof" the java process as it runs, I see it hit 800 or so localhost TCP connections in ESTABLISHED state while running this test case. So, needs more investigation yet.

Todd Lipcon
added a comment - 19/May/11 22:30 Actually, it looks like the leak in 7146 "pushed this over the edge". But, even with that patch, if I "lsof" the java process as it runs, I see it hit 800 or so localhost TCP connections in ESTABLISHED state while running this test case. So, needs more investigation yet.

sam rash
added a comment - 19/May/11 19:14 if it helps, there is only ever 1 writer + 1 reader in the test. 1 reader 'tails' by opening and closing the file repeatedly, up to 1000 times (hence exposing socket leaks in the past)

It seems that the machines on the build infrastructure are slow/old/heavily loaded. Tests may be easier to fail there but not locally. So that choosing the test parameters, e.g. how many concurrent writer, becomes non-trivial.

Tsz Wo Nicholas Sze
added a comment - 19/May/11 19:09 > I believe this test mostly fails on the build infrastructure ...
It seems that the machines on the build infrastructure are slow/old/heavily loaded. Tests may be easier to fail there but not locally. So that choosing the test parameters, e.g. how many concurrent writer, becomes non-trivial.

sam rash
added a comment - 19/May/11 02:45 Todd: on a different issue, one test in here that looks suspicious is the testImmediateReadOfNewFile.
It repeatedly opens and closes a file right away, requiring 1k successful opens (at least in our copy)

Todd Lipcon
added a comment - 19/May/11 02:28 I believe this test mostly fails on the build infrastructure and not on our local boxes. So that limits the number of people who can investigate. I will try to take a look by running tests there.

i assume a similar problem as before. The problem was that code that opened RPC proxies to DNs did not get closed in a finally block. The test failure output indicates a socket/fd leak ("Too many open files").

sam rash
added a comment - 19/May/11 01:21 i assume a similar problem as before. The problem was that code that opened RPC proxies to DNs did not get closed in a finally block. The test failure output indicates a socket/fd leak ("Too many open files").
https://issues.apache.org/jira/browse/HDFS-1310
the test was succeeding 8 months ago, 2010-09-10, so I'd look at commits that came after that.

Tsz Wo Nicholas Sze
added a comment - 19/May/11 01:09 Hey Sam, do you exactly know the "underlying problem" or it is just a hypothesis?
Could you work on a better one? It seems that no one is working it and the test was failing for almost a year. So I suggest removing it. It would be great if you can improve it.

sam rash
added a comment - 19/May/11 00:53 the test opens/closes files for read/write. that exposed a slow leak last time.
I suggest anyone concerned with resources leaks in hadoop investigate. we don't see the test failure in our open-sourced 0.20 fork
removing the test is an option; or coming up with a better one (this was my first hdfs feature + test).

Hi Sam, I agree that there are bugs in Hadoop or any other software. However, I don't see any value of putting a test to fail every build but not fixing the so called "underlying problem".

BTW, it is questionable whether the test is a valid. Are we supporting that many concurrent readers? How was the number chosen? Machine has finite resource. It is definitely easy to create extreme tests and make them fail.

Tsz Wo Nicholas Sze
added a comment - 19/May/11 00:46 Hi Sam, I agree that there are bugs in Hadoop or any other software. However, I don't see any value of putting a test to fail every build but not fixing the so called "underlying problem".
BTW, it is questionable whether the test is a valid. Are we supporting that many concurrent readers? How was the number chosen? Machine has finite resource. It is definitely easy to create extreme tests and make them fail.

the last time I debugged the test failure, it exposed a socket/fd leak in a completely unrelated part of the code. The test failing here also has 0 to do with the added feature--because it closes/opens files in rapid succession, it is prone expose resource leaks.

Removing this test (or feature) won't take away the underlying problem that should be looked at.

sam rash
added a comment - 19/May/11 00:23 the last time I debugged the test failure, it exposed a socket/fd leak in a completely unrelated part of the code. The test failing here also has 0 to do with the added feature--because it closes/opens files in rapid succession, it is prone expose resource leaks.
Removing this test (or feature) won't take away the underlying problem that should be looked at.

TestFileConcurrentReader has a long history for failing. There are multiple JIRAs: HDFS-1401, HDFS-1310, HDFS-1679, HDFS-1885 but the problem is still not fixed. I suggest simply remove it since no one is maintaining it.

Tsz Wo Nicholas Sze
added a comment - 18/May/11 23:26 TestFileConcurrentReader has a long history for failing. There are multiple JIRAs: HDFS-1401 , HDFS-1310 , HDFS-1679 , HDFS-1885 but the problem is still not fixed. I suggest simply remove it since no one is maintaining it.

Tanping Wang
added a comment - 31/Aug/10 02:26 Hi, Sam, I have just commented on HDFS-1310 . Would you please look at
Hudson result for reference,
https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorTransferToVerySmallWrite/
https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransfer/
https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite/
Thanks!

-returns 0 length only if all DNs are missing the replica (any other io exception will cause client to get exception and it can retry)
-my diff viewer does not show any whitespace or indentation changes, but please advise if you see any

sam rash
added a comment - 25/Jun/10 19:44 -returns 0 length only if all DNs are missing the replica (any other io exception will cause client to get exception and it can retry)
-my diff viewer does not show any whitespace or indentation changes, but please advise if you see any

sam rash
added a comment - 25/Jun/10 19:09 per out offline discussion, it seems the NN doesn't know when the pipeline is created, but the writer does, so the NN has to return the replicas for the current block in this case.
I will change it so we check all DNs for a replica before using the default of 0. I need to think about if we require all DNs to have ReplicaNotFound or all have that (versus some other exception).

Yeah,NN does not know the block construction stages.
> can you clarify what you want?
The patch returns 0 if one datanode throws ReplicaNotFoundException. Should it check all replicas before returning 0?

Hairong Kuang
added a comment - 25/Jun/10 19:08 Yeah,NN does not know the block construction stages.
> can you clarify what you want?
The patch returns 0 if one datanode throws ReplicaNotFoundException. Should it check all replicas before returning 0?

sorry, i don't understand. this is a race condition where the namenode has assigned locations to the block, but the client hasn't sent data yet. the NN cannot know that the DNs don't have data on disk yet unless we add additional NN coordination? our choice in this condition is to return 0 or let the exception be. I had done the latter, but you asked for the former unless I misunderstood.

sam rash
added a comment - 25/Jun/10 18:17 sorry, i don't understand. this is a race condition where the namenode has assigned locations to the block, but the client hasn't sent data yet. the NN cannot know that the DNs don't have data on disk yet unless we add additional NN coordination? our choice in this condition is to return 0 or let the exception be. I had done the latter, but you asked for the former unless I misunderstood.
can you clarify what you want?

Hairong Kuang
added a comment - 25/Jun/10 17:54
> translating a ReplicaNotFoundException into a 0-length block within DFSInputStream for under construction files
Should the code check all replicas before returning 0?
Ideally NN should not return block locations to readers if the block construction is still in the pipeline construction stage.

sam rash
added a comment - 23/Jun/10 00:59 includes requested changes by hairong. also handles immediate reading of new files by translating a ReplicaNotFoundException into a 0-length block within DFSInputStream for under construction files

sam rash
added a comment - 22/Jun/10 23:34 i have an updated patch, but it does not yet handle the missing replicas as 0 sized for under construction. there may be other 20 patches to port to make this happen.

will not have the latter. So all other implementations of Replica will have a valid value for getByteOnDIsk()?
Does this also mean that the impl of getBytesOnDisk for ReplicaInPipeline will move to ReplicaBeingWritten ?

sam rash
added a comment - 18/Jun/10 19:43 So removing setBytesOnDisk() means:
if (replicaInfo instanceof ReplicaBeingWritten) {
((ReplicaBeingWritten) replicaInfo)
.setLastChecksumAndDataLen(offsetInBlock, lastChunkChecksum);
}
replicaInfo.setBytesOnDisk(offsetInBlock);
will not have the latter. So all other implementations of Replica will have a valid value for getByteOnDIsk()?
Does this also mean that the impl of getBytesOnDisk for ReplicaInPipeline will move to ReplicaBeingWritten ?

> they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes.
I do not see any place that updates bytesOnDisk except for BlockReceiver. That's why I suggested to remove setBytesOnDisk from ReplicaInPipeline etc.

Hairong Kuang
added a comment - 16/Jun/10 20:52 > they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes.
I do not see any place that updates bytesOnDisk except for BlockReceiver. That's why I suggested to remove setBytesOnDisk from ReplicaInPipeline etc.

1. they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes. It's entirely conceivable that something could update the bytes on disk w/o updating the lastChecksum with the current set of methods

If we are ok with a loosely coupled guarantee, then we can use bytesOnDisk and be careful never to call setBytesOnDisk() for any RBW

2. oh--your previous comments indicated we shouldn't change either ReplicaInPipelineInterface or ReplicaInPipeline. If that's not the case and we can do this, then my comment above doesn't hold. we use bytesOnDisk and guarantee it's in sync with the checksum in a single synchronized method (I like this)

3. will make the update to treat missing last blocks as 0-length and re-instate the unit test.

sam rash
added a comment - 16/Jun/10 20:01 1. they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes. It's entirely conceivable that something could update the bytes on disk w/o updating the lastChecksum with the current set of methods
If we are ok with a loosely coupled guarantee, then we can use bytesOnDisk and be careful never to call setBytesOnDisk() for any RBW
2. oh--your previous comments indicated we shouldn't change either ReplicaInPipelineInterface or ReplicaInPipeline. If that's not the case and we can do this, then my comment above doesn't hold. we use bytesOnDisk and guarantee it's in sync with the checksum in a single synchronized method (I like this)
3. will make the update to treat missing last blocks as 0-length and re-instate the unit test.
thanks for all the help on this

Sam, the patch is in good shape. Thanks for working on this. A few minor comments:
1. ReplicaBeingWritten.java: dataLength and bytesOnDisk are the same, right? We do not need to introduce another field dataLength. I am also hesitate to delare datalength & lastchecksum as volatible. Accesses to them are already synchronized and the norm case is that writing without reading.
2. We probably should remove setBytesOnDisk from ReplicaInPipelineInterface & ReplicaInPipeline.

> In 0.20, I made it so that client just treats this as a 0-length file. one of our internal tools saw this rather frequently in 0.20.
Good to know this. Then could you please handle this case in the trunk the same as well? Thanks again, Sam.

Hairong Kuang
added a comment - 16/Jun/10 19:52 Sam, the patch is in good shape. Thanks for working on this. A few minor comments:
1. ReplicaBeingWritten.java: dataLength and bytesOnDisk are the same, right? We do not need to introduce another field dataLength. I am also hesitate to delare datalength & lastchecksum as volatible. Accesses to them are already synchronized and the norm case is that writing without reading.
2. We probably should remove setBytesOnDisk from ReplicaInPipelineInterface & ReplicaInPipeline.
> In 0.20, I made it so that client just treats this as a 0-length file. one of our internal tools saw this rather frequently in 0.20.
Good to know this. Then could you please handle this case in the trunk the same as well? Thanks again, Sam.

1. endOffset is either bytesOnDisk or the chunkChecksum.getDataLength()
2. if tmpLen >= endOffset && this is a write in progress, use the in--memory checksum (else this is a finalized block not ending on a chunk bondary)
3. fixed up whitespace

sam rash
added a comment - 07/Jun/10 22:11 1. endOffset is either bytesOnDisk or the chunkChecksum.getDataLength()
2. if tmpLen >= endOffset && this is a write in progress, use the in--memory checksum (else this is a finalized block not ending on a chunk bondary)
3. fixed up whitespace

Hairong Kuang
added a comment - 07/Jun/10 21:10 The append design document discussed this scenario, in which it says it is OK to throw an exception because the chance of hitting this is low. But the semantics you made in 0.20 seems fine to me too.

I am addressing the last comments. I have one more question, though, as I have one test that it still fails and I want to see what you think the expected behavior should be:

immediate read of a new file:
1. writer creates a file and starts to write and hence blocks are assigned in the NN
2. a reader gets these locations and contacts DN
3. DN has not yet put the replica in the volumeMap and FSDataset.getVisibleLength() throws a MissingReplicaException

In 0.20, I made it so that client just treats this as a 0-length file. What should the behavior in trunk be?

sam rash
added a comment - 07/Jun/10 20:44 I am addressing the last comments. I have one more question, though, as I have one test that it still fails and I want to see what you think the expected behavior should be:
immediate read of a new file:
1. writer creates a file and starts to write and hence blocks are assigned in the NN
2. a reader gets these locations and contacts DN
3. DN has not yet put the replica in the volumeMap and FSDataset.getVisibleLength() throws a MissingReplicaException
In 0.20, I made it so that client just treats this as a 0-length file. What should the behavior in trunk be?

The new patch is in a good shape! More comments list below:
1. please remove all unnecessary indention change;
2. BlockSender: endOffset = chunkChecksum.getDataLength();---this may cause NPE exception if replica to be read is not a RBW;
3. BlockSender: If tmpLen >= endOffset, there might because the replica is a finalized replica and the reader tries to read the end of file and the end of file is not aligned at chunk boundary.

Hairong Kuang
added a comment - 07/Jun/10 18:47 The new patch is in a good shape! More comments list below:
1. please remove all unnecessary indention change;
2. BlockSender: endOffset = chunkChecksum.getDataLength();---this may cause NPE exception if replica to be read is not a RBW;
3. BlockSender: If tmpLen >= endOffset, there might because the replica is a finalized replica and the reader tries to read the end of file and the end of file is not aligned at chunk boundary.

1. new endOffset calc includes determining if in-memory checksum is needed
2. added methods to RBW only to set/get last checksum and data length
-track this dataLength separate as setBytesOnDisk may be called independently and make the length/byte[] not match (in theory bytes on disk could be set to more and we still want a checksum + the corresponding length kept)
3. appropriate changes around waiting for start + length

did not remove all replicaVisibleLength uses yet--want to clarify what to replace them with in pre-existing code.

sam rash
added a comment - 06/Jun/10 22:21 1. new endOffset calc includes determining if in-memory checksum is needed
2. added methods to RBW only to set/get last checksum and data length
-track this dataLength separate as setBytesOnDisk may be called independently and make the length/byte[] not match (in theory bytes on disk could be set to more and we still want a checksum + the corresponding length kept)
3. appropriate changes around waiting for start + length
did not remove all replicaVisibleLength uses yet--want to clarify what to replace them with in pre-existing code.

Thanks for the quick review.
I understand most of the comments, but have a couple of questions:

1. replicaVisibleLength was here before I made any changes. Why is it not valid? I understood it to be an upper bound on the bytes that could be read from this block. Is it the case that start + length <= replicaVisibleLength and we want to optimize?

(the for loop to wait for bytes on disk >= visible length was here before, I just moved it earlier in the constructor)

2. not sure I understand endOffset. This was again a variable that already existed. What I thought you were getting at was the condition to decide if we should use the in-memory checksum or not (which is what you describe).

3. If we don't put the sync set/get method in ReplicaInPipelineInterface, we will have to use an if/else construct on instanceof in BlockReceiver and call one or the other. I can see the argument for keeping the method out of the interface since it is RBW-specific, but on the other hand, it's effectively a no-op for other implementers of the interface and leads to cleaner code (better natural polymorphism then if-else constructs to force it).

either way, just wanted to throw that out there as a question of style

sam rash
added a comment - 05/Jun/10 01:02 Thanks for the quick review.
I understand most of the comments, but have a couple of questions:
1. replicaVisibleLength was here before I made any changes. Why is it not valid? I understood it to be an upper bound on the bytes that could be read from this block. Is it the case that start + length <= replicaVisibleLength and we want to optimize?
(the for loop to wait for bytes on disk >= visible length was here before, I just moved it earlier in the constructor)
2. not sure I understand endOffset. This was again a variable that already existed. What I thought you were getting at was the condition to decide if we should use the in-memory checksum or not (which is what you describe).
3. If we don't put the sync set/get method in ReplicaInPipelineInterface, we will have to use an if/else construct on instanceof in BlockReceiver and call one or the other. I can see the argument for keeping the method out of the interface since it is RBW-specific, but on the other hand, it's effectively a no-op for other implementers of the interface and leads to cleaner code (better natural polymorphism then if-else constructs to force it).
either way, just wanted to throw that out there as a question of style

In case 1, the last chunk's checksum does not need to be read from disk.

ReplicaInPipeline, ReplicaPipeInterface, and ReplicaBeingWritten

I do not think we need to make any change to ReplicaInPipeline and ReplicaPipeInterface

We just need to adds the attribute lastChecksum and two synchronized method to ReplicaBeingWritten. Would it be more readable if we use the method names as getLastChecksumAndDataLen and setLastChecksumAndDataLen.

Hairong Kuang
added a comment - 03/Jun/10 20:59 Thank Sam for working on the patch for trunk. Here are my comments:
BlockSender.java:
the condition replica.getBytesOnDisk() < replicaVisibleLength should be gtBytesOnDisk() < startOffset + length. This guarantees that the bytes to be read have already flushed to the disk.
When the while loop exits and the bytes still have not flushed to disk yet, BlockSender should throw an IOException.
It seems to me that we should remove the use of replicaVisisbleLength from BlockSender;
the way to calculate endOffset should be
if (startOffset + length falls into the same chunk where chunkChecksum.getDataLength() is located {
endOffset = chunkChecksum.getDataLength(); --- case 1
} else {
endOffset = chunk boundary where (startOffset + length) is located
}
In case 1, the last chunk's checksum does not need to be read from disk.
ReplicaInPipeline, ReplicaPipeInterface, and ReplicaBeingWritten
I do not think we need to make any change to ReplicaInPipeline and ReplicaPipeInterface
We just need to adds the attribute lastChecksum and two synchronized method to ReplicaBeingWritten. Would it be more readable if we use the method names as getLastChecksumAndDataLen and setLastChecksumAndDataLen.

When I thought more about it, the replica length in my previous comment should be replica's disk length. So after a chunk is flushed to disk, a block receiver updates its disk length (as in trunk) and should also update its last chunk crc. When a read request comes in, a block reader waits until its requested bytes are on disk and then return all the bytes up to the replica's disk length.

Hairong Kuang
added a comment - 26/May/10 18:03 When I thought more about it, the replica length in my previous comment should be replica's disk length. So after a chunk is flushed to disk, a block receiver updates its disk length (as in trunk) and should also update its last chunk crc. When a read request comes in, a block reader waits until its requested bytes are on disk and then return all the bytes up to the replica's disk length.

Sam, the replica length in my comment meant the visible length so no new length field is needed. The read algorithm in trunk simplifies the read algorithm in the design doc.

A block reader does read until its disk length >= the request # of bytes. To solve the last chunk CRC problem, I am thinking to store last chunk's crc in ReplicaBeingWritten after the chunk is received. A block reader returns visible length of bytes and last chunk crc from memory. This should work because # of requested bytes <= disk length <= visible length.

Hairong Kuang
added a comment - 26/May/10 05:07 Sam, the replica length in my comment meant the visible length so no new length field is needed. The read algorithm in trunk simplifies the read algorithm in the design doc.
A block reader does read until its disk length >= the request # of bytes. To solve the last chunk CRC problem, I am thinking to store last chunk's crc in ReplicaBeingWritten after the chunk is received. A block reader returns visible length of bytes and last chunk crc from memory. This should work because # of requested bytes <= disk length <= visible length.

hmm, in looking at the code more, I see that this depends on what # of bytes we want to make available to readers.

-visible length (bytes acked) : needed for consistent view of data I think
-bytes on disk : this seems like we'll get inconsistent reads; and in theory, acked data may be more than on disk for a given node (if I read the doc + code right). However, how can a DN send data that's not on disk unless it's made available via memory? (very complex)

sam rash
added a comment - 25/May/10 19:47 hmm, in looking at the code more, I see that this depends on what # of bytes we want to make available to readers.
-visible length (bytes acked) : needed for consistent view of data I think
-bytes on disk : this seems like we'll get inconsistent reads; and in theory, acked data may be more than on disk for a given node (if I read the doc + code right). However, how can a DN send data that's not on disk unless it's made available via memory? (very complex)

I'm looking a little at implementing this in trunk (reading your append/hflush doc from hdfs-265), and I have a question. From above:

"In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc"

why does there need to be another length field? the getVisibleLenght() == acked bytes isn't sufficient? if the crc stored in the RBW is for that length, you only need the additional byte[] field which is the last chunk's crc I think.

ReplicaBeingWritten.setBytesAcked() could take the crc and atomically set the len + bytes

sam rash
added a comment - 25/May/10 19:38 @hairong:
I'm looking a little at implementing this in trunk (reading your append/hflush doc from hdfs-265), and I have a question. From above:
"In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc"
why does there need to be another length field? the getVisibleLenght() == acked bytes isn't sufficient? if the crc stored in the RBW is for that length, you only need the additional byte[] field which is the last chunk's crc I think.
ReplicaBeingWritten.setBytesAcked() could take the crc and atomically set the len + bytes

ah, sorry, meant to respond to that: I ran the unit tests I have the exercise that in a debugger and set breakpoints to verify the code path is exercised.

i agree it's a complicated bit of code-I cleaned it up a tiny bit. I think in reality breaking out into 2 subclasses/methods that do the sending of packets (sendChunks method)one for transferTo and one for the regular path. probably separate classes (non-static inner classes so as to use parent class state maybe-haven't thought the details through yet).

sam rash
added a comment - 05/May/10 01:59 ah, sorry, meant to respond to that: I ran the unit tests I have the exercise that in a debugger and set breakpoints to verify the code path is exercised.
i agree it's a complicated bit of code- I cleaned it up a tiny bit. I think in reality breaking out into 2 subclasses/methods that do the sending of packets (sendChunks method) one for transferTo and one for the regular path. probably separate classes (non-static inner classes so as to use parent class state maybe -haven't thought the details through yet).

Todd Lipcon
added a comment - 05/May/10 00:36 That sounds fine - I was making those comments so you could incorporate them into the trunk patch when you get there.
I am still curious about whether you verified that transferTo still works (either via benchmark or instrumentation/strace)

regarding the 0.20 patch, i was planning on leaving it as-is and working on the patch for trunk. that one will be up to snuff and have all the headers and hopefully be cleaner. otherwise, I'll have 3 concurrent branches to keep going (internal 0.20, 0.20 'refined', and trunk)

does that work for you? or do you need a cleaned up 0.20 version for something in particular?

sam rash
added a comment - 04/May/10 23:23 hey todd,
regarding the 0.20 patch, i was planning on leaving it as-is and working on the patch for trunk. that one will be up to snuff and have all the headers and hopefully be cleaner. otherwise, I'll have 3 concurrent branches to keep going (internal 0.20, 0.20 'refined', and trunk)
does that work for you? or do you need a cleaned up 0.20 version for something in particular?

Generally, have you verified through strace or other means that transferto is still being used properly for normal transfers? The logic in BlockSender is getting very complicated, it's hard to verify by looking at it.

Todd Lipcon
added a comment - 04/May/10 20:32 Some comments:
IOUtils.readFileChannelFully:
IOUtils.readFully spelled "premature" wrong, but we might as well not duplicate the typo
no need to increment the 'off' variable since the bytebuffer internally deals with that
can you use ByteBuffer.hasRemaining() in the loop instead of updating the toRead variable?
ChecksumUtil:
Missing apache license header
Since it only contains static methods, either make it an abstract class or give it a private constructor
Perhaps add an assert like: assert checksumOff + numChunks * checksumSize < dataOff; (to make sure there's enough space in the checksum buffer)
Generally, have you verified through strace or other means that transferto is still being used properly for normal transfers? The logic in BlockSender is getting very complicated, it's hard to verify by looking at it.

sam rash
added a comment - 01/May/10 05:53 -includes all patches around concurrent reader crc problems for the 0.20 port
-fixed so patch should include full unit tests (previously depended on other commit not included)

sam rash
added a comment - 01/May/10 05:40 note: this patch is incomplete--there's effectively a 1-line change needed if you happen to deal with very small writes + sync (if you sync lots of writes < chunk size).
i have a 2nd patch that includes test + fix for this for 0.20--need to merge into a unified patch and will post

Todd Lipcon
added a comment - 28/Apr/10 20:07 Hey Sam. No urgent need, except that we'd like to close some of the HDFS blockers for the upcoming release that Tom is working on. If you're on it, great! Let me know if I can be of help.
Regarding the 20 version, I'd be happy to take a look, thanks.

i have a 0.20 patch we've tested and i plan to port it fwd to trunk and adopt some changes similar to what Hairong laid out (hopefully in the next couple of weeks). is there a more urgent need for it in trunk? do u want me to post the 0.20-fb version ?

sam rash
added a comment - 28/Apr/10 20:01 i have a 0.20 patch we've tested and i plan to port it fwd to trunk and adopt some changes similar to what Hairong laid out (hopefully in the next couple of weeks). is there a more urgent need for it in trunk? do u want me to post the 0.20-fb version ?

sorry, as noted, that's an awful name: "metadata length which is the amount of data for which we have meta data"
ie, the length of data for which we have valid crc data. data length is always growing first so the solution here was to track the metadata length as 'visible block length'. this doesn't cleanly solve the unstable last checksum problem which is why we detect that change and recompute.

sam rash
added a comment - 23/Apr/10 01:03 sorry, as noted, that's an awful name: "metadata length which is the amount of data for which we have meta data"
ie, the length of data for which we have valid crc data. data length is always growing first so the solution here was to track the metadata length as 'visible block length'. this doesn't cleanly solve the unstable last checksum problem which is why we detect that change and recompute.

In the trunk I think there is a more elegant way of solving the problem.

In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc. The last consistent state is defined as the replica length right after the most recent packet has been flushed to the disk. So when a read request comes in, if the last byte requested falls into a being-written last chunk, the datanode returns the last chunk upto the last consistent state. The crc is read from memory in stead of reading from the disk.

This will also make the replica recovery problem filed in HDFS-1103 easier to solve.

Hairong Kuang
added a comment - 22/Apr/10 23:56 In the trunk I think there is a more elegant way of solving the problem.
In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc. The last consistent state is defined as the replica length right after the most recent packet has been flushed to the disk. So when a read request comes in, if the last byte requested falls into a being-written last chunk, the datanode returns the last chunk upto the last consistent state. The crc is read from memory in stead of reading from the disk.
This will also make the replica recovery problem filed in HDFS-1103 easier to solve.

First, we can't recompute all partial chunk CRCs without a regression: blocks on disk that are not being written to may have errors that we would miss. This means we have to be more careful and effectively only recompute a checksum that we can guarantee is mismatched or very very likely to be wrong.

After trying out various ways of detecting if blocks changed and recomputing the CRC for partial chunks, I generalized the problem to this: there is data length which is the size of the data on disk, and metadata length which is the amount of data for which we have meta data (poor name--need a better one). The problem with concurrent reads/write is that these get out of sync.

Problems that could occur or have actually occurred:

1. CRC error1: BlockSender is created when data length is > metadata length. Produces a crc error in that the crc in the last chunk is for less data
2. CRC error2: when reading from disk, more metadata is available than blockLength (data size is in fact greater than when BlockSender was created)
3. EOF error: similar to 1, but in theory EOF could be be encountered when reading checksumIn (haven't seen yet)

Solution:
1. we need to guarantee when we create a BlockSender (or anything that will direct reading of data), blockLength <= metadata length <= data length
(blockLength being what we consider available to read)
2. we detect when the block changes after we get the blockLength (create BlockSender) in order to know if we should recompute partial chunk CRCs

In this way, if we guarantee #1, and implement #2, we can know when the CRC will be invalid in recompute them w/o any regression

For #1, we already have ongoing creates concept which keeps some in-memory information about blocks being written to (new or append). We add a 'visible length' attribute which we also expose in FSDatasetInterface.getVisibleLength(Block b). This is an in memory length of the block that we update only after flushing both data and metadata to disk. For blocks not being appended to, this function delegates to the original FSDatasetInterface.getLength(Block b) which is unchanged. In this way, if BlockSender uses this for the blockLength, #1 above holds.

For #2, we 'memoize' some info about the block when we create it. In particular, we get the actual length of the block on disk and store it. On each packet send, we compare the expected length of our inputstream to what is actually there. If there is more data, the block has changed and the CRC data can't be trusted for the last partial chunk.

Test cases : inspired by Todd's (could not apply patch to 0.20), I create two threads and have one write very small data sizes and call sync. Another thread opens the file, reads to the end, remembers that position, and starts off there again and resumes. This reliably produces the errors within a few seconds and runs 10-15s when there is no error (tunable). In order to detect data corruption (which I saw in some particular bugs), I made the data written such that you can tell byte X is X % 127. This helps catch issues that CRC recomputation might hide.

sam rash
added a comment - 15/Apr/10 05:04 update:
problem refinement/issues:
First, we can't recompute all partial chunk CRCs without a regression: blocks on disk that are not being written to may have errors that we would miss. This means we have to be more careful and effectively only recompute a checksum that we can guarantee is mismatched or very very likely to be wrong.
After trying out various ways of detecting if blocks changed and recomputing the CRC for partial chunks, I generalized the problem to this: there is data length which is the size of the data on disk, and metadata length which is the amount of data for which we have meta data (poor name--need a better one). The problem with concurrent reads/write is that these get out of sync.
Problems that could occur or have actually occurred:
1. CRC error1: BlockSender is created when data length is > metadata length. Produces a crc error in that the crc in the last chunk is for less data
2. CRC error2: when reading from disk, more metadata is available than blockLength (data size is in fact greater than when BlockSender was created)
3. EOF error: similar to 1, but in theory EOF could be be encountered when reading checksumIn (haven't seen yet)
Solution:
1. we need to guarantee when we create a BlockSender (or anything that will direct reading of data), blockLength <= metadata length <= data length
(blockLength being what we consider available to read)
2. we detect when the block changes after we get the blockLength (create BlockSender) in order to know if we should recompute partial chunk CRCs
In this way, if we guarantee #1, and implement #2, we can know when the CRC will be invalid in recompute them w/o any regression
For #1, we already have ongoing creates concept which keeps some in-memory information about blocks being written to (new or append). We add a 'visible length' attribute which we also expose in FSDatasetInterface.getVisibleLength(Block b). This is an in memory length of the block that we update only after flushing both data and metadata to disk. For blocks not being appended to, this function delegates to the original FSDatasetInterface.getLength(Block b) which is unchanged. In this way, if BlockSender uses this for the blockLength, #1 above holds.
For #2, we 'memoize' some info about the block when we create it. In particular, we get the actual length of the block on disk and store it. On each packet send, we compare the expected length of our inputstream to what is actually there. If there is more data, the block has changed and the CRC data can't be trusted for the last partial chunk.
Test cases : inspired by Todd's (could not apply patch to 0.20), I create two threads and have one write very small data sizes and call sync. Another thread opens the file, reads to the end, remembers that position, and starts off there again and resumes. This reliably produces the errors within a few seconds and runs 10-15s when there is no error (tunable). In order to detect data corruption (which I saw in some particular bugs), I made the data written such that you can tell byte X is X % 127. This helps catch issues that CRC recomputation might hide.

re: recovery code
I will check with dhruba about how 20 handles the recovery (not familiar with that part of the code yet)

re: tests
great, thanks. I was able to construct a case that deterministically fails:

1. writer opens a file, wites and syncs some # of bytes that includes a partial chunk
2. reader opens that stream, reads some bytes (to make the datanode open the meta data and block data streams)
3. writer writes additional bytes that fill out the partial chunk
4. reader continues reading to the end of file when it opened
5. reader throws CRC error

I will see if I can construct deterministic test cases for these other ones or use directly as well.

sam rash
added a comment - 08/Apr/10 23:30 re: recovery code
I will check with dhruba about how 20 handles the recovery (not familiar with that part of the code yet)
re: tests
great, thanks. I was able to construct a case that deterministically fails:
1. writer opens a file, wites and syncs some # of bytes that includes a partial chunk
2. reader opens that stream, reads some bytes (to make the datanode open the meta data and block data streams)
3. writer writes additional bytes that fill out the partial chunk
4. reader continues reading to the end of file when it opened
5. reader throws CRC error
I will see if I can construct deterministic test cases for these other ones or use directly as well.
thanks again

> need to add to the recovery code to recompute the last chunk checksum during rbw recovery, right?
In the trunk, this is not an issue. When DN startup, rbw is changed to be rwr (ReplicaWaitingToBeRecover), during it checks the last chunk and find out the number of bytes in the block that match its crc. If crc does not match, the last chunk gets thrown away. This does not violates hflush (sync) semantics because in this case some error occurred and other replicas may still have a good copy.

But I am not sure about 0.20. I think it does something similar. Anyway, I think we could ignore startup for the issue.

> is there an existing unit test for this case yet?
Todd posted some good concurrent reader tests at HDFS-1060. Please check if you could use them.

Hairong Kuang
added a comment - 08/Apr/10 22:32 > need to add to the recovery code to recompute the last chunk checksum during rbw recovery, right?
In the trunk, this is not an issue. When DN startup, rbw is changed to be rwr (ReplicaWaitingToBeRecover), during it checks the last chunk and find out the number of bytes in the block that match its crc. If crc does not match, the last chunk gets thrown away. This does not violates hflush (sync) semantics because in this case some error occurred and other replicas may still have a good copy.
But I am not sure about 0.20. I think it does something similar. Anyway, I think we could ignore startup for the issue.
> is there an existing unit test for this case yet?
Todd posted some good concurrent reader tests at HDFS-1060 . Please check if you could use them.

fwiw, I have an impl of #2 that does the slightly more efficient way: the BlockSender.sendChunks() examines 'len' and if % bytesPerChecksum != 0, it truncates len to a chunk boundary. It can do this since it returns len. The result is that what was a single packet to the receiver is now two, but the first one can be done with transferToFully() using existing checksums and the lone partial chunk has its own packet (and in this case, there's the extra buffer copy in order to recompute the checksum.

For this attempt, I punted on figuring out if the block is in progress or not--I'm ok with the slight inefficiency if it avoids race conditions. I believe we might be able to address this with some synchronization around state changes to a block being an ongoing create or not.

Can you comment on where the recovery code is that I also need to tweak? (very new to working on hdfs)

I still need to do some more testing on it and clean it up; also, is there an existing unit test for this case yet? (there certainly isn't in 0.20) I am also going to try to construct one for that.

sam rash
added a comment - 08/Apr/10 19:12 thanks for the additional info.
fwiw, I have an impl of #2 that does the slightly more efficient way: the BlockSender.sendChunks() examines 'len' and if % bytesPerChecksum != 0, it truncates len to a chunk boundary. It can do this since it returns len. The result is that what was a single packet to the receiver is now two, but the first one can be done with transferToFully() using existing checksums and the lone partial chunk has its own packet (and in this case, there's the extra buffer copy in order to recompute the checksum.
For this attempt, I punted on figuring out if the block is in progress or not--I'm ok with the slight inefficiency if it avoids race conditions. I believe we might be able to address this with some synchronization around state changes to a block being an ongoing create or not.
Can you comment on where the recovery code is that I also need to tweak? (very new to working on hdfs)
I still need to do some more testing on it and clean it up; also, is there an existing unit test for this case yet? (there certainly isn't in 0.20) I am also going to try to construct one for that.

Todd Lipcon
added a comment - 08/Apr/10 18:55 Solution 2 should work, but we still need to add to the recovery code to recompute the last chunk checksum during rbw recovery, right? (ie a DN could crash in between flushing data and metadata)

Thanks Sam for his initiative on this problem and had a brief off-line discussion with him. I personally like solution 2: recompute the checksum for the partial last chunk. This was also what came up in my mind when Todd filed the jira.

Solution 1 violates semantics. Solution 3 needs to extend with Todd's suggestion. However this means readers may significantly slow down the writer if there are large amount of read requests.

Hairong Kuang
added a comment - 08/Apr/10 18:51 > Would be interested to hear from the Yahoo folks on this one.
Thanks Sam for his initiative on this problem and had a brief off-line discussion with him. I personally like solution 2: recompute the checksum for the partial last chunk. This was also what came up in my mind when Todd filed the jira.
Solution 1 violates semantics. Solution 3 needs to extend with Todd's suggestion. However this means readers may significantly slow down the writer if there are large amount of read requests.

BTW, this issue is actually really dangerous - it doesn't just cause checksum exceptions for the reader, but the reader then goes and reports the blocks as corrupt to the NN, and the entire block becomes unreadable!

Todd Lipcon
added a comment - 08/Apr/10 18:29 BTW, this issue is actually really dangerous - it doesn't just cause checksum exceptions for the reader, but the reader then goes and reports the blocks as corrupt to the NN, and the entire block becomes unreadable!
Would be interested to hear from the Yahoo folks on this one.

Thanks for posting some thoughts here. I've been thinking about this one over the past couple of weeks and don't have any particularly great solutions either :-/

1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end)

I agree this isn't very acceptable - a common use case is following an edit log, and we don't want to lose the last 512 bytes of edits from the reader.

when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet

So then we're throwing away disk checksumming on the last chunk of a file? I thought of this one too, and I think it's OK, though it's not great. I think so long as we only do this when the file is known to be Under Construction, it's acceptable. We'll have to be careful, though, about races between checking whether we're reading the last checksum chunk and appending more to the file.

have datanode 'refresh' the length when it actually starts reading

Not sure I follow - as you noted it still seems to have a bug potential.

The other idea I had that isn't fully fleshed out is this:

The DN already knows which blocks are "in progress"

For each "in progress" block, we add a lock updatingLastChecksumChunkLock.

When the writer detects that it's appending mid-checksum-chunk to the last chunk, it takes this lock before doing the rewind and rewrite of meta.

The writer releases the lock after writing both the checksum and the data

When the reader is reading a range that includes the last checksum chunk of an under-construction file, it needs to acquire this lock first,. read both the checksum and data, then release the lock.

Couple potential issues:

We'll have to be careful about races between the under-construction check and the read - perhaps some more coordination with FSDataset is necessary here (we already have to coordinate this a little bit to figure out whether to read out of the rbw directory or the finalized directory, right?)

We still have a potential issue about failure recovery - I think we need to add some code to the DN recovery that cleans up the last partial chunk in RBW replicas.

One other more complicated idea:

Next to each RBW file, in addition to having the meta/CRC32 file, we can add a blk_N.rbwMetaData - we can keep some extra state in this file about the last partial checksum chunk, and use it for getting consistent reads.

Todd Lipcon
added a comment - 08/Apr/10 18:26 Hey Sam,
Thanks for posting some thoughts here. I've been thinking about this one over the past couple of weeks and don't have any particularly great solutions either :-/
1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end)
I agree this isn't very acceptable - a common use case is following an edit log, and we don't want to lose the last 512 bytes of edits from the reader.
when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet
So then we're throwing away disk checksumming on the last chunk of a file? I thought of this one too, and I think it's OK, though it's not great. I think so long as we only do this when the file is known to be Under Construction, it's acceptable. We'll have to be careful, though, about races between checking whether we're reading the last checksum chunk and appending more to the file.
have datanode 'refresh' the length when it actually starts reading
Not sure I follow - as you noted it still seems to have a bug potential.
The other idea I had that isn't fully fleshed out is this:
The DN already knows which blocks are "in progress"
For each "in progress" block, we add a lock updatingLastChecksumChunkLock .
When the writer detects that it's appending mid-checksum-chunk to the last chunk, it takes this lock before doing the rewind and rewrite of meta.
The writer releases the lock after writing both the checksum and the data
When the reader is reading a range that includes the last checksum chunk of an under-construction file, it needs to acquire this lock first,. read both the checksum and data, then release the lock.
Couple potential issues:
We'll have to be careful about races between the under-construction check and the read - perhaps some more coordination with FSDataset is necessary here (we already have to coordinate this a little bit to figure out whether to read out of the rbw directory or the finalized directory, right?)
We still have a potential issue about failure recovery - I think we need to add some code to the DN recovery that cleans up the last partial chunk in RBW replicas.
One other more complicated idea:
Next to each RBW file, in addition to having the meta/CRC32 file, we can add a blk_N.rbwMetaData - we can keep some extra state in this file about the last partial checksum chunk, and use it for getting consistent reads.

I'm working on getting this functionality into an internal 20-based rev as well. Here are some solutions I've thought of (and tried 1 & 2) in case it helps discussion towards a good sol'n:

solutions (some tried, some theories):
1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end)
2. when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet
-functionally solves the problem
-introduces inefficiency since the socket to socket copy can't be donecould possibly improve efficiency by having client break packet requests for anything that has multiple chunks with a partial chunk at the end into two-partial chunk goes as a lone request
-results in one more packet going back & forth, but the previous full chunks can do more efficient socket to socket copies
3. have datanode 'refresh' the length when it actually starts reading
-will send more data than the client requests
-checksum will match data
-client does truncation of data after crc check
-still has race condition with checksum for last chunk being out of sync with data from a given reader's perspective

sam rash
added a comment - 08/Apr/10 05:00 Hi,
I'm working on getting this functionality into an internal 20-based rev as well. Here are some solutions I've thought of (and tried 1 & 2) in case it helps discussion towards a good sol'n:
solutions (some tried, some theories):
1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end)
2. when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet
-functionally solves the problem
-introduces inefficiency since the socket to socket copy can't be done
could possibly improve efficiency by having client break packet requests for anything that has multiple chunks with a partial chunk at the end into two -partial chunk goes as a lone request
-results in one more packet going back & forth, but the previous full chunks can do more efficient socket to socket copies
3. have datanode 'refresh' the length when it actually starts reading
-will send more data than the client requests
-checksum will match data
-client does truncation of data after crc check
-still has race condition with checksum for last chunk being out of sync with data from a given reader's perspective

Concurrent readers hit ChecksumExceptions if following a writer to very end of file

Description

In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs.

In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable.

This problem is worse than originally reported. Switching the order of flush and setBytesOnDisk doesn't solve the problem, because the last checksum in the meta file is still changing. So, since we don't access the data synchronously with the checksum, a client trying to read the last several bytes of a file under construction will get checksum errors.

Todd Lipcon
added a comment - 22/Mar/10 02:30 This problem is worse than originally reported. Switching the order of flush and setBytesOnDisk doesn't solve the problem, because the last checksum in the meta file is still changing. So, since we don't access the data synchronously with the checksum, a client trying to read the last several bytes of a file under construction will get checksum errors.
Solving this is likely to be very tricky...