Description

I have seen one situation with Hbase cluster.

Scenario is as follows:

1)1.5 blocks has been written and synced.

2)Suddenly cluster has been restarted.

Reader opened the file and trying to get the length., By this time partial block contained DNs are not reported to NN. So, locations for this partial block would be 0. In this case, DFSInputStream assumes that, 1 block size as final size.

But reader also assuming that, 1 block size is the final length and setting his end marker. Finally reader ending up reading only partial data. Due to this, HMaster could not replay the complete edits.

Actually this happend with 20 version. Looking at the code, same should present in trunk as well.

int replicaNotFoundCount = locatedblock.getLocations().length;
for(DatanodeInfo datanode : locatedblock.getLocations()) {
..........
..........
// Namenode told us about these locations, but none know about the replica
// means that we hit the race between pipeline creation start and end.
// we require all 3 because some other exception could have happened
// on a DN that has it. we want to report that error
if (replicaNotFoundCount == 0) {
return 0;
}

Activity

Nice catch, Uma. I think we can use the length field of the block in the NN metadata to solve this, right? The first hflush()/sync() call from the client will cause persistBlocks() to be called, which should write down the block with a non-zero length. Then on restart, we can use this length instead of 0 when the replicas aren't found.

Todd Lipcon
added a comment - 06/Apr/12 20:59 Nice catch, Uma. I think we can use the length field of the block in the NN metadata to solve this, right? The first hflush()/sync() call from the client will cause persistBlocks() to be called, which should write down the block with a non-zero length. Then on restart, we can use this length instead of 0 when the replicas aren't found.

1) found the loations, but not able to connect to any of them. then replicaNotFoundCount will be decrented on each trail. and if replicaNotFoundCount == 0, then it returns 0.
2) if there no loactions for that block. Then obviously replicaNotFoundCount will be 0 and returns length as 0.

Uma Maheswara Rao G
added a comment - 07/Apr/12 03:09 Yes, I think that should solve the problem.
As we are persisting the number bytes for block, we can use block size of that partial block. Let me write some tests to replicate this.
public static void writeCompactBlockArray(
Block[] blocks, DataOutputStream out) throws IOException {
WritableUtils.writeVInt(out, blocks.length);
Block prev = null ;
for (Block b : blocks) {
long szDelta = b.getNumBytes() -
(prev != null ? prev.getNumBytes() : 0);
long gsDelta = b.getGenerationStamp() -
(prev != null ? prev.getGenerationStamp() : 0);
out.writeLong(b.getBlockId()); // blockid is random
WritableUtils.writeVLong(out, szDelta);
WritableUtils.writeVLong(out, gsDelta);
prev = b;
}
There are 2 cases here,
1) found the loations, but not able to connect to any of them. then replicaNotFoundCount will be decrented on each trail. and if replicaNotFoundCount == 0, then it returns 0.
2) if there no loactions for that block. Then obviously replicaNotFoundCount will be 0 and returns length as 0.
I think for both cases we can go for this persisted blockSize.

@Todd,
I think our proposal won't work here, because by the time of hsync, DN will not report to NN anyway. So, NN can not know the length and will persist that length as 0 only. So, even if we retun block size when we find the locations 0, will not help to solve the problem. am i missing?

I updated the patch for simulating the issue. Even if we fix in our proposed way, test will fail.
So, we may have to think alternative approach for this issue.

Uma Maheswara Rao G
added a comment - 09/Apr/12 13:28 @Todd,
I think our proposal won't work here, because by the time of hsync, DN will not report to NN anyway. So, NN can not know the length and will persist that length as 0 only. So, even if we retun block size when we find the locations 0, will not help to solve the problem. am i missing?
I updated the patch for simulating the issue. Even if we fix in our proposed way, test will fail.
So, we may have to think alternative approach for this issue.

I think our proposal won't work here, because by the time of hsync, DN will not report to NN anyway.

On the first hflush() for a block, it calls NN.fsync(), which internally calls persistBlocks(). Currently, the fsync call doesn't give a length, but perhaps it could?

The other thought is that, after a restart, a block that was previously being written would be in the under construction state, but with no expectedTargets. This differs from the case where a block has been allocated but not yet written to replicas. We could use that to set a new flag in the LocatedBlock response indicating that it's not a 0-length, but instead that it's corrupt.

Todd Lipcon
added a comment - 09/Apr/12 17:43 I think our proposal won't work here, because by the time of hsync, DN will not report to NN anyway.
On the first hflush() for a block, it calls NN.fsync(), which internally calls persistBlocks(). Currently, the fsync call doesn't give a length, but perhaps it could?
The other thought is that, after a restart, a block that was previously being written would be in the under construction state, but with no expectedTargets. This differs from the case where a block has been allocated but not yet written to replicas. We could use that to set a new flag in the LocatedBlock response indicating that it's not a 0-length, but instead that it's corrupt.

On the first hflush() for a block, it calls NN.fsync(), which internally calls persistBlocks(). Currently, the fsync call doesn't give a length, but perhaps it could?

My point is, even though client flushed the data, DNs will not report to NN right. Did you check the test above?
I have changed the code as per our proposal and debugged as well. It was persisting length as 0.

The other thought is that, after a restart, a block that was previously being written would be in the under construction state, but with no expectedTargets. This differs from the case where a block has been allocated but not yet written to replicas. We could use that to set a new flag in the LocatedBlock response indicating that it's not a 0-length,

I was thinking in the same lines. I got your point. I feel this would be possible to do. But my actual question is, after we distinguish, what we can do from client?
You mean we will retry until we get the locations? If yes, there would be another problem,
because when
1) client wants to read some partial data which exists in first block itself,
2) open may try to get complete length, and that will block if we retry until DNs reports to NN.
3) But really that DNs down for long time.
This time, we can not read even until the specified length, which is less than the start offset of partial block.

Uma Maheswara Rao G
added a comment - 09/Apr/12 18:34
On the first hflush() for a block, it calls NN.fsync(), which internally calls persistBlocks(). Currently, the fsync call doesn't give a length, but perhaps it could?
My point is, even though client flushed the data, DNs will not report to NN right. Did you check the test above?
I have changed the code as per our proposal and debugged as well. It was persisting length as 0.
The other thought is that, after a restart, a block that was previously being written would be in the under construction state, but with no expectedTargets. This differs from the case where a block has been allocated but not yet written to replicas. We could use that to set a new flag in the LocatedBlock response indicating that it's not a 0-length,
I was thinking in the same lines . I got your point. I feel this would be possible to do. But my actual question is, after we distinguish, what we can do from client?
You mean we will retry until we get the locations? If yes, there would be another problem,
because when
1) client wants to read some partial data which exists in first block itself,
2) open may try to get complete length, and that will block if we retry until DNs reports to NN.
3) But really that DNs down for long time.
This time, we can not read even until the specified length, which is less than the start offset of partial block.
Your suggestion might be something else here. what is your thought here?

My point is, even though client flushed the data, DNs will not report to NN right. Did you check the test above?

Right, but the client reports to the NN. So, the client could report the number of bytes hflushed, and the NN could fill in the last block with that information when it persists it.

You mean we will retry until we get the locations?

Yea – treat it the same as we treat a corrupt file.

1) client wants to read some partial data which exists in first block itself,
2) open may try to get complete length, and that will block if we retry until DNs reports to NN.
3) But really that DNs down for long time.

This time, we can not read even until the specified length, which is less than the start offset of partial block.

That's true. Is it possible for us to change the client code to defer this code path until either (a) the client wants to read from the partial block, or (b) the client explictly asks for the file length?

Alternatively, maybe this is so rare that it doesn't matter, and it's OK to disallow reading from an unrecovered file whose last block is missing all of its block locations after a restart.

Todd Lipcon
added a comment - 09/Apr/12 18:51 My point is, even though client flushed the data, DNs will not report to NN right. Did you check the test above?
Right, but the client reports to the NN. So, the client could report the number of bytes hflushed, and the NN could fill in the last block with that information when it persists it.
You mean we will retry until we get the locations?
Yea – treat it the same as we treat a corrupt file.
1) client wants to read some partial data which exists in first block itself,
2) open may try to get complete length, and that will block if we retry until DNs reports to NN.
3) But really that DNs down for long time.
This time, we can not read even until the specified length, which is less than the start offset of partial block.
That's true. Is it possible for us to change the client code to defer this code path until either (a) the client wants to read from the partial block, or (b) the client explictly asks for the file length?
Alternatively, maybe this is so rare that it doesn't matter, and it's OK to disallow reading from an unrecovered file whose last block is missing all of its block locations after a restart.

It may be difficult for the clients to differ whether this is real corruption or it will be recovered after DN reports to NN.

Alternatively, maybe this is so rare that it doesn't matter, and it's OK to disallow reading from an unrecovered file whose last block is missing all of its block locations after a restart.

Let's take others opinion also on this?
Others also, please comment on this point.

That's true. Is it possible for us to change the client code to defer this code path until either (a) the client wants to read from the partial block, or (b) the client explictly asks for the file length?

Uma Maheswara Rao G
added a comment - 09/Apr/12 19:15 Thanks a lot, Todd for your time.
Yea – treat it the same as we treat a corrupt file.
It may be difficult for the clients to differ whether this is real corruption or it will be recovered after DN reports to NN.
Alternatively, maybe this is so rare that it doesn't matter, and it's OK to disallow reading from an unrecovered file whose last block is missing all of its block locations after a restart.
Let's take others opinion also on this?
Others also, please comment on this point.
That's true. Is it possible for us to change the client code to defer this code path until either (a) the client wants to read from the partial block, or (b) the client explictly asks for the file length?
you mean reader will pass the option? (a) or (b).

It may be difficult for the clients to differ whether this is real corruption or it will be recovered after DN reports to NN.

What's the difference? If none of the DNs holding a block have reported a replica, it's missing/corrupt. The same is true of finalized blocks - if three DNs crash, and we have no replicas anymore, it still might come back if an admin fixes one of the DNs.

you mean reader will pass the option? (a) or (b).

Sorry, I wasn't clear. Right now, the behavior is that, when we call open() on a file which is under construction, we always go to the DNs holding the last block to find the length. My proposal is the following:

on open(), do not determine the visible length of the file. Set the member variable to something like "-1" to indicate it's still unknown

in the code that opens a block reader, change it to check if it's about to read from the last block. If it is, try to determine the visible length.

in the explicit getVisibleLength() call, if it's not determined yet, try to determine the visible length

With the above changes, we can allow a client who only wants to access the first blocks of a file to do so without having to contact the DNs holding the last block. But as soon as the client wants to access the under-construction block, or explicitly wants to know the visible length, then we go to the DNs.

Todd Lipcon
added a comment - 09/Apr/12 19:26 It may be difficult for the clients to differ whether this is real corruption or it will be recovered after DN reports to NN.
What's the difference? If none of the DNs holding a block have reported a replica, it's missing/corrupt. The same is true of finalized blocks - if three DNs crash, and we have no replicas anymore, it still might come back if an admin fixes one of the DNs.
you mean reader will pass the option? (a) or (b).
Sorry, I wasn't clear. Right now, the behavior is that, when we call open() on a file which is under construction, we always go to the DNs holding the last block to find the length. My proposal is the following:
on open(), do not determine the visible length of the file. Set the member variable to something like "-1" to indicate it's still unknown
in the code that opens a block reader, change it to check if it's about to read from the last block. If it is, try to determine the visible length.
in the explicit getVisibleLength() call, if it's not determined yet, try to determine the visible length
With the above changes, we can allow a client who only wants to access the first blocks of a file to do so without having to contact the DNs holding the last block. But as soon as the client wants to access the under-construction block, or explicitly wants to know the visible length, then we go to the DNs.

What's the difference? If none of the DNs holding a block have reported a replica, it's missing/corrupt. The same is true of finalized blocks - if three DNs crash, and we have no replicas anymore, it still might come back if an admin fixes one of the DNs.

you convinced me . I like your other proposal, but when reader calls explicit getVisibleLength call, if DNs still did not report, then situation would be same right. So, I am ok to retry 3 times and throw the exception to user as corrupt in this situation. Let the user take action.
What is your opinion?

Uma Maheswara Rao G
added a comment - 09/Apr/12 19:50
What's the difference? If none of the DNs holding a block have reported a replica, it's missing/corrupt. The same is true of finalized blocks - if three DNs crash, and we have no replicas anymore, it still might come back if an admin fixes one of the DNs.
you convinced me . I like your other proposal, but when reader calls explicit getVisibleLength call, if DNs still did not report, then situation would be same right. So, I am ok to retry 3 times and throw the exception to user as corrupt in this situation. Let the user take action.
What is your opinion?

The other thought is that, after a restart, a block that was previously being written would be in the under construction state, but with no expectedTargets. This differs from the case where a block has been allocated but not yet written to replicas. We could use that to set a new flag in the LocatedBlock response indicating that it's not a 0-length, but instead that it's corrupt.

On restart case, block is there and there is no locations means there must be a sync call before. So, some data must have been written to it. If there is no sync means, that partial block itself will not be noted by NN.

In your above comment, what is the other case you are pointing?

So, we need not set any flag from namenode right? I can decide directly from client it self.

Uma Maheswara Rao G
added a comment - 10/Apr/12 06:24 Before moving.....
The other thought is that, after a restart, a block that was previously being written would be in the under construction state, but with no expectedTargets. This differs from the case where a block has been allocated but not yet written to replicas. We could use that to set a new flag in the LocatedBlock response indicating that it's not a 0-length, but instead that it's corrupt.
On restart case, block is there and there is no locations means there must be a sync call before. So, some data must have been written to it. If there is no sync means, that partial block itself will not be noted by NN.
In your above comment, what is the other case you are pointing?
So, we need not set any flag from namenode right? I can decide directly from client it self.

Hi Uma, thanks for work on this. openInfo() updates locatedBlocks and other stuffs. How about we move the retry logic outside openInfo() and retry the entire openInfo() call? Otherwise, we have to check oldIter and newIter in the retry.

Tsz Wo Nicholas Sze
added a comment - 25/Apr/12 21:06 Hi Uma, thanks for work on this. openInfo() updates locatedBlocks and other stuffs. How about we move the retry logic outside openInfo() and retry the entire openInfo() call? Otherwise, we have to check oldIter and newIter in the retry.

Uma Maheswara Rao G
added a comment - 26/Apr/12 19:04 Thanks a lot, Nicholas. for taking a look!
Yep, you are right, I have to compare with olderIter values. Now I am doing the retries outside of previous openInfo API.

Uma Maheswara Rao G
added a comment - 26/Apr/12 21:03 Seems like Jenkins did not paste the results here.
https://builds.apache.org/job/PreCommit-HDFS-Build/2336/
But actually it ran the tests. There are no failures due to this patch.
Thanks a lot Nicholas, for the reviews!