ArrayIndexOutOfBoundsException during fsck

Details

Description

After observing a lot of corrupted blocks, I suddenly started to get a lot of ArrayIndexOutOfBoundsException.

It appears to be an issue very similar to HADOOP-3649, which is supposed to be fixed in 0.18.1.

2008-10-06 08:48:43,241 WARN /: /fsck?path=%2F:
java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.hadoop.dfs.FSNamesystem.getBlockLocationsInternal(FSNamesystem.java:789)
at org.apache.hadoop.dfs.FSNamesystem.getBlockLocations(FSNamesystem.java:727)
at org.apache.hadoop.dfs.NamenodeFsck.check(NamenodeFsck.java:167)
at org.apache.hadoop.dfs.NamenodeFsck.check(NamenodeFsck.java:162)
at org.apache.hadoop.dfs.NamenodeFsck.check(NamenodeFsck.java:162)
at org.apache.hadoop.dfs.NamenodeFsck.check(NamenodeFsck.java:162)
at org.apache.hadoop.dfs.NamenodeFsck.check(NamenodeFsck.java:162)
at org.apache.hadoop.dfs.NamenodeFsck.fsck(NamenodeFsck.java:128)
at org.apache.hadoop.dfs.FsckServlet.doGet(FsckServlet.java:48)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Hi Konstantin, the patch that I uploaded at 4:20pm on 10/10 addressed all your review comments including last two that you commented after 5:00pm. That's also the one that I committed this morning. But looks like that I accidentally deleted it when I was managing the attachments. I just uploaded the one that I committed. Please double check.

Hairong Kuang
added a comment - 13/Oct/08 22:54 I just committed this.
Hi Konstantin, the patch that I uploaded at 4:20pm on 10/10 addressed all your review comments including last two that you commented after 5:00pm. That's also the one that I committed this morning. But looks like that I accidentally deleted it when I was managing the attachments. I just uploaded the one that I committed. Please double check.

Konstantin Shvachko
added a comment - 11/Oct/08 01:42
Hairong you forgot to include a warning in FSNamesystem.adStoredBlock().
And corruptReplicas.numCorruptReplicas(blk) can be used both in FSNamesystem.adStoredBlock() and FSNamesystem.getBlockLocations(). It is better to use the method rather than replicating common code.
+1 other than that.

Konstantin Shvachko
added a comment - 11/Oct/08 00:03 Some comments:
FSNamesystem.markBlockAsCorrupt()
inode cannot be null after you verified that the block is in the blocksMap. So I propose to replace condition inode != null by an assert.
variable NumberReplicas num is redundant you can directly calculate int numLiveReplicas instead.
Totally agree with the changes in removeStoredBlock() .
FSNamesystem.adStoredBlock()
You should use corruptReplicas.numCorruptReplicas(blk)
Since this is sort of a self-corrective code I would also include a warning message here the same as in getBlockLocation().

This patch fixes a bug in markBlockAsCorrupt:
"updateNeededReplications(blk, 0, 1);" should be "updateNeededReplications(blk, -1, 0);"
because replication factor does not change but the current number of replicas is decremented.

Hairong Kuang
added a comment - 10/Oct/08 21:59 This patch fixes a bug in markBlockAsCorrupt:
"updateNeededReplications(blk, 0, 1);" should be "updateNeededReplications(blk, -1, 0);"
because replication factor does not change but the current number of replicas is decremented.

This new patch adds two more changes:
1. Prevent ArrayIndexOutOfBoundException being thrown even when inconsistency happens. Instead it prints a warning in the log.
2. Allow a file's blocks to exit the inconsistency state by increasing the file's replication factor.

Hairong Kuang
added a comment - 10/Oct/08 18:39 This new patch adds two more changes:
1. Prevent ArrayIndexOutOfBoundException being thrown even when inconsistency happens. Instead it prints a warning in the log.
2. Allow a file's blocks to exit the inconsistency state by increasing the file's replication factor.

Hairong also mentioned that she conducted a test where the fault was observed without the patch, and the fault was not observed with the patch. Incidentally, the fault is only observed when the intended replication factor for the block is greater than one. Otherwise the block is reported as missing rather than having a bad replicas.

Robert Chansler
added a comment - 10/Oct/08 01:39 Hairong also mentioned that she conducted a test where the fault was observed without the patch, and the fault was not observed with the patch. Incidentally, the fault is only observed when the intended replication factor for the block is greater than one. Otherwise the block is reported as missing rather than having a bad replicas.

Thanks Brian! The information you reported is very helpful. I checked the code, I saw two scenarios that might cause the inconsistency: a replica is in corruptedBlockMap but not in blockMap.

1. In FSNamesystem.markBlockAsCorrupted, a replica is added to corruptedBlocksMap even if it is not in the blocksMap;
2. In FSNamesystem.removeStoredBlock, a replica may be removed from the blocksMap but not from the corruptedBlocksMap.

Hairong Kuang
added a comment - 08/Oct/08 04:52 Thanks Brian! The information you reported is very helpful. I checked the code, I saw two scenarios that might cause the inconsistency: a replica is in corruptedBlockMap but not in blockMap.
1. In FSNamesystem.markBlockAsCorrupted, a replica is added to corruptedBlocksMap even if it is not in the blocksMap;
2. In FSNamesystem.removeStoredBlock, a replica may be removed from the blocksMap but not from the corruptedBlocksMap.

Unfortunately, the admins moved the namenode to a more permanent home and clobbered the existing logfiles in the process. (And I fixed the corrupt blocks: I didn't want to live with a problematic file system for too long!)

The code printed out the replicas in the blocksMap (node 10, 145, 117) and the corrupt entries in the corruptReplicas vairable (node16). The existing code calculates that 3 - 1 = 2 replicas must be good (this is a mistake as corruptReplicas is not a subset of blocksMap); however, when it starts to populate the machineSet, it only gets as far as nodes 10 and 145, then throws the exception on 117.

You're right - there's possibly an underlying problem here which this patches a symptom of. However, it still is a useful thing to fix: it is rather painful to go through filesystem cleanup when fsck dies instantly.

I believe there log message was something like this one:

2008-10-07 05:04:24,021 INFO org.apache.hadoop.dfs.StateChange: BLOCK NameSystem.markBlockAsCorrupt: block blk_-5420894356244363410_2169 could not be marked as corrupt as it does not exists in blocksMap

I think if a data node reports it has a corrupt block, then it gets added to the corrupt map and not the blocksMap. I took a peek at the underlying issue, and wasn't able to make much progress - an expert on FSNamesystem will be needed to find the underlying problem.

Brian

PS: I looked at the patch again - silly me leaked several changes I have made into this patch; please disregard any changes to files that are NOT FSNamesystem.java

Brian Bockelman
added a comment - 08/Oct/08 00:15 Hey Hairong,
Unfortunately, the admins moved the namenode to a more permanent home and clobbered the existing logfiles in the process. (And I fixed the corrupt blocks: I didn't want to live with a problematic file system for too long!)
The code printed out the replicas in the blocksMap (node 10, 145, 117) and the corrupt entries in the corruptReplicas vairable (node16). The existing code calculates that 3 - 1 = 2 replicas must be good (this is a mistake as corruptReplicas is not a subset of blocksMap); however, when it starts to populate the machineSet, it only gets as far as nodes 10 and 145, then throws the exception on 117.
You're right - there's possibly an underlying problem here which this patches a symptom of. However, it still is a useful thing to fix: it is rather painful to go through filesystem cleanup when fsck dies instantly.
I believe there log message was something like this one:
2008-10-07 05:04:24,021 INFO org.apache.hadoop.dfs.StateChange: BLOCK NameSystem.markBlockAsCorrupt: block blk_-5420894356244363410_2169 could not be marked as corrupt as it does not exists in blocksMap
I think if a data node reports it has a corrupt block, then it gets added to the corrupt map and not the blocksMap. I took a peek at the underlying issue, and wasn't able to make much progress - an expert on FSNamesystem will be needed to find the underlying problem.
Brian
PS: I looked at the patch again - silly me leaked several changes I have made into this patch; please disregard any changes to files that are NOT FSNamesystem.java

Hairong Kuang
added a comment - 07/Oct/08 23:28 Hi Brian, the debugging output that you provided is very informative. Can you still reproduce the problem? Can you print which replicas are in curruptReplicasMap?

Hairong Kuang
added a comment - 07/Oct/08 22:58 > it appears that the error happens when there are corrupt entries which are not in the blocksMap, but they are in the corruptReplicas object.
The solutions discussed avoid the exception to be thrown, but it does not fix the above fundamental problem.

> However, I thought that, because everything was written in terms of arrays, this might be a section where performance is critical.

You made a good point here. It will affect the performance if we use ArrayList. When I take a look at the codes again, it seems that there are bugs somewhere. Otherwise, we won't have ArrayIndexOutOfBoundsException. So we better fix the bugs.

Tsz Wo Nicholas Sze
added a comment - 07/Oct/08 19:26 > However, I thought that, because everything was written in terms of arrays, this might be a section where performance is critical.
You made a good point here. It will affect the performance if we use ArrayList. When I take a look at the codes again, it seems that there are bugs somewhere. Otherwise, we won't have ArrayIndexOutOfBoundsException. So we better fix the bugs.

Brian Bockelman
added a comment - 07/Oct/08 18:55 Hey Nicholas,
Per your second point: That would have been the natural thing to do. However, I thought that, because everything was written in terms of arrays, this might be a section where performance is critical.
If performance is not critical in this section, then rewriting things as an ArrayList/Set would be appropriate. Do you know if it is or not?
Brian

the expression "blockCorrupt || (!blockCorrupt && !replicaCorrupt)" is equivalent to "blockCorrupt || !replicaCorrupt". So we should use the second one.

use ArrayList<DatanodeDescriptor> for machineSet instead of DatanodeDescriptor[]. Then, we won't have ArrayIndexOutOfBoundsException anymore. (We probably should use Set<DatanodeDescriptor> as the variable name suggested.)

Tsz Wo Nicholas Sze
added a comment - 06/Oct/08 18:42 I suggest to do the following improvements:
the expression "blockCorrupt || (!blockCorrupt && !replicaCorrupt)" is equivalent to "blockCorrupt || !replicaCorrupt". So we should use the second one.
use ArrayList<DatanodeDescriptor> for machineSet instead of DatanodeDescriptor[]. Then, we won't have ArrayIndexOutOfBoundsException anymore. (We probably should use Set<DatanodeDescriptor> as the variable name suggested.)

The attached file fixes this problem (using svn diff against trunk); it allocates a full-sized array, and later truncates the array as necessary using System.arraycopy.

However, the whole thing smells of over-optimization. I would think that a better approach would be to rewrite the entire block: Convert everything to sets, and do the calculations based on the intersection of the sets.

The initial problem was due to the fact that there were corrupt blocks that aren't in the block map. I'm not sure if this is an allowed situation (i.e., is this situation due to a bug elsewhere in the code?)

Brian Bockelman
added a comment - 06/Oct/08 18:14 The attached file fixes this problem (using svn diff against trunk); it allocates a full-sized array, and later truncates the array as necessary using System.arraycopy.
However, the whole thing smells of over-optimization. I would think that a better approach would be to rewrite the entire block: Convert everything to sets, and do the calculations based on the intersection of the sets.
The initial problem was due to the fact that there were corrupt blocks that aren't in the block map. I'm not sure if this is an allowed situation (i.e., is this situation due to a bug elsewhere in the code?)

So, nodes 10, 145, and 117 hold good replicas; 16 held a bad one. However, because the allocated array size is (# of replicas) - (# of bad replicas), machineSet is allocated a size of 2, but we try to write 3 elements into it.