Details

Description

The 2NN currently has logic to detect when its on-disk FS metadata needs an upgrade with respect to the NN's metadata (i.e. the layout versions are different) and in this case it will proceed with the checkpoint despite storage signatures not matching precisely if the BP ID and Cluster ID do match exactly. However, in situations where we're upgrading from versions of HDFS prior to federation, which had no BP IDs or Cluster IDs, checkpoints will always fail with an error like the following:

Activity

Here's a patch which addresses the issue by only comparing the namespace IDs to determine if this is the same NN/2NN pair, instead of the NS IDs and BP IDs/Cluster IDs. NS IDs predate federation so this allows upgrades from pre-federation versions of HDFS. This patch also makes sure that we definitely reload the downloaded fsimage from disk in the case that we've identified that the 2NN's metadata does not match the layout version of the NN's.

Aaron T. Myers
added a comment - 01/Feb/13 01:36 Here's a patch which addresses the issue by only comparing the namespace IDs to determine if this is the same NN/2NN pair, instead of the NS IDs and BP IDs/Cluster IDs. NS IDs predate federation so this allows upgrades from pre-federation versions of HDFS. This patch also makes sure that we definitely reload the downloaded fsimage from disk in the case that we've identified that the 2NN's metadata does not match the layout version of the NN's.

Arun C Murthy Blocker? Probably not. Pretty good to have? I think so. There's a pretty simple work-around: when upgrading from a pre-federation version of HDFS, blow away your 2NN checkpoint dirs before starting up your 2NN again. A problem will arise if an admin doesn't notice that all of their 2NN checkpoints are failing post-upgrade.

Considering that namespace ID is an integer, whereas cluster ID is based on a GUID, it seems there is higher likelihood of accidental collision. Then, CheckpointSignature#validateStorageInfo could misidentify a match. It's still highly unlikely (but non-zero).

I'm wondering if a safer change would be (pseudo-code):

if namespace ID + cluster ID + blockpool ID are defined on both
compare all 3 fields
elseif only namespace ID is defined on one of them
compare only namespace ID

This would keep the logic the same for upgrades between 2 post-federation versions, and just change the logic for the case of pre-fed -> post-fed.

Chris Nauroth
added a comment - 01/Feb/13 20:44 Hi, Aaron. The code looks good. I applied the patch to branch-2 and ran multiple test suites related to checkpoints and 2NN.
- boolean isSameCluster(FSImage si) {
- return namespaceID == si.getStorage().namespaceID &&
- clusterID.equals(si.getClusterID()) &&
- blockpoolID.equals(si.getBlockPoolID());
+ boolean namespaceIdMatches(FSImage si) {
+ return namespaceID == si.getStorage().namespaceID;
}
Considering that namespace ID is an integer, whereas cluster ID is based on a GUID, it seems there is higher likelihood of accidental collision. Then, CheckpointSignature#validateStorageInfo could misidentify a match. It's still highly unlikely (but non-zero).
I'm wondering if a safer change would be (pseudo-code):
if namespace ID + cluster ID + blockpool ID are defined on both
compare all 3 fields
else if only namespace ID is defined on one of them
compare only namespace ID
This would keep the logic the same for upgrades between 2 post-federation versions, and just change the logic for the case of pre-fed -> post-fed.
Or am I being too paranoid?

Thanks a lot for the review, Chris, and for running those additional tests.

Your suggestion does seem pretty paranoid (odds are 1 over 2^31), but better to be overly conservative in cases such as this.

Please take a look at the updated patch. This patch expressly checks to see if the local metadata's layout version supports federation or not, and only compares the namespace IDs if it doesn't support federation. If federation is supported, all three fields are compared.

Aaron T. Myers
added a comment - 01/Feb/13 21:38 Thanks a lot for the review, Chris, and for running those additional tests.
Your suggestion does seem pretty paranoid (odds are 1 over 2^31), but better to be overly conservative in cases such as this.
Please take a look at the updated patch. This patch expressly checks to see if the local metadata's layout version supports federation or not, and only compares the namespace IDs if it doesn't support federation. If federation is supported, all three fields are compared.