Details

Description

Currently the version check for DN/NN communication is strict (it checks the exact svn revision or git hash, Storage#getBuildVersion calls VersionInfo#getRevision), which prevents rolling upgrades across any releases. Once we have the PB-base RPC in place (coming soon to branch-23) we'll have the necessary pieces in place to loosen this restriction, though perhaps it takes another 23 minor release or so before we're ready to commit to making the minor versions compatible.

Todd Lipcon
added a comment - 22/Feb/12 20:10 Would be nice to, at the same time or separately, add a "cluster version summary" section on the web UIs and JMX. I'm thinking something like a table like:
Build version
Build rev
Number of nodes
Node list ...
0.23.1
r294294
99
too many to list
0.23.1-SNAPSHOT
r294295
1
dn034.example.com
and perhaps some kind of banner at the top of the NN UI indicating that a rolling upgrade is in progress (ie that the cluster has mixed versions)

New code enforces more strict version checking: if a data-node has different from the
name-node build version then it fails, even if the layout and protocol versions are the same.
The build version is checked during handshake - a new rpc call which happens before registration.

(The patch actually checked the revision, not the build version)

Currently the only way to do a rolling upgrade, eg for a security update, is to lie about the build revision.

I propose the following:

For branch-1 we allow different build revisions but require the version match strictly (and the layout version must match). I think we should do this on the MR in branch-1 as well.

For trunk/branch-23 we allow different versions (for HDFS) since we're adding wire compat. The layout version must still match.

Eli Collins
added a comment - 19/Mar/12 22:53 The motivation for the current behavior from HADOOP-702 :
New code enforces more strict version checking: if a data-node has different from the
name-node build version then it fails, even if the layout and protocol versions are the same.
The build version is checked during handshake - a new rpc call which happens before registration.
(The patch actually checked the revision, not the build version)
Currently the only way to do a rolling upgrade, eg for a security update, is to lie about the build revision.
I propose the following:
For branch-1 we allow different build revisions but require the version match strictly (and the layout version must match). I think we should do this on the MR in branch-1 as well.
For trunk/branch-23 we allow different versions (for HDFS) since we're adding wire compat. The layout version must still match.
Reasonable?

Aaron T. Myers
added a comment - 05/Apr/12 00:29 Here's a patch which should address the issue. I'm going to do some manual testing of this, but I wanted to get a patch up for people to take a look at.

I did some manual testing of this patch and in doing so discovered a bug which blocks this issue: HDFS-3202.

After applying the patch on HDFS-3202, I confirmed that this patch works. Without this patch, DNs will fail to start if they try to register with an NN with a different build version. With this patch, a warning is printed that the build versions do not match but the DN starts up just fine.

Aaron T. Myers
added a comment - 05/Apr/12 02:39 I did some manual testing of this patch and in doing so discovered a bug which blocks this issue: HDFS-3202 .
After applying the patch on HDFS-3202 , I confirmed that this patch works. Without this patch, DNs will fail to start if they try to register with an NN with a different build version. With this patch, a warning is printed that the build versions do not match but the DN starts up just fine.

Hadoop QA
added a comment - 05/Apr/12 02:44 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12521419/HDFS-2983.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2188//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2188//console
This message is automatically generated.

Maybe I'm misunderstanding, but I thought the plan for this JIRA was to add a more structured version number with major/minor/patch components. Then have the check still verify that the major/minor match up, but not verify the patch level and svn revision? That is to say, we should loosen the restriction but not entirely drop it.

Todd Lipcon
added a comment - 05/Apr/12 19:30 Maybe I'm misunderstanding, but I thought the plan for this JIRA was to add a more structured version number with major/minor/patch components. Then have the check still verify that the major/minor match up, but not verify the patch level and svn revision? That is to say, we should loosen the restriction but not entirely drop it.

Tsz Wo Nicholas Sze
added a comment - 05/Apr/12 19:47 Or how about adding a conf for the upgrade version? The NN version could be either the upgrade version or the same as the DN version. Throw exception for the other case.

Maybe I'm misunderstanding, but I thought the plan for this JIRA was to add a more structured version number with major/minor/patch components. Then have the check still verify that the major/minor match up, but not verify the patch level and svn revision? That is to say, we should loosen the restriction but not entirely drop it.

That makes sense. Do folks think that enforcing that the major/minor versions match is appropriate? Or should we only be checking major?

Or how about adding a conf for the upgrade version? The NN version could be either the upgrade version or the same as the DN version. Throw exception for the other case.

Not a bad idea, but it seems a shame to have to update all DN configs with a list of acceptable versions in order to do an upgrade, especially since the project has policies in place to try to maintain compatibility between patch releases.

Aaron T. Myers
added a comment - 06/Apr/12 01:32 Maybe I'm misunderstanding, but I thought the plan for this JIRA was to add a more structured version number with major/minor/patch components. Then have the check still verify that the major/minor match up, but not verify the patch level and svn revision? That is to say, we should loosen the restriction but not entirely drop it.
That makes sense. Do folks think that enforcing that the major/minor versions match is appropriate? Or should we only be checking major?
Or how about adding a conf for the upgrade version? The NN version could be either the upgrade version or the same as the DN version. Throw exception for the other case.
Not a bad idea, but it seems a shame to have to update all DN configs with a list of acceptable versions in order to do an upgrade, especially since the project has policies in place to try to maintain compatibility between patch releases.

> Not a bad idea, but it seems a shame to have to update all DN configs with a list of acceptable versions ...

It is a shame but this is not what I mean. During a upgrade, the upgrade version is known and unique. Only that version is put in the conf but not a list of acceptable versions. This also prevents accidentally upgrade to a different version.

Tsz Wo Nicholas Sze
added a comment - 06/Apr/12 02:03 > Not a bad idea, but it seems a shame to have to update all DN configs with a list of acceptable versions ...
It is a shame but this is not what I mean. During a upgrade, the upgrade version is known and unique. Only that version is put in the conf but not a list of acceptable versions. This also prevents accidentally upgrade to a different version.

It is a shame but this is not what I mean. During a upgrade, the upgrade version is known and unique. Only that version is put in the conf but not a list of acceptable versions.

Sorry, I misunderstood your proposal. Regardless, I think the point still remains - it seems a shame to have to update all of the DN configs when doing an upgrade in order to allow for rolling upgrade. I think it also implies that the user should again update all of the DN configs after the rolling upgrade is completed to remove the entry for the other acceptable version. It seems better to me to just check the major/minor (or just major?) version numbers, than require these conf changes.

Does that make sense?

This also prevents accidentally upgrade to a different version.

Does it? Unless I'm still misunderstanding you, if the version we're currently running is X, but we want to do a rolling upgrade to Y, I think we'd have to put "X" in the DN configs. The upgraded DNs will know that they're version Y, but they will have to know to accept a response from NN which is still running version X.

Aaron T. Myers
added a comment - 06/Apr/12 02:41 It is a shame but this is not what I mean. During a upgrade, the upgrade version is known and unique. Only that version is put in the conf but not a list of acceptable versions.
Sorry, I misunderstood your proposal. Regardless, I think the point still remains - it seems a shame to have to update all of the DN configs when doing an upgrade in order to allow for rolling upgrade. I think it also implies that the user should again update all of the DN configs after the rolling upgrade is completed to remove the entry for the other acceptable version. It seems better to me to just check the major/minor (or just major?) version numbers, than require these conf changes.
Does that make sense?
This also prevents accidentally upgrade to a different version.
Does it? Unless I'm still misunderstanding you, if the version we're currently running is X, but we want to do a rolling upgrade to Y, I think we'd have to put "X" in the DN configs. The upgraded DNs will know that they're version Y, but they will have to know to accept a response from NN which is still running version X.

I think we should upgrade NN first. Suppose we upgrade from X to Y as in your example. Then, upgrade can be done by (1) adding Y to conf for DN's conf, (2) upgrading NN to Y and (3) rolling upgrade DNs to Y.

Comparing major/minor does not require changing conf but it also limits the rolling upgrade. Say, v1.1 and v1.2 are actually compatible but we cannot perform the upgrade due to the version check.

Tsz Wo Nicholas Sze
added a comment - 06/Apr/12 02:56 I think we should upgrade NN first. Suppose we upgrade from X to Y as in your example. Then, upgrade can be done by (1) adding Y to conf for DN's conf, (2) upgrading NN to Y and (3) rolling upgrade DNs to Y.
Comparing major/minor does not require changing conf but it also limits the rolling upgrade. Say, v1.1 and v1.2 are actually compatible but we cannot perform the upgrade due to the version check.

I think we should upgrade NN first. Suppose we upgrade from X to Y as in your example. Then, upgrade can be done by (1) adding Y to conf for DN's conf, (2) upgrading NN to Y and (3) rolling upgrade DNs to Y.

What about the case (as mentioned in this comment) where a change is only scoped to the DN, and thus we'd like to do a rolling restart of DNs without ever bouncing the NN? I don't think your proposal allows for that.

Also, I think that the steps you describe above would have to have "perform a rolling restart of DNs to pick up the conf change" inserted between steps 1 and 2. It seems pretty kludgy to me to require 2 separate rolling restarts of DNs in order to perform a rolling upgrade.

Comparing major/minor does not require changing conf but it also limits the rolling upgrade. Say, v1.1 and v1.2 are actually compatible but we cannot perform the upgrade due to the version check.

One thing I've proposed is that we could check only the major version number.

Another possibility is that we could by default check major and minor versions, but have a config which optionally disables the minor version check, to account for the case you describe above. How does that sound to you?

Aaron T. Myers
added a comment - 06/Apr/12 03:23 I think we should upgrade NN first. Suppose we upgrade from X to Y as in your example. Then, upgrade can be done by (1) adding Y to conf for DN's conf, (2) upgrading NN to Y and (3) rolling upgrade DNs to Y.
What about the case (as mentioned in this comment ) where a change is only scoped to the DN, and thus we'd like to do a rolling restart of DNs without ever bouncing the NN? I don't think your proposal allows for that.
Also, I think that the steps you describe above would have to have "perform a rolling restart of DNs to pick up the conf change" inserted between steps 1 and 2. It seems pretty kludgy to me to require 2 separate rolling restarts of DNs in order to perform a rolling upgrade.
Comparing major/minor does not require changing conf but it also limits the rolling upgrade. Say, v1.1 and v1.2 are actually compatible but we cannot perform the upgrade due to the version check.
One thing I've proposed is that we could check only the major version number.
Another possibility is that we could by default check major and minor versions, but have a config which optionally disables the minor version check, to account for the case you describe above. How does that sound to you?

Another possibility is that we could by default check major and minor versions, but have a config which optionally disables the minor version check, to account for the case you describe above. How does that sound to you?

Tsz Wo Nicholas Sze
added a comment - 06/Apr/12 19:29 Another possibility is that we could by default check major and minor versions, but have a config which optionally disables the minor version check, to account for the case you describe above. How does that sound to you?
It sounds good.

1) Ensure that version compatibility is checked both on the NN side and the DN side. So, when the DN first connects to the NN, the NN verifies the DN's version. If it is deemed incompatible, it is rejected. Then, then DN verifies the NN version in the response. If it is deemed incompatible, it does not proceed with registration.

3) In hdfs-default.xml, add a configuration like cluster.min.supported.version. In branch-2, we set this to "2.0.0". So, by default, any 2.x.x can talk to any other 2.x.x. When we release 3.x.x, if it is incompatible with 2.x.x, then we just need to bump that config in 3.0's hdfs-default.xml.

This supports the following use cases/requirements:

rolling upgrade can be done for most users without having to change any configs.

new versions of Hadoop can be marked incompatible with old versions of Hadoop

cluster admins can still override it if they want to disallow older nodes from connecting. For example, imagine there is a critical security bug fixed in 2.0.0a - the admin can set the config to 2.0.0.a, and then 2.0.0 nodes may no longer join the cluster (even though they are protocol-wise compatible)

Todd Lipcon
added a comment - 06/Apr/12 19:58 Here's another proposal which I think makes sense:
1) Ensure that version compatibility is checked both on the NN side and the DN side. So, when the DN first connects to the NN, the NN verifies the DN's version. If it is deemed incompatible, it is rejected. Then, then DN verifies the NN version in the response. If it is deemed incompatible, it does not proceed with registration.
2) Add a function to compare two version numbers in the straightforward manner: split the numbers on ".", then componentwise, do comparisons according to "string numerical value" (like sort -n). Some examples: 2.0.1 > 2.0.0. 10.0 > 2.0.0. 2.0.0a > 2.0.0. 2.0.0b > 2.0.0.a. (this is the comparison mechanism package managers tend to use)
3) In hdfs-default.xml, add a configuration like cluster.min.supported.version . In branch-2, we set this to "2.0.0". So, by default, any 2.x.x can talk to any other 2.x.x. When we release 3.x.x, if it is incompatible with 2.x.x, then we just need to bump that config in 3.0's hdfs-default.xml.
This supports the following use cases/requirements:
rolling upgrade can be done for most users without having to change any configs.
new versions of Hadoop can be marked incompatible with old versions of Hadoop
cluster admins can still override it if they want to disallow older nodes from connecting. For example, imagine there is a critical security bug fixed in 2.0.0a - the admin can set the config to 2.0.0.a, and then 2.0.0 nodes may no longer join the cluster (even though they are protocol-wise compatible)
Thoughts?

Upgrades involves snapshots on NN and DNs. Snapshots are not just for layout changes but also for data protection against a version change that introduced a bug that has the potential to wipe out a portion of the data. (other systems ask the admin to do a backup before starting the upgrade - that does not make sense for HDFS).
HDFS's snapshot feature is a unique and powerful part of HDFS's upgrade.

Yahoo does snapshot as part of all upgrades, except for EBFs that do not touching the core of the file system.
At this stage I don't know what snapshots mean for rolling upgrades.
This requires further thought.

Sanjay Radia
added a comment - 06/Apr/12 21:16 Upgrades involves snapshots on NN and DNs. Snapshots are not just for layout changes but also for data protection against a version change that introduced a bug that has the potential to wipe out a portion of the data. (other systems ask the admin to do a backup before starting the upgrade - that does not make sense for HDFS).
HDFS's snapshot feature is a unique and powerful part of HDFS's upgrade.
Yahoo does snapshot as part of all upgrades, except for EBFs that do not touching the core of the file system.
At this stage I don't know what snapshots mean for rolling upgrades.
This requires further thought.

Hey Sanjay. I agree that the snapshot-on-upgrade feature is really important, and I don't think this work precludes/breaks that. Here's my line of thinking:

even with the ability to do rolling upgrade, there is no restriction that you must do upgrades like this. So, you could still decide to use the current upgrade process as a policy decision.

As you mentioned, many upgrades/hotfixes/EBFs don't touch core code, so for those, most people would prefer a rolling upgrade without downtime.

separately, after this is committed, we can work on figuring out a strategy that allows you to do an upgrade-style snapshot before starting the rolling upgrade. It looks like you just filed HDFS-3225 for this, so let's continue this discussion there.

Todd Lipcon
added a comment - 06/Apr/12 21:22 Hey Sanjay. I agree that the snapshot-on-upgrade feature is really important, and I don't think this work precludes/breaks that. Here's my line of thinking:
even with the ability to do rolling upgrade, there is no restriction that you must do upgrades like this. So, you could still decide to use the current upgrade process as a policy decision.
As you mentioned, many upgrades/hotfixes/EBFs don't touch core code, so for those, most people would prefer a rolling upgrade without downtime.
separately, after this is committed, we can work on figuring out a strategy that allows you to do an upgrade-style snapshot before starting the rolling upgrade. It looks like you just filed HDFS-3225 for this, so let's continue this discussion there.
Agree?

Most operations will not know when to take the snapshot. Hence there must conditions under which the system forces an upgrade snapshot. Any layout version should force a snapshot. Perhaps a minor version also forces a snapshot.
Perhaps each version of HDFS specifies if a snapshot is required from the previous.

Sanjay Radia
added a comment - 07/Apr/12 03:00 Most operations will not know when to take the snapshot. Hence there must conditions under which the system forces an upgrade snapshot. Any layout version should force a snapshot. Perhaps a minor version also forces a snapshot.
Perhaps each version of HDFS specifies if a snapshot is required from the previous.

Sanjay Radia
added a comment - 07/Apr/12 03:05 The proposal seems to suggest that the NN does not need to be updated if desired. Correct?
I see why it is desirable but does can we simplify things or make upgrades safer if we drop that requirement?

The proposal seems to suggest that the NN does not need to be updated if desired. Correct?

Yes, I think that's correct, and desired. Sometimes upgrades only address the slave nodes, so there's no sense having to change the NN. Of course, with HA, upgrading the NN isn't as big a problem, but even so it is a more complicated/delicate operation.

I see why it is desirable but does can we simplify things or make upgrades safer if we drop that requirement?

I don't know if it makes things much simpler. I think adding a requirement that the NN upgrade before the DNs is quite inconvenient for operators. But I am not 100% sure of this, and willing to be convinced

Todd Lipcon
added a comment - 07/Apr/12 07:08 The proposal seems to suggest that the NN does not need to be updated if desired. Correct?
Yes, I think that's correct, and desired. Sometimes upgrades only address the slave nodes, so there's no sense having to change the NN. Of course, with HA, upgrading the NN isn't as big a problem, but even so it is a more complicated/delicate operation.
I see why it is desirable but does can we simplify things or make upgrades safer if we drop that requirement?
I don't know if it makes things much simpler. I think adding a requirement that the NN upgrade before the DNs is quite inconvenient for operators. But I am not 100% sure of this, and willing to be convinced

Hadoop QA
added a comment - 08/Apr/12 13:13 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12521882/HDFS-2983.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 4 new or modified test files.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2224//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2224//console
This message is automatically generated.

+ * This function splits the two versions on "." and performs a lexical
+ * comparison of the resulting components.

Technically the quote characters inside the Javadoc should be " - or you could just use single-quotes instead to avoid the hassle.

VersionUtil should be doing numeric comparison rather than straight string comparison. For example "10.0.0" should be considered greater than "2.0", but I think the current implementation doesn't implement this correctly.

Todd Lipcon
added a comment - 09/Apr/12 02:46 Technically the quote characters inside the Javadoc should be " - or you could just use single-quotes instead to avoid the hassle.
erg, JIRA went and formatted my explanation The quote characters should be & quot; without the space.

Hadoop QA
added a comment - 09/Apr/12 06:45 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12521907/HDFS-2983.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 5 new or modified test files.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2227//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2227//console
This message is automatically generated.

Aaron T. Myers
added a comment - 09/Apr/12 08:07 I realized that the previous patch had a bug in handling numeric groups of characters after non-numeric groups in version components. Here's an updated patch with added test cases to cover this case.

Hadoop QA
added a comment - 09/Apr/12 09:24 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12521915/HDFS-2983.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 5 new or modified test files.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2228//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2228//console
This message is automatically generated.

John George
added a comment - 09/Apr/12 16:09 I am +1 to Todd's proposal.
A couple of very minor comments on the code. It's up to you to take it or leave it.
I liked the proposed name of MIN_SUPPORTED_NN_VERSION than MINIMUM_NN_VERSION .
In cases where version numbers do not match, but compatible (eg: when a DN with a higher version than MINIMUM_*_VERSION connects), it would be nice to print a LOG message that states that version.
Good work Aaron.

Hadoop QA
added a comment - 09/Apr/12 18:32 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12521968/HDFS-2983.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 5 new or modified test files.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2229//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2229//console
This message is automatically generated.

I'd like to reiterate on Sanjay's point. If you start rolling upgrade with snapshot creation it can get very messy. Some nodes will be creating snapshots with new version and some will continue with the old. Then rollback will become tricky if possible at all. Two question in this regard:

What is the procedure for rolling upgrade? I mean the exact sequence of steps admins will follow to upgrade a cluster.

Can we distinguish between rolling and old-style upgrades and never allow them simultaneously. Feels like this should be the precondition for relaxing the version matching condition.

Konstantin Shvachko
added a comment - 09/Apr/12 18:50 I'd like to reiterate on Sanjay's point. If you start rolling upgrade with snapshot creation it can get very messy. Some nodes will be creating snapshots with new version and some will continue with the old. Then rollback will become tricky if possible at all. Two question in this regard:
What is the procedure for rolling upgrade? I mean the exact sequence of steps admins will follow to upgrade a cluster.
Can we distinguish between rolling and old-style upgrades and never allow them simultaneously. Feels like this should be the precondition for relaxing the version matching condition.

In a running cluster, if you restart the NN without the -upgrade flag, then the DataNodes will happily re-register without exiting.

If you restart the NN with -upgrade, then when the DN next heartbeats, it will fail the verifyRequest() check, since the registration ID's namespace fields no longer match (the ctime has been incremented by the upgrade). This causes the DataNode to exit.

Of course, restarting the DN at this point makes it take the snapshot and participate in the upgrade as expected.

So, to try to respond to Konstantin's questions, here are a couple example scenarios:

On each DN, in succession: (1) shutdown DN, (2) upgrade software to the new version, (3) start DN

The above is sufficient if the changes are scoped only to DNs. If the change also affects the NN, then you will need to add the following step, either at the beginning or end of the process:

shutdown NN. upgrade installed software. start NN on new version

In the case of an HA setup, we can do the NN upgrade without downtime:

shutdown SBN. upgrade SBN software. start SBN.

failover to SBN running new version.

Shutdown previous active. Upgrade software. Start previous active

Optionally fail back

Scenario 2: upgrade to a version with a new layout version (LV)

In this case, a "snapshot" style upgrade is required – the NN will not restart without the "-upgrade" flag, and a DN will not connect to a NN with a different LV. So the scenario is the same as today:

Shutdown entire cluster

Upgrade all software in teh clsuter

Start cluster with -upgrade flag

any nodes that missed the software upgrade will fail to connect, since their LV does not match (this patch retains that behavior)

Scenario 3: upgrade to a version with same layout version, but some data risk (for example upgrading to a version with bug fixes pertaining to replication policies, corrupt block detection, etc)

In this scenario, the NN does not mandate a -upgrade flag, but as Sanjay mentioned above, it can still be useful for data protection. As with today, if the user does not want the extra protection, this scenario can be treated identically to scenario 1. If the user does want the protection, it can be treated identically to scenario 2. Scenario 2 remains safe because of the check against the NameNode's ctime matching the DN's ctime. As soon as you restart the NN with the -upgrade flag, all running DNs will exit. Any newly started DN will noticethe new namespace ctime and take part in the snapshot upgrade.

Does the above description address your concerns? Another idea would be to add a new configuration option like dfs.allow.rolling.upgrades which enables the new behavior, so an admin who prefers not to use the feature can disallow it completely.

Todd Lipcon
added a comment - 09/Apr/12 21:29 I did a little investigation to try to answer Konstantin's questions above.
First, I'll summarize our current behavior, verified on 0.23.1 release (I didn't understand this thoroughly before trying it out):
In a running cluster, if you restart the NN without the -upgrade flag, then the DataNodes will happily re-register without exiting.
If you restart the NN with -upgrade , then when the DN next heartbeats, it will fail the verifyRequest() check, since the registration ID's namespace fields no longer match (the ctime has been incremented by the upgrade). This causes the DataNode to exit.
Of course, restarting the DN at this point makes it take the snapshot and participate in the upgrade as expected.
So, to try to respond to Konstantin's questions, here are a couple example scenarios:
Scenario 1 : rolling upgrade without doing a "snapshot" upgrade (for emergency bug fixes, hot fixes, MR fixes, other fixes which we don't expect to affect data reliability):
Leave the NN running, on the old version.
On each DN, in succession: (1) shutdown DN, (2) upgrade software to the new version, (3) start DN
The above is sufficient if the changes are scoped only to DNs. If the change also affects the NN, then you will need to add the following step, either at the beginning or end of the process:
shutdown NN. upgrade installed software. start NN on new version
In the case of an HA setup, we can do the NN upgrade without downtime:
shutdown SBN. upgrade SBN software. start SBN.
failover to SBN running new version.
Shutdown previous active. Upgrade software. Start previous active
Optionally fail back
Scenario 2 : upgrade to a version with a new layout version (LV)
In this case, a "snapshot" style upgrade is required – the NN will not restart without the "-upgrade" flag, and a DN will not connect to a NN with a different LV. So the scenario is the same as today:
Shutdown entire cluster
Upgrade all software in teh clsuter
Start cluster with -upgrade flag
any nodes that missed the software upgrade will fail to connect, since their LV does not match (this patch retains that behavior)
Scenario 3 : upgrade to a version with same layout version, but some data risk (for example upgrading to a version with bug fixes pertaining to replication policies, corrupt block detection, etc)
In this scenario, the NN does not mandate a -upgrade flag, but as Sanjay mentioned above, it can still be useful for data protection. As with today, if the user does not want the extra protection, this scenario can be treated identically to scenario 1. If the user does want the protection, it can be treated identically to scenario 2. Scenario 2 remains safe because of the check against the NameNode's ctime matching the DN's ctime . As soon as you restart the NN with the -upgrade flag, all running DNs will exit. Any newly started DN will noticethe new namespace ctime and take part in the snapshot upgrade.
Does the above description address your concerns? Another idea would be to add a new configuration option like dfs.allow.rolling.upgrades which enables the new behavior, so an admin who prefers not to use the feature can disallow it completely.

Nice lengthy javadoc on VersionUtil.compareVersions. Can you please add something like:
"This method of comparison is similar to the method used by package versioning systems like deb and RPM"

and also maybe give one example of what you mean? eg add "For example, Hadoop 0.3 < Hadoop 0.20 even though naive string comparison would consider it larger."

Otherwise, looks great. +1 from my standpoint. Konstantin/Sanjay - can you please comment regarding the above discussion? While I agree that there are more improvements to be made, I don't think this patch will hurt things. Or, if you are nervous about it, can we commit this with a flag to allow rolling upgrade if the operator permits it?

Todd Lipcon
added a comment - 09/Apr/12 22:49
+ if (!dnVersion.equals(nnVersion)) {
+ LOG.info( "Reported DataNode version '" + dnVersion + "' does not match " +
+ "NameNode version '" + nnVersion + "' but is within acceptable " +
+ "limits. Note: This is normal during a rolling upgrade." );
+ }
Can you also please include the DN IP address in this log message?
Nice lengthy javadoc on VersionUtil.compareVersions. Can you please add something like:
"This method of comparison is similar to the method used by package versioning systems like deb and RPM"
and also maybe give one example of what you mean? eg add "For example, Hadoop 0.3 < Hadoop 0.20 even though naive string comparison would consider it larger."
Otherwise, looks great. +1 from my standpoint. Konstantin/Sanjay - can you please comment regarding the above discussion? While I agree that there are more improvements to be made, I don't think this patch will hurt things. Or, if you are nervous about it, can we commit this with a flag to allow rolling upgrade if the operator permits it?

Aaron T. Myers
added a comment - 09/Apr/12 23:19 Thanks a lot for the review, Todd. Here's an updated patch which addresses your latest two comments.
PS. I thought you'd like that javadoc. I put a lot of thought into it.

Hadoop QA
added a comment - 10/Apr/12 00:32 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12522027/HDFS-2983.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 5 new or modified test files.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2234//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2234//console
This message is automatically generated.

Yes, ctime will prevent DNs from previous versions connecting to NN started with -upgrade.
Can we add a condition to allow rolling-upgrades only if there are no upgrades-snapshots in progress.
That is one should first finalize or rollback the upgrade-snapshot, then do rolling upgrades.
The scenario that scares me is if somebody does a snapshot, then several rolling upgrades, and then decides to rollback. This may be possible, but seems to be very much error-prone.
Also that way we will have clear separation between snapshots and rolling upgrades.

Konstantin Shvachko
added a comment - 10/Apr/12 02:13 Yes, ctime will prevent DNs from previous versions connecting to NN started with -upgrade.
Can we add a condition to allow rolling-upgrades only if there are no upgrades-snapshots in progress.
That is one should first finalize or rollback the upgrade-snapshot, then do rolling upgrades.
The scenario that scares me is if somebody does a snapshot, then several rolling upgrades, and then decides to rollback. This may be possible, but seems to be very much error-prone.
Also that way we will have clear separation between snapshots and rolling upgrades.

The scenario that scares me is if somebody does a snapshot, then several rolling upgrades, and then decides to rollback. This may be possible, but seems to be very much error-prone.

Why is this scenario different than if somebody does a snapshot, then several non-rolling upgrades, then decides to rollback? In both cases, we have the case of a newer version trying to do a rollback to an older version snapshot. Right?

Todd Lipcon
added a comment - 10/Apr/12 02:17 The scenario that scares me is if somebody does a snapshot, then several rolling upgrades, and then decides to rollback. This may be possible, but seems to be very much error-prone.
Why is this scenario different than if somebody does a snapshot, then several non-rolling upgrades, then decides to rollback? In both cases, we have the case of a newer version trying to do a rollback to an older version snapshot. Right?

The scenario is not different if you do rolling or non-rolling upgrades. The difference is that the patch removes build version verification, which means that after 3 upgrades (rolling or not) you are guaranteed to have 3 sets of nodes running different version of hadoop talking to your NameNode.
While in today's world the strict version verification guarantees that un-upgraded nodes never talk to NN. If you add to this snapshots, block pools, HA NameNodes it becomes an operability horror.
Separating snapshots from upgrades straightens things up, imho.

Todd, your proposal to add DN versions to the Web UI is also very important. Otherwise it is hard to monitor cluster consistency.

Konstantin Shvachko
added a comment - 10/Apr/12 19:36 The scenario is not different if you do rolling or non-rolling upgrades. The difference is that the patch removes build version verification , which means that after 3 upgrades (rolling or not) you are guaranteed to have 3 sets of nodes running different version of hadoop talking to your NameNode.
While in today's world the strict version verification guarantees that un-upgraded nodes never talk to NN. If you add to this snapshots, block pools, HA NameNodes it becomes an operability horror.
Separating snapshots from upgrades straightens things up, imho.
Todd, your proposal to add DN versions to the Web UI is also very important. Otherwise it is hard to monitor cluster consistency.

Todd Lipcon
added a comment - 10/Apr/12 19:51 How about the following proposal:
change the check for DN registration so that, if the DN's ctime differs from the NN's ctime (i.e the NN has started a snapshot style upgrade), then the version check will be strict
file a follow-up JIRA to add a cluster version summary to the web UI and to the NN metrics, allowing ops to monitor whether they have machines that might have "missed" a rolling upgrade
Does that address your concern? I agree with your point that it can be confusing to manage, but not sure what the specific change you're asking for is.

Hadoop QA
added a comment - 11/Apr/12 04:31 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12522193/HDFS-2983.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 5 new or modified test files.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2248//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2248//console
This message is automatically generated.

To clarify my point for the record.From the time a snapshot is started (NN -upgrade) and until it is completed by -finalize or -rollback no DataNodes with different build versions are allowed on the cluster.Todd's comment captures it technically.

Konstantin Shvachko
added a comment - 12/Apr/12 08:43 To clarify my point for the record.
From the time a snapshot is started (NN -upgrade) and until it is completed by -finalize or -rollback no DataNodes with different build versions are allowed on the cluster.
Todd's comment captures it technically.