Description of problem: There existed a previous bug where if a Volume failed to unmount (timeout or other) kubelet would recursive delete the mount point regardless. This lead to data being deleted on volumes.
This has been fixed so that now recursive delete wont cross file systems.. However, if a user running the old kubelet tries to upgrade a version of openshift missing the fix via the documented upgrade path, when the node is drained this data-deletion code path may be triggered and data loss may still occur. For example here: https://bugzilla.redhat.com/show_bug.cgi?id=1443623
Talking with sjennings on IRC, he said that for z-stream releases, while risky, we could replace the node binary and restart the service instead of draining the node. This approach would not be advisable for major releases or even as a permanent upgrade process for z-stream and should only be used as required.
To add context to sjennings concern about using it permanently, he said that kubelet may have trouble rebuilding its state from the previous version if paths or file naming changed.
The data deletion bug is fixed in these versions:
v3.2.1.31
v3.3.1.28
v3.4.1.27
v3.5.5.18
v3.6
Any z-streams older than the ones listed should do the binary replace at least to the version listed, then the node drain upgrade process could be followed.
Additionally the upgrade docs need to be updated to instruct the user that they must upgrade to the most recent z-stream patch before upgrading to another major release.. (3.2.1.29 --> 3.2.1.31 then 3.3, 3.4, 3.5 etc..)
REF:
https://docs.openshift.com/container-platform/3.3/install_config/upgrading/automated_upgrades.htmlhttps://bugzilla.redhat.com/show_bug.cgi?id=1447764https://bugzilla.redhat.com/show_bug.cgi?id=1447312https://bugzilla.redhat.com/show_bug.cgi?id=1447767https://bugzilla.redhat.com/show_bug.cgi?id=1447767

The proposed solution is that as part of the node upgrade we update the package or container and restart if the version falls into the problematic ranges then perform normal upgrade procedures.
3.2.0 < $current_version < 3.2.1.31
3.3.0 < $current_version < 3.3.1.28
3.4.0 < $current_version < 3.4.1.27
3.5.0 < $current_version < 3.5.5.18
$current_version < 3.5.5.18 AND $target_version >= 3.6
This means, if the version is in the problematic ranges then the node process will be restarted twice during an upgrade. Once for this special task, and then once as part of the normal drain and upgrade process.

Isn't it the new node, which starts, finds things in /var/lib/origin-volumes (or whatever) which actually was doing the bad rm? If so, isn't this just fixed by updating to the latest? It doesn't matter where you come from, it only matters where you go to.
Also, if that is the case, we could more easily work around things by stopping the node, running umount -l on everything we still have mounted, then starting the new node. Thus the new node couldn't see the thing and delete them. (It also could never properly unmount them or re-use them, but that's a different problem and one I'd rather we had instead of known data deletion...)

After some pre-defined internal stopping the node will automatically trigger migration of pods to another node. The migration may trigger code that cleans up orphaned pod directory and thereby triggering the bug..

minor correction - stopping the atomic-openshift-node process while will trigger migration of pods to another node, that will not trigger orphan pod directory deletion code (because node process is not running!). sorry, I had a brain f**t for a moment.
Anyways though - stopping node will not stop containers running on it. so, we would be performing unmount on volumes that are being used.

Yes but drain cause pod eviction and that triggers CleanOrphanPodDir routine (for all evicted pods). Evicted pods are treated differently than deleted pods and somehow we are calling CleanOrphanPodDir for all evicted pods. I have only verified this through experimentation, didn't dig through code yet.
So if somehow NFS volume didn't unmount properly during drain then CleanOrphanPodDir routine *will* delete data for versions without fix. We have 2 bugs here:
a. Volumes not getting unmounted on pod eviction.
b. Data getting deleted on CleanOrphanPodDir call.
So, it matters which version we are upgrading from as well.

The decision is that we'll do the forced upgrade and then follow with normal upgrade steps afterwards. I think this means we'll need to re-factor the playbook to move the drain operation into openshift_node_upgrade role but I think we should do that anyway. I'll get someone working on this tomorrow.

I just realized the proposal would not account for blue green upgrades where the service is never upgraded but instead new hosts are provisioned running the new version and then the old hosts are removed from the cluster in this manner (per the docs, we don't actually automate this).
$ oadm manage-node --selector=color=blue --evacuate
$ oc delete node --selector=color=blue
I assume this would trigger the problem?