Commit Message

Andrei,
This fixes the issue you have reported and through a refactoring
makes the code simpler and easier to verify. That said I find your
last test case very interesting. While looking at it in detail
I have realized I don't fully understand why we have both lookup_mnt and
lookup_mnt_last, so I can't say that this change is fully correct.
Outside of propogate_umount I am don't have concerns but I am not 100%
convinced that my change to lookup_mnt_last does the right thing
in the case of propagate_umount.
I do see why your last test case scales badly. Long chains of shared
mounts that we can't skip. At the same time I don't really understand
that case. Part of it has to do with multiple child mounts of the same
mount on the same mountpoint.
So I am working through my concerns. In the mean time I figured it
would be useful to post this version. As this version is clearly better
than the version of this change that have come before it.
Eric
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Thu, 13 Oct 2016 13:27:19 -0500
Adrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.
While investigating the horrible performance I realized that in
the case overlapping mount trees since the addition of locked
mount support the code has been failing to unmount all of the
mounts it should have been unmounting.
Make the walk of the mount propagation trees nearly linear by using
MNT_MARK to mark pieces of the mount propagation trees that have
already been visited, allowing subsequent walks to skip over
subtrees.
Make the processing of mounts order independent by adding a list of
mount entries that need to be unmounted, and simply adding a mount to
that list when it becomes apparent the mount can safely be unmounted.
For mounts that are locked on other mounts but otherwise could be
unmounted move them from their parnets mnt_mounts to mnt_umounts so
that if and when their parent becomes unmounted these mounts can be
added to the list of mounts to unmount.
Add a final pass to clear MNT_MARK and to restore mnt_mounts
from mnt_umounts for anything that did not get unmounted.
Add the functions propagation_visit_next and propagation_revisit_next
to coordinate walking of the mount tree and setting and clearing the
mount mark.
The skipping of already unmounted mounts has been moved from
__lookup_mnt_last to mark_umount_candidates, so that the new
propagation functions can notice when the propagation tree passes
through the initial set of unmounted mounts. Except in umount_tree as
part of the unmounting process the only place where unmounted mounts
should be found are in unmounted subtrees. All of the other callers
of __lookup_mnt_last are from mounted subtrees so the not checking for
unmounted mounts should not affect them.
A script to generate overlapping mount propagation trees:
$ cat run.sh
mount -t tmpfs test-mount /mnt
mount --make-shared /mnt
for i in `seq $1`; do
mkdir /mnt/test.$i
mount --bind /mnt /mnt/test.$i
done
cat /proc/mounts | grep test-mount | wc -l
time umount -l /mnt
$ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
Here are the performance numbers with and without the patch:
mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857
mounts | before | after | after (sys) | after | after (sys) | after | after (sys)
-------------------------------------------------------------------------------------
1024 | 0.071s | 0.020s | 0.000s | 0.022s | 0.004s | 0.020s | 0.004s
2048 | 0.184s | 0.022s | 0.004s | 0.023s | 0.004s | 0.022s | 0.008s
4096 | 0.604s | 0.025s | 0.020s | 0.029s | 0.008s | 0.026s | 0.004s
8912 | 4.471s | 0.053s | 0.020s | 0.051s | 0.024s | 0.047s | 0.016s
16384 | 34.826s | 0.088s | 0.060s | 0.081s | 0.048s | 0.082s | 0.052s
32768 | | 0.216s | 0.172s | 0.160s | 0.124s | 0.160s | 0.096s
65536 | | 0.819s | 0.726s | 0.330s | 0.260s | 0.338s | 0.256s
131072 | | 4.502s | 4.168s | 0.707s | 0.580s | 0.709s | 0.592s
Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.
A script for a pathlogical set of mounts:
$ cat pathological.sh
mount -t tmpfs base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b
mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10
mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20
mount --rbind /mnt/b /mnt/b/10/20
unshare -Urm sleep 2
umount -l /mnt/b
wait %%
$ unshare -Urm pathlogical.sh
Cc: stable@vger.kernel.org
Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount")
Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
Reported-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
fs/mount.h | 1 +
fs/namespace.c | 7 +--
fs/pnode.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
fs/pnode.h | 2 +-
4 files changed, 133 insertions(+), 56 deletions(-)