Commit Message

Adrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.
While investigating the horrible performance I realized that in
the case overlapping mount trees since the addition of locked
mount support the code has been failing to unmount all of the
mounts it should have been unmounting.
Make the walk of the mount propagation trees nearly linear by using
MNT_MARK to mark pieces of the mount propagation trees that have
already been visited, allowing subsequent walks to skip over
subtrees.
Make the processing of mounts order independent by adding a list of
mount entries that need to be unmounted, and simply adding a mount to
that list when it becomes apparent the mount can safely be unmounted.
For mounts that are locked on other mounts but otherwise could be
unmounted move them from their parnets mnt_mounts to mnt_umounts so
that if and when their parent becomes unmounted these mounts can be
added to the list of mounts to unmount.
Add a final pass to clear MNT_MARK and to restore mnt_mounts
from mnt_umounts for anything that did not get unmounted.
Add the functions propagation_visit_next and propagation_revisit_next
to coordinate walking of the mount tree and setting and clearing the
mount mark.
The skipping of already unmounted mounts has been moved from
__lookup_mnt_last to mark_umount_candidates, so that the new
propagation functions can notice when the propagation tree passes
through the initial set of unmounted mounts. Except in umount_tree as
part of the unmounting process the only place where unmounted mounts
should be found are in unmounted subtrees. All of the other callers
of __lookup_mnt_last are from mounted subtrees so the not checking for
unmounted mounts should not affect them.
A script to generate overlapping mount propagation trees:
$ cat run.sh
mount -t tmpfs test-mount /mnt
mount --make-shared /mnt
for i in `seq $1`; do
mkdir /mnt/test.$i
mount --bind /mnt /mnt/test.$i
done
cat /proc/mounts | grep test-mount | wc -l
time umount -l /mnt
$ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
Here are the performance numbers with and without the patch:
mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857
mounts | before | after | after (sys) | after | after (sys) | after | after (sys)
-------------------------------------------------------------------------------------
1024 | 0.071s | 0.020s | 0.000s | 0.022s | 0.004s | 0.020s | 0.004s
2048 | 0.184s | 0.022s | 0.004s | 0.023s | 0.004s | 0.022s | 0.008s
4096 | 0.604s | 0.025s | 0.020s | 0.029s | 0.008s | 0.026s | 0.004s
8912 | 4.471s | 0.053s | 0.020s | 0.051s | 0.024s | 0.047s | 0.016s
16384 | 34.826s | 0.088s | 0.060s | 0.081s | 0.048s | 0.082s | 0.052s
32768 | | 0.216s | 0.172s | 0.160s | 0.124s | 0.160s | 0.096s
65536 | | 0.819s | 0.726s | 0.330s | 0.260s | 0.338s | 0.256s
131072 | | 4.502s | 4.168s | 0.707s | 0.580s | 0.709s | 0.592s
Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.
A script for a pathlogical set of mounts:
$ cat pathological.sh
mount -t tmpfs base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b
mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10
mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20
mount --rbind /mnt/b /mnt/b/10/20
unshare -Urm sleep 2
umount -l /mnt/b
wait %%
$ unsahre -Urm pathlogical.sh
Cc: stable@vger.kernel.org
Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount")
Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
Reported-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
Barring some stupid mistake this looks like it fixes both the performance
and the correctness issues I was able to spot earlier. Andrei if you
could give this version a look over I would appreciate it.
Unless we can find a problem I am going to call this the final version.
fs/mount.h | 1 +
fs/namespace.c | 7 +-
fs/pnode.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++-----------
fs/pnode.h | 2 +-
4 files changed, 165 insertions(+), 43 deletions(-)

Andrei Vagin <avagin@virtuozzo.com> writes:
>> Barring some stupid mistake this looks like it fixes both the performance>> and the correctness issues I was able to spot earlier. Andrei if you>> could give this version a look over I would appreciate it.>> Eric, could you try out this script:
[snip script]
> It hangs up on my host with this patch.
Ugh. I am seeing the hang as well digging into it.
Thanks for keeping me honest.
Eric