Greetings!
Preface:
--------
This is patchset v8 to modernize procfs and make it able to support multiple
private instances per the same pid namespace.
This patchset can be applied on top of v5.4-rc7-49-g0e3f1ad80fc8
Procfs modernization:
---------------------
Historically procfs was always tied to pid namespaces, during pid
namespace creation we internally create a procfs mount for it. However,
this has the effect that all new procfs mounts are just a mirror of the
internal one, any change, any mount option update, any new future
introduction will propagate to all other procfs mounts that are in the
same pid namespace.
This may have solved several use cases in that time. However today we
face new requirements, and making procfs able to support new private
instances inside same pid namespace seems a major point. If we want to
to introduce new features and security mechanisms we have to make sure
first that we do not break existing usecases. Supporting private procfs
instances will allow to support new features and behaviour without
propagating it to all other procfs mounts.
Today procfs is more of a burden especially to some Embedded, IoT,
sandbox, container use cases. In user space we are over-mounting null
or inaccessible files on top to hide files and information. If we want
to hide pids we have to create PID namespaces otherwise mount options
propagate to all other proc mounts, changing a mount option value in one
mount will propagate to all other proc mounts. If we want to introduce
new features, then they will propagate to all other mounts too, resulting
either maybe new useful functionality or maybe breaking stuff. We have
also to note that userspace should not workaround procfs, the kernel
should just provide a sane simple interface.
In this regard several developers and maintainers pointed out that
there are problems with procfs and it has to be modernized:
"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]
Discussion about kernel pointer leaks:
"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]
Lot of other areas in the kernel and filesystems have been updated to be
able to support private instances, devpts is one major example [3].
Which will be used for:
1) Embedded systems and IoT: usually we have one supervisor for
apps, we have some lightweight sandbox support, however if we create
pid namespaces we have to manage all the processes inside too,
where our goal is to be able to run a bunch of apps each one inside
its own mount namespace, maybe use network namespaces for vlans
setups, but right now we only want mount namespaces, without all the
other complexity. We want procfs to behave more like a real file system,
and block access to inodes that belong to other users. The 'hidepid=' will
not work since it is a shared mount option.
2) Containers, sandboxes and Private instances of file systems - devpts case
Historically, lot of file systems inside Linux kernel view when instantiated
were just a mirror of an already created and mounted filesystem. This was the
case of devpts filesystem, it seems at that time the requirements were to
optimize things and reuse the same memory, etc. This design used to work but not
anymore with today's containers, IoT, hostile environments and all the privacy
challenges that Linux faces.
In that regards, devpts was updated so that each new mounts is a total
independent file system by the following patches:
"devpts: Make each mount of devpts an independent filesystem" by
Eric W. Biederman [3] [4]
3) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.
This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.
By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which user can not.
Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change all LSMs should be able to analyze
open/read/write/close... on /proc/<pid>/
4) This will allow to implement new features either in kernel or
userspace without having to worry about procfs.
In containers, sandboxes, etc we have workarounds to hide some /proc
inodes, this should be supported natively without doing extra complex
work, the kernel should be able to support sane options that work with
today and future Linux use cases.
5) Creation of new superblock with all procfs options for each procfs
mount will fix the ignoring of mount options. The problem is that the
second mount of procfs in the same pid namespace ignores the mount
options. The mount options are ignored without error until procfs is
remounted.
Before:
# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
# strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
+++ exited with 0 +++
# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0
# mount -o remount,hidepid=1 -t proc proc /tmp/proc
# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=1 0 0
proc /tmp/proc proc rw,relatime,hidepid=1 0 0
After:
# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
# mount -o hidepid=1 -t proc proc /tmp/proc
# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=1 0 0
Introduced changes:
-------------------
Each mount of procfs creates a separate procfs instance with its own
mount options.
This series adds few new mount options:
* New 'hidepid=4' mount option to show only ptraceable processes in the procfs.
This allows to support lightweight sandboxes in Embedded Linux, also
solves the case for LSM where now with this mount option, we make sure
that they have a ptrace path in procfs.
* 'subset=pidfs' that allows to hide non-pid inodes from procfs. It can be used
in containers and sandboxes, as these are already trying to hide and block
access to procfs inodes anyway.
ChangeLog:
----------
# v8:
* Started using RCU lock to clean dcache entries as Linus Torvalds suggested.
# v7:
* 'pidonly=1' renamed to 'subset=pidfs' as Alexey Dobriyan suggested.
* HIDEPID_* moved to uapi/ as they are user interface to mount().
Suggested-by Alexey Dobriyan <adobriyan@gmail.com>
# v6:
* 'hidepid=' and 'gid=' mount options are moved from pid namespace to superblock.
* 'newinstance' mount option removed as Eric W. Biederman suggested.
Mount of procfs always creates a new instance.
* 'limit_pids' renamed to 'hidepid=3'.
* I took into account the comment of Linus Torvalds [7].
* Documentation added.
# v5:
* Fixed a bug that caused a problem with the Fedora boot.
* The 'pidonly' option is visible among the mount options.
# v2:
* Renamed mount options to 'newinstance' and 'pids='
Suggested-by: Andy Lutomirski <luto@kernel.org>
* Fixed order of commit, Suggested-by: Andy Lutomirski <luto@kernel.org>
* Many bug fixes.
# v1:
* Removed 'unshared' mount option and replaced it with 'limit_pids'
which is attached to the current procfs mount.
Suggested-by Andy Lutomirski <luto@kernel.org>
* Do not fill dcache with pid entries that we can not ptrace.
* Many bug fixes.
References:
-----------
[1] https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] https://lwn.net/Articles/689539/
[4] http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
[5] https://lkml.org/lkml/2017/5/2/407
[6] https://lkml.org/lkml/2017/5/3/357
[7] https://lkml.org/lkml/2018/5/11/505
Alexey Gladkov (11):
proc: Rename struct proc_fs_info to proc_fs_opts
proc: add proc_fs_info struct to store proc information
proc: move /proc/{self|thread-self} dentries to proc_fs_info
proc: move hide_pid, pid_gid from pid_namespace to proc_fs_info
proc: add helpers to set and get proc hidepid and gid mount options
proc: support mounting procfs instances inside same pid namespace
proc: flush task dcache entries from all procfs instances
proc: instantiate only pids that we can ptrace on 'hidepid=4' mount
option
proc: add option to mount only a pids subset
docs: proc: add documentation for "hidepid=4" and "subset=pidfs"
options and new mount behavior
proc: Move hidepid values to uapi as they are user interface to mount
Documentation/filesystems/proc.txt | 53 +++++++++++
fs/locks.c | 6 +-
fs/proc/base.c | 66 ++++++++++----
fs/proc/generic.c | 9 ++
fs/proc/inode.c | 21 +++--
fs/proc/internal.h | 30 ++++++
fs/proc/root.c | 141 +++++++++++++++++++++++------
fs/proc/self.c | 4 +-
fs/proc/thread_self.c | 6 +-
fs/proc_namespace.c | 14 +--
include/linux/pid_namespace.h | 14 +--
include/linux/proc_fs.h | 25 ++++-
include/uapi/linux/proc_fs.h | 13 +++
13 files changed, 324 insertions(+), 78 deletions(-)
create mode 100644 include/uapi/linux/proc_fs.h
--
2.24.1

On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> This allows to flush dcache entries of a task on multiple procfs mounts
> per pid namespace.
>
> The RCU lock is used because the number of reads at the task exit time
> is much larger than the number of procfs mounts.
Ok, this looks better to me than the previous version.
But that may be the "pee-in-the-snow" effect, and I _really_ want
others to take a good look at the whole series.
The right people seem to be cc'd, but this is pretty core, and /proc
has a tendency to cause interesting issues because of how it's
involved in a lot of areas indirectly.
Al, Oleg, Andy, Eric?
Linus

On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> This is a preparation patch that moves /proc/{self|thread-self} dentries
> to be stored inside procfs fs_info struct instead of making them per pid
> namespace. Since we want to support multiple procfs instances we need to
> make sure that these dentries are also per-superblock instead of
> per-pidns,
The changelog makes perfect sense so far...
> unmounting a private procfs won't clash with other procfs
> mounts.
This doesn't parse as part of the previous sentence. I'm also not
convinced that this really involves unmounting per se. Maybe just
delete these words.

On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> This is a cleaning patch to add helpers to set and get proc mount
> options instead of directly using them. This make it easy to track
> what's happening and easy to update in future.
On a cursory inspection, this looks like it obfuscates the code, and I
don't see where it does something useful later in the series. What is
this abstraction for?
--Andy

On Mon, Feb 10, 2020 at 09:46:26AM -0800, Linus Torvalds wrote:
> On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > This allows to flush dcache entries of a task on multiple procfs mounts
> > per pid namespace.
> >
> > The RCU lock is used because the number of reads at the task exit time
> > is much larger than the number of procfs mounts.
>
> Ok, this looks better to me than the previous version.
>
> But that may be the "pee-in-the-snow" effect, and I _really_ want
> others to take a good look at the whole series.
>
> The right people seem to be cc'd, but this is pretty core, and /proc
> has a tendency to cause interesting issues because of how it's
> involved in a lot of areas indirectly.
>
> Al, Oleg, Andy, Eric?
Will check tonight (ears-deep in sorting out the old branches right now)

On Mon, Feb 10, 2020 at 10:30:45AM -0800, Andy Lutomirski wrote:
> On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > This is a cleaning patch to add helpers to set and get proc mount
> > options instead of directly using them. This make it easy to track
> > what's happening and easy to update in future.
>
> On a cursory inspection, this looks like it obfuscates the code, and I
> don't see where it does something useful later in the series. What is
> this abstraction for?
To be honest, this part is from Djalal Harouni. For me, these wrappers
exist for a slight increase in readability.
I will not cry if you tell me to remove them :)
--
Rgrds, legion

On Mon, Feb 10, 2020 at 10:23:23AM -0800, Andy Lutomirski wrote:
> On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > This is a preparation patch that moves /proc/{self|thread-self} dentries
> > to be stored inside procfs fs_info struct instead of making them per pid
> > namespace. Since we want to support multiple procfs instances we need to
> > make sure that these dentries are also per-superblock instead of
> > per-pidns,
>
> The changelog makes perfect sense so far...
>
> > unmounting a private procfs won't clash with other procfs
> > mounts.
>
> This doesn't parse as part of the previous sentence. I'm also not
> convinced that this really involves unmounting per se. Maybe just
> delete these words.
Sure. I will remove this part.
--
Rgrds, legion

On Wed, Feb 12, 2020 at 7:01 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Fundamentally proc_flush_task is an optimization. Just getting rid of
> dentries earlier. At least at one point it was an important
> optimization because the old process dentries would just sit around
> doing nothing for anyone.
I'm pretty sure it's still important. It's very easy to generate a
_ton_ of dentries with /proc.
> I wonder if instead of invalidating specific dentries we could instead
> fire wake up a shrinker and point it at one or more instances of proc.
It shouldn't be the dentries themselves that are a freeing problem.
They're being RCU-free'd anyway because of lookup. It's the
proc_mounts list that is the problem, isn't it?
So it's just fs_info that needs to be rcu-delayed because it contains
that list. Or is there something else?
Linus

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Wed, Feb 12, 2020 at 7:01 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Fundamentally proc_flush_task is an optimization. Just getting rid of
>> dentries earlier. At least at one point it was an important
>> optimization because the old process dentries would just sit around
>> doing nothing for anyone.
>
> I'm pretty sure it's still important. It's very easy to generate a
> _ton_ of dentries with /proc.
>
>> I wonder if instead of invalidating specific dentries we could instead
>> fire wake up a shrinker and point it at one or more instances of proc.
>
> It shouldn't be the dentries themselves that are a freeing problem.
> They're being RCU-free'd anyway because of lookup. It's the
> proc_mounts list that is the problem, isn't it?
>
> So it's just fs_info that needs to be rcu-delayed because it contains
> that list. Or is there something else?
The fundamental dcache thing we are playing with is:
dentry = d_hash_and_lookup(proc_root, &name);
if (dentry) {
d_invalidate(dentry);
dput(dentry);
}
As Al pointed out upthread dput and d_invalidate can both sleep.
The dput can potentially go away if we use __d_lookup_rcu instead of
d_lookup.
The challenge is d_invalidate.
It has the fundamentally sleeping detach_mounts loop. Even
shrink_dcache_parent has a cond_sched() in there to ensure it doesn't
live lock the system.
We could and arguabley should set DCACHE_CANT_MOUNT on the proc pid
dentries. Which will prevent having to deal with mounts.
But I don't see an easy way of getting shrink_dcache_parent to run
without sleeping. Ideas?
Eric

On Wed, Feb 12, 2020 at 10:45:06AM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2020 at 7:01 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Fundamentally proc_flush_task is an optimization. Just getting rid of
> > dentries earlier. At least at one point it was an important
> > optimization because the old process dentries would just sit around
> > doing nothing for anyone.
>
> I'm pretty sure it's still important. It's very easy to generate a
> _ton_ of dentries with /proc.
>
> > I wonder if instead of invalidating specific dentries we could instead
> > fire wake up a shrinker and point it at one or more instances of proc.
>
> It shouldn't be the dentries themselves that are a freeing problem.
> They're being RCU-free'd anyway because of lookup. It's the
> proc_mounts list that is the problem, isn't it?
>
> So it's just fs_info that needs to be rcu-delayed because it contains
> that list. Or is there something else?
Large part of the headache is the possibility that some joker has
done something like mounting tmpfs on /proc/<pid>/map_files, or
binding /dev/null on top of /proc/<pid>/syscall, etc.
IOW, that d_invalidate() can very well have to grab namespace_sem.
And possibly do a full-blown fs shutdown of something NFS-mounted,
etc...

On Wed, Feb 12, 2020 at 11:18 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> > So it's just fs_info that needs to be rcu-delayed because it contains
> > that list. Or is there something else?
>
> The fundamental dcache thing we are playing with is:
>
> dentry = d_hash_and_lookup(proc_root, &name);
> if (dentry) {
> d_invalidate(dentry);
> dput(dentry);
> }
Ahh. And we can't do that part under the RCU read lock. So it's not
the freeing, it's the list traversal itself.
Fair enough.
Hmm.
I wonder if we could split up d_invalidate(). It already ends up being
two phases: first the unhashing under the d_lock, and then the
recursive shrinking of parents and children.
The recursive shrinking of the parent isn't actually interesting for
the proc shrinking case: we just looked up one child, after all. So we
only care about the d_walk of the children.
So if we only did the first part under the RCU lock, and just
collected the dentries (can we perhaps then re-use the hash list to
collect them to another list?) and then did the child d_walk
afterwards?
Linus

On Wed, Feb 12, 2020 at 11:49:58AM -0800, Linus Torvalds wrote:
> I wonder if we could split up d_invalidate(). It already ends up being
> two phases: first the unhashing under the d_lock, and then the
> recursive shrinking of parents and children.
>
> The recursive shrinking of the parent isn't actually interesting for
> the proc shrinking case: we just looked up one child, after all. So we
> only care about the d_walk of the children.
>
> So if we only did the first part under the RCU lock, and just
> collected the dentries (can we perhaps then re-use the hash list to
> collect them to another list?) and then did the child d_walk
> afterwards?
What's to prevent racing with fs shutdown while you are doing the second part?
We could, after all, just have them[*] on procfs-private list (anchored in
task_struct) from the very beginning; evict on ->d_prune(), walk the list
on exit... How do you make sure the fs instance won't go away right under
you while you are doing the real work? Suppose you are looking at one
of those dentries and you've found something blocking to do. You can't
pin that dentry; you can pin ->s_active on its superblock (if it's already
zero, you can skip it - fs shutdown already in progress will take care of
the damn thing), but that will lead to quite a bit of cacheline pingpong...
[*] only /proc/<pid> and /proc/*/task/<pid> dentries, obviously.

On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What's to prevent racing with fs shutdown while you are doing the second part?
I was thinking that only the proc_flush_task() code would do this.
And that holds a ref to the vfsmount through upid->ns.
So I wasn't suggesting doing this in general - just splitting up the
implementation of d_invalidate() so that proc_flush_task_mnt() could
delay the complex part to after having traversed the RCU-protected
list.
But hey - I missed this part of the problem originally, so maybe I'm
just missing something else this time. Wouldn't be the first time.
Linus

On Wed, Feb 12, 2020 at 12:35:04PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What's to prevent racing with fs shutdown while you are doing the second part?
>
> I was thinking that only the proc_flush_task() code would do this.
>
> And that holds a ref to the vfsmount through upid->ns.
>
> So I wasn't suggesting doing this in general - just splitting up the
> implementation of d_invalidate() so that proc_flush_task_mnt() could
> delay the complex part to after having traversed the RCU-protected
> list.
>
> But hey - I missed this part of the problem originally, so maybe I'm
> just missing something else this time. Wouldn't be the first time.
Wait, I thought the whole point of that had been to allow multiple
procfs instances for the same userns? Confused...

On Wed, Feb 12, 2020 at 08:38:33PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2020 at 12:35:04PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > What's to prevent racing with fs shutdown while you are doing the second part?
> >
> > I was thinking that only the proc_flush_task() code would do this.
> >
> > And that holds a ref to the vfsmount through upid->ns.
> >
> > So I wasn't suggesting doing this in general - just splitting up the
> > implementation of d_invalidate() so that proc_flush_task_mnt() could
> > delay the complex part to after having traversed the RCU-protected
> > list.
> >
> > But hey - I missed this part of the problem originally, so maybe I'm
> > just missing something else this time. Wouldn't be the first time.
>
> Wait, I thought the whole point of that had been to allow multiple
> procfs instances for the same userns? Confused...
s/userns/pidns/, sorry

On Wed, Feb 12, 2020 at 12:41 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Feb 12, 2020 at 08:38:33PM +0000, Al Viro wrote:
> >
> > Wait, I thought the whole point of that had been to allow multiple
> > procfs instances for the same userns? Confused...
>
> s/userns/pidns/, sorry
Right, but we still hold the ref to it here...
[ Looks more ]
Oooh. No we don't. Exactly because we don't hold the lock, only the
rcu lifetime, the ref can go away from under us. I see what your
concern is.
Ouch, this is more painful than I expected - the code flow looked so
simple. I really wanted to avoid a new lock during process shutdown,
because that has always been somewhat painful.
Linus

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Wed, Feb 12, 2020 at 12:41 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> On Wed, Feb 12, 2020 at 08:38:33PM +0000, Al Viro wrote:
>> >
>> > Wait, I thought the whole point of that had been to allow multiple
>> > procfs instances for the same userns? Confused...
>>
>> s/userns/pidns/, sorry
>
> Right, but we still hold the ref to it here...
>
> [ Looks more ]
>
> Oooh. No we don't. Exactly because we don't hold the lock, only the
> rcu lifetime, the ref can go away from under us. I see what your
> concern is.
>
> Ouch, this is more painful than I expected - the code flow looked so
> simple. I really wanted to avoid a new lock during process shutdown,
> because that has always been somewhat painful.
The good news is proc_flush_task isn't exactly called from process exit.
proc_flush_task is called during zombie clean up. AKA release_task.
So proc_flush_task isn't called with any locks held, and it is
called in a context where it can sleep.
Further after proc_flush_task does it's thing the code goes
and does "write_lock_irq(&task_list_lock);"
So the code is definitely serialized to one processor already.
What would be downside of having a mutex for a list of proc superblocks?
A mutex that is taken for both reading and writing the list.
Eric

On Wed, Feb 12, 2020 at 1:48 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The good news is proc_flush_task isn't exactly called from process exit.
> proc_flush_task is called during zombie clean up. AKA release_task.
Yeah, that at least avoids some of the nasty locking while dying debug problems.
But the one I was more worried about was actually the lock contention
issue with lots of processes. The lock is basically a single global
lock in many situations - yes, it's technically per-ns, but in a lot
of cases you really only have one namespace anyway.
And we've had problems with global locks in this area before, notably
the one you call out:
> Further after proc_flush_task does it's thing the code goes
> and does "write_lock_irq(&task_list_lock);"
Yeah, so it's not introducing a new issue, but it is potentially
making something we already know is bad even worse.
> What would be downside of having a mutex for a list of proc superblocks?
> A mutex that is taken for both reading and writing the list.
That's what the original patch actually was, and I was hoping we could
avoid that thing.
An rwsem would be possibly better, since most cases by far are likely
about reading.
And yes, I'm very aware of the task_list_lock, but it's literally why
I don't want to make a new one.
I'm _hoping_ we can some day come up with something better than task_list_lock.
Linus

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Wed, Feb 12, 2020 at 1:48 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The good news is proc_flush_task isn't exactly called from process exit.
>> proc_flush_task is called during zombie clean up. AKA release_task.
>
> Yeah, that at least avoids some of the nasty locking while dying debug problems.
>
> But the one I was more worried about was actually the lock contention
> issue with lots of processes. The lock is basically a single global
> lock in many situations - yes, it's technically per-ns, but in a lot
> of cases you really only have one namespace anyway.
>
> And we've had problems with global locks in this area before, notably
> the one you call out:
>
>> Further after proc_flush_task does it's thing the code goes
>> and does "write_lock_irq(&task_list_lock);"
>
> Yeah, so it's not introducing a new issue, but it is potentially
> making something we already know is bad even worse.
>
>> What would be downside of having a mutex for a list of proc superblocks?
>> A mutex that is taken for both reading and writing the list.
>
> That's what the original patch actually was, and I was hoping we could
> avoid that thing.
>
> An rwsem would be possibly better, since most cases by far are likely
> about reading.
>
> And yes, I'm very aware of the task_list_lock, but it's literally why
> I don't want to make a new one.
>
> I'm _hoping_ we can some day come up with something better than
> task_list_lock.
Yes. I understand that.
I occassionally play with ideas, and converted all of proc to rcu
to help with situation but I haven't come up with anything clearly
better.
All of this is why I was really hoping we could have a change in
strategy and see if we can make the shrinker be able to better prune
proc inodes.
I think I have an alternate idea that could work. Add some extra code
into proc_task_readdir, that would look for dentries that no longer
point to tasks and d_invalidate them. With the same logic probably
being called from a few more places as well like proc_pid_readdir,
proc_task_lookup, and proc_pid_lookup.
We could even optimize it and have a process died flag we set in the
superblock.
That would would batch up the freeing work until the next time someone
reads from proc in a way that would create more dentries. So it would
prevent dentries from reaped zombies from growing without bound.
Hmm. Given the existence of proc_fill_cache it would really be a good
idea if readdir and lookup performed some of the freeing work as well.
As on readdir we always populate the dcache for all of the directory
entries.
I am booked solid for the next little while but if no one beats me to it
I will try and code something like that up where at least readdir
looks for and invalidates stale dentries.
Eric

On Wed, Feb 12, 2020 at 10:37:52PM -0600, Eric W. Biederman wrote:
> I think I have an alternate idea that could work. Add some extra code
> into proc_task_readdir, that would look for dentries that no longer
> point to tasks and d_invalidate them. With the same logic probably
> being called from a few more places as well like proc_pid_readdir,
> proc_task_lookup, and proc_pid_lookup.
>
> We could even optimize it and have a process died flag we set in the
> superblock.
>
> That would would batch up the freeing work until the next time someone
> reads from proc in a way that would create more dentries. So it would
> prevent dentries from reaped zombies from growing without bound.
>
> Hmm. Given the existence of proc_fill_cache it would really be a good
> idea if readdir and lookup performed some of the freeing work as well.
> As on readdir we always populate the dcache for all of the directory
> entries.
First of all, that won't do a damn thing when nobody is accessing
given superblock. What's more, readdir in root of that procfs instance
is not enough - you need it in task/ of group leader.
What I don't understand is the insistence on getting those dentries
via dcache lookups. _IF_ we are willing to live with cacheline
contention (on ->d_lock of root dentry, if nothing else), why not
do the following:
* put all dentries of such directories ([0-9]* and [0-9]*/task/*)
into a list anchored in task_struct; have non-counting reference to
task_struct stored in them (might simplify part of get_proc_task() users,
BTW - avoids pid-to-task_struct lookups if we have a dentry and not just
the inode; many callers do)
* have ->d_release() remove from it (protecting per-task_struct lock
nested outside of all ->d_lock)
* on exit:
lock the (per-task_struct) list
while list is non-empty
pick the first dentry
remove from the list
sb = dentry->d_sb
try to bump sb->s_active (if non-zero, that is).
if failed
continue // move on to the next one - nothing to do here
grab ->d_lock
res = handle_it(dentry, &temp_list)
drop ->d_lock
unlock the list
if (!list_empty(&temp_list))
shrink_dentry_list(&temp_list)
if (res)
d_invalidate(dentry)
dput(dentry)
deactivate_super(sb)
lock the list
unlock the list
handle_it(dentry, temp_list) // ->d_lock held; that one should be in dcache.c
if ->d_count is negative // unlikely
return 0;
if ->d_count is positive,
increment ->d_count
return 1;
// OK, it's still alive, but ->d_count is 0
__d_drop // equivalent of d_invalidate in this case
if not on a shrink list // otherwise it's not our headache
if on lru list
d_lru_del
d_shrink_add dentry to temp_list
return 0;
And yeah, that'll dirty ->s_active for each procfs superblock that
has dentry for our process present in dcache. On exit()...

On Wed, Feb 12, 2020 at 9:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What I don't understand is the insistence on getting those dentries
> via dcache lookups.
I don't think that's an "insistence", it's more of a "historical
behavior" together with "several changes over the years to deal with
dentry-level cleanups and updates".
> _IF_ we are willing to live with cacheline
> contention (on ->d_lock of root dentry, if nothing else), why not
> do the following:
> * put all dentries of such directories ([0-9]* and [0-9]*/task/*)
> into a list anchored in task_struct; have non-counting reference to
> task_struct stored in them (might simplify part of get_proc_task() users,
Hmm.
Right now I don't think we actually create any dentries at all for the
short-lived process case.
Wouldn't your suggestion make fork/exit rather worse?
Or would you create the dentries dynamically still at lookup time, and
then attach them to the process at that point?
What list would you use for the dentry chaining? Would you play games
with the dentry hashing, and "hash" them off the process, and never
hit in the lookup cache?
Am I misunderstanding what you suggest?
Linus

On Thu, Feb 13, 2020 at 01:30:11PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2020 at 9:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What I don't understand is the insistence on getting those dentries
> > via dcache lookups.
>
> I don't think that's an "insistence", it's more of a "historical
> behavior" together with "several changes over the years to deal with
> dentry-level cleanups and updates".
>
> > _IF_ we are willing to live with cacheline
> > contention (on ->d_lock of root dentry, if nothing else), why not
> > do the following:
> > * put all dentries of such directories ([0-9]* and [0-9]*/task/*)
> > into a list anchored in task_struct; have non-counting reference to
> > task_struct stored in them (might simplify part of get_proc_task() users,
>
> Hmm.
>
> Right now I don't think we actually create any dentries at all for the
> short-lived process case.
>
> Wouldn't your suggestion make fork/exit rather worse?
>
> Or would you create the dentries dynamically still at lookup time, and
> then attach them to the process at that point?
>
> What list would you use for the dentry chaining? Would you play games
> with the dentry hashing, and "hash" them off the process, and never
> hit in the lookup cache?
I'd been thinking of ->d_fsdata pointing to a structure with list_head
and a (non-counting) task_struct pointer for those guys. Allocated
on lookup, of course (as well as readdir ;-/) and put on the list
at the same time.
IOW, for short-lived process we simply have an empty (h)list anchored
in task_struct and that's it.

On Thu, Feb 13, 2020 at 2:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I'd been thinking of ->d_fsdata pointing to a structure with list_head
> and a (non-counting) task_struct pointer for those guys. Allocated
> on lookup, of course (as well as readdir ;-/) and put on the list
> at the same time.
Hmm. That smells like potentially a lot of small allocations, and
making readdir() even nastier.
Do we really want to create the dentries at readdir time? We do now
(with proc_fill_cache()) but do we actually _need_ to?
I guess a lot of readdir users end up doing a stat on it immediately
afterwards. I think right now we do it to get the inode number, and
maybe that is a basic requirement (even if I don't think it's really
stable - an inode could be evicted and then the ino changes, no?)
Ho humm. This all doesn't make me happy. But I guess the proof is in
the pudding - and if you come up with a good patch, I won't complain.
Linus

Al Viro <viro@zeniv.linux.org.uk> writes:
> On Wed, Feb 12, 2020 at 10:37:52PM -0600, Eric W. Biederman wrote:
>
>> I think I have an alternate idea that could work. Add some extra code
>> into proc_task_readdir, that would look for dentries that no longer
>> point to tasks and d_invalidate them. With the same logic probably
>> being called from a few more places as well like proc_pid_readdir,
>> proc_task_lookup, and proc_pid_lookup.
>>
>> We could even optimize it and have a process died flag we set in the
>> superblock.
>>
>> That would would batch up the freeing work until the next time someone
>> reads from proc in a way that would create more dentries. So it would
>> prevent dentries from reaped zombies from growing without bound.
>>
>> Hmm. Given the existence of proc_fill_cache it would really be a good
>> idea if readdir and lookup performed some of the freeing work as well.
>> As on readdir we always populate the dcache for all of the directory
>> entries.
>
> First of all, that won't do a damn thing when nobody is accessing
> given superblock. What's more, readdir in root of that procfs instance
> is not enough - you need it in task/ of group leader.
It should give a rough bound on the number of stale dentries a
superblock can have. The same basic concept has been used very
successfully in many incremental garbage collectors. In those malloc
(or the equivalent) does a finite amount of garbage collection work to
roughly balance out the amount of memory allocated. I am proposing
something similar for proc instances.
Further if no one is accessing a superblock we don't have a problem
either.
> What I don't understand is the insistence on getting those dentries
> via dcache lookups. _IF_ we are willing to live with cacheline
> contention (on ->d_lock of root dentry, if nothing else), why not
> do the following:
No insistence from this side.
I was not seeing atomic_inc_not_zero(sb->s_active) from rcu
context as option earlier. But it is an option.
> * put all dentries of such directories ([0-9]* and [0-9]*/task/*)
> into a list anchored in task_struct; have non-counting reference to
> task_struct stored in them (might simplify part of get_proc_task() users,
> BTW - avoids pid-to-task_struct lookups if we have a dentry and not just
> the inode; many callers do)
> * have ->d_release() remove from it (protecting per-task_struct lock
> nested outside of all ->d_lock)
> * on exit:
> lock the (per-task_struct) list
> while list is non-empty
> pick the first dentry
> remove from the list
> sb = dentry->d_sb
> try to bump sb->s_active (if non-zero, that is).
> if failed
> continue // move on to the next one - nothing to do here
> grab ->d_lock
> res = handle_it(dentry, &temp_list)
> drop ->d_lock
> unlock the list
> if (!list_empty(&temp_list))
> shrink_dentry_list(&temp_list)
> if (res)
> d_invalidate(dentry)
> dput(dentry)
> deactivate_super(sb)
> lock the list
> unlock the list
>
> handle_it(dentry, temp_list) // ->d_lock held; that one should be in dcache.c
> if ->d_count is negative // unlikely
> return 0;
> if ->d_count is positive,
> increment ->d_count
> return 1;
> // OK, it's still alive, but ->d_count is 0
> __d_drop // equivalent of d_invalidate in this case
> if not on a shrink list // otherwise it's not our headache
> if on lru list
> d_lru_del
> d_shrink_add dentry to temp_list
> return 0;
>
> And yeah, that'll dirty ->s_active for each procfs superblock that
> has dentry for our process present in dcache. On exit()...
I would thread the whole thing through the proc_inode instead of coming
up with a new allocation per dentry so an extra memory allocation isn't
needed. We already have i_dentry. So going from the vfs_inode to
the dentry is trivial.
But truthfully I don't like proc_flush_task.
The problem is that proc_flush_task is a layering violation and magic
code that pretty much no one understands. We have some very weird
cases where dput or d_invalidate wound up triggering ext3 code. It has
been fixed for a long time now, but it wasy crazy weird unexpected
stuff.
Al your logic above just feels very clever, and like many pieces of the
kernel have to know how other pieces of the kernel work. If we can find
something stupid and simple that also solves the problem I would be much
happier. Than anyone could understand and fix it if something goes
wrong.
Eric

Al Viro <viro@zeniv.linux.org.uk> writes:
> On Wed, Feb 12, 2020 at 12:35:04PM -0800, Linus Torvalds wrote:
>> On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >
>> > What's to prevent racing with fs shutdown while you are doing the second part?
>>
>> I was thinking that only the proc_flush_task() code would do this.
>>
>> And that holds a ref to the vfsmount through upid->ns.
>>
>> So I wasn't suggesting doing this in general - just splitting up the
>> implementation of d_invalidate() so that proc_flush_task_mnt() could
>> delay the complex part to after having traversed the RCU-protected
>> list.
>>
>> But hey - I missed this part of the problem originally, so maybe I'm
>> just missing something else this time. Wouldn't be the first time.
>
> Wait, I thought the whole point of that had been to allow multiple
> procfs instances for the same userns? Confused...
Multiple procfs instances for the same pidns. Exactly.
Which would let people have their own set of procfs mount
options without having to worry about stomping on someone else.
The fundamental problem with multiple procfs instances per pidns
is there isn't an obvous place to put a vfs mount.
...
Which means we need some way to keep the file system from going away
while anyone in the kernel is running proc_flush_task.
One was I can see to solve this that would give us cheap readers, is to
have a percpu count of the number of processes in proc_flush_task.
That would work something like mnt_count.
Then forbid proc_kill_sb from removing any super block from the list
or otherwise making progress until the proc_flush_task_count goes
to zero.
f we wanted cheap readers and an expensive writer
kind of flag that proc_kill_sb can
Thinking out loud perhaps we have add a list_head on task_struct
and a list_head in proc_inode. That would let us find the inodes
and by extention the dentries we care about quickly.
Then in evict_inode we could remove the proc_inode from the list.
Eric

Linus Torvalds <torvalds@linux-foundation.org> writes:
> I guess a lot of readdir users end up doing a stat on it immediately
> afterwards. I think right now we do it to get the inode number, and
> maybe that is a basic requirement (even if I don't think it's really
> stable - an inode could be evicted and then the ino changes, no?)
All I know is proc_fill_cache seemed like a good idea at the time.
I may have been to clever.
While I think proc_fill_cache probably exacerbates the issue
it isn't the reason we have the flushing logic. The proc
flushing logic was introduced in around 2.5.9 much earlier
than the other proc things.
commit 0030633355db2bba32d97655df73b04215018ab9
Author: Alexander Viro <viro@math.psu.edu>
Date: Sun Apr 21 23:03:37 2002 -0700
[PATCH] (3/5) sane procfs/dcache interaction
- sane dentry retention. Namely, we don't kill /proc/<pid> dentries at the
first opportunity (as the current tree does). Instead we do the following:
* ->d_delete() kills it only if process is already dead.
* all ->lookup() in proc/base.c end with checking if process is still
alive and unhash if it isn't.
* proc_pid_lookup() (lookup for /proc/<pid>) caches reference to dentry
in task_struct. It's _not_ counted in ->d_count.
* ->d_iput() resets said reference to NULL.
* release_task() (burying a zombie) checks if there is a cached
reference and if there is - shrinks the subtree.
* tasklist_lock is used for exclusion.
That way we are guaranteed that after release_task() all dentries in
/proc/<pid> will go away as soon as possible; OTOH, before release_task()
we have normal retention policy - they go away under memory pressure with
the same rules as for dentries on any other fs.
Tracking down when this logic was introduced I also see that this code
has broken again and again any time proc changes (like now). So it is
definitely subtle and fragile.
Eric

Just because it is less of a fundamental change and less testing I went
and looked at updating proc_flush_task to use a list as Al suggested.
If we can stand an sget/deactivate_super pair for every dentry we want
to invalidate I think I have something.
Comments from anyone will be appreciated I gave this some light testing
and the code is based on something similar already present in proc so
I think there is a high chance this code is correct but I could easily
be wrong.
Linus, does this approach look like something you can stand?
Eric
Eric W. Biederman (7):
proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache
proc: Use d_invalidate in proc_prune_siblings_dcache
proc: Clear the pieces of proc_inode that proc_evict_inode cares about
proc: Use a list of inodes to flush from proc
proc: Ensure we see the exit of each process tid exactly once
fs/exec.c | 5 +--
fs/proc/base.c | 111 ++++++++++++++++--------------------------------
fs/proc/inode.c | 60 +++++++++++++++++++++++---
fs/proc/internal.h | 4 +-
fs/proc/proc_sysctl.c | 45 +++-----------------
include/linux/pid.h | 2 +
include/linux/proc_fs.h | 4 +-
kernel/exit.c | 4 +-
kernel/pid.c | 16 +++++++
9 files changed, 124 insertions(+), 127 deletions(-)

On Thu, Feb 20, 2020 at 12:51 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Don't make it look like rcu_read_lock is held over the entire loop
> instead just take the rcu_read_lock over the part of the loop that
> matters. This makes the intent of the code a little clearer.
No, this is horrid.
Maybe it makes the intent clearer, but it also causes that "continue"
case to unlock and relock immediately.
And maybe that case never triggers, and that's ok. But then it needs a
big comment about it.
Linus

On Thu, Feb 20, 2020 at 12:51 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> To use d_invalidate replace d_prune_aliases with d_find_alias
> followed by d_invalidate and dput. This is safe and complete
> because no inode in proc has any hardlinks or aliases.
Are you sure you can't create them some way? This makes em go "what
if we had multiple dentries associated with that inode?" Then the code
would just invalidate the first one.
I guess we don't have export_operations or anything like that, but
this makes me worry...
Linus

On Thu, Feb 20, 2020 at 02:49:53PM -0600, Eric W. Biederman wrote:
>
> The function d_prune_aliases has the problem that it will only prune
> aliases thare are completely unused. It will not remove aliases for
> the dcache or even think of removing mounts from the dcache. For that
> behavior d_invalidate is needed.
>
> To use d_invalidate replace d_prune_aliases with d_find_alias
> followed by d_invalidate and dput. This is safe and complete
> because no inode in proc has any hardlinks or aliases.
s/no inode.*/it's a fucking directory inode./

On Thu, Feb 20, 2020 at 12:48 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Linus, does this approach look like something you can stand?
A couple of worries, although one of them seem to have already been
resolved by Al.
I think the real gatekeeper should be Al in general. But other than
the small comments I had, I think this might work just fine.
Al?
Linus

On Thu, Feb 20, 2020 at 10:54:20PM +0000, Al Viro wrote:
> On Thu, Feb 20, 2020 at 02:49:53PM -0600, Eric W. Biederman wrote:
> >
> > The function d_prune_aliases has the problem that it will only prune
> > aliases thare are completely unused. It will not remove aliases for
> > the dcache or even think of removing mounts from the dcache. For that
> > behavior d_invalidate is needed.
> >
> > To use d_invalidate replace d_prune_aliases with d_find_alias
> > followed by d_invalidate and dput. This is safe and complete
> > because no inode in proc has any hardlinks or aliases.
>
> s/no inode.*/it's a fucking directory inode./
Wait... You are using it for sysctls as well? Ho-hum... The thing is,
for sysctls you are likely to run into consequent entries with the
same superblock, making for a big pile of useless playing with
->s_active... And yes, that applied to mainline as well

On Thu, Feb 20, 2020 at 03:02:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2020 at 12:48 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > Linus, does this approach look like something you can stand?
>
> A couple of worries, although one of them seem to have already been
> resolved by Al.
>
> I think the real gatekeeper should be Al in general. But other than
> the small comments I had, I think this might work just fine.
>
> Al?
I'll need to finish RTFS there; I have initially misread that patch,
actually - Eric _is_ using that thing both for those directories
and for sysctl inodes. And the prototype for that machinery (the
one he'd pulled from proc_sysctl.c) is playing with pinning superblocks
way too much; for per-pid directories that's not an issue, but
for sysctl table removal you are very likely to hit a bunch of
evictees on the same superblock...

Al Viro <viro@zeniv.linux.org.uk> writes:
> On Thu, Feb 20, 2020 at 03:02:22PM -0800, Linus Torvalds wrote:
>> On Thu, Feb 20, 2020 at 12:48 PM Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>> >
>> > Linus, does this approach look like something you can stand?
>>
>> A couple of worries, although one of them seem to have already been
>> resolved by Al.
>>
>> I think the real gatekeeper should be Al in general. But other than
>> the small comments I had, I think this might work just fine.
>>
>> Al?
>
> I'll need to finish RTFS there; I have initially misread that patch,
> actually - Eric _is_ using that thing both for those directories
> and for sysctl inodes. And the prototype for that machinery (the
> one he'd pulled from proc_sysctl.c) is playing with pinning superblocks
> way too much; for per-pid directories that's not an issue, but
> for sysctl table removal you are very likely to hit a bunch of
> evictees on the same superblock...
I saw that was possible. If the broad strokes look correct I don't have
a problem at all with optimizing for the case where many of the entries
are for inodes on the same superblock. I just had enough other details
on my mind I was afraid if I got a little more clever I would have
introduced a typo somewhere.
I wish I could limit the sysctl parts to just directories, but
unfortunately the sysctl tables don't always give a guarantee that a
directory is what will be removed. But sysctls do have one name per
inode invarant like fat. There is no way to express a sysctl
table that doesn't have that invariant.
As for d_find_alias/d_invalidate.
Just for completeness I wanted to write a loop:
while (dentry = d_find_alias(inode)) {
d_invalidate(dentry);
dput(dentry);
}
Unfortunately that breaks on directories, because for directories
d_find_alias turns into d_find_any_alias, and continues to return aliases
even when they are unhashed.
It might be nice to write a cousin of d_prune_aliases call
it d_invalidate_aliases that just does that loop the correct way
in dcache.c
Eric

Al Viro <viro@zeniv.linux.org.uk> writes:
> On Thu, Feb 20, 2020 at 10:54:20PM +0000, Al Viro wrote:
>> On Thu, Feb 20, 2020 at 02:49:53PM -0600, Eric W. Biederman wrote:
>> >
>> > The function d_prune_aliases has the problem that it will only prune
>> > aliases thare are completely unused. It will not remove aliases for
>> > the dcache or even think of removing mounts from the dcache. For that
>> > behavior d_invalidate is needed.
>> >
>> > To use d_invalidate replace d_prune_aliases with d_find_alias
>> > followed by d_invalidate and dput. This is safe and complete
>> > because no inode in proc has any hardlinks or aliases.
>>
>> s/no inode.*/it's a fucking directory inode./
>
> Wait... You are using it for sysctls as well? Ho-hum... The thing is,
> for sysctls you are likely to run into consequent entries with the
> same superblock, making for a big pile of useless playing with
> ->s_active... And yes, that applied to mainline as well
Which is why I worked to merge the two cases since they were so close.
Fewer things to fix and more eyeballs on the code.
Eric

Oleg Nesterovoleg@redhat.comRe: [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once2020-02-21T16:50:53Zurn:uuid:6207504c-dc19-0feb-8c6f-df30df6d5b4e

Oleg Nesterov <oleg@redhat.com> writes:
> On 02/20, Eric W. Biederman wrote:
>>
>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> + /* pid_links[PIDTYPE_PID].next is always NULL */
>> + struct pid *npid = READ_ONCE(ntask->thread_pid);
>> + struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> + rcu_assign_pointer(ntask->thread_pid, opid);
>> + rcu_assign_pointer(otask->thread_pid, npid);
>
> this breaks has_group_leader_pid()...
>
> proc_pid_readdir() can miss a process doing mt-exec but this looks fixable,
> just we need to update ntask->thread_pid before updating ->first.
>
> The more problematic case is __exit_signal() which does
>
> if (unlikely(has_group_leader_pid(tsk)))
> posix_cpu_timers_exit_group(tsk);
Along with the comment:
/*
* This can only happen if the caller is de_thread().
* FIXME: this is the temporary hack, we should teach
* posix-cpu-timers to handle this case correctly.
*/
So I suspect this is fixable and the above fix might be part of that.
Hmm looking at your commit:
commit e0a70217107e6f9844628120412cb27bb4cea194
Author: Oleg Nesterov <oleg@redhat.com>
Date: Fri Nov 5 16:53:42 2010 +0100
posix-cpu-timers: workaround to suppress the problems with mt exec
posix-cpu-timers.c correctly assumes that the dying process does
posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
timers from signal->cpu_timers list.
But, it also assumes that timer->it.cpu.task is always the group
leader, and thus the dead ->task means the dead thread group.
This is obviously not true after de_thread() changes the leader.
After that almost every posix_cpu_timer_ method has problems.
It is not simple to fix this bug correctly. First of all, I think
that timer->it.cpu should use struct pid instead of task_struct.
Also, the locking should be reworked completely. In particular,
tasklist_lock should not be used at all. This all needs a lot of
nontrivial and hard-to-test changes.
Change __exit_signal() to do posix_cpu_timers_exit_group() when
the old leader dies during exec. This is not the fix, just the
temporary hack to hide the problem for 2.6.37 and stable. IOW,
this is obviously wrong but this is what we currently have anyway:
cpu timers do not work after mt exec.
In theory this change adds another race. The exiting leader can
detach the timers which were attached to the new leader. However,
the window between de_thread() and release_task() is small, we
can pretend that sys_timer_create() was called before de_thread().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It looks like the data structures need fixing. Possibly to use struct
pid. Possibly to move the group data to signal struct.
I think I played with some of that awhile ago.
I am going to move this change to another patchset. So I don't wind up
playing shift the bug around. I thought I would need this to get the
other code working but it turns out we remain bug compatible without
this.
Hopefully I can get something out in the next week or so that addresses
the issues you have pointed out.
Eric

I have addressed all of the review comments as I understand them,
and fixed the small oversight the kernel test robot was able to
find. (I had failed to initialize the new field pid->inodes).
I did not hear any concerns from the 10,000 foot level last time
so I am assuming this set of changes (baring bugs) is good to go.
Unless some new issues appear my plan is to put this in my tree
and get this into linux-next. Which will give Alexey something
to build his changes on.
I tested this set of changes by running:
(while ls -1 -f /proc > /dev/null ; do :; done ) &
And monitoring the amount of free memory.
With the flushing disabled I saw the used memory in the system grow by
20M before the shrinker would bring it back down to where it started.
With the patch applied I saw the memory usage stay essentially fixed.
So flushing definitely keeps things working better.
If anyone sees any problems with this code please let me know.
Thank you,
Eric W. Biederman (6):
proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
proc: In proc_prune_siblings_dcache cache an aquired super block
proc: Use d_invalidate in proc_prune_siblings_dcache
proc: Clear the pieces of proc_inode that proc_evict_inode cares about
proc: Use a list of inodes to flush from proc
fs/proc/base.c | 111 ++++++++++++++++--------------------------------
fs/proc/inode.c | 73 ++++++++++++++++++++++++++++---
fs/proc/internal.h | 4 +-
fs/proc/proc_sysctl.c | 45 +++-----------------
include/linux/pid.h | 1 +
include/linux/proc_fs.h | 4 +-
kernel/exit.c | 4 +-
kernel/pid.c | 1 +
8 files changed, 120 insertions(+), 123 deletions(-)

Proc mount option handling is broken, and it has been since I
accidentally broke it in the middle 2006.
The problem is that because we perform an internal mount of proc
before user space mounts proc all of the mount options that user
specifies when mounting proc are ignored.
You can set those mount options with a remount but that is rather
surprising.
This most directly affects android which is using hidpid=2 by default.
Now that the sysctl system call support has been removed, and we have
settled on way of flushing proc dentries when a process exits without
using proc_mnt, there is an simple and easy fix.
a) Give UML mconsole it's own private mount of proc to use.
b) Stop creating the internal mount of proc
We still need Alexey Gladkov's full patch to get proc mount options to
work inside of UML, and to be generally useful. This set of changes
is just enough to get them working as well as they have in the past.
If anyone sees any problem with this code please let me know.
Otherwise I plan to merge these set of fixes through my tree.
Link: https://lore.kernel.org/lkml/87r21tuulj.fsf@x220.int.ebiederm.org/
Link: https://lore.kernel.org/lkml/871rqk2brn.fsf_-_@x220.int.ebiederm.org/
Link: https://lore.kernel.org/lkml/20200210150519.538333-1-gladkov.alexey@gmail.com/
Link: https://lore.kernel.org/lkml/20180611195744.154962-1-astrachan@google.com/
Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.")
Eric W. Biederman (3):
uml: Don't consult current to find the proc_mnt in mconsole_proc
uml: Create a private mount of proc for mconsole
proc: Remove the now unnecessary internal mount of proc
arch/um/drivers/mconsole_kern.c | 28 +++++++++++++++++++++++++++-
fs/proc/root.c | 36 ------------------------------------
include/linux/pid_namespace.h | 2 --
include/linux/proc_ns.h | 5 -----
kernel/pid.c | 8 --------
kernel/pid_namespace.c | 7 -------
6 files changed, 27 insertions(+), 59 deletions(-)
Eric

Oleg wrote a very informative comment, but with the removal of
proc_cleanup_work it is no longer accurate.
Rewrite the comment so that it only talks about the details
that are still relevant, and hopefully is a little clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
kernel/pid_namespace.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 318fcc6ba301..01f8ba32cc0c 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -224,20 +224,27 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
} while (rc != -ECHILD);
/*
- * kernel_wait4() above can't reap the EXIT_DEAD children but we do not
- * really care, we could reparent them to the global init. We could
- * exit and reap ->child_reaper even if it is not the last thread in
- * this pid_ns, free_pid(pid_allocated == 0) calls proc_cleanup_work(),
- * pid_ns can not go away until proc_kill_sb() drops the reference.
+ * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE
+ * process whose parents processes are outside of the pid
+ * namespace. Such processes are created with setns()+fork().
*
- * But this ns can also have other tasks injected by setns()+fork().
- * Again, ignoring the user visible semantics we do not really need
- * to wait until they are all reaped, but they can be reparented to
- * us and thus we need to ensure that pid->child_reaper stays valid
- * until they all go away. See free_pid()->wake_up_process().
+ * If those EXIT_ZOMBIE processes are not reaped by their
+ * parents before their parents exit, they will be reparented
+ * to pid_ns->child_reaper. Thus pidns->child_reaper needs to
+ * stay valid until they all go away.
*
- * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
- * if reparented.
+ * The code relies on the the pid_ns->child_reaper ignoring
+ * SIGCHILD to cause those EXIT_ZOMBIE processes to be
+ * autoreaped if reparented.
+ *
+ * Semantically it is also desirable to wait for EXIT_ZOMBIE
+ * processes before allowing the child_reaper to be reaped, as
+ * that gives the invariant that when the init process of a
+ * pid namespace is reaped all of the processes in the pid
+ * namespace are gone.
+ *
+ * Once all of the other tasks are gone from the pid_namespace
+ * free_pid() will awaken this task.
*/
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
--
2.20.1

On Fri, Feb 28, 2020 at 04:34:20PM -0600, Eric W. Biederman wrote:
>
> Oleg wrote a very informative comment, but with the removal of
> proc_cleanup_work it is no longer accurate.
>
> Rewrite the comment so that it only talks about the details
> that are still relevant, and hopefully is a little clearer.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> kernel/pid_namespace.c | 31 +++++++++++++++++++------------
> 1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 318fcc6ba301..01f8ba32cc0c 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -224,20 +224,27 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> } while (rc != -ECHILD);
>
> /*
> - * kernel_wait4() above can't reap the EXIT_DEAD children but we do not
> - * really care, we could reparent them to the global init. We could
> - * exit and reap ->child_reaper even if it is not the last thread in
> - * this pid_ns, free_pid(pid_allocated == 0) calls proc_cleanup_work(),
> - * pid_ns can not go away until proc_kill_sb() drops the reference.
> + * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE
> + * process whose parents processes are outside of the pid
> + * namespace. Such processes are created with setns()+fork().
> *
> - * But this ns can also have other tasks injected by setns()+fork().
> - * Again, ignoring the user visible semantics we do not really need
> - * to wait until they are all reaped, but they can be reparented to
> - * us and thus we need to ensure that pid->child_reaper stays valid
> - * until they all go away. See free_pid()->wake_up_process().
> + * If those EXIT_ZOMBIE processes are not reaped by their
> + * parents before their parents exit, they will be reparented
> + * to pid_ns->child_reaper. Thus pidns->child_reaper needs to
> + * stay valid until they all go away.
> *
> - * We rely on ignored SIGCHLD, an injected zombie must be autoreaped
> - * if reparented.
> + * The code relies on the the pid_ns->child_reaper ignoring
s/the the/the/
Hm, can we maybe reformulate this to:
"The code relies on having made pid_ns->child_reaper ignore SIGCHLD above
causing EXIT_ZOMBIE processes to be autoreaped if reparented."
Which imho makes it clearer that it was us ensuring that SIGCHLD is
ignored. Someone not too familiar with the exit codepaths might be
looking at zap_pid_ns_processes() not knowing that it is only called
when namespace init is exiting.
Otherwise
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>