Commit Message

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases). This is done
in both directions -- hence two helpers -- though it's more common to
have to copy user space structs into kernel space.
Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[1]). A future
patch replaces all of the common uses of this pattern to use the new
copy_struct_{to,from}_user() helpers.
[1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
similar checks to copy_struct_from_user() while rt_sigprocmask(2)
always rejects differently-sized struct arguments.
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
include/linux/uaccess.h | 5 ++
lib/Makefile | 2 +-
lib/struct_user.c | 182 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 188 insertions(+), 1 deletion(-)
create mode 100644 lib/struct_user.c

Comments

On Wed, Sep 4, 2019 at 1:20 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>> A common pattern for syscall extensions is increasing the size of a> struct passed from userspace, such that the zero-value of the new fields> result in the old kernel behaviour (allowing for a mix of userspace and> kernel vintages to operate on one another in most cases).
Ack, this makes the whole series (and a few unrelated system calls) cleaner.
Linus

On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> A common pattern for syscall extensions is increasing the size of a> struct passed from userspace, such that the zero-value of the new fields> result in the old kernel behaviour (allowing for a mix of userspace and> kernel vintages to operate on one another in most cases). This is done> in both directions -- hence two helpers -- though it's more common to> have to copy user space structs into kernel space.> > Previously there was no common lib/ function that implemented> the necessary extension-checking semantics (and different syscalls> implemented them slightly differently or incompletely[1]). A future> patch replaces all of the common uses of this pattern to use the new> copy_struct_{to,from}_user() helpers.> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do> similar checks to copy_struct_from_user() while rt_sigprocmask(2)> always rejects differently-sized struct arguments.> > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
I would probably split this out into a separate patchset. It can very
well go in before openat2(). Thoughts?
Christian

On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:> > A common pattern for syscall extensions is increasing the size of a> > struct passed from userspace, such that the zero-value of the new fields> > result in the old kernel behaviour (allowing for a mix of userspace and> > kernel vintages to operate on one another in most cases). This is done> > in both directions -- hence two helpers -- though it's more common to> > have to copy user space structs into kernel space.> > > > Previously there was no common lib/ function that implemented> > the necessary extension-checking semantics (and different syscalls> > implemented them slightly differently or incompletely[1]). A future> > patch replaces all of the common uses of this pattern to use the new> > copy_struct_{to,from}_user() helpers.> > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do> > similar checks to copy_struct_from_user() while rt_sigprocmask(2)> > always rejects differently-sized struct arguments.> > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>> > I would probably split this out into a separate patchset. It can very> well go in before openat2(). Thoughts?
Yeah, I'll split this and the related patches out -- though I will admit
I'm not sure how you're supposed to deal with multiple independent
patchsets that depend on each other. How will folks reviewing openat2(2)
know to include the lib/struct_user.c changes?
Also, whose tree should it go through?

On Thu, Sep 05, 2019 at 09:27:18PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote:> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:> > > A common pattern for syscall extensions is increasing the size of a> > > struct passed from userspace, such that the zero-value of the new fields> > > result in the old kernel behaviour (allowing for a mix of userspace and> > > kernel vintages to operate on one another in most cases). This is done> > > in both directions -- hence two helpers -- though it's more common to> > > have to copy user space structs into kernel space.> > > > > > Previously there was no common lib/ function that implemented> > > the necessary extension-checking semantics (and different syscalls> > > implemented them slightly differently or incompletely[1]). A future> > > patch replaces all of the common uses of this pattern to use the new> > > copy_struct_{to,from}_user() helpers.> > > > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do> > > similar checks to copy_struct_from_user() while rt_sigprocmask(2)> > > always rejects differently-sized struct arguments.> > > > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>> > > > I would probably split this out into a separate patchset. It can very> > well go in before openat2(). Thoughts?> > Yeah, I'll split this and the related patches out -- though I will admit> I'm not sure how you're supposed to deal with multiple independent> patchsets that depend on each other. How will folks reviewing openat2(2)> know to include the lib/struct_user.c changes?
The way I usually deal with this is to make two branches. One with the
changes the other depends on and then merge this branch into the other
and put the changes on top. Then you can provide a complete branch that
people can test when you send the patchset out by just linking to it in
the cover letter.
(But if it's too much hazzle just leave it.)
> > Also, whose tree should it go through?
If people think splitting it out makes sense and we can settle the
technical details I can take it and let it stew in linux-next at least
for a little while.
I have changes to clone3() in there that touch
copy_clone_args_from_user() anyway and there are tests for clone3()
struct copying so we'd catch regressions (for clone3() at least) pretty
quickly.
If we don't see any major issues in the next two weeks it might even be
ok to send for 5.4.
Christian

On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:> > A common pattern for syscall extensions is increasing the size of a> > struct passed from userspace, such that the zero-value of the new fields> > result in the old kernel behaviour (allowing for a mix of userspace and> > kernel vintages to operate on one another in most cases). This is done> > in both directions -- hence two helpers -- though it's more common to> > have to copy user space structs into kernel space.> > > > Previously there was no common lib/ function that implemented> > the necessary extension-checking semantics (and different syscalls> > implemented them slightly differently or incompletely[1]). A future> > patch replaces all of the common uses of this pattern to use the new> > copy_struct_{to,from}_user() helpers.> > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do> > similar checks to copy_struct_from_user() while rt_sigprocmask(2)> > always rejects differently-sized struct arguments.> > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
[...]
> > + if (unlikely(!access_ok(src, usize)))> > + return -EFAULT;> > +> > + /* Deal with trailing bytes. */> > + if (usize < ksize)> > + memset(dst + size, 0, rest);
[...]
> That's a change in behavior for clone3() and sched at least, no? Unless> - which I guess you might have done - you have moved the "error out when> the struct is too small" part before the call to copy_struct_from_user()> for them.
Yes, I've put the minimum size check to the callers in all of the
cases (in the case of clone3() I've #define'd a CLONE_ARGS_SIZE_VER0 to
match the others -- see patch 2 of the series).

On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> Because every caller of that function right now has that limit set> anyway iirc. So we can either remove it from here and place it back for> the individual callers or leave it in the helper.> Also, I'm really asking, why not? Is it unreasonable to have an upper> bound on the size (for a long time probably) or are you disagreeing with> PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,> bpf, and clone3 and in a few other places.
For a primitive that can be safely used with any size (OK, any within
the usual 2Gb limit)? Why push the random policy into the place where
it doesn't belong?
Seriously, what's the point? If they want to have a large chunk of
userland memory zeroed or checked for non-zeroes - why would that
be a problem?

On Thu, Sep 05, 2019 at 07:28:01PM +0100, Al Viro wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:> > > Because every caller of that function right now has that limit set> > anyway iirc. So we can either remove it from here and place it back for> > the individual callers or leave it in the helper.> > Also, I'm really asking, why not? Is it unreasonable to have an upper> > bound on the size (for a long time probably) or are you disagreeing with> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,> > bpf, and clone3 and in a few other places.> > For a primitive that can be safely used with any size (OK, any within> the usual 2Gb limit)? Why push the random policy into the place where> it doesn't belong?
Ah, the "not in the helper part" makes sense.
As long as leave the check for the callers themselves.
> > Seriously, what's the point? If they want to have a large chunk of> userland memory zeroed or checked for non-zeroes - why would that> be a problem?

On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:> > > Because every caller of that function right now has that limit set> > anyway iirc. So we can either remove it from here and place it back for> > the individual callers or leave it in the helper.> > Also, I'm really asking, why not? Is it unreasonable to have an upper> > bound on the size (for a long time probably) or are you disagreeing with> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,> > bpf, and clone3 and in a few other places.> > For a primitive that can be safely used with any size (OK, any within> the usual 2Gb limit)? Why push the random policy into the place where> it doesn't belong?> > Seriously, what's the point? If they want to have a large chunk of> userland memory zeroed or checked for non-zeroes - why would that> be a problem?
Thinking about it some more, there isn't really any r/w amplification --
so there isn't much to gain by passing giant structs. Though, if we are
going to permit 2GB buffers, isn't that also an argument to use
memchr_inv()? :P

On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote:> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:> > > > > Because every caller of that function right now has that limit set> > > anyway iirc. So we can either remove it from here and place it back for> > > the individual callers or leave it in the helper.> > > Also, I'm really asking, why not? Is it unreasonable to have an upper> > > bound on the size (for a long time probably) or are you disagreeing with> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,> > > bpf, and clone3 and in a few other places.> > > > For a primitive that can be safely used with any size (OK, any within> > the usual 2Gb limit)? Why push the random policy into the place where> > it doesn't belong?> > > > Seriously, what's the point? If they want to have a large chunk of> > userland memory zeroed or checked for non-zeroes - why would that> > be a problem?> > Thinking about it some more, there isn't really any r/w amplification --> so there isn't much to gain by passing giant structs. Though, if we are> going to permit 2GB buffers, isn't that also an argument to use> memchr_inv()? :P
I'm not sure I understand the last bit. If you look at what copy_from_user()
does on misaligned source/destination, especially on architectures that
really, really do not like unaligned access...
Case in point: alpha (and it's not unusual in that respect). What it boils
down to is
copy bytes until the destination is aligned
if source and destination are both aligned
copy word by word
else
read word by word, storing the mix of two adjacent words
copy the rest byte by byte
The unpleasant case (to and from having different remainders modulo 8) is
basically
if (count >= 8) {
u64 *aligned = (u64 *)(from & ~7);
u64 *dest = (u64 *)to;
int bitshift = (from & 7) * 8;
u64 prev, next;
prev = aligned[0];
do {
next = aligned[1];
prev <<= bitshift;
prev |= next >> (64 - bitshift);
*dest++ = prev;
aligned++;
prev = next;
from += 8;
to += 8;
count -= 8;
} while (count >= 8);
}
Now, mix that with "... and do memchr_inv() on the copy to find if we'd
copied any non-zeroes, nevermind where" and it starts looking really
ridiculous.
We should just read the fscking source, aligned down to word boundary
and check each word being read. The first and the last ones - masked.
All there is to it. On almost all architectures that'll work well
enough; s390 might want something more elaborate (there even word-by-word
copies are costly, but I'd suggest talking to them for details).
Something like bool all_zeroes_user(const void __user *p, size_t count)
would probably be a sane API...

On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote:> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:> > > > > Because every caller of that function right now has that limit set> > > anyway iirc. So we can either remove it from here and place it back for> > > the individual callers or leave it in the helper.> > > Also, I'm really asking, why not? Is it unreasonable to have an upper> > > bound on the size (for a long time probably) or are you disagreeing with> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,> > > bpf, and clone3 and in a few other places.> > > > For a primitive that can be safely used with any size (OK, any within> > the usual 2Gb limit)? Why push the random policy into the place where> > it doesn't belong?> > > > Seriously, what's the point? If they want to have a large chunk of> > userland memory zeroed or checked for non-zeroes - why would that> > be a problem?> > Thinking about it some more, there isn't really any r/w amplification --> so there isn't much to gain by passing giant structs. Though, if we are> going to permit 2GB buffers, isn't that also an argument to use> memchr_inv()? :P
I think we should just do a really dumb, easy to understand minimal
thing for now. It could even just be what every caller is doing right
now anyway with the get_user() loop.
The only way to settle memchr_inv() vs all the other clever ways
suggested here is to benchmark it and see if it matters *for the current
users* of this helper. If it does, great we can do it. If it doesn't why
bother having that argument right now?
Once we somehow end up in a possible world where we apparently have
decided it's a great idea to copy 2GB argument structures for a syscall
into or from the kernel we can start optimizing the hell out of this.
Before that and especially with current callers I honestly doubt it
matters whether we use memchr_inv() or while() {get_user()} loops.
Christian