*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-06 16:30 [PATCH v15 00/17] arm64: untag user pointers passed to the kernel Andrey Konovalov
` (16 preceding siblings ...)
2019-05-06 16:31 ` [PATCH v15 17/17] selftests, arm64: add a selftest for passing tagged pointers to kernel Andrey Konovalov
@ 2019-05-17 14:49 ` Catalin Marinas
2019-05-20 23:53 ` Evgenii Stepanov
2019-05-21 18:48 ` Jason Gunthorpe17 siblings, 2 replies; 99+ messages in thread
From: Catalin Marinas @ 2019-05-17 14:49 UTC (permalink / raw)
To: Andrey Konovalov
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
linux-mm, linux-kselftest, Felix Kuehling, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgeniy Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Ramana Radhakrishnan,
Alex Williamson, Mauro Carvalho Chehab, linux-arm-kernel,
Kostya Serebryany, Greg Kroah-Hartman, Yishai Hadas,
linux-kernel, Jens Wiklander, Lee Smith, Alexander Deucher,
Andrew Morton, Robin Murphy, Christian Koenig,
Luc Van Oostenryck
Hi Andrey,
On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
> One of the alternative approaches to untagging that was considered is to
> completely strip the pointer tag as the pointer enters the kernel with
> some kind of a syscall wrapper, but that won't work with the countless
> number of different ioctl calls. With this approach we would need a custom
> wrapper for each ioctl variation, which doesn't seem practical.
The more I look at this problem, the less convinced I am that we can
solve it in a way that results in a stable ABI covering ioctls(). While
for the Android kernel codebase it could be simpler as you don't upgrade
the kernel version every 2.5 months, for the mainline kernel this
doesn't scale. Any run-time checks are relatively limited in terms of
drivers covered. Better static checking would be nice as a long term
solution but we didn't get anywhere with the discussion last year.
IMO (RFC for now), I see two ways forward:
1. Make this a user space problem and do not allow tagged pointers into
the syscall ABI. A libc wrapper would have to convert structures,
parameters before passing them into the kernel. Note that we can
still support the hardware MTE in the kernel by enabling tagged
memory ranges, saving/restoring tags etc. but not allowing tagged
addresses at the syscall boundary.
2. Similar shim to the above libc wrapper but inside the kernel
(arch/arm64 only; most pointer arguments could be covered with an
__SC_CAST similar to the s390 one). There are two differences from
what we've discussed in the past:
a) this is an opt-in by the user which would have to explicitly call
prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
to pass tagged pointers to the kernel. This would probably be the
responsibility of the C lib to make sure it doesn't tag heap
allocations. If the user did not opt-in, the syscalls are routed
through the normal path (no untagging address shim).
b) ioctl() and other blacklisted syscalls (prctl) will not accept
tagged pointers (to be documented in Vicenzo's ABI patches).
It doesn't solve the problems we are trying to address but 2.a saves us
from blindly relaxing the ABI without knowing how to easily assess new
code being merged (over 500K lines between kernel versions). Existing
applications (who don't opt-in) won't inadvertently start using the new
ABI which could risk becoming de-facto ABI that we need to support on
the long run.
Option 1 wouldn't solve the ioctl() problem either and while it makes
things simpler for the kernel, I am aware that it's slightly more
complicated in user space (but I really don't mind if you prefer option
1 ;)).
The tagged pointers (whether hwasan or MTE) should ideally be a
transparent feature for the application writer but I don't think we can
solve it entirely and make it seamless for the multitude of ioctls().
I'd say you only opt in to such feature if you know what you are doing
and the user code takes care of specific cases like ioctl(), hence the
prctl() proposal even for the hwasan.
Comments welcomed.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-17 14:49 ` [PATCH v15 00/17] arm64: untag user pointers passed to the kernel Catalin Marinas
@ 2019-05-20 23:53 ` Evgenii Stepanov
2019-05-21 18:29 ` Catalin Marinas
2019-05-21 18:48 ` Jason Gunthorpe1 sibling, 1 reply; 99+ messages in thread
From: Evgenii Stepanov @ 2019-05-20 23:53 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List,
open list:KERNEL SELFTEST FRAMEWORK, Felix Kuehling,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, Dmitry Vyukov, Dave Martin, linux-media, Kevin Brodsky,
Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Mauro Carvalho Chehab,
Linux ARM, Kostya Serebryany, Greg Kroah-Hartman, Yishai Hadas,
LKML, Jens Wiklander, Lee Smith, Alexander Deucher,
Andrew Morton, Elliott Hughes, Robin Murphy, Christian Koenig,
Luc Van Oostenryck
On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Hi Andrey,
>
> On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
> > One of the alternative approaches to untagging that was considered is to
> > completely strip the pointer tag as the pointer enters the kernel with
> > some kind of a syscall wrapper, but that won't work with the countless
> > number of different ioctl calls. With this approach we would need a custom
> > wrapper for each ioctl variation, which doesn't seem practical.
>
> The more I look at this problem, the less convinced I am that we can
> solve it in a way that results in a stable ABI covering ioctls(). While
> for the Android kernel codebase it could be simpler as you don't upgrade
> the kernel version every 2.5 months, for the mainline kernel this
> doesn't scale. Any run-time checks are relatively limited in terms of
> drivers covered. Better static checking would be nice as a long term
> solution but we didn't get anywhere with the discussion last year.
>
> IMO (RFC for now), I see two ways forward:
>
> 1. Make this a user space problem and do not allow tagged pointers into
> the syscall ABI. A libc wrapper would have to convert structures,
> parameters before passing them into the kernel. Note that we can
> still support the hardware MTE in the kernel by enabling tagged
> memory ranges, saving/restoring tags etc. but not allowing tagged
> addresses at the syscall boundary.
>
> 2. Similar shim to the above libc wrapper but inside the kernel
> (arch/arm64 only; most pointer arguments could be covered with an
> __SC_CAST similar to the s390 one). There are two differences from
> what we've discussed in the past:
>
> a) this is an opt-in by the user which would have to explicitly call
> prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> to pass tagged pointers to the kernel. This would probably be the
> responsibility of the C lib to make sure it doesn't tag heap
> allocations. If the user did not opt-in, the syscalls are routed
> through the normal path (no untagging address shim).
>
> b) ioctl() and other blacklisted syscalls (prctl) will not accept
> tagged pointers (to be documented in Vicenzo's ABI patches).
>
> It doesn't solve the problems we are trying to address but 2.a saves us
> from blindly relaxing the ABI without knowing how to easily assess new
> code being merged (over 500K lines between kernel versions). Existing
> applications (who don't opt-in) won't inadvertently start using the new
> ABI which could risk becoming de-facto ABI that we need to support on
> the long run.
>
> Option 1 wouldn't solve the ioctl() problem either and while it makes
> things simpler for the kernel, I am aware that it's slightly more
> complicated in user space (but I really don't mind if you prefer option
> 1 ;)).
>
> The tagged pointers (whether hwasan or MTE) should ideally be a
> transparent feature for the application writer but I don't think we can
> solve it entirely and make it seamless for the multitude of ioctls().
> I'd say you only opt in to such feature if you know what you are doing
> and the user code takes care of specific cases like ioctl(), hence the
> prctl() proposal even for the hwasan.
>
> Comments welcomed.
Any userspace shim approach is problematic for Android because of the
apps that use raw system calls. AFAIK, all apps written in Go are in
that camp - I'm not sure how common they are, but getting them all
recompiled is probably not realistic.
The way I see it, a patch that breaks handling of tagged pointers is
not that different from, say, a patch that adds a wild pointer
dereference. Both are bugs; the difference is that (a) the former
breaks a relatively uncommon target and (b) it's arguably an easier
mistake to make. If MTE adoption goes well, (a) will not be the case
for long.
This is a bit of a chicken-and-egg problem. In a world where memory
allocators on one or several popular platforms generate pointers with
non-zero tags, any such breakage will be caught in testing.
Unfortunately to reach that state we need the kernel to start
accepting tagged pointers first, and then hold on for a couple of
years until userspace catches up.
Perhaps we can start by whitelisting ioctls by driver?
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-20 23:53 ` Evgenii Stepanov@ 2019-05-21 18:29 ` Catalin Marinas
2019-05-22 0:04 ` Kees Cook0 siblings, 1 reply; 99+ messages in thread
From: Catalin Marinas @ 2019-05-21 18:29 UTC (permalink / raw)
To: Evgenii Stepanov
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List,
open list:KERNEL SELFTEST FRAMEWORK, Felix Kuehling,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, Dmitry Vyukov, Dave Martin, linux-media, Kevin Brodsky,
Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Mauro Carvalho Chehab,
Linux ARM, Kostya Serebryany, Greg Kroah-Hartman, Yishai Hadas,
LKML, Jens Wiklander, Lee Smith, Alexander Deucher,
Andrew Morton, Elliott Hughes, Robin Murphy, Christian Koenig,
Luc Van Oostenryck
On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > IMO (RFC for now), I see two ways forward:
> >
> > 1. Make this a user space problem and do not allow tagged pointers into
> > the syscall ABI. A libc wrapper would have to convert structures,
> > parameters before passing them into the kernel. Note that we can
> > still support the hardware MTE in the kernel by enabling tagged
> > memory ranges, saving/restoring tags etc. but not allowing tagged
> > addresses at the syscall boundary.
> >
> > 2. Similar shim to the above libc wrapper but inside the kernel
> > (arch/arm64 only; most pointer arguments could be covered with an
> > __SC_CAST similar to the s390 one). There are two differences from
> > what we've discussed in the past:
> >
> > a) this is an opt-in by the user which would have to explicitly call
> > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > to pass tagged pointers to the kernel. This would probably be the
> > responsibility of the C lib to make sure it doesn't tag heap
> > allocations. If the user did not opt-in, the syscalls are routed
> > through the normal path (no untagging address shim).
> >
> > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > tagged pointers (to be documented in Vicenzo's ABI patches).
[...]
> Any userspace shim approach is problematic for Android because of the
> apps that use raw system calls. AFAIK, all apps written in Go are in
> that camp - I'm not sure how common they are, but getting them all
> recompiled is probably not realistic.
That's a fair point (I wasn't expecting it would get much traction
anyway ;)). OTOH, it allows upstreaming of the MTE patches while we
continue the discussions around TBI.
> The way I see it, a patch that breaks handling of tagged pointers is
> not that different from, say, a patch that adds a wild pointer
> dereference. Both are bugs; the difference is that (a) the former
> breaks a relatively uncommon target and (b) it's arguably an easier
> mistake to make. If MTE adoption goes well, (a) will not be the case
> for long.
It's also the fact such patch would go unnoticed for a long time until
someone exercises that code path. And when they do, the user would be
pretty much in the dark trying to figure what what went wrong, why a
SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
all the places where it matters in the current kernel codebase (ignoring
future patches).
I think we should revisit the static checking discussions we had last
year. Run-time checking (even with compiler instrumentation and
syzkaller fuzzing) would only cover the code paths specific to a Linux
or Android installation.
> This is a bit of a chicken-and-egg problem. In a world where memory
> allocators on one or several popular platforms generate pointers with
> non-zero tags, any such breakage will be caught in testing.
> Unfortunately to reach that state we need the kernel to start
> accepting tagged pointers first, and then hold on for a couple of
> years until userspace catches up.
Would the kernel also catch up with providing a stable ABI? Because we
have two moving targets.
On one hand, you have Android or some Linux distro that stick to a
stable kernel version for some time, so they have better chance of
clearing most of the problems. On the other hand, we have mainline
kernel that gets over 500K lines every release. As maintainer, I can't
rely on my testing alone as this is on a limited number of platforms. So
my concern is that every kernel release has a significant chance of
breaking the ABI, unless we have a better way of identifying potential
issues.
> Perhaps we can start by whitelisting ioctls by driver?
This was also raised by Ruben in private but without a (static) tool to
to check, manually going through all the drivers doesn't scale. It's
very likely that most drivers don't care, just a get_user/put_user is
already handled by these patches. Searching for find_vma() was
identifying one such use-case but is this sufficient? Are there other
cases we need to explicitly untag a pointer?
The other point I'd like feedback on is 2.a above. I see _some_ value
into having the user opt-in to this relaxed ABI rather than blinding
exposing it to all applications. Dave suggested (in private) a new
personality (e.g. PER_LINUX_TBI) inherited by children. It would be the
responsibility of the C library to check the current personality bits
and only tag pointers on allocation *if* the kernel allowed it. The
kernel could provide the AT_FLAGS bit as in Vincenzo's patches if the
personality was set but can't set it retrospectively if the user called
sys_personality. By default, /sbin/init would not have this personality
and libc would not tag pointers, so we can guarantee that your distro
boots normally with a new kernel version. We could have an envp that
gets caught by /sbin/init so you can pass it on the kernel command line
(or a dynamic loader at run-time). But the default should be the current
ABI behaviour.
We can enforce the current behaviour by having access_ok() check the
personality or a TIF flag but we may relax this enforcement at some
point in the future as we learn more about the implications of TBI.
Thanks.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-17 14:49 ` [PATCH v15 00/17] arm64: untag user pointers passed to the kernel Catalin Marinas
2019-05-20 23:53 ` Evgenii Stepanov@ 2019-05-21 18:48 ` Jason Gunthorpe
2019-05-22 13:49 ` Dave Martin1 sibling, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2019-05-21 18:48 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
linux-mm, linux-kselftest, Felix Kuehling, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgeniy Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Mauro Carvalho Chehab,
linux-arm-kernel, Kostya Serebryany, Greg Kroah-Hartman,
Yishai Hadas, linux-kernel, Jens Wiklander, Lee Smith,
Alexander Deucher, Andrew Morton, Robin Murphy, Christian Koenig,
Luc Van Oostenryck
On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> The tagged pointers (whether hwasan or MTE) should ideally be a
> transparent feature for the application writer but I don't think we can
> solve it entirely and make it seamless for the multitude of ioctls().
> I'd say you only opt in to such feature if you know what you are doing
> and the user code takes care of specific cases like ioctl(), hence the
> prctl() proposal even for the hwasan.
I'm not sure such a dire view is warrented..
The ioctl situation is not so bad, other than a few special cases,
most drivers just take a 'void __user *' and pass it as an argument to
some function that accepts a 'void __user *'. sparse et al verify
this.
As long as the core functions do the right thing the drivers will be
OK.
The only place things get dicy is if someone casts to unsigned long
(ie for vma work) but I think that reflects that our driver facing
APIs for VMAs are compatible with static analysis (ie I have no
earthly idea why get_user_pages() accepts an unsigned long), not that
this is too hard.
Jason
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-21 18:29 ` Catalin Marinas@ 2019-05-22 0:04 ` Kees Cook
2019-05-22 10:11 ` Catalin Marinas
2019-05-23 17:51 ` Khalid Aziz0 siblings, 2 replies; 99+ messages in thread
From: Kees Cook @ 2019-05-22 0:04 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Elliott Hughes,
Robin Murphy, Christian Koenig, Luc Van Oostenryck
On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > IMO (RFC for now), I see two ways forward:
> > > [...]
> > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > (arch/arm64 only; most pointer arguments could be covered with an
> > > __SC_CAST similar to the s390 one). There are two differences from
> > > what we've discussed in the past:
> > >
> > > a) this is an opt-in by the user which would have to explicitly call
> > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > to pass tagged pointers to the kernel. This would probably be the
> > > responsibility of the C lib to make sure it doesn't tag heap
> > > allocations. If the user did not opt-in, the syscalls are routed
> > > through the normal path (no untagging address shim).
> > >
> > > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > tagged pointers (to be documented in Vicenzo's ABI patches).
> >
> > The way I see it, a patch that breaks handling of tagged pointers is
> > not that different from, say, a patch that adds a wild pointer
> > dereference. Both are bugs; the difference is that (a) the former
> > breaks a relatively uncommon target and (b) it's arguably an easier
> > mistake to make. If MTE adoption goes well, (a) will not be the case
> > for long.
>
> It's also the fact such patch would go unnoticed for a long time until
> someone exercises that code path. And when they do, the user would be
> pretty much in the dark trying to figure what what went wrong, why a
> SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> all the places where it matters in the current kernel codebase (ignoring
> future patches).
So, looking forward a bit, this isn't going to be an ARM-specific issue
for long. In fact, I think we shouldn't have arm-specific syscall wrappers
in this series: I think untagged_addr() should likely be added at the
top-level and have it be a no-op for other architectures. So given this
becoming a kernel-wide multi-architecture issue (under the assumption
that x86, RISC-V, and others will gain similar TBI or MTE things),
we should solve it in a way that we can re-use.
We need something that is going to work everywhere. And it needs to be
supported by the kernel for the simple reason that the kernel needs to
do MTE checks during copy_from_user(): having that information stripped
means we lose any userspace-assigned MTE protections if they get handled
by the kernel, which is a total non-starter, IMO.
As an aside: I think Sparc ADI support in Linux actually side-stepped
this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
for kernel code.") I think this was a mistake we should not repeat for
arm64 (we do seem to be at least in agreement about this, I think).
[1] https://lore.kernel.org/patchwork/patch/654481/> > This is a bit of a chicken-and-egg problem. In a world where memory
> > allocators on one or several popular platforms generate pointers with
> > non-zero tags, any such breakage will be caught in testing.
> > Unfortunately to reach that state we need the kernel to start
> > accepting tagged pointers first, and then hold on for a couple of
> > years until userspace catches up.
>
> Would the kernel also catch up with providing a stable ABI? Because we
> have two moving targets.
>
> On one hand, you have Android or some Linux distro that stick to a
> stable kernel version for some time, so they have better chance of
> clearing most of the problems. On the other hand, we have mainline
> kernel that gets over 500K lines every release. As maintainer, I can't
> rely on my testing alone as this is on a limited number of platforms. So
> my concern is that every kernel release has a significant chance of
> breaking the ABI, unless we have a better way of identifying potential
> issues.
I just want to make sure I fully understand your concern about this
being an ABI break, and I work best with examples. The closest situation
I can see would be:
- some program has no idea about MTE
- malloc() starts returning MTE-tagged addresses
- program doesn't break from that change
- program uses some syscall that is missing untagged_addr() and fails
- kernel has now broken userspace that used to work
The trouble I see with this is that it is largely theoretical and
requires part of userspace to collude to start using a new CPU feature
that tickles a bug in the kernel. As I understand the golden rule,
this is a bug in the kernel (a missed ioctl() or such) to be fixed,
not a global breaking of some userspace behavior.
I feel like I'm missing something about this being seen as an ABI
break. The kernel already fails on userspace addresses that have high
bits set -- are there things that _depend_ on this failure to operate?
--
Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 0:04 ` Kees Cook@ 2019-05-22 10:11 ` Catalin Marinas
2019-05-22 15:30 ` enh
2019-05-23 17:51 ` Khalid Aziz1 sibling, 1 reply; 99+ messages in thread
From: Catalin Marinas @ 2019-05-22 10:11 UTC (permalink / raw)
To: Kees Cook
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Elliott Hughes,
Robin Murphy, Christian Koenig, Luc Van Oostenryck
Hi Kees,
Thanks for joining the thread ;).
On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > IMO (RFC for now), I see two ways forward:
> > > > [...]
> > > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > > (arch/arm64 only; most pointer arguments could be covered with an
> > > > __SC_CAST similar to the s390 one). There are two differences from
> > > > what we've discussed in the past:
> > > >
> > > > a) this is an opt-in by the user which would have to explicitly call
> > > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > > to pass tagged pointers to the kernel. This would probably be the
> > > > responsibility of the C lib to make sure it doesn't tag heap
> > > > allocations. If the user did not opt-in, the syscalls are routed
> > > > through the normal path (no untagging address shim).
> > > >
> > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > > tagged pointers (to be documented in Vicenzo's ABI patches).
> > >
> > > The way I see it, a patch that breaks handling of tagged pointers is
> > > not that different from, say, a patch that adds a wild pointer
> > > dereference. Both are bugs; the difference is that (a) the former
> > > breaks a relatively uncommon target and (b) it's arguably an easier
> > > mistake to make. If MTE adoption goes well, (a) will not be the case
> > > for long.
> >
> > It's also the fact such patch would go unnoticed for a long time until
> > someone exercises that code path. And when they do, the user would be
> > pretty much in the dark trying to figure what what went wrong, why a
> > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> > all the places where it matters in the current kernel codebase (ignoring
> > future patches).
>
> So, looking forward a bit, this isn't going to be an ARM-specific issue
> for long.
I do hope so.
> In fact, I think we shouldn't have arm-specific syscall wrappers
> in this series: I think untagged_addr() should likely be added at the
> top-level and have it be a no-op for other architectures.
That's what the current patchset does, so we have this as a starting
point. Kostya raised another potential issue with the syscall wrappers:
with MTE the kernel will be forced to enable the match-all (wildcard)
pointers for user space accesses since copy_from_user() would only get a
0 tag. So it has wider implications than just uaccess routines not
checking the colour.
> So given this becoming a kernel-wide multi-architecture issue (under
> the assumption that x86, RISC-V, and others will gain similar TBI or
> MTE things), we should solve it in a way that we can re-use.
Can we do any better to aid the untagged_addr() placement (e.g. better
type annotations, better static analysis)? We have to distinguish
between user pointers that may be dereferenced by the kernel (I think
almost fully covered with this patchset) and user addresses represented
as ulong that may:
a) be converted to a user pointer and dereferenced; I think that's the
case for many overloaded ulong/u64 arguments
b) used for address space management, rbtree look-ups etc. where the tag
is no longer relevant and it even gets in the way
We tried last year to identify void __user * casts to unsigned long
using sparse on the assumption that pointers can be tagged while ulong
is about address space management and needs to lose such tag. I think we
could have pushed this further. For example, get_user_pages() takes an
unsigned long but it is perfectly capable of untagging the address
itself. Shall we change its first argument to void __user * (together
with all its callers)?
find_vma(), OTOH, could untag the address but it doesn't help since
vm_start/end don't have such information (that's more about the content
or type that the user decided) and the callers check against it.
Are there any other places where this matters? These patches tracked
down find_vma() as some heuristics but we may need better static
analysis to identify other cases.
> We need something that is going to work everywhere. And it needs to be
> supported by the kernel for the simple reason that the kernel needs to
> do MTE checks during copy_from_user(): having that information stripped
> means we lose any userspace-assigned MTE protections if they get handled
> by the kernel, which is a total non-starter, IMO.
Such feedback is welcomed ;).
> As an aside: I think Sparc ADI support in Linux actually side-stepped
> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> for kernel code.") I think this was a mistake we should not repeat for
> arm64 (we do seem to be at least in agreement about this, I think).
>
> [1] https://lore.kernel.org/patchwork/patch/654481/
I tried to drag the SPARC guys into this discussion but without much
success.
> > > This is a bit of a chicken-and-egg problem. In a world where memory
> > > allocators on one or several popular platforms generate pointers with
> > > non-zero tags, any such breakage will be caught in testing.
> > > Unfortunately to reach that state we need the kernel to start
> > > accepting tagged pointers first, and then hold on for a couple of
> > > years until userspace catches up.
> >
> > Would the kernel also catch up with providing a stable ABI? Because we
> > have two moving targets.
> >
> > On one hand, you have Android or some Linux distro that stick to a
> > stable kernel version for some time, so they have better chance of
> > clearing most of the problems. On the other hand, we have mainline
> > kernel that gets over 500K lines every release. As maintainer, I can't
> > rely on my testing alone as this is on a limited number of platforms. So
> > my concern is that every kernel release has a significant chance of
> > breaking the ABI, unless we have a better way of identifying potential
> > issues.
>
> I just want to make sure I fully understand your concern about this
> being an ABI break, and I work best with examples. The closest situation
> I can see would be:
>
> - some program has no idea about MTE
Apart from some libraries like libc (and maybe those that handle
specific device ioctls), I think most programs should have no idea about
MTE. I wouldn't expect programmers to have to change their app just
because we have a new feature that colours heap allocations.
> - malloc() starts returning MTE-tagged addresses
> - program doesn't break from that change
> - program uses some syscall that is missing untagged_addr() and fails
> - kernel has now broken userspace that used to work
That's one aspect though probably more of a case of plugging in a new
device (graphics card, network etc.) and the ioctl to the new device
doesn't work.
The other is that, assuming we reach a point where the kernel entirely
supports this relaxed ABI, can we guarantee that it won't break in the
future. Let's say some subsequent kernel change (some refactoring)
misses out an untagged_addr(). This renders a previously TBI/MTE-capable
syscall unusable. Can we rely only on testing?
> The trouble I see with this is that it is largely theoretical and
> requires part of userspace to collude to start using a new CPU feature
> that tickles a bug in the kernel. As I understand the golden rule,
> this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> not a global breaking of some userspace behavior.
Yes, we should follow the rule that it's a kernel bug but it doesn't
help the user that a newly installed kernel causes user space to no
longer reach a prompt. Hence the proposal of an opt-in via personality
(for MTE we would need an explicit opt-in by the user anyway since the
top byte is no longer ignored but checked against the allocation tag).
> I feel like I'm missing something about this being seen as an ABI
> break. The kernel already fails on userspace addresses that have high
> bits set -- are there things that _depend_ on this failure to operate?
It's about providing a relaxed ABI which allows non-zero top byte and
breaking it later inadvertently without having something better in place
to analyse the kernel changes.
Thanks.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-21 18:48 ` Jason Gunthorpe@ 2019-05-22 13:49 ` Dave Martin
2019-05-23 0:20 ` Jason Gunthorpe0 siblings, 1 reply; 99+ messages in thread
From: Dave Martin @ 2019-05-22 13:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy,
Catalin Marinas, Will Deacon, dri-devel, linux-mm, Lee Smith,
linux-kselftest, Vincenzo Frascino, Jacob Bramley,
Leon Romanovsky, linux-rdma, amd-gfx, linux-arm-kernel,
Evgeniy Stepanov, linux-media, Kees Cook, Ruben Ayrapetyan,
Andrey Konovalov, Kevin Brodsky, Alex Williamson,
Mauro Carvalho Chehab, Dmitry Vyukov, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Morton,
Robin Murphy, Yishai Hadas, Luc Van Oostenryck
On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
>
> > The tagged pointers (whether hwasan or MTE) should ideally be a
> > transparent feature for the application writer but I don't think we can
> > solve it entirely and make it seamless for the multitude of ioctls().
> > I'd say you only opt in to such feature if you know what you are doing
> > and the user code takes care of specific cases like ioctl(), hence the
> > prctl() proposal even for the hwasan.
>
> I'm not sure such a dire view is warrented..
>
> The ioctl situation is not so bad, other than a few special cases,
> most drivers just take a 'void __user *' and pass it as an argument to
> some function that accepts a 'void __user *'. sparse et al verify
> this.
>
> As long as the core functions do the right thing the drivers will be
> OK.
>
> The only place things get dicy is if someone casts to unsigned long
> (ie for vma work) but I think that reflects that our driver facing
> APIs for VMAs are compatible with static analysis (ie I have no
> earthly idea why get_user_pages() accepts an unsigned long), not that
> this is too hard.
If multiple people will care about this, perhaps we should try to
annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
structures.
For example, we could have a couple of mutually exclusive modifiers
T __object *
T __vaddr * (or U __vaddr)
In the first case the pointer points to an object (in the C sense)
that the call may dereference but not use for any other purpose.
In the latter case the pointer (or other type) is a virtual address
that the call does not dereference but my do other things with.
Also
U __really(T)
to tell static analysers the real type of pointers smuggled through
UAPI disguised as other types (*cough* KVM, etc.)
We could gradually make sparse more strict about the presence of
annotations and allowed conversions, add get/put_user() variants
that demand explicit annotation, etc.
find_vma() wouldn't work with a __object pointer, for example. A
get_user_pages_for_dereference() might be needed for __object pointers
(embodying a promise from the caller that only the object will be
dereferenced within the mapped pages).
Thoughts?
This kind of thing would need widespread buy-in in order to be viable.
Cheers
---Dave
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 10:11 ` Catalin Marinas@ 2019-05-22 15:30 ` enh
2019-05-22 16:35 ` Catalin Marinas
2019-05-22 19:21 ` Kees Cook0 siblings, 2 replies; 99+ messages in thread
From: enh @ 2019-05-22 15:30 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Hi Kees,
>
> Thanks for joining the thread ;).
>
> On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> > > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > IMO (RFC for now), I see two ways forward:
> > > > > [...]
> > > > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > > > (arch/arm64 only; most pointer arguments could be covered with an
> > > > > __SC_CAST similar to the s390 one). There are two differences from
> > > > > what we've discussed in the past:
> > > > >
> > > > > a) this is an opt-in by the user which would have to explicitly call
> > > > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > > > to pass tagged pointers to the kernel. This would probably be the
> > > > > responsibility of the C lib to make sure it doesn't tag heap
> > > > > allocations. If the user did not opt-in, the syscalls are routed
> > > > > through the normal path (no untagging address shim).
> > > > >
> > > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > > > tagged pointers (to be documented in Vicenzo's ABI patches).
> > > >
> > > > The way I see it, a patch that breaks handling of tagged pointers is
> > > > not that different from, say, a patch that adds a wild pointer
> > > > dereference. Both are bugs; the difference is that (a) the former
> > > > breaks a relatively uncommon target and (b) it's arguably an easier
> > > > mistake to make. If MTE adoption goes well, (a) will not be the case
> > > > for long.
> > >
> > > It's also the fact such patch would go unnoticed for a long time until
> > > someone exercises that code path. And when they do, the user would be
> > > pretty much in the dark trying to figure what what went wrong, why a
> > > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> > > all the places where it matters in the current kernel codebase (ignoring
> > > future patches).
> >
> > So, looking forward a bit, this isn't going to be an ARM-specific issue
> > for long.
>
> I do hope so.
>
> > In fact, I think we shouldn't have arm-specific syscall wrappers
> > in this series: I think untagged_addr() should likely be added at the
> > top-level and have it be a no-op for other architectures.
>
> That's what the current patchset does, so we have this as a starting
> point. Kostya raised another potential issue with the syscall wrappers:
> with MTE the kernel will be forced to enable the match-all (wildcard)
> pointers for user space accesses since copy_from_user() would only get a
> 0 tag. So it has wider implications than just uaccess routines not
> checking the colour.
>
> > So given this becoming a kernel-wide multi-architecture issue (under
> > the assumption that x86, RISC-V, and others will gain similar TBI or
> > MTE things), we should solve it in a way that we can re-use.
>
> Can we do any better to aid the untagged_addr() placement (e.g. better
> type annotations, better static analysis)? We have to distinguish
> between user pointers that may be dereferenced by the kernel (I think
> almost fully covered with this patchset) and user addresses represented
> as ulong that may:
>
> a) be converted to a user pointer and dereferenced; I think that's the
> case for many overloaded ulong/u64 arguments
>
> b) used for address space management, rbtree look-ups etc. where the tag
> is no longer relevant and it even gets in the way
>
> We tried last year to identify void __user * casts to unsigned long
> using sparse on the assumption that pointers can be tagged while ulong
> is about address space management and needs to lose such tag. I think we
> could have pushed this further. For example, get_user_pages() takes an
> unsigned long but it is perfectly capable of untagging the address
> itself. Shall we change its first argument to void __user * (together
> with all its callers)?
>
> find_vma(), OTOH, could untag the address but it doesn't help since
> vm_start/end don't have such information (that's more about the content
> or type that the user decided) and the callers check against it.
>
> Are there any other places where this matters? These patches tracked
> down find_vma() as some heuristics but we may need better static
> analysis to identify other cases.
>
> > We need something that is going to work everywhere. And it needs to be
> > supported by the kernel for the simple reason that the kernel needs to
> > do MTE checks during copy_from_user(): having that information stripped
> > means we lose any userspace-assigned MTE protections if they get handled
> > by the kernel, which is a total non-starter, IMO.
>
> Such feedback is welcomed ;).
>
> > As an aside: I think Sparc ADI support in Linux actually side-stepped
> > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> > for kernel code.") I think this was a mistake we should not repeat for
> > arm64 (we do seem to be at least in agreement about this, I think).
> >
> > [1] https://lore.kernel.org/patchwork/patch/654481/
>
> I tried to drag the SPARC guys into this discussion but without much
> success.
>
> > > > This is a bit of a chicken-and-egg problem. In a world where memory
> > > > allocators on one or several popular platforms generate pointers with
> > > > non-zero tags, any such breakage will be caught in testing.
> > > > Unfortunately to reach that state we need the kernel to start
> > > > accepting tagged pointers first, and then hold on for a couple of
> > > > years until userspace catches up.
> > >
> > > Would the kernel also catch up with providing a stable ABI? Because we
> > > have two moving targets.
> > >
> > > On one hand, you have Android or some Linux distro that stick to a
> > > stable kernel version for some time, so they have better chance of
> > > clearing most of the problems. On the other hand, we have mainline
> > > kernel that gets over 500K lines every release. As maintainer, I can't
> > > rely on my testing alone as this is on a limited number of platforms. So
> > > my concern is that every kernel release has a significant chance of
> > > breaking the ABI, unless we have a better way of identifying potential
> > > issues.
> >
> > I just want to make sure I fully understand your concern about this
> > being an ABI break, and I work best with examples. The closest situation
> > I can see would be:
> >
> > - some program has no idea about MTE
>
> Apart from some libraries like libc (and maybe those that handle
> specific device ioctls), I think most programs should have no idea about
> MTE. I wouldn't expect programmers to have to change their app just
> because we have a new feature that colours heap allocations.
obviously i'm biased as a libc maintainer, but...
i don't think it helps to move this to libc --- now you just have an
extra dependency where to have a guaranteed working system you need to
update your kernel and libc together. (or at least update your libc to
understand new ioctls etc _before_ you can update your kernel.)
> > - malloc() starts returning MTE-tagged addresses
> > - program doesn't break from that change
> > - program uses some syscall that is missing untagged_addr() and fails
> > - kernel has now broken userspace that used to work
>
> That's one aspect though probably more of a case of plugging in a new
> device (graphics card, network etc.) and the ioctl to the new device
> doesn't work.
>
> The other is that, assuming we reach a point where the kernel entirely
> supports this relaxed ABI, can we guarantee that it won't break in the
> future. Let's say some subsequent kernel change (some refactoring)
> misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> syscall unusable. Can we rely only on testing?
>
> > The trouble I see with this is that it is largely theoretical and
> > requires part of userspace to collude to start using a new CPU feature
> > that tickles a bug in the kernel. As I understand the golden rule,
> > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > not a global breaking of some userspace behavior.
>
> Yes, we should follow the rule that it's a kernel bug but it doesn't
> help the user that a newly installed kernel causes user space to no
> longer reach a prompt. Hence the proposal of an opt-in via personality
> (for MTE we would need an explicit opt-in by the user anyway since the
> top byte is no longer ignored but checked against the allocation tag).
but realistically would this actually get used in this way? or would
any given system either be MTE or non-MTE. in which case a kernel
configuration option would seem to make more sense. (because either
way, the hypothetical user basically needs to recompile the kernel to
get back on their feet. or all of userspace.)
i'm not sure i see this new way for a kernel update to break my system
and need to be fixed forward/rolled back as any different from any of
the existing ways in which this can happen :-) as an end-user i have
to rely on whoever's sending me software updates to test adequately
enough that they find the problems. as an end user, there isn't any
difference between "my phone rebooted when i tried to take a photo
because of a kernel/driver leak", say, and "my phone rebooted when i
tried to take a photo because of missing untagging of a pointer passed
via ioctl".
i suspect you and i have very different people in mind when we say "user" :-)
> > I feel like I'm missing something about this being seen as an ABI
> > break. The kernel already fails on userspace addresses that have high
> > bits set -- are there things that _depend_ on this failure to operate?
>
> It's about providing a relaxed ABI which allows non-zero top byte and
> breaking it later inadvertently without having something better in place
> to analyse the kernel changes.
>
> Thanks.
>
> --
> Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 15:30 ` enh@ 2019-05-22 16:35 ` Catalin Marinas
2019-05-22 16:58 ` enh
2019-05-22 20:47 ` Kees Cook
2019-05-22 19:21 ` Kees Cook1 sibling, 2 replies; 99+ messages in thread
From: Catalin Marinas @ 2019-05-22 16:35 UTC (permalink / raw)
To: enh
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > I just want to make sure I fully understand your concern about this
> > > being an ABI break, and I work best with examples. The closest situation
> > > I can see would be:
> > >
> > > - some program has no idea about MTE
> >
> > Apart from some libraries like libc (and maybe those that handle
> > specific device ioctls), I think most programs should have no idea about
> > MTE. I wouldn't expect programmers to have to change their app just
> > because we have a new feature that colours heap allocations.
>
> obviously i'm biased as a libc maintainer, but...
>
> i don't think it helps to move this to libc --- now you just have an
> extra dependency where to have a guaranteed working system you need to
> update your kernel and libc together. (or at least update your libc to
> understand new ioctls etc _before_ you can update your kernel.)
That's not what I meant (or I misunderstood you). If we have a relaxed
ABI in the kernel and a libc that returns tagged pointers on malloc() I
wouldn't expect the programmer to do anything different in the
application code like explicit untagging. Basically the program would
continue to run unmodified irrespective of whether you use an old libc
without tagged pointers or a new one which tags heap allocations.
What I do expect is that the libc checks for the presence of the relaxed
ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a
HWCAP_MTE), and only tag the malloc() pointers if the kernel supports
the relaxed ABI. As you said, you shouldn't expect that the C library
and kernel are upgraded together, so they should be able to work in any
new/old version combination.
> > > The trouble I see with this is that it is largely theoretical and
> > > requires part of userspace to collude to start using a new CPU feature
> > > that tickles a bug in the kernel. As I understand the golden rule,
> > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > not a global breaking of some userspace behavior.
> >
> > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > help the user that a newly installed kernel causes user space to no
> > longer reach a prompt. Hence the proposal of an opt-in via personality
> > (for MTE we would need an explicit opt-in by the user anyway since the
> > top byte is no longer ignored but checked against the allocation tag).
>
> but realistically would this actually get used in this way? or would
> any given system either be MTE or non-MTE. in which case a kernel
> configuration option would seem to make more sense. (because either
> way, the hypothetical user basically needs to recompile the kernel to
> get back on their feet. or all of userspace.)
The two hard requirements I have for supporting any new hardware feature
in Linux are (1) a single kernel image binary continues to run on old
hardware while making use of the new feature if available and (2) old
user space continues to run on new hardware while new user space can
take advantage of the new feature.
The distro user space usually has a hard requirement that it continues
to run on (certain) old hardware. We can't enforce this in the kernel
but we offer the option to user space developers of checking feature
availability through HWCAP bits.
The Android story may be different as you have more control about which
kernel configurations are deployed on specific SoCs. I'm looking more
from a Linux distro angle where you just get an off-the-shelf OS image
and install it on your hardware, either taking advantage of new features
or just not using them if the software was not updated. Or, if updated
software is installed on old hardware, it would just run.
For MTE, we just can't enable it by default since there are applications
who use the top byte of a pointer and expect it to be ignored rather
than failing with a mismatched tag. Just think of a hwasan compiled
binary where TBI is expected to work and you try to run it with MTE
turned on.
I would also expect the C library or dynamic loader to check for the
presence of a HWCAP_MTE bit before starting to tag memory allocations,
otherwise it would get SIGILL on the first MTE instruction it tries to
execute.
> i'm not sure i see this new way for a kernel update to break my system
> and need to be fixed forward/rolled back as any different from any of
> the existing ways in which this can happen :-) as an end-user i have
> to rely on whoever's sending me software updates to test adequately
> enough that they find the problems. as an end user, there isn't any
> difference between "my phone rebooted when i tried to take a photo
> because of a kernel/driver leak", say, and "my phone rebooted when i
> tried to take a photo because of missing untagging of a pointer passed
> via ioctl".
>
> i suspect you and i have very different people in mind when we say "user" :-)
Indeed, I think we have different users in mind. I didn't mean the end
user who doesn't really care which C library version it's running on
their phone but rather advanced users (not necessarily kernel
developers) that prefer to build their own kernels with every release.
We could extend this to kernel developers who don't have time to track
down why a new kernel triggers lots of SIGSEGVs during boot.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 16:35 ` Catalin Marinas@ 2019-05-22 16:58 ` enh
2019-05-23 15:21 ` Catalin Marinas
2019-05-22 20:47 ` Kees Cook1 sibling, 1 reply; 99+ messages in thread
From: enh @ 2019-05-22 16:58 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 9:35 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > > I just want to make sure I fully understand your concern about this
> > > > being an ABI break, and I work best with examples. The closest situation
> > > > I can see would be:
> > > >
> > > > - some program has no idea about MTE
> > >
> > > Apart from some libraries like libc (and maybe those that handle
> > > specific device ioctls), I think most programs should have no idea about
> > > MTE. I wouldn't expect programmers to have to change their app just
> > > because we have a new feature that colours heap allocations.
> >
> > obviously i'm biased as a libc maintainer, but...
> >
> > i don't think it helps to move this to libc --- now you just have an
> > extra dependency where to have a guaranteed working system you need to
> > update your kernel and libc together. (or at least update your libc to
> > understand new ioctls etc _before_ you can update your kernel.)
>
> That's not what I meant (or I misunderstood you). If we have a relaxed
> ABI in the kernel and a libc that returns tagged pointers on malloc() I
> wouldn't expect the programmer to do anything different in the
> application code like explicit untagging. Basically the program would
> continue to run unmodified irrespective of whether you use an old libc
> without tagged pointers or a new one which tags heap allocations.
>
> What I do expect is that the libc checks for the presence of the relaxed
> ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a
> HWCAP_MTE), and only tag the malloc() pointers if the kernel supports
> the relaxed ABI. As you said, you shouldn't expect that the C library
> and kernel are upgraded together, so they should be able to work in any
> new/old version combination.
yes, that part makes sense. i do think we'd use the AT_FLAGS bit, for
exactly this.
i was questioning the argument about the ioctl issues, and saying that
from my perspective, untagging bugs are not really any different than
any other kind of kernel bug.
> > > > The trouble I see with this is that it is largely theoretical and
> > > > requires part of userspace to collude to start using a new CPU feature
> > > > that tickles a bug in the kernel. As I understand the golden rule,
> > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > > not a global breaking of some userspace behavior.
> > >
> > > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > > help the user that a newly installed kernel causes user space to no
> > > longer reach a prompt. Hence the proposal of an opt-in via personality
> > > (for MTE we would need an explicit opt-in by the user anyway since the
> > > top byte is no longer ignored but checked against the allocation tag).
> >
> > but realistically would this actually get used in this way? or would
> > any given system either be MTE or non-MTE. in which case a kernel
> > configuration option would seem to make more sense. (because either
> > way, the hypothetical user basically needs to recompile the kernel to
> > get back on their feet. or all of userspace.)
>
> The two hard requirements I have for supporting any new hardware feature
> in Linux are (1) a single kernel image binary continues to run on old
> hardware while making use of the new feature if available and (2) old
> user space continues to run on new hardware while new user space can
> take advantage of the new feature.
>
> The distro user space usually has a hard requirement that it continues
> to run on (certain) old hardware. We can't enforce this in the kernel
> but we offer the option to user space developers of checking feature
> availability through HWCAP bits.
>
> The Android story may be different as you have more control about which
> kernel configurations are deployed on specific SoCs. I'm looking more
> from a Linux distro angle where you just get an off-the-shelf OS image
> and install it on your hardware, either taking advantage of new features
> or just not using them if the software was not updated. Or, if updated
> software is installed on old hardware, it would just run.
>
> For MTE, we just can't enable it by default since there are applications
> who use the top byte of a pointer and expect it to be ignored rather
> than failing with a mismatched tag. Just think of a hwasan compiled
> binary where TBI is expected to work and you try to run it with MTE
> turned on.
>
> I would also expect the C library or dynamic loader to check for the
> presence of a HWCAP_MTE bit before starting to tag memory allocations,
> otherwise it would get SIGILL on the first MTE instruction it tries to
> execute.
(a bit off-topic, but i thought the MTE instructions were encoded in
the no-op space, to avoid this?)
> > i'm not sure i see this new way for a kernel update to break my system
> > and need to be fixed forward/rolled back as any different from any of
> > the existing ways in which this can happen :-) as an end-user i have
> > to rely on whoever's sending me software updates to test adequately
> > enough that they find the problems. as an end user, there isn't any
> > difference between "my phone rebooted when i tried to take a photo
> > because of a kernel/driver leak", say, and "my phone rebooted when i
> > tried to take a photo because of missing untagging of a pointer passed
> > via ioctl".
> >
> > i suspect you and i have very different people in mind when we say "user" :-)
>
> Indeed, I think we have different users in mind. I didn't mean the end
> user who doesn't really care which C library version it's running on
> their phone but rather advanced users (not necessarily kernel
> developers) that prefer to build their own kernels with every release.
> We could extend this to kernel developers who don't have time to track
> down why a new kernel triggers lots of SIGSEGVs during boot.
i still don't see how this isn't just a regular testing/CI issue, the
same as any other kind of kernel bug. it's already the case that i can
get a bad kernel...
> --
> Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 15:30 ` enh
2019-05-22 16:35 ` Catalin Marinas@ 2019-05-22 19:21 ` Kees Cook
2019-05-22 20:15 ` enh
2019-05-23 15:08 ` Catalin Marinas1 sibling, 2 replies; 99+ messages in thread
From: Kees Cook @ 2019-05-22 19:21 UTC (permalink / raw)
To: enh
Cc: Mark Rutland, kvm, Szabolcs Nagy, Catalin Marinas, Will Deacon,
dri-devel, Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > I just want to make sure I fully understand your concern about this
> > > being an ABI break, and I work best with examples. The closest situation
> > > I can see would be:
> > >
> > > - some program has no idea about MTE
> >
> > Apart from some libraries like libc (and maybe those that handle
> > specific device ioctls), I think most programs should have no idea about
> > MTE. I wouldn't expect programmers to have to change their app just
> > because we have a new feature that colours heap allocations.
Right -- things should Just Work from the application perspective.
> obviously i'm biased as a libc maintainer, but...
>
> i don't think it helps to move this to libc --- now you just have an
> extra dependency where to have a guaranteed working system you need to
> update your kernel and libc together. (or at least update your libc to
> understand new ioctls etc _before_ you can update your kernel.)
I think (hope?) we've all agreed that we shouldn't pass this off to
userspace. At the very least, it reduces the utility of MTE, and at worst
it complicates userspace when this is clearly a kernel/architecture issue.
>
> > > - malloc() starts returning MTE-tagged addresses
> > > - program doesn't break from that change
> > > - program uses some syscall that is missing untagged_addr() and fails
> > > - kernel has now broken userspace that used to work
> >
> > That's one aspect though probably more of a case of plugging in a new
> > device (graphics card, network etc.) and the ioctl to the new device
> > doesn't work.
I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will
be glitches, and we can disable stuff either via CONFIG or (as is more
common now) via a kernel commandline with untagged_addr() containing a
static branch, etc. But I actually don't think we need to go this route
(see below...)
> > The other is that, assuming we reach a point where the kernel entirely
> > supports this relaxed ABI, can we guarantee that it won't break in the
> > future. Let's say some subsequent kernel change (some refactoring)
> > misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> > syscall unusable. Can we rely only on testing?
> >
> > > The trouble I see with this is that it is largely theoretical and
> > > requires part of userspace to collude to start using a new CPU feature
> > > that tickles a bug in the kernel. As I understand the golden rule,
> > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > not a global breaking of some userspace behavior.
> >
> > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > help the user that a newly installed kernel causes user space to no
> > longer reach a prompt. Hence the proposal of an opt-in via personality
> > (for MTE we would need an explicit opt-in by the user anyway since the
> > top byte is no longer ignored but checked against the allocation tag).
>
> but realistically would this actually get used in this way? or would
> any given system either be MTE or non-MTE. in which case a kernel
> configuration option would seem to make more sense. (because either
> way, the hypothetical user basically needs to recompile the kernel to
> get back on their feet. or all of userspace.)
Right: the point is to design things so that we do our best to not break
userspace that is using the new feature (which I think this series has
done well). But supporting MTE/TBI is just like supporting PAN: if someone
refactors a driver and swaps a copy_from_user() to a memcpy(), it's going
to break under PAN. There will be the same long tail of these bugs like
any other, but my sense is that they are small and rare. But I agree:
they're going to be pretty weird bugs to track down. The final result,
however, will be excellent annotation in the kernel for where userspace
addresses get used and people make assumptions about them.
The sooner we get the series landed and gain QEMU support (or real
hardware), the faster we can hammer out these missed corner-cases.
What's the timeline for either of those things, BTW?
> > > I feel like I'm missing something about this being seen as an ABI
> > > break. The kernel already fails on userspace addresses that have high
> > > bits set -- are there things that _depend_ on this failure to operate?
> >
> > It's about providing a relaxed ABI which allows non-zero top byte and
> > breaking it later inadvertently without having something better in place
> > to analyse the kernel changes.
It sounds like the question is how to switch a process in or out of this
ABI (but I don't think that's the real issue: I think it's just a matter
of whether or not a process uses tags at all). Doing it at the prctl()
level doesn't make sense to me, except maybe to detect MTE support or
something. ("Should I tag allocations?") And that state is controlled
by the kernel: the kernel does it or it doesn't.
If a process wants to not tag, that's also up to the allocator where
it can decide not to ask the kernel, and just not tag. Nothing breaks in
userspace if a process is NOT tagging and untagged_addr() exists or is
missing. This, I think, is the core way this doesn't trip over the
golden rule: an old system image will run fine (because it's not
tagging). A *new* system may encounter bugs with tagging because it's a
new feature: this is The Way Of Things. But we don't break old userspace
because old userspace isn't using tags.
So the agreement appears to be between the kernel and the allocator.
Kernel says "I support this" or not. Telling the allocator to not tag if
something breaks sounds like an entirely userspace decision, yes?
--
Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 19:21 ` Kees Cook@ 2019-05-22 20:15 ` enh
2019-05-23 15:08 ` Catalin Marinas1 sibling, 0 replies; 99+ messages in thread
From: enh @ 2019-05-22 20:15 UTC (permalink / raw)
To: Kees Cook
Cc: Mark Rutland, kvm, Szabolcs Nagy, Catalin Marinas, Will Deacon,
dri-devel, Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 12:21 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > > I just want to make sure I fully understand your concern about this
> > > > being an ABI break, and I work best with examples. The closest situation
> > > > I can see would be:
> > > >
> > > > - some program has no idea about MTE
> > >
> > > Apart from some libraries like libc (and maybe those that handle
> > > specific device ioctls), I think most programs should have no idea about
> > > MTE. I wouldn't expect programmers to have to change their app just
> > > because we have a new feature that colours heap allocations.
>
> Right -- things should Just Work from the application perspective.
>
> > obviously i'm biased as a libc maintainer, but...
> >
> > i don't think it helps to move this to libc --- now you just have an
> > extra dependency where to have a guaranteed working system you need to
> > update your kernel and libc together. (or at least update your libc to
> > understand new ioctls etc _before_ you can update your kernel.)
>
> I think (hope?) we've all agreed that we shouldn't pass this off to
> userspace. At the very least, it reduces the utility of MTE, and at worst
> it complicates userspace when this is clearly a kernel/architecture issue.
>
> >
> > > > - malloc() starts returning MTE-tagged addresses
> > > > - program doesn't break from that change
> > > > - program uses some syscall that is missing untagged_addr() and fails
> > > > - kernel has now broken userspace that used to work
> > >
> > > That's one aspect though probably more of a case of plugging in a new
> > > device (graphics card, network etc.) and the ioctl to the new device
> > > doesn't work.
>
> I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will
> be glitches, and we can disable stuff either via CONFIG or (as is more
> common now) via a kernel commandline with untagged_addr() containing a
> static branch, etc. But I actually don't think we need to go this route
> (see below...)
>
> > > The other is that, assuming we reach a point where the kernel entirely
> > > supports this relaxed ABI, can we guarantee that it won't break in the
> > > future. Let's say some subsequent kernel change (some refactoring)
> > > misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> > > syscall unusable. Can we rely only on testing?
> > >
> > > > The trouble I see with this is that it is largely theoretical and
> > > > requires part of userspace to collude to start using a new CPU feature
> > > > that tickles a bug in the kernel. As I understand the golden rule,
> > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > > not a global breaking of some userspace behavior.
> > >
> > > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > > help the user that a newly installed kernel causes user space to no
> > > longer reach a prompt. Hence the proposal of an opt-in via personality
> > > (for MTE we would need an explicit opt-in by the user anyway since the
> > > top byte is no longer ignored but checked against the allocation tag).
> >
> > but realistically would this actually get used in this way? or would
> > any given system either be MTE or non-MTE. in which case a kernel
> > configuration option would seem to make more sense. (because either
> > way, the hypothetical user basically needs to recompile the kernel to
> > get back on their feet. or all of userspace.)
>
> Right: the point is to design things so that we do our best to not break
> userspace that is using the new feature (which I think this series has
> done well). But supporting MTE/TBI is just like supporting PAN: if someone
> refactors a driver and swaps a copy_from_user() to a memcpy(), it's going
> to break under PAN. There will be the same long tail of these bugs like
> any other, but my sense is that they are small and rare. But I agree:
> they're going to be pretty weird bugs to track down. The final result,
> however, will be excellent annotation in the kernel for where userspace
> addresses get used and people make assumptions about them.
>
> The sooner we get the series landed and gain QEMU support (or real
> hardware), the faster we can hammer out these missed corner-cases.
> What's the timeline for either of those things, BTW?
>
> > > > I feel like I'm missing something about this being seen as an ABI
> > > > break. The kernel already fails on userspace addresses that have high
> > > > bits set -- are there things that _depend_ on this failure to operate?
> > >
> > > It's about providing a relaxed ABI which allows non-zero top byte and
> > > breaking it later inadvertently without having something better in place
> > > to analyse the kernel changes.
>
> It sounds like the question is how to switch a process in or out of this
> ABI (but I don't think that's the real issue: I think it's just a matter
> of whether or not a process uses tags at all). Doing it at the prctl()
> level doesn't make sense to me, except maybe to detect MTE support or
> something. ("Should I tag allocations?") And that state is controlled
> by the kernel: the kernel does it or it doesn't.
>
> If a process wants to not tag, that's also up to the allocator where
> it can decide not to ask the kernel, and just not tag. Nothing breaks in
> userspace if a process is NOT tagging and untagged_addr() exists or is
> missing. This, I think, is the core way this doesn't trip over the
> golden rule: an old system image will run fine (because it's not
> tagging). A *new* system may encounter bugs with tagging because it's a
> new feature: this is The Way Of Things. But we don't break old userspace
> because old userspace isn't using tags.
>
> So the agreement appears to be between the kernel and the allocator.
> Kernel says "I support this" or not. Telling the allocator to not tag if
> something breaks sounds like an entirely userspace decision, yes?
sgtm, and the AT_FLAGS suggestion sounds fine for our needs in that regard.
> --
> Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 16:35 ` Catalin Marinas
2019-05-22 16:58 ` enh@ 2019-05-22 20:47 ` Kees Cook
2019-05-22 23:03 ` Evgenii Stepanov
2019-05-23 14:44 ` Catalin Marinas1 sibling, 2 replies; 99+ messages in thread
From: Kees Cook @ 2019-05-22 20:47 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, enh, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> The two hard requirements I have for supporting any new hardware feature
> in Linux are (1) a single kernel image binary continues to run on old
> hardware while making use of the new feature if available and (2) old
> user space continues to run on new hardware while new user space can
> take advantage of the new feature.
Agreed! And I think the series meets these requirements, yes?
> For MTE, we just can't enable it by default since there are applications
> who use the top byte of a pointer and expect it to be ignored rather
> than failing with a mismatched tag. Just think of a hwasan compiled
> binary where TBI is expected to work and you try to run it with MTE
> turned on.
Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
conflicting with MTE. And anything that starts using TBI suddenly can't
run in the future because it's being interpreted as MTE bits? (Is that
the ABI concern? I feel like we got into the weeds about ioctl()s and
one-off bugs...)
So there needs to be some way to let the kernel know which of three
things it should be doing:
1- leaving userspace addresses as-is (present)
2- wiping the top bits before using (this series)
3- wiping the top bits for most things, but retaining them for MTE as
needed (the future)
I expect MTE to be the "default" in the future. Once a system's libc has
grown support for it, everything will be trying to use MTE. TBI will be
the special case (but TBI is effectively a prerequisite).
AFAICT, the only difference I see between 2 and 3 will be the tag handling
in usercopy (all other places will continue to ignore the top bits). Is
that accurate?
Is "1" a per-process state we want to keep? (I assume not, but rather it
is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
To choose between "2" and "3", it seems we need a per-process flag to
opt into TBI (and out of MTE). For userspace, how would a future binary
choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
since the choice may be "late" after ELF load (this implies the need
for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
bit seems sufficient. And without the marking, I'd expect the kernel to
enforce MTE when there are high bits.
> I would also expect the C library or dynamic loader to check for the
> presence of a HWCAP_MTE bit before starting to tag memory allocations,
> otherwise it would get SIGILL on the first MTE instruction it tries to
> execute.
I've got the same question as Elliot: aren't MTE instructions just NOP
to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
gets entirely ignored: checking is only needed to satisfy curiosity
or behavioral expectations.
To me, the conflict seems to be using TBI in the face of expecting MTE to
be the default state of the future. (But the internal changes needed
for TBI -- this series -- is a prereq for MTE.)
--
Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 20:47 ` Kees Cook@ 2019-05-22 23:03 ` Evgenii Stepanov
2019-05-22 23:09 ` enh
2019-05-23 14:44 ` Catalin Marinas1 sibling, 1 reply; 99+ messages in thread
From: Evgenii Stepanov @ 2019-05-22 23:03 UTC (permalink / raw)
To: Kees Cook
Cc: Mark Rutland, kvm, Szabolcs Nagy, Catalin Marinas, Will Deacon,
dri-devel, Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Felix Kuehling,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, Dmitry Vyukov, Dave Martin, linux-media, Kevin Brodsky,
Ruben Ayrapetyan, Andrey Konovalov, Ramana Radhakrishnan,
Alex Williamson, Mauro Carvalho Chehab, Linux ARM,
Kostya Serebryany, Greg Kroah-Hartman, Yishai Hadas, LKML,
Jens Wiklander, Lee Smith, Alexander Deucher, Andrew Morton, enh,
Robin Murphy, Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > The two hard requirements I have for supporting any new hardware feature
> > in Linux are (1) a single kernel image binary continues to run on old
> > hardware while making use of the new feature if available and (2) old
> > user space continues to run on new hardware while new user space can
> > take advantage of the new feature.
>
> Agreed! And I think the series meets these requirements, yes?
>
> > For MTE, we just can't enable it by default since there are applications
> > who use the top byte of a pointer and expect it to be ignored rather
> > than failing with a mismatched tag. Just think of a hwasan compiled
> > binary where TBI is expected to work and you try to run it with MTE
> > turned on.
>
> Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> conflicting with MTE. And anything that starts using TBI suddenly can't
> run in the future because it's being interpreted as MTE bits? (Is that
> the ABI concern? I feel like we got into the weeds about ioctl()s and
> one-off bugs...)
>
> So there needs to be some way to let the kernel know which of three
> things it should be doing:
> 1- leaving userspace addresses as-is (present)
> 2- wiping the top bits before using (this series)
> 3- wiping the top bits for most things, but retaining them for MTE as
> needed (the future)
>
> I expect MTE to be the "default" in the future. Once a system's libc has
> grown support for it, everything will be trying to use MTE. TBI will be
> the special case (but TBI is effectively a prerequisite).
>
> AFAICT, the only difference I see between 2 and 3 will be the tag handling
> in usercopy (all other places will continue to ignore the top bits). Is
> that accurate?
>
> Is "1" a per-process state we want to keep? (I assume not, but rather it
> is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> To choose between "2" and "3", it seems we need a per-process flag to
> opt into TBI (and out of MTE). For userspace, how would a future binary
> choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
> since the choice may be "late" after ELF load (this implies the need
> for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
> bit seems sufficient. And without the marking, I'd expect the kernel to
> enforce MTE when there are high bits.
>
> > I would also expect the C library or dynamic loader to check for the
> > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > otherwise it would get SIGILL on the first MTE instruction it tries to
> > execute.
>
> I've got the same question as Elliot: aren't MTE instructions just NOP
> to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> gets entirely ignored: checking is only needed to satisfy curiosity
> or behavioral expectations.
MTE instructions are not NOP. Most of them have side effects (changing
register values, zeroing memory).
This only matters for stack tagging, though. Heap tagging is a runtime
decision in the allocator.
If an image needs to run on old hardware, it will have to do heap tagging only.
> To me, the conflict seems to be using TBI in the face of expecting MTE to
> be the default state of the future. (But the internal changes needed
> for TBI -- this series -- is a prereq for MTE.)
>
> --
> Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 23:03 ` Evgenii Stepanov@ 2019-05-22 23:09 ` enh
2019-05-23 7:34 ` Catalin Marinas0 siblings, 1 reply; 99+ messages in thread
From: enh @ 2019-05-22 23:09 UTC (permalink / raw)
To: Evgenii Stepanov
Cc: Mark Rutland, kvm, Szabolcs Nagy, Catalin Marinas, Will Deacon,
dri-devel, Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Felix Kuehling,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, Dmitry Vyukov, Dave Martin, linux-media, Kevin Brodsky,
Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Mauro Carvalho Chehab,
Linux ARM, Kostya Serebryany, Greg Kroah-Hartman, Yishai Hadas,
LKML, Jens Wiklander, Lee Smith, Alexander Deucher,
Andrew Morton, Robin Murphy, Christian Koenig,
Luc Van Oostenryck
On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov <eugenis@google.com> wrote:
>
> On Wed, May 22, 2019 at 1:47 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > The two hard requirements I have for supporting any new hardware feature
> > > in Linux are (1) a single kernel image binary continues to run on old
> > > hardware while making use of the new feature if available and (2) old
> > > user space continues to run on new hardware while new user space can
> > > take advantage of the new feature.
> >
> > Agreed! And I think the series meets these requirements, yes?
> >
> > > For MTE, we just can't enable it by default since there are applications
> > > who use the top byte of a pointer and expect it to be ignored rather
> > > than failing with a mismatched tag. Just think of a hwasan compiled
> > > binary where TBI is expected to work and you try to run it with MTE
> > > turned on.
> >
> > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> > conflicting with MTE. And anything that starts using TBI suddenly can't
> > run in the future because it's being interpreted as MTE bits? (Is that
> > the ABI concern? I feel like we got into the weeds about ioctl()s and
> > one-off bugs...)
> >
> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
> > 3- wiping the top bits for most things, but retaining them for MTE as
> > needed (the future)
> >
> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
> >
> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
> >
> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
> >
> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE). For userspace, how would a future binary
> > choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
> > since the choice may be "late" after ELF load (this implies the need
> > for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
> > bit seems sufficient. And without the marking, I'd expect the kernel to
> > enforce MTE when there are high bits.
> >
> > > I would also expect the C library or dynamic loader to check for the
> > > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > > otherwise it would get SIGILL on the first MTE instruction it tries to
> > > execute.
> >
> > I've got the same question as Elliot: aren't MTE instructions just NOP
> > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> > gets entirely ignored: checking is only needed to satisfy curiosity
> > or behavioral expectations.
>
> MTE instructions are not NOP. Most of them have side effects (changing
> register values, zeroing memory).
no, i meant "they're encoded in a space that was previously no-ops, so
running on MTE code on old hardware doesn't cause SIGILL".
> This only matters for stack tagging, though. Heap tagging is a runtime
> decision in the allocator.
>
> If an image needs to run on old hardware, it will have to do heap tagging only.
>
> > To me, the conflict seems to be using TBI in the face of expecting MTE to
> > be the default state of the future. (But the internal changes needed
> > for TBI -- this series -- is a prereq for MTE.)
> >
> > --
> > Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 13:49 ` Dave Martin@ 2019-05-23 0:20 ` Jason Gunthorpe
2019-05-23 10:42 ` Dave Martin0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 0:20 UTC (permalink / raw)
To: Dave Martin
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy,
Catalin Marinas, Will Deacon, dri-devel, linux-mm, Lee Smith,
linux-kselftest, Vincenzo Frascino, Jacob Bramley,
Leon Romanovsky, linux-rdma, amd-gfx, linux-arm-kernel,
Evgeniy Stepanov, linux-media, Kees Cook, Ruben Ayrapetyan,
Andrey Konovalov, Kevin Brodsky, Alex Williamson,
Mauro Carvalho Chehab, Dmitry Vyukov, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Morton,
Robin Murphy, Yishai Hadas, Luc Van Oostenryck
On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> >
> > > The tagged pointers (whether hwasan or MTE) should ideally be a
> > > transparent feature for the application writer but I don't think we can
> > > solve it entirely and make it seamless for the multitude of ioctls().
> > > I'd say you only opt in to such feature if you know what you are doing
> > > and the user code takes care of specific cases like ioctl(), hence the
> > > prctl() proposal even for the hwasan.
> >
> > I'm not sure such a dire view is warrented..
> >
> > The ioctl situation is not so bad, other than a few special cases,
> > most drivers just take a 'void __user *' and pass it as an argument to
> > some function that accepts a 'void __user *'. sparse et al verify
> > this.
> >
> > As long as the core functions do the right thing the drivers will be
> > OK.
> >
> > The only place things get dicy is if someone casts to unsigned long
> > (ie for vma work) but I think that reflects that our driver facing
> > APIs for VMAs are compatible with static analysis (ie I have no
> > earthly idea why get_user_pages() accepts an unsigned long), not that
> > this is too hard.
>
> If multiple people will care about this, perhaps we should try to
> annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> structures.
>
> For example, we could have a couple of mutually exclusive modifiers
>
> T __object *
> T __vaddr * (or U __vaddr)
>
> In the first case the pointer points to an object (in the C sense)
> that the call may dereference but not use for any other purpose.
How would you use these two differently?
So far the kernel has worked that __user should tag any pointer that
is from userspace and then you can't do anything with it until you
transform it into a kernel something
> to tell static analysers the real type of pointers smuggled through
> UAPI disguised as other types (*cough* KVM, etc.)
Yes, that would help alot, we often have to pass pointers through a
u64 in the uAPI, and there is no static checker support to make sure
they are run through the u64_to_user_ptr() helper.
Jason
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 0:20 ` Jason Gunthorpe@ 2019-05-23 10:42 ` Dave Martin
2019-05-23 16:57 ` Catalin Marinas0 siblings, 1 reply; 99+ messages in thread
From: Dave Martin @ 2019-05-23 10:42 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mark Rutland, kvm, Szabolcs Nagy, Catalin Marinas, Will Deacon,
dri-devel, linux-mm, linux-kselftest, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Ramana Radhakrishnan, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, linux-arm-kernel, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, linux-kernel, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> > >
> > > > The tagged pointers (whether hwasan or MTE) should ideally be a
> > > > transparent feature for the application writer but I don't think we can
> > > > solve it entirely and make it seamless for the multitude of ioctls().
> > > > I'd say you only opt in to such feature if you know what you are doing
> > > > and the user code takes care of specific cases like ioctl(), hence the
> > > > prctl() proposal even for the hwasan.
> > >
> > > I'm not sure such a dire view is warrented..
> > >
> > > The ioctl situation is not so bad, other than a few special cases,
> > > most drivers just take a 'void __user *' and pass it as an argument to
> > > some function that accepts a 'void __user *'. sparse et al verify
> > > this.
> > >
> > > As long as the core functions do the right thing the drivers will be
> > > OK.
> > >
> > > The only place things get dicy is if someone casts to unsigned long
> > > (ie for vma work) but I think that reflects that our driver facing
> > > APIs for VMAs are compatible with static analysis (ie I have no
> > > earthly idea why get_user_pages() accepts an unsigned long), not that
> > > this is too hard.
> >
> > If multiple people will care about this, perhaps we should try to
> > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > structures.
> >
> > For example, we could have a couple of mutually exclusive modifiers
> >
> > T __object *
> > T __vaddr * (or U __vaddr)
> >
> > In the first case the pointer points to an object (in the C sense)
> > that the call may dereference but not use for any other purpose.
>
> How would you use these two differently?
>
> So far the kernel has worked that __user should tag any pointer that
> is from userspace and then you can't do anything with it until you
> transform it into a kernel something
Ultimately it would be good to disallow casting __object pointers execpt
to compatible __object pointer types, and to make get_user etc. demand
__object.
__vaddr pointers / addresses would be freely castable, but not to
__object and so would not be dereferenceable even indirectly.
Or that's the general idea. Figuring out a sane set of rules that we
could actually check / enforce would require a bit of work.
(Whether the __vaddr base type is a pointer or an integer type is
probably moot, due to the restrictions we would place on the use of
these anyway.)
> > to tell static analysers the real type of pointers smuggled through
> > UAPI disguised as other types (*cough* KVM, etc.)
>
> Yes, that would help alot, we often have to pass pointers through a
> u64 in the uAPI, and there is no static checker support to make sure
> they are run through the u64_to_user_ptr() helper.
Agreed.
Cheers
---Dave
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 20:47 ` Kees Cook
2019-05-22 23:03 ` Evgenii Stepanov@ 2019-05-23 14:44 ` Catalin Marinas
2019-05-23 15:44 ` enh
2019-05-23 16:38 ` Kees Cook1 sibling, 2 replies; 99+ messages in thread
From: Catalin Marinas @ 2019-05-23 14:44 UTC (permalink / raw)
To: Kees Cook
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, enh, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > The two hard requirements I have for supporting any new hardware feature
> > in Linux are (1) a single kernel image binary continues to run on old
> > hardware while making use of the new feature if available and (2) old
> > user space continues to run on new hardware while new user space can
> > take advantage of the new feature.
>
> Agreed! And I think the series meets these requirements, yes?
Yes. I mentioned this just to make sure people don't expect different
kernel builds for different hardware features.
There is also the obvious requirement which I didn't mention: new user
space continues to run on new/subsequent kernel versions. That's one of
the points of contention for this series (ignoring MTE) with the
maintainers having to guarantee this without much effort. IOW, do the
500K+ new lines in a subsequent kernel version break any user space out
there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
syskaller sufficient? Better static analysis would definitely help.
> > For MTE, we just can't enable it by default since there are applications
> > who use the top byte of a pointer and expect it to be ignored rather
> > than failing with a mismatched tag. Just think of a hwasan compiled
> > binary where TBI is expected to work and you try to run it with MTE
> > turned on.
>
> Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> conflicting with MTE. And anything that starts using TBI suddenly can't
> run in the future because it's being interpreted as MTE bits? (Is that
> the ABI concern?
That's another aspect to figure out when we add the MTE support. I don't
think we'd be able to do this without an explicit opt-in by the user.
Or, if we ever want MTE to be turned on by default (i.e. tag checking),
even if everything is tagged with 0, we have to disallow TBI for user
and this includes hwasan. There were a small number of programs using
the TBI (I think some JavaScript compilers tried this). But now we are
bringing in the hwasan support and this can be a large user base. Shall
we add an ELF note for such binaries that use TBI/hwasan?
This series is still required for MTE but we may decide not to relax the
ABI blindly, therefore the opt-in (prctl) or personality idea.
> I feel like we got into the weeds about ioctl()s and one-off bugs...)
This needs solving as well. Most driver developers won't know why
untagged_addr() is needed unless we have more rigorous types or type
annotations and a tool to check them (we should probably revive the old
sparse thread).
> So there needs to be some way to let the kernel know which of three
> things it should be doing:
> 1- leaving userspace addresses as-is (present)
> 2- wiping the top bits before using (this series)
(I'd say tolerating rather than wiping since get_user still uses the tag
in the current series)
The current series does not allow any choice between 1 and 2, the
default ABI basically becomes option 2.
> 3- wiping the top bits for most things, but retaining them for MTE as
> needed (the future)
2 and 3 are not entirely compatible as a tagged pointer may be checked
against the memory colour by the hardware. So you can't have hwasan
binary with MTE enabled.
> I expect MTE to be the "default" in the future. Once a system's libc has
> grown support for it, everything will be trying to use MTE. TBI will be
> the special case (but TBI is effectively a prerequisite).
The kernel handling of tagged pointers is indeed a prerequisite. The ABI
distinction between the above 2 and 3 needs to be solved.
> AFAICT, the only difference I see between 2 and 3 will be the tag handling
> in usercopy (all other places will continue to ignore the top bits). Is
> that accurate?
Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
binary, it will SEGFAULT (either in user space or in kernel uaccess).
How does the kernel choose between 2 and 3?
> Is "1" a per-process state we want to keep? (I assume not, but rather it
> is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
Possibly, though not necessarily per process. For testing or if
something goes wrong during boot, a command line option with a static
label would do. The AT_FLAGS bit needs to be checked by user space. My
preference would be per-process.
> To choose between "2" and "3", it seems we need a per-process flag to
> opt into TBI (and out of MTE).
Or leave option 2 the default and get it to opt in to MTE.
> For userspace, how would a future binary choose TBI over MTE? If it's
> a library issue, we can't use an ELF bit, since the choice may be
> "late" after ELF load (this implies the need for a prctl().) If it's
> binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> And without the marking, I'd expect the kernel to enforce MTE when
> there are high bits.
The current plan is that a future binary issues a prctl(), after
checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
are not in the current NOP space). I'd expect this to be done by the
libc or dynamic loader under the assumption that the binaries it loads
do _not_ use the top pointer byte for anything else. With hwasan
compiled objects this gets more confusing (any ELF note to identify
them?).
(there is also the risk of existing applications using TBI already but
I'm not aware of any still using this feature other than hwasan)
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-22 19:21 ` Kees Cook
2019-05-22 20:15 ` enh@ 2019-05-23 15:08 ` Catalin Marinas1 sibling, 0 replies; 99+ messages in thread
From: Catalin Marinas @ 2019-05-23 15:08 UTC (permalink / raw)
To: Kees Cook
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, enh, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Wed, May 22, 2019 at 12:21:27PM -0700, Kees Cook wrote:
> If a process wants to not tag, that's also up to the allocator where
> it can decide not to ask the kernel, and just not tag. Nothing breaks in
> userspace if a process is NOT tagging and untagged_addr() exists or is
> missing. This, I think, is the core way this doesn't trip over the
> golden rule: an old system image will run fine (because it's not
> tagging). A *new* system may encounter bugs with tagging because it's a
> new feature: this is The Way Of Things. But we don't break old userspace
> because old userspace isn't using tags.
With this series and hwasan binaries, at some point in the future they
will be considered "old userspace" and they do use pointer tags which
expect to be ignored by both the hardware and the kernel. MTE breaks
this assumption.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 14:44 ` Catalin Marinas@ 2019-05-23 15:44 ` enh
2019-05-23 17:00 ` Catalin Marinas
2019-05-23 16:38 ` Kees Cook1 sibling, 1 reply; 99+ messages in thread
From: enh @ 2019-05-23 15:44 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Thu, May 23, 2019 at 7:45 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > The two hard requirements I have for supporting any new hardware feature
> > > in Linux are (1) a single kernel image binary continues to run on old
> > > hardware while making use of the new feature if available and (2) old
> > > user space continues to run on new hardware while new user space can
> > > take advantage of the new feature.
> >
> > Agreed! And I think the series meets these requirements, yes?
>
> Yes. I mentioned this just to make sure people don't expect different
> kernel builds for different hardware features.
>
> There is also the obvious requirement which I didn't mention: new user
> space continues to run on new/subsequent kernel versions. That's one of
> the points of contention for this series (ignoring MTE) with the
> maintainers having to guarantee this without much effort. IOW, do the
> 500K+ new lines in a subsequent kernel version break any user space out
> there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> syskaller sufficient? Better static analysis would definitely help.
>
> > > For MTE, we just can't enable it by default since there are applications
> > > who use the top byte of a pointer and expect it to be ignored rather
> > > than failing with a mismatched tag. Just think of a hwasan compiled
> > > binary where TBI is expected to work and you try to run it with MTE
> > > turned on.
> >
> > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> > conflicting with MTE. And anything that starts using TBI suddenly can't
> > run in the future because it's being interpreted as MTE bits? (Is that
> > the ABI concern?
>
> That's another aspect to figure out when we add the MTE support. I don't
> think we'd be able to do this without an explicit opt-in by the user.
>
> Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> even if everything is tagged with 0, we have to disallow TBI for user
> and this includes hwasan. There were a small number of programs using
> the TBI (I think some JavaScript compilers tried this). But now we are
> bringing in the hwasan support and this can be a large user base. Shall
> we add an ELF note for such binaries that use TBI/hwasan?
>
> This series is still required for MTE but we may decide not to relax the
> ABI blindly, therefore the opt-in (prctl) or personality idea.
>
> > I feel like we got into the weeds about ioctl()s and one-off bugs...)
>
> This needs solving as well. Most driver developers won't know why
> untagged_addr() is needed unless we have more rigorous types or type
> annotations and a tool to check them (we should probably revive the old
> sparse thread).
>
> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
>
> (I'd say tolerating rather than wiping since get_user still uses the tag
> in the current series)
>
> The current series does not allow any choice between 1 and 2, the
> default ABI basically becomes option 2.
>
> > 3- wiping the top bits for most things, but retaining them for MTE as
> > needed (the future)
>
> 2 and 3 are not entirely compatible as a tagged pointer may be checked
> against the memory colour by the hardware. So you can't have hwasan
> binary with MTE enabled.
>
> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
>
> The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> distinction between the above 2 and 3 needs to be solved.
>
> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
>
> Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
> binary, it will SEGFAULT (either in user space or in kernel uaccess).
> How does the kernel choose between 2 and 3?
>
> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> Possibly, though not necessarily per process. For testing or if
> something goes wrong during boot, a command line option with a static
> label would do. The AT_FLAGS bit needs to be checked by user space. My
> preference would be per-process.
>
> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE).
>
> Or leave option 2 the default and get it to opt in to MTE.
>
> > For userspace, how would a future binary choose TBI over MTE? If it's
> > a library issue, we can't use an ELF bit, since the choice may be
> > "late" after ELF load (this implies the need for a prctl().) If it's
> > binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> > And without the marking, I'd expect the kernel to enforce MTE when
> > there are high bits.
>
> The current plan is that a future binary issues a prctl(), after
> checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> are not in the current NOP space). I'd expect this to be done by the
> libc or dynamic loader under the assumption that the binaries it loads
> do _not_ use the top pointer byte for anything else.
yeah, it sounds like to support hwasan and MTE, the dynamic linker
will need to not use either itself.
> With hwasan
> compiled objects this gets more confusing (any ELF note to identify
> them?).
no, at the moment code that wants to know checks for the presence of
__hwasan_init. (and bionic doesn't actually look at any ELF notes
right now.) but we can always add something if we need to.
> (there is also the risk of existing applications using TBI already but
> I'm not aware of any still using this feature other than hwasan)
>
> --
> Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 14:44 ` Catalin Marinas
2019-05-23 15:44 ` enh@ 2019-05-23 16:38 ` Kees Cook
2019-05-23 17:43 ` Catalin Marinas1 sibling, 1 reply; 99+ messages in thread
From: Kees Cook @ 2019-05-23 16:38 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, enh, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
> There is also the obvious requirement which I didn't mention: new user
> space continues to run on new/subsequent kernel versions. That's one of
> the points of contention for this series (ignoring MTE) with the
> maintainers having to guarantee this without much effort. IOW, do the
> 500K+ new lines in a subsequent kernel version break any user space out
> there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> syskaller sufficient? Better static analysis would definitely help.
We can't have perfect coverage of people actively (or accidentally)
working to trick static analyzers (and the compiler) into "forgetting"
about a __user annotation. We can certainly improve analysis (I see
the other sub-thread on that), but I'd like that work not to block
this series.
What on this front would you be comfortable with? Given it's a new
feature isn't it sufficient to have a CONFIG (and/or boot option)?
> Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> even if everything is tagged with 0, we have to disallow TBI for user
> and this includes hwasan. There were a small number of programs using
> the TBI (I think some JavaScript compilers tried this). But now we are
> bringing in the hwasan support and this can be a large user base. Shall
> we add an ELF note for such binaries that use TBI/hwasan?
Just to be clear, you say "disallow TBI for user" -- you mean a
particular process, yes? i.e. there is no architectural limitation that
says once we're using MTE nothing can switch to TBI. i.e. a process is
either doing MTE or TBI (or nothing, but that's the same as TBI).
> This needs solving as well. Most driver developers won't know why
> untagged_addr() is needed unless we have more rigorous types or type
> annotations and a tool to check them (we should probably revive the old
> sparse thread).
This seems like a parallel concern: we can do that separately from this
series. Without landing it, is it much harder for people to test it,
look for bugs, help with types/annotations, etc.
> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
>
> (I'd say tolerating rather than wiping since get_user still uses the tag
> in the current series)
>
> The current series does not allow any choice between 1 and 2, the
> default ABI basically becomes option 2.
What about testing tools that intentionally insert high bits for syscalls
and are _expecting_ them to fail? It seems the TBI series will break them?
In that case, do we need to opt into TBI as well?
> > 3- wiping the top bits for most things, but retaining them for MTE as
> > needed (the future)
>
> 2 and 3 are not entirely compatible as a tagged pointer may be checked
> against the memory colour by the hardware. So you can't have hwasan
> binary with MTE enabled.
Right: a process must be either MTE or TBI, not both.
> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
>
> The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> distinction between the above 2 and 3 needs to be solved.
Does that need solving now or when the MTE series appears? As there is
no reason to distinguish between "normal" and "TBI", that doesn't seem
to need solving at this point?
> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
>
> Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
> binary, it will SEGFAULT (either in user space or in kernel uaccess).
> How does the kernel choose between 2 and 3?
Right -- that was my question as well.
> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> Possibly, though not necessarily per process. For testing or if
> something goes wrong during boot, a command line option with a static
> label would do. The AT_FLAGS bit needs to be checked by user space. My
> preference would be per-process.
I would agree.
> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE).
>
> Or leave option 2 the default and get it to opt in to MTE.
Given that MTE has to "start" at some point in the binary lifetime, I'm
fine with opting into MTE. I do expect, though, this will feel redundant
in a couple years as everything will immediately opt-in. But, okay, this
is therefore an issue for the MTE series.
> The current plan is that a future binary issues a prctl(), after
> checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> are not in the current NOP space). I'd expect this to be done by the
> libc or dynamic loader under the assumption that the binaries it loads
> do _not_ use the top pointer byte for anything else. With hwasan
> compiled objects this gets more confusing (any ELF note to identify
> them?).
Okay, sounds fine.
> (there is also the risk of existing applications using TBI already but
> I'm not aware of any still using this feature other than hwasan)
Correct.
Alright, the tl;dr appears to be:
- you want more assurances that we can find __user stripping in the
kernel more easily. (But this seems like a parallel problem.)
- we might need to opt in to TBI with a prctl()
- all other concerns are for the future MTE series (though it sounds
like HWCAP_MTE and a prctl() solve those issues too).
Is this accurate? What do you see as the blockers for this series at
this point?
--
Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 10:42 ` Dave Martin@ 2019-05-23 16:57 ` Catalin Marinas
2019-05-24 14:23 ` Dave Martin0 siblings, 1 reply; 99+ messages in thread
From: Catalin Marinas @ 2019-05-23 16:57 UTC (permalink / raw)
To: Dave Martin
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
linux-mm, linux-kselftest, Vincenzo Frascino, Jacob Bramley,
Leon Romanovsky, linux-rdma, amd-gfx, Jason Gunthorpe,
Dmitry Vyukov, Ramana Radhakrishnan, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, linux-arm-kernel, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, linux-kernel, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
> On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > > If multiple people will care about this, perhaps we should try to
> > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > > structures.
> > >
> > > For example, we could have a couple of mutually exclusive modifiers
> > >
> > > T __object *
> > > T __vaddr * (or U __vaddr)
> > >
> > > In the first case the pointer points to an object (in the C sense)
> > > that the call may dereference but not use for any other purpose.
> >
> > How would you use these two differently?
> >
> > So far the kernel has worked that __user should tag any pointer that
> > is from userspace and then you can't do anything with it until you
> > transform it into a kernel something
>
> Ultimately it would be good to disallow casting __object pointers execpt
> to compatible __object pointer types, and to make get_user etc. demand
> __object.
>
> __vaddr pointers / addresses would be freely castable, but not to
> __object and so would not be dereferenceable even indirectly.
I think it gets too complicated and there are ambiguous cases that we
may not be able to distinguish. For example copy_from_user() may be used
to copy a user data structure into the kernel, hence __object would
work, while the same function may be used to copy opaque data to a file,
so __vaddr may be a better option (unless I misunderstood your
proposal).
We currently have T __user * and I think it's a good starting point. The
prior attempt [1] was shut down because it was just hiding the cast
using __force. We'd need to work through those cases again and rather
start changing the function prototypes to avoid unnecessary casting in
the callers (e.g. get_user_pages(void __user *) or come up with a new
type) while changing the explicit casting to a macro where it needs to
be obvious that we are converting a user pointer, potentially typed
(tagged), to an untyped address range. We may need a user_ptr_to_ulong()
macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware
of it).
It may actually not be far from what you suggested but I'd keep the
current T __user * to denote possible dereference.
[1] https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.1535629099.git.andreyknvl@google.com/
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 16:38 ` Kees Cook@ 2019-05-23 17:43 ` Catalin Marinas
2019-05-23 21:31 ` Kees Cook0 siblings, 1 reply; 99+ messages in thread
From: Catalin Marinas @ 2019-05-23 17:43 UTC (permalink / raw)
To: Kees Cook
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, enh, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
> > There is also the obvious requirement which I didn't mention: new user
> > space continues to run on new/subsequent kernel versions. That's one of
> > the points of contention for this series (ignoring MTE) with the
> > maintainers having to guarantee this without much effort. IOW, do the
> > 500K+ new lines in a subsequent kernel version break any user space out
> > there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> > syskaller sufficient? Better static analysis would definitely help.
>
> We can't have perfect coverage of people actively (or accidentally)
> working to trick static analyzers (and the compiler) into "forgetting"
> about a __user annotation. We can certainly improve analysis (I see
> the other sub-thread on that), but I'd like that work not to block
> this series.
I don't want to block this series either as it's a prerequisite for MTE
but at the moment I'm not confident we tracked down all the places where
things can break and what the future effort will be to analyse new
kernel changes.
We've made lots of progress since March last year (I think when the
first version was posted) by both identifying new places in the kernel
that need untagging and limiting the ranges that user space can tag
(e.g. an mmap() on a device should not allow tagged pointers). Can we
say we are done or more investigation is needed?
> What on this front would you be comfortable with? Given it's a new
> feature isn't it sufficient to have a CONFIG (and/or boot option)?
I'd rather avoid re-building kernels. A boot option would do, unless we
see value in a per-process (inherited) personality or prctl. The
per-process option has the slight advantage that I can boot my machine
normally while being able to run LTP with a relaxed ABI (and a new libc
which tags pointers). I admit I don't have a very strong argument on a
per-process opt-in.
Another option would be a sysctl control together with a cmdline option.
> > Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> > even if everything is tagged with 0, we have to disallow TBI for user
> > and this includes hwasan. There were a small number of programs using
> > the TBI (I think some JavaScript compilers tried this). But now we are
> > bringing in the hwasan support and this can be a large user base. Shall
> > we add an ELF note for such binaries that use TBI/hwasan?
>
> Just to be clear, you say "disallow TBI for user" -- you mean a
> particular process, yes? i.e. there is no architectural limitation that
> says once we're using MTE nothing can switch to TBI. i.e. a process is
> either doing MTE or TBI (or nothing, but that's the same as TBI).
I may have answered this below. The architecture does not allow TBI
(i.e. faults) if MTE is enabled for a process and address range.
> > > So there needs to be some way to let the kernel know which of three
> > > things it should be doing:
> > > 1- leaving userspace addresses as-is (present)
> > > 2- wiping the top bits before using (this series)
> >
> > (I'd say tolerating rather than wiping since get_user still uses the tag
> > in the current series)
> >
> > The current series does not allow any choice between 1 and 2, the
> > default ABI basically becomes option 2.
>
> What about testing tools that intentionally insert high bits for syscalls
> and are _expecting_ them to fail? It seems the TBI series will break them?
> In that case, do we need to opt into TBI as well?
If there are such tools, then we may need a per-process control. It's
basically an ABI change.
> > > 3- wiping the top bits for most things, but retaining them for MTE as
> > > needed (the future)
> >
> > 2 and 3 are not entirely compatible as a tagged pointer may be checked
> > against the memory colour by the hardware. So you can't have hwasan
> > binary with MTE enabled.
>
> Right: a process must be either MTE or TBI, not both.
Indeed.
> > > I expect MTE to be the "default" in the future. Once a system's libc has
> > > grown support for it, everything will be trying to use MTE. TBI will be
> > > the special case (but TBI is effectively a prerequisite).
> >
> > The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> > distinction between the above 2 and 3 needs to be solved.
>
> Does that need solving now or when the MTE series appears? As there is
> no reason to distinguish between "normal" and "TBI", that doesn't seem
> to need solving at this point?
We don't need to solve 2 vs 3 as long as option 3 is an explicit opt-in
(prctl) by the user. Otherwise the kernel cannot guess whether the user
is using TBI or not (and if it does and still opts in to MTE, we can say
it's a user problem ;)).
> > > To choose between "2" and "3", it seems we need a per-process flag to
> > > opt into TBI (and out of MTE).
> >
> > Or leave option 2 the default and get it to opt in to MTE.
>
> Given that MTE has to "start" at some point in the binary lifetime, I'm
> fine with opting into MTE. I do expect, though, this will feel redundant
> in a couple years as everything will immediately opt-in. But, okay, this
> is therefore an issue for the MTE series.
Unless Google phases out hwasan soon ;), they may live in parallel for a
while, so a choice of TBI vs MTE.
> Alright, the tl;dr appears to be:
> - you want more assurances that we can find __user stripping in the
> kernel more easily. (But this seems like a parallel problem.)
Yes, and that we found all (most) cases now. The reason I don't see it
as a parallel problem is that, as maintainer, I promise an ABI to user
and I'd rather stick to it. I don't want, for example, ncurses to stop
working because of some ioctl() rejecting tagged pointers.
If it's just the occasional one-off bug I'm fine to deal with it. But
no-one convinced me yet that this is the case.
As for the generic driver code (filesystems or other subsystems),
without some clear direction for developers, together with static
checking/sparse, on how user pointers are cast to longs (one example),
it would become my responsibility to identify and fix them up with any
kernel release. This series is not providing such guidance, just adding
untagged_addr() in some places that we think matter.
> - we might need to opt in to TBI with a prctl()
Yes, although still up for discussion.
> - all other concerns are for the future MTE series (though it sounds
> like HWCAP_MTE and a prctl() solve those issues too).
Indeed.
> Is this accurate? What do you see as the blockers for this series at
> this point?
Well, see my comment on your first bullet point above ;).
And thanks for summarising it.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 17:43 ` Catalin Marinas@ 2019-05-23 21:31 ` Kees Cook
2019-05-24 11:20 ` Catalin Marinas
2019-05-28 17:02 ` Catalin Marinas0 siblings, 2 replies; 99+ messages in thread
From: Kees Cook @ 2019-05-23 21:31 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, enh, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> > What on this front would you be comfortable with? Given it's a new
> > feature isn't it sufficient to have a CONFIG (and/or boot option)?
>
> I'd rather avoid re-building kernels. A boot option would do, unless we
> see value in a per-process (inherited) personality or prctl. The
I think I've convinced myself that the normal<->TBI ABI control should
be a boot parameter. More below...
> > What about testing tools that intentionally insert high bits for syscalls
> > and are _expecting_ them to fail? It seems the TBI series will break them?
> > In that case, do we need to opt into TBI as well?
>
> If there are such tools, then we may need a per-process control. It's
> basically an ABI change.
syzkaller already attempts to randomly inject non-canonical and
0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
able to write directly to kernel memory[1].
It seems that using TBI by default and not allowing a switch back to
"normal" ABI without a reboot actually means that userspace cannot inject
kernel pointers into syscalls any more, since they'll get universally
stripped now. Is my understanding correct, here? i.e. exploiting
CVE-2017-5123 would be impossible under TBI?
If so, then I think we should commit to the TBI ABI and have a boot
flag to disable it, but NOT have a process flag, as that would allow
attackers to bypass the masking. The only flag should be "TBI or MTE".
If so, can I get top byte masking for other architectures too? Like,
just to strip high bits off userspace addresses? ;)
(Oh, in looking I see this is implemented with sign-extension... why
not just a mask? So it'll either be valid userspace address or forced
into the non-canonical range?)
[1] https://salls.github.io/Linux-Kernel-CVE-2017-5123/> > Alright, the tl;dr appears to be:
> > - you want more assurances that we can find __user stripping in the
> > kernel more easily. (But this seems like a parallel problem.)
>
> Yes, and that we found all (most) cases now. The reason I don't see it
> as a parallel problem is that, as maintainer, I promise an ABI to user
> and I'd rather stick to it. I don't want, for example, ncurses to stop
> working because of some ioctl() rejecting tagged pointers.
But this is what I don't understand: it would need to be ncurses _using
TBI_, that would stop working (having started to work before, but then
regress due to a newly added one-off bug). Regular ncurses will be fine
because it's not using TBI. So The Golden Rule isn't violated, and by
definition, it's a specific regression caused by some bug (since TBI
would have had to have worked _before_ in the situation to be considered
a regression now). Which describes the normal path for kernel
development... add feature, find corner cases where it doesn't work,
fix them, encounter new regressions, fix those, repeat forever.
> If it's just the occasional one-off bug I'm fine to deal with it. But
> no-one convinced me yet that this is the case.
You believe there still to be some systemic cases that haven't been
found yet? And even if so -- isn't it better to work on that
incrementally?
> As for the generic driver code (filesystems or other subsystems),
> without some clear direction for developers, together with static
> checking/sparse, on how user pointers are cast to longs (one example),
> it would become my responsibility to identify and fix them up with any
> kernel release. This series is not providing such guidance, just adding
> untagged_addr() in some places that we think matter.
What about adding a nice bit of .rst documentation that describes the
situation and shows how to use untagged_addr(). This is the kind of
kernel-wide change that "everyone" needs to know about, and shouldn't
be the arch maintainer's sole responsibility to fix.
> > - we might need to opt in to TBI with a prctl()
>
> Yes, although still up for discussion.
I think I've talked myself out of it. I say boot param only! :)
So what do you say to these next steps:
- change untagged_addr() to use a static branch that is controlled with
a boot parameter.
- add, say, Documentation/core-api/user-addresses.rst to describe
proper care and handling of user space pointers with untagged_addr(),
with examples based on all the cases seen so far in this series.
- continue work to improve static analysis.
Thanks for wading through this with me! :)
--
Kees Cook
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 21:31 ` Kees Cook@ 2019-05-24 11:20 ` Catalin Marinas
2019-05-28 17:02 ` Catalin Marinas1 sibling, 0 replies; 99+ messages in thread
From: Catalin Marinas @ 2019-05-24 11:20 UTC (permalink / raw)
To: Kees Cook
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, enh, Robin Murphy,
Christian Koenig, Luc Van Oostenryck
On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
> On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
> > On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> > > What about testing tools that intentionally insert high bits for syscalls
> > > and are _expecting_ them to fail? It seems the TBI series will break them?
> > > In that case, do we need to opt into TBI as well?
> >
> > If there are such tools, then we may need a per-process control. It's
> > basically an ABI change.
>
> syzkaller already attempts to randomly inject non-canonical and
> 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
> find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
> able to write directly to kernel memory[1].
>
> It seems that using TBI by default and not allowing a switch back to
> "normal" ABI without a reboot actually means that userspace cannot inject
> kernel pointers into syscalls any more, since they'll get universally
> stripped now. Is my understanding correct, here? i.e. exploiting
> CVE-2017-5123 would be impossible under TBI?
Unless the kernel is also using TBI (khwasan), in which case masking out
the top byte wouldn't help. Anyway, as per this discussion, we want the
tagged pointer to remain intact all the way to put_user(), so nothing
gets masked out. I don't think this would have helped with the waitid()
bug.
> If so, then I think we should commit to the TBI ABI and have a boot
> flag to disable it, but NOT have a process flag, as that would allow
> attackers to bypass the masking. The only flag should be "TBI or MTE".
>
> If so, can I get top byte masking for other architectures too? Like,
> just to strip high bits off userspace addresses? ;)
But you didn't like my option 2 shim proposal which strips the tag on
kernel entry because it lowers the value of MTE ;).
> (Oh, in looking I see this is implemented with sign-extension... why
> not just a mask? So it'll either be valid userspace address or forced
> into the non-canonical range?)
The TTBR0/1 selection on memory accesses is done based on bit 63 if TBI
is disabled and bit 55 when enabled. Sign-extension allows us to use the
same macro for both user and kernel tagged pointers. With MTE tag 0
would be match-all for TTBR0 and 0xff for TTBR1 (so that we don't modify
the virtual address space of the kernel; I need to check the latest spec
to be sure). Note that the VA space for both user and kernel is limited
to 52-bit architecturally so, on access, bits 55..52 must be the same, 0
or 1, otherwise you get a fault.
Since the syzkaller tests would also need to set bits 55-52 (actually 48
for kernel addresses, we haven't merged the 52-bit kernel VA patches
yet) to hit a valid kernel address, I don't think ignoring the top byte
makes much difference to the expected failure scenario.
> > > Alright, the tl;dr appears to be:
> > > - you want more assurances that we can find __user stripping in the
> > > kernel more easily. (But this seems like a parallel problem.)
> >
> > Yes, and that we found all (most) cases now. The reason I don't see it
> > as a parallel problem is that, as maintainer, I promise an ABI to user
> > and I'd rather stick to it. I don't want, for example, ncurses to stop
> > working because of some ioctl() rejecting tagged pointers.
>
> But this is what I don't understand: it would need to be ncurses _using
> TBI_, that would stop working (having started to work before, but then
> regress due to a newly added one-off bug). Regular ncurses will be fine
> because it's not using TBI. So The Golden Rule isn't violated,
Once we introduced TBI and the libc starts tagging heap allocations,
this becomes the new "regular" user space behaviour (i.e. using TBI). So
a new bug would break the golden rule. It could also be an old bug that
went unnoticed (i.e. you changed the graphics card and its driver gets
confused by tagged pointers coming from user-space).
> and by definition, it's a specific regression caused by some bug
> (since TBI would have had to have worked _before_ in the situation to
> be considered a regression now). Which describes the normal path for
> kernel development... add feature, find corner cases where it doesn't
> work, fix them, encounter new regressions, fix those, repeat forever.
>
> > If it's just the occasional one-off bug I'm fine to deal with it. But
> > no-one convinced me yet that this is the case.
>
> You believe there still to be some systemic cases that haven't been
> found yet? And even if so -- isn't it better to work on that
> incrementally?
I want some way to systematically identify potential issues (sparse?).
Since problems are most likely in drivers, I don't have all devices to
check and not all users have the knowledge to track down why something
failed.
I think we can do this incrementally as long the TBI ABI is not the
default. Even better if we made it per process.
> > As for the generic driver code (filesystems or other subsystems),
> > without some clear direction for developers, together with static
> > checking/sparse, on how user pointers are cast to longs (one example),
> > it would become my responsibility to identify and fix them up with any
> > kernel release. This series is not providing such guidance, just adding
> > untagged_addr() in some places that we think matter.
>
> What about adding a nice bit of .rst documentation that describes the
> situation and shows how to use untagged_addr(). This is the kind of
> kernel-wide change that "everyone" needs to know about, and shouldn't
> be the arch maintainer's sole responsibility to fix.
This works (if people read it) but we also need to be more prescriptive
in how casting is done and how we differentiate between a pointer for
dereference (T __user *) and address space management (usually unsigned
long). On top of that, we'd get sparse to check for such conversions and
maybe even checkpatch for some low-hanging fruit.
> > > - we might need to opt in to TBI with a prctl()
> >
> > Yes, although still up for discussion.
>
> I think I've talked myself out of it. I say boot param only! :)
I hope I talked you in again ;). I don't see TBI as improving kernel
security.
> So what do you say to these next steps:
>
> - change untagged_addr() to use a static branch that is controlled with
> a boot parameter.
access_ok() as well.
> - add, say, Documentation/core-api/user-addresses.rst to describe
> proper care and handling of user space pointers with untagged_addr(),
> with examples based on all the cases seen so far in this series.
We have u64_to_user_ptr(). What about the reverse? And maybe changing
get_user_pages() to take void __user *.
> - continue work to improve static analysis.
Andrew Murray in the ARM kernel team started revisiting the old sparse
threads, let's see how it goes.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-23 16:57 ` Catalin Marinas@ 2019-05-24 14:23 ` Dave Martin0 siblings, 0 replies; 99+ messages in thread
From: Dave Martin @ 2019-05-24 14:23 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy, Will Deacon,
dri-devel, linux-mm, Lee Smith, linux-kselftest,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, Jason Gunthorpe, linux-arm-kernel, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Mauro Carvalho Chehab,
Dmitry Vyukov, Kostya Serebryany, Greg Kroah-Hartman,
Yishai Hadas, linux-kernel, Jens Wiklander, Ramana Radhakrishnan,
Alexander Deucher, Andrew Morton, Robin Murphy, Felix Kuehling,
Luc Van Oostenryck
On Thu, May 23, 2019 at 05:57:09PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
> > On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > > > If multiple people will care about this, perhaps we should try to
> > > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > > > structures.
> > > >
> > > > For example, we could have a couple of mutually exclusive modifiers
> > > >
> > > > T __object *
> > > > T __vaddr * (or U __vaddr)
> > > >
> > > > In the first case the pointer points to an object (in the C sense)
> > > > that the call may dereference but not use for any other purpose.
> > >
> > > How would you use these two differently?
> > >
> > > So far the kernel has worked that __user should tag any pointer that
> > > is from userspace and then you can't do anything with it until you
> > > transform it into a kernel something
> >
> > Ultimately it would be good to disallow casting __object pointers execpt
> > to compatible __object pointer types, and to make get_user etc. demand
> > __object.
> >
> > __vaddr pointers / addresses would be freely castable, but not to
> > __object and so would not be dereferenceable even indirectly.
>
> I think it gets too complicated and there are ambiguous cases that we
> may not be able to distinguish. For example copy_from_user() may be used
> to copy a user data structure into the kernel, hence __object would
> work, while the same function may be used to copy opaque data to a file,
> so __vaddr may be a better option (unless I misunderstood your
> proposal).
Can you illustrate? I'm not sure of your point here.
> We currently have T __user * and I think it's a good starting point. The
> prior attempt [1] was shut down because it was just hiding the cast
> using __force. We'd need to work through those cases again and rather
> start changing the function prototypes to avoid unnecessary casting in
> the callers (e.g. get_user_pages(void __user *) or come up with a new
> type) while changing the explicit casting to a macro where it needs to
> be obvious that we are converting a user pointer, potentially typed
> (tagged), to an untyped address range. We may need a user_ptr_to_ulong()
> macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware
> of it).
>
> It may actually not be far from what you suggested but I'd keep the
> current T __user * to denote possible dereference.
This may not have been clear, but __object and __vaddr would be
orthogonal to __user. Since __object and __vaddr strictly constrain
what can be done with an lvalue, they could be cast on, but not be
cast off without __force.
Syscall arguments and pointer in ioctl structs etc. would typically
be annotated as __object __user * or __vaddr __user *. Plain old
__user * would work as before, but would be more permissive and give
static analysers less information to go on.
Conversion or use or __object or __vaddr pointers would require specific
APIs in the kernel, so that we can be clear about the semantics.
Doing things this way would allow migration to annotation of most or all
ABI pointers with __object or __vaddr over time, but we wouldn't have to
do it all in one go. Problem cases (which won't be the majority) could
continue to be plain __user.
This does not magically solve the challenges of MTE, but might provide
tools that are useful to help avoid bitrot and regressions over time.
I agree though that there might be a fair number of of cases that don't
conveniently fall under __object or __vaddr semantics. It's hard to
know without trying it.
_Most_ syscall arguments seem to be fairly obviously one or another
though, and this approach has some possibility of scaling to ioctls
and other odd interfaces.
Cheers
---Dave
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-24 10:11 ` Catalin Marinas@ 2019-05-24 14:25 ` Khalid Aziz
2019-05-28 14:14 ` Andrey Konovalov0 siblings, 1 reply; 99+ messages in thread
From: Khalid Aziz @ 2019-05-24 14:25 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Ramana Radhakrishnan, Alex Williamson, Yishai Hadas,
Mauro Carvalho Chehab, Linux ARM, Kostya Serebryany,
Greg Kroah-Hartman, Felix Kuehling, LKML, Jens Wiklander,
Lee Smith, Alexander Deucher, Andrew Morton, Elliott Hughes,
Robin Murphy, Christian Koenig, Luc Van Oostenryck
On 5/24/19 4:11 AM, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote:
>> On 5/23/19 2:11 PM, Catalin Marinas wrote:
>>> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
>>>> On 5/21/19 6:04 PM, Kees Cook wrote:
>>>>> As an aside: I think Sparc ADI support in Linux actually side-stepped
>>>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
>>>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
>>>>> for kernel code.") I think this was a mistake we should not repeat for
>>>>> arm64 (we do seem to be at least in agreement about this, I think).
>>>>>
>>>>> [1] https://lore.kernel.org/patchwork/patch/654481/
>>>>
>>>> That is a very early version of the sparc ADI patch. Support for tagged
>>>> addresses in syscalls was added in later versions and is in the patch
>>>> that is in the kernel.
>>>
>>> I tried to figure out but I'm not familiar with the sparc port. How did
>>> you solve the tagged address going into various syscall implementations
>>> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
>>> ends up deeper in the core code?
>>
>> Another spot I should point out in ADI patch - Tags are not stored in
>> VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped
>> before userspace addresses are passed to IOMMU in the following snippet
>> from the patch:
>>
>> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
>> index 5335ba3c850e..357b6047653a 100644
>> --- a/arch/sparc/mm/gup.c
>> +++ b/arch/sparc/mm/gup.c
>> @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int
>> nr_pages
>> , int write,
>> pgd_t *pgdp;
>> int nr = 0;
>>
>> +#ifdef CONFIG_SPARC64
>> + if (adi_capable()) {
>> + long addr = start;
>> +
>> + /* If userspace has passed a versioned address, kernel
>> + * will not find it in the VMAs since it does not store
>> + * the version tags in the list of VMAs. Storing version
>> + * tags in list of VMAs is impractical since they can be
>> + * changed any time from userspace without dropping into
>> + * kernel. Any address search in VMAs will be done with
>> + * non-versioned addresses. Ensure the ADI version bits
>> + * are dropped here by sign extending the last bit before
>> + * ADI bits. IOMMU does not implement version tags.
>> + */
>> + addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
>> + start = addr;
>> + }
>> +#endif
>> start &= PAGE_MASK;
>> addr = start;
>> len = (unsigned long) nr_pages << PAGE_SHIFT;
>
> Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so
> you fix this case here. If we add the generic untagged_addr() macro in
> the generic code, I think sparc can start making use of it rather than
> open-coding the shifts.
Hi Catalin,
Yes, that will be good. Right now addresses are untagged in sparc code
in only two spots but that will expand as we expand use of tags.
Scalabale solution is definitely better.
>
> There are a few other other places where tags can leak and the core code
> would get confused (for example, madvise()). I presume your user space
> doesn't exercise them. On arm64 we plan to just allow the C library to
> tag any new memory allocation, so those core code paths would need to be
> covered.
>
> And similarly, devices, IOMMU, any DMA would ignore tags.
>
Right. You are doing lot more with tags than sparc code intended to do.
I had looked into implementing just malloc(), mmap() and possibly
shmat() in library that automatically tags pointers. Expanding tags to
any pointers in C library will require covering lot more paths in kernel.
--
Khalid
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls
2019-05-28 15:40 ` Catalin Marinas
2019-05-28 15:56 ` Dave Martin@ 2019-05-28 23:33 ` Khalid Aziz
2019-05-29 14:20 ` Catalin Marinas1 sibling, 1 reply; 99+ messages in thread
From: Khalid Aziz @ 2019-05-28 23:33 UTC (permalink / raw)
To: Catalin Marinas, Andrew Murray
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy, Will Deacon,
dri-devel, linux-mm, Lee Smith, linux-kselftest,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, linux-arm-kernel, Dave Martin, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Mauro Carvalho Chehab,
Dmitry Vyukov, Kostya Serebryany, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Morton,
Robin Murphy, Yishai Hadas, Luc Van Oostenryck
On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
> On Tue, May 28, 2019 at 03:54:11PM +0100, Andrew Murray wrote:
> > On Mon, May 27, 2019 at 03:37:20PM +0100, Catalin Marinas wrote:
> > > On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> > > > This patch is a part of a series that extends arm64 kernel ABI
> > > > to allow to
> > > > pass tagged user pointers (with the top byte set to something
> > > > else other
> > > > than 0x00) as syscall arguments.
> > > >
> > > > This patch allows tagged pointers to be passed to the following
> > > > memory
> > > > syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock,
> > > > mlock2,
> > > > mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> > > > remap_file_pages, shmat and shmdt.
> > > >
> > > > This is done by untagging pointers passed to these syscalls in
> > > > the
> > > > prologues of their handlers.
> > > >
> > > > Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
> > >
> > > Actually, I don't think any of these wrappers get called (have
> > > you
> > > tested this patch?). Following commit 4378a7d4be30 ("arm64:
> > > implement
> > > syscall wrappers"), I think we have other macro names for
> > > overriding the
> > > sys_* ones.
> >
> > What is the value in adding these wrappers?
>
> Not much value, initially proposed just to keep the core changes
> small.
> I'm fine with moving them back in the generic code (but see below).
>
> I think another aspect is how we define the ABI. Is allowing tags to
> mlock() for example something specific to arm64 or would sparc ADI
> need
> the same? In the absence of other architectures defining such ABI, my
> preference would be to keep the wrappers in the arch code.
>
> Assuming sparc won't implement untagged_addr(), we can place the
> macros
> back in the generic code but, as per the review here, we need to be
> more
> restrictive on where we allow tagged addresses. For example, if
> mmap()
> gets a tagged address with MAP_FIXED, is it expected to return the
> tag?
I would recommend against any ABI differences between ARM64 MTE/TBI and
sparc ADI unless it simply can not be helped. My understanding of
MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a
tagged address has no meaning until following steps happen:
1. set the user mode PSTATE.mcde bit. This acts as the master switch to
enable ADI for a process.
2. set TTE.mcd bit on TLB entries that match the address range ADI is
being enabled on.
3. Store version tag for the range of addresses userspace wants ADI
enabled on using "stxa" instruction. These tags are stored in physical
memory by the memory controller.
Steps 1 and 2 are accomplished by userspace by calling mprotect() with
PROT_ADI. Tags are set by storing tags in a loop, for example:
version = 10;
tmp_addr = shmaddr;
end = shmaddr + BUFFER_SIZE;
while (tmp_addr < end) {
asm volatile(
"stxa %1, [%0]0x90\n\t"
:
: "r" (tmp_addr), "r" (version));
tmp_addr += adi_blksz;
}
With these semantics, giving mmap() or shamat() a tagged address is
meaningless since no tags have been stored at the addresses mmap() will
allocate and one can not store tags before memory range has been
allocated. If we choose to allow tagged addresses to come into mmap()
and shmat(), sparc code can strip the tags unconditionally and that may
help simplify ABI and/or code.
>
> My thoughts on allowing tags (quick look):
>
> brk - no
> get_mempolicy - yes
> madvise - yes
> mbind - yes
> mincore - yes
> mlock, mlock2, munlock - yes
> mmap - no (we may change this with MTE but not for TBI)
> mmap_pgoff - not used on arm64
> mprotect - yes
> mremap - yes for old_address, no for new_address (on par with mmap)
> msync - yes
> munmap - probably no (mmap does not return tagged ptrs)
> remap_file_pages - no (also deprecated syscall)
> shmat, shmdt - shall we allow tagged addresses on shared memory?
>
> The above is only about the TBI ABI while ignoring hardware MTE. For
> the
> latter, we may want to change the mmap() to allow pre-colouring on
> page
> fault which means that munmap()/mprotect() should also support tagged
> pointers. Possibly mremap() as well but we need to decide whether it
> should allow re-colouring the page (probably no, in which case
> old_address and new_address should have the same tag). For some of
> these
> we'll end up with arm64 specific wrappers again, unless sparc ADI
> adopts
> exactly the same ABI restrictions.
>
Let us keep any restrictions common across ARM64 and sparc. pre-
coloring on sparc in the kernel would mean kernel will have to execute
stxa instructions in a loop for each page being faulted in. Not that
big a deal but doesn't that assume the entire page has the same tag
which is dedcued from the upper bits of address? Shouldn't we support
tags at the same granularity level as what the hardware supports? We
went through this discussion for sparc and decision was to support tags
at the same granularity as hardware. That means we can not deduce tags
from the first address that pioints into an mmap or shmat region. Those
tags and the upper bytes of colored address could change for every
cacheline sized block (64-bytes on sparc M7). We can try to store tags
for an entire region in vma but that is expensive, plus on sparc tags
are set in userspace with no participation from kernel and now we need
a way for userspace to communicate the tags to kernel. From sparc point
of view, making kernel responsible for assigning tags to a page on page
fault is full of pitfalls.
Thanks,
Khalid
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls
2019-05-28 23:33 ` Khalid Aziz@ 2019-05-29 14:20 ` Catalin Marinas
2019-05-29 19:16 ` Khalid Aziz0 siblings, 1 reply; 99+ messages in thread
From: Catalin Marinas @ 2019-05-29 14:20 UTC (permalink / raw)
To: Khalid Aziz
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy, Will Deacon,
dri-devel, linux-mm, Lee Smith, linux-kselftest,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, linux-arm-kernel, Dave Martin, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Mauro Carvalho Chehab,
Dmitry Vyukov, Kostya Serebryany, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Murray,
Andrew Morton, Robin Murphy, Yishai Hadas, Luc Van Oostenryck
Hi Khalid,
On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
> On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
> > I think another aspect is how we define the ABI. Is allowing tags to
> > mlock() for example something specific to arm64 or would sparc ADI
> > need the same? In the absence of other architectures defining such
> > ABI, my preference would be to keep the wrappers in the arch code.
> >
> > Assuming sparc won't implement untagged_addr(), we can place the
> > macros back in the generic code but, as per the review here, we need
> > to be more restrictive on where we allow tagged addresses. For
> > example, if mmap() gets a tagged address with MAP_FIXED, is it
> > expected to return the tag?
>
> I would recommend against any ABI differences between ARM64 MTE/TBI and
> sparc ADI unless it simply can not be helped. My understanding of
> MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a
> tagged address has no meaning until following steps happen:
Before we go into the MTE/ADI similarities or differences, just to
clarify that TBI is something that we supported from the start of the
arm64 kernel port. TBI (top byte ignore) allows a user pointer to have
non-zero top byte and dereference it without causing a fault (the
hardware masks it out). The user/kernel ABI does not allow such tagged
pointers into the kernel, nor would the kernel return any such tagged
addresses.
With MTE (memory tagging extensions), the top-byte meaning is changed
from no longer being ignored to actually being checked against a tag in
the physical RAM (we call it allocation tag).
> 1. set the user mode PSTATE.mcde bit. This acts as the master switch to
> enable ADI for a process.
>
> 2. set TTE.mcd bit on TLB entries that match the address range ADI is
> being enabled on.
Something close enough for MTE, with the difference that enabling it is
not a PSTATE bit but rather a system control bit (SCTLR_EL1 register),
so only the kernel can turn it on/off for the user.
> 3. Store version tag for the range of addresses userspace wants ADI
> enabled on using "stxa" instruction. These tags are stored in physical
> memory by the memory controller.
Do you have an "ldxa" instruction to load the tags from physical memory?
> Steps 1 and 2 are accomplished by userspace by calling mprotect() with
> PROT_ADI. Tags are set by storing tags in a loop, for example:
>
> version = 10;
> tmp_addr = shmaddr;
> end = shmaddr + BUFFER_SIZE;
> while (tmp_addr < end) {
> asm volatile(
> "stxa %1, [%0]0x90\n\t"
> :
> : "r" (tmp_addr), "r" (version));
> tmp_addr += adi_blksz;
> }
On arm64, a sequence similar to the above would live in the libc. So a
malloc() call will tag the memory and return the tagged address to the
user.
We were not planning for a PROT_ADI/MTE but rather have MTE enabled for
all user memory ranges. We may revisit this before we upstream the MTE
support (probably some marginal benefit for the hardware not fetching
the tags from memory if we don't need to, e.g. code sections).
Given that we already have the TBI feature and with MTE enabled the top
byte is no longer ignored, we are planning for an explicit opt-in by the
user via prctl() to enable MTE.
> With these semantics, giving mmap() or shamat() a tagged address is
> meaningless since no tags have been stored at the addresses mmap() will
> allocate and one can not store tags before memory range has been
> allocated. If we choose to allow tagged addresses to come into mmap()
> and shmat(), sparc code can strip the tags unconditionally and that may
> help simplify ABI and/or code.
We could say that with TBI (pre-MTE support), the top byte is actually
ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address,
should the user expect the same tagged address back or stripping the tag
is acceptable? If we want to keep the current mmap() semantics, I'd say
the same tag is returned. However, with MTE this also implies that the
memory was coloured.
> > My thoughts on allowing tags (quick look):
> >
> > brk - no
> > get_mempolicy - yes
> > madvise - yes
> > mbind - yes
> > mincore - yes
> > mlock, mlock2, munlock - yes
> > mmap - no (we may change this with MTE but not for TBI)
> > mmap_pgoff - not used on arm64
> > mprotect - yes
> > mremap - yes for old_address, no for new_address (on par with mmap)
> > msync - yes
> > munmap - probably no (mmap does not return tagged ptrs)
> > remap_file_pages - no (also deprecated syscall)
> > shmat, shmdt - shall we allow tagged addresses on shared memory?
> >
> > The above is only about the TBI ABI while ignoring hardware MTE. For
> > the latter, we may want to change the mmap() to allow pre-colouring
> > on page fault which means that munmap()/mprotect() should also
> > support tagged pointers. Possibly mremap() as well but we need to
> > decide whether it should allow re-colouring the page (probably no,
> > in which case old_address and new_address should have the same tag).
> > For some of these we'll end up with arm64 specific wrappers again,
> > unless sparc ADI adopts exactly the same ABI restrictions.
>
> Let us keep any restrictions common across ARM64 and sparc. pre-
> coloring on sparc in the kernel would mean kernel will have to execute
> stxa instructions in a loop for each page being faulted in.
Since the user can probe the pre-existing colour in a faulted-in page
(either with some 'ldxa' instruction or by performing a tag-checked
access), the kernel should always pre-colour (even if colour 0) any
allocated page. There might not be an obvious security risk but I feel
uneasy about letting colours leak between address spaces (different user
processes or between kernel and user).
Since we already need such loop in the kernel, we might as well allow
user space to require a certain colour. This comes in handy for large
malloc() and another advantage is that the C library won't be stuck
trying to paint the whole range (think GB).
> Not that big a deal but doesn't that assume the entire page has the
> same tag which is dedcued from the upper bits of address? Shouldn't we
> support tags at the same granularity level as what the hardware
> supports?
That's mostly about large malloc() optimisation via mmap(), the latter
working on page granularity already. There is another use-case for
pre-coloured thread stacks, also allocated via anonymous mmap().
> We went through this discussion for sparc and decision was to support
> tags at the same granularity as hardware. That means we can not deduce
> tags from the first address that pioints into an mmap or shmat region.
> Those tags and the upper bytes of colored address could change for
> every cacheline sized block (64-bytes on sparc M7).
It's 16-byte for arm64, so smaller than the cacheline.
> We can try to store tags for an entire region in vma but that is
> expensive, plus on sparc tags are set in userspace with no
> participation from kernel and now we need a way for userspace to
> communicate the tags to kernel.
We can't support finer granularity through the mmap() syscall and, as
you said, the vma is not the right thing to store the individual tags.
With the above extension to mmap(), we'd have to store a colour per vma
and prevent merging if different colours (we could as well use the
pkeys mechanism we already have in the kernel but use a colour per vma
instead of a key).
Of course, the user is allowed to change the in-memory colours at a
finer granularity and the kernel will preserve them during swapping
out/in, page migration etc. The above mmap() proposal is just for the
first fault-in of a page in a given range/vma.
> From sparc point of view, making kernel responsible for assigning tags
> to a page on page fault is full of pitfalls.
This could be just some arm64-specific but if you plan to deploy it more
generically for sparc (at the C library level), you may find this
useful.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls
2019-05-29 14:20 ` Catalin Marinas@ 2019-05-29 19:16 ` Khalid Aziz
2019-05-30 15:11 ` Catalin Marinas0 siblings, 1 reply; 99+ messages in thread
From: Khalid Aziz @ 2019-05-29 19:16 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy, Will Deacon,
dri-devel, linux-mm, Lee Smith, linux-kselftest,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, linux-arm-kernel, Dave Martin, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Mauro Carvalho Chehab,
Dmitry Vyukov, Kostya Serebryany, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Murray,
Andrew Morton, Robin Murphy, Yishai Hadas, Luc Van Oostenryck
On 5/29/19 8:20 AM, Catalin Marinas wrote:
> Hi Khalid,
>
> On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
>> On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
>>> I think another aspect is how we define the ABI. Is allowing tags to
>>> mlock() for example something specific to arm64 or would sparc ADI
>>> need the same? In the absence of other architectures defining such
>>> ABI, my preference would be to keep the wrappers in the arch code.
>>>
>>> Assuming sparc won't implement untagged_addr(), we can place the
>>> macros back in the generic code but, as per the review here, we need
>>> to be more restrictive on where we allow tagged addresses. For
>>> example, if mmap() gets a tagged address with MAP_FIXED, is it
>>> expected to return the tag?
>>
>> I would recommend against any ABI differences between ARM64 MTE/TBI and
>> sparc ADI unless it simply can not be helped. My understanding of
>> MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a
>> tagged address has no meaning until following steps happen:
>
> Before we go into the MTE/ADI similarities or differences, just to
> clarify that TBI is something that we supported from the start of the
> arm64 kernel port. TBI (top byte ignore) allows a user pointer to have
> non-zero top byte and dereference it without causing a fault (the
> hardware masks it out). The user/kernel ABI does not allow such tagged
> pointers into the kernel, nor would the kernel return any such tagged
> addresses.
>
> With MTE (memory tagging extensions), the top-byte meaning is changed
> from no longer being ignored to actually being checked against a tag in
> the physical RAM (we call it allocation tag).
>
>> 1. set the user mode PSTATE.mcde bit. This acts as the master switch to
>> enable ADI for a process.
>>
>> 2. set TTE.mcd bit on TLB entries that match the address range ADI is
>> being enabled on.
>
> Something close enough for MTE, with the difference that enabling it is
> not a PSTATE bit but rather a system control bit (SCTLR_EL1 register),
> so only the kernel can turn it on/off for the user.
>
>> 3. Store version tag for the range of addresses userspace wants ADI
>> enabled on using "stxa" instruction. These tags are stored in physical
>> memory by the memory controller.
>
> Do you have an "ldxa" instruction to load the tags from physical memory?
Yes, "ldxa" can be used to read current tag for any memory location.
Kernel uses it to read the tags for a physical page being swapped out
and restores those tags when the page is swapped back in.
>
>> Steps 1 and 2 are accomplished by userspace by calling mprotect() with
>> PROT_ADI. Tags are set by storing tags in a loop, for example:
>>
>> version = 10;
>> tmp_addr = shmaddr;
>> end = shmaddr + BUFFER_SIZE;
>> while (tmp_addr < end) {
>> asm volatile(
>> "stxa %1, [%0]0x90\n\t"
>> :
>> : "r" (tmp_addr), "r" (version));
>> tmp_addr += adi_blksz;
>> }
>
> On arm64, a sequence similar to the above would live in the libc. So a
> malloc() call will tag the memory and return the tagged address to thePre-coloring could easily be done by
> user.
>
> We were not planning for a PROT_ADI/MTE but rather have MTE enabled for
> all user memory ranges. We may revisit this before we upstream the MTE
> support (probably some marginal benefit for the hardware not fetching
> the tags from memory if we don't need to, e.g. code sections).
>
> Given that we already have the TBI feature and with MTE enabled the top
> byte is no longer ignored, we are planning for an explicit opt-in by the
> user via prctl() to enable MTE.
OK. I had initially proposed enabling ADI for a process using prctl().
Feedback I got was prctl was not a desirable interface and I ended up
making mprotect() with PROT_ADI enable ADI on the process instead. Just
something to keep in mind.
>
>> With these semantics, giving mmap() or shamat() a tagged address is
>> meaningless since no tags have been stored at the addresses mmap() will
>> allocate and one can not store tags before memory range has been
>> allocated. If we choose to allow tagged addresses to come into mmap()
>> and shmat(), sparc code can strip the tags unconditionally and that may
>> help simplify ABI and/or code.
>
> We could say that with TBI (pre-MTE support), the top byte is actually
> ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address,
> should the user expect the same tagged address back or stripping the tag
> is acceptable? If we want to keep the current mmap() semantics, I'd say
> the same tag is returned. However, with MTE this also implies that the
> memory was coloured.
>
Is assigning a tag aprivileged operationon ARM64? I am thinking not
since you mentioned libc could do it in a loop for malloc'd memory.
mmap() can return the same tagged address but I am uneasy about kernel
pre-coloring the pages. Database can mmap 100's of GB of memory. That is
lot of work being offloaded to the kernel to pre-color the page even if
done in batches as pages are faulted in.
>>> My thoughts on allowing tags (quick look):
>>>
>>> brk - no
>>> get_mempolicy - yes
>>> madvise - yes
>>> mbind - yes
>>> mincore - yes
>>> mlock, mlock2, munlock - yes
>>> mmap - no (we may change this with MTE but not for TBI)
>>> mmap_pgoff - not used on arm64
>>> mprotect - yes
>>> mremap - yes for old_address, no for new_address (on par with mmap)
>>> msync - yes
>>> munmap - probably no (mmap does not return tagged ptrs)
>>> remap_file_pages - no (also deprecated syscall)
>>> shmat, shmdt - shall we allow tagged addresses on shared memory?
>>>
>>> The above is only about the TBI ABI while ignoring hardware MTE. For
>>> the latter, we may want to change the mmap() to allow pre-colouring
>>> on page fault which means that munmap()/mprotect() should also
>>> support tagged pointers. Possibly mremap() as well but we need to
>>> decide whether it should allow re-colouring the page (probably no,
>>> in which case old_address and new_address should have the same tag).
>>> For some of these we'll end up with arm64 specific wrappers again,
>>> unless sparc ADI adopts exactly the same ABI restrictions.
>>
>> Let us keep any restrictions common across ARM64 and sparc. pre-
>> coloring on sparc in the kernel would mean kernel will have to execute
>> stxa instructions in a loop for each page being faulted in.
>
> Since the user can probe the pre-existing colour in a faulted-in page
> (either with some 'ldxa' instruction or by performing a tag-checked
> access), the kernel should always pre-colour (even if colour 0) any
> allocated page. There might not be an obvious security risk but I feel
> uneasy about letting colours leak between address spaces (different user
> processes or between kernel and user).
On sparc, tags 0 and 15 are special in that 0 means untagged memory and
15 means match any tag in the address. Colour 0 is the default for any
newly faulted in page on sparc.
>
> Since we already need such loop in the kernel, we might as well allow
> user space to require a certain colour. This comes in handy for large
> malloc() and another advantage is that the C library won't be stuck
> trying to paint the whole range (think GB).
If kernel is going to pre-color all pages in a vma, we will need to
store the default tag in the vma. It will add more time to page fault
handling code. On sparc M7, kernel will need to execute additional 128
stxa instructions to set the tags on a normal page.
>
>> Not that big a deal but doesn't that assume the entire page has the
>> same tag which is dedcued from the upper bits of address? Shouldn't we
>> support tags at the same granularity level as what the hardware
>> supports?
>
> That's mostly about large malloc() optimisation via mmap(), the latter
> working on page granularity already. There is another use-case for
> pre-coloured thread stacks, also allocated via anonymous mmap().
>
>> We went through this discussion for sparc and decision was to support
>> tags at the same granularity as hardware. That means we can not deduce
>> tags from the first address that pioints into an mmap or shmat region.
>> Those tags and the upper bytes of colored address could change for
>> every cacheline sized block (64-bytes on sparc M7).
>
> It's 16-byte for arm64, so smaller than the cacheline.
>
>> We can try to store tags for an entire region in vma but that is
>> expensive, plus on sparc tags are set in userspace with no
>> participation from kernel and now we need a way for userspace to
>> communicate the tags to kernel.
>
> We can't support finer granularity through the mmap() syscall and, as
> you said, the vma is not the right thing to store the individual tags.
> With the above extension to mmap(), we'd have to store a colour per vma
> and prevent merging if different colours (we could as well use the
> pkeys mechanism we already have in the kernel but use a colour per vma
> instead of a key).
Since tags can change on any part of mmap region on sparc at any time
without kernel being involved, I am not sure I see much reason for
kernel to enforce any tag related restrictions.
>
> Of course, the user is allowed to change the in-memory colours at a
> finer granularity and the kernel will preserve them during swapping
> out/in, page migration etc. The above mmap() proposal is just for the
> first fault-in of a page in a given range/vma.
>
>> From sparc point of view, making kernel responsible for assigning tags
>> to a page on page fault is full of pitfalls.
>
> This could be just some arm64-specific but if you plan to deploy it more
> generically for sparc (at the C library level), you may find this
> useful.
>
Common semantics from app developer point of view will be very useful to
maintain. If arm64 says mmap with MAP_FIXED and a tagged address will
return a pre-colored page, I would rather have it be the same on any
architecture. Is there a use case that justifies kernel doing this extra
work?
--
Khalid
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls
2019-05-29 19:16 ` Khalid Aziz@ 2019-05-30 15:11 ` Catalin Marinas
2019-05-30 16:05 ` Khalid Aziz0 siblings, 1 reply; 99+ messages in thread
From: Catalin Marinas @ 2019-05-30 15:11 UTC (permalink / raw)
To: Khalid Aziz
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy, Will Deacon,
dri-devel, linux-mm, Lee Smith, linux-kselftest,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, linux-arm-kernel, Dave Martin, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Mauro Carvalho Chehab,
Dmitry Vyukov, Kostya Serebryany, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Murray,
Andrew Morton, Robin Murphy, Yishai Hadas, Luc Van Oostenryck
On Wed, May 29, 2019 at 01:16:37PM -0600, Khalid Aziz wrote:
> On 5/29/19 8:20 AM, Catalin Marinas wrote:
> > On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
> >> Steps 1 and 2 are accomplished by userspace by calling mprotect() with
> >> PROT_ADI. Tags are set by storing tags in a loop, for example:
> >>
> >> version = 10;
> >> tmp_addr = shmaddr;
> >> end = shmaddr + BUFFER_SIZE;
> >> while (tmp_addr < end) {
> >> asm volatile(
> >> "stxa %1, [%0]0x90\n\t"
> >> :
> >> : "r" (tmp_addr), "r" (version));
> >> tmp_addr += adi_blksz;
> >> }
> >
> > On arm64, a sequence similar to the above would live in the libc. So a
> > malloc() call will tag the memory and return the tagged address to thePre-coloring could easily be done by
> > user.
> >
> > We were not planning for a PROT_ADI/MTE but rather have MTE enabled for
> > all user memory ranges. We may revisit this before we upstream the MTE
> > support (probably some marginal benefit for the hardware not fetching
> > the tags from memory if we don't need to, e.g. code sections).
> >
> > Given that we already have the TBI feature and with MTE enabled the top
> > byte is no longer ignored, we are planning for an explicit opt-in by the
> > user via prctl() to enable MTE.
>
> OK. I had initially proposed enabling ADI for a process using prctl().
> Feedback I got was prctl was not a desirable interface and I ended up
> making mprotect() with PROT_ADI enable ADI on the process instead. Just
> something to keep in mind.
Thanks for the feedback. We'll keep this in mind when adding MTE
support. In the way we plan to deploy this, it would be a libc decision
to invoke the mmap() with the right flag.
This could actually simplify the automatic page faulting below brk(),
basically no tagged/coloured memory allowed implicitly. It needs
feedback from the bionic/glibc folk.
> >> With these semantics, giving mmap() or shamat() a tagged address is
> >> meaningless since no tags have been stored at the addresses mmap() will
> >> allocate and one can not store tags before memory range has been
> >> allocated. If we choose to allow tagged addresses to come into mmap()
> >> and shmat(), sparc code can strip the tags unconditionally and that may
> >> help simplify ABI and/or code.
> >
> > We could say that with TBI (pre-MTE support), the top byte is actually
> > ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address,
> > should the user expect the same tagged address back or stripping the tag
> > is acceptable? If we want to keep the current mmap() semantics, I'd say
> > the same tag is returned. However, with MTE this also implies that the
> > memory was coloured.
>
> Is assigning a tag aprivileged operationon ARM64? I am thinking not
> since you mentioned libc could do it in a loop for malloc'd memory.
Indeed it's not, the user can do it.
> mmap() can return the same tagged address but I am uneasy about kernel
> pre-coloring the pages. Database can mmap 100's of GB of memory. That is
> lot of work being offloaded to the kernel to pre-color the page even if
> done in batches as pages are faulted in.
For anonymous mmap() for example, the kernel would have to zero the
faulted in pages anyway. We can handle the colouring at the same time in
clear_user_page() (as I said below, we have to clear the colour anyway
from previous uses, so it's simply extending this to support something
other than tag/colour 0 by default with no additional overhead).
> > Since the user can probe the pre-existing colour in a faulted-in page
> > (either with some 'ldxa' instruction or by performing a tag-checked
> > access), the kernel should always pre-colour (even if colour 0) any
> > allocated page. There might not be an obvious security risk but I feel
> > uneasy about letting colours leak between address spaces (different user
> > processes or between kernel and user).
>
> On sparc, tags 0 and 15 are special in that 0 means untagged memory and
> 15 means match any tag in the address. Colour 0 is the default for any
> newly faulted in page on sparc.
With MTE we don't have match-all/any tag in memory, only in the virtual
address/pointer. So if we turn on MTE for all pages and the user
accesses an address with a 0 tag, the underlying memory needs to be
coloured with the same 0 value.
> > Since we already need such loop in the kernel, we might as well allow
> > user space to require a certain colour. This comes in handy for large
> > malloc() and another advantage is that the C library won't be stuck
> > trying to paint the whole range (think GB).
>
> If kernel is going to pre-color all pages in a vma, we will need to
> store the default tag in the vma. It will add more time to page fault
> handling code. On sparc M7, kernel will need to execute additional 128
> stxa instructions to set the tags on a normal page.
As I said, since the user can retrieve an old colour using ldxa, the
kernel should perform this operation anyway on any newly allocated page
(unless you clear the existing colour on page freeing).
> >> We can try to store tags for an entire region in vma but that is
> >> expensive, plus on sparc tags are set in userspace with no
> >> participation from kernel and now we need a way for userspace to
> >> communicate the tags to kernel.
> >
> > We can't support finer granularity through the mmap() syscall and, as
> > you said, the vma is not the right thing to store the individual tags.
> > With the above extension to mmap(), we'd have to store a colour per vma
> > and prevent merging if different colours (we could as well use the
> > pkeys mechanism we already have in the kernel but use a colour per vma
> > instead of a key).
>
> Since tags can change on any part of mmap region on sparc at any time
> without kernel being involved, I am not sure I see much reason for
> kernel to enforce any tag related restrictions.
It's not enforcing a tag, more like the default colour for a faulted in
page. Anyway, if sparc is going with default 0/untagged, that's fine as
well. We may add this mmap() option to arm64 only.
> >> From sparc point of view, making kernel responsible for assigning tags
> >> to a page on page fault is full of pitfalls.
> >
> > This could be just some arm64-specific but if you plan to deploy it more
> > generically for sparc (at the C library level), you may find this
> > useful.
>
> Common semantics from app developer point of view will be very useful to
> maintain. If arm64 says mmap with MAP_FIXED and a tagged address will
> return a pre-colored page, I would rather have it be the same on any
> architecture. Is there a use case that justifies kernel doing this extra
> work?
So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB,
IIUC for sparc the faulted-in pages will have random colours (on 64-byte
granularity). Ignoring the information leak from prior uses of such
pages, it would be the responsibility of the db program to issue the
stxa. On arm64, since we also want to do this via malloc(), any large
allocation would require all pages to be faulted in so that malloc() can
set the write colour before being handed over to the user. That's what
we want to avoid and the user is free to repaint the memory as it likes.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls
2019-05-30 15:11 ` Catalin Marinas@ 2019-05-30 16:05 ` Khalid Aziz
2019-05-30 16:57 ` Catalin Marinas0 siblings, 1 reply; 99+ messages in thread
From: Khalid Aziz @ 2019-05-30 16:05 UTC (permalink / raw)
To: Catalin Marinas
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy, Will Deacon,
dri-devel, linux-mm, Lee Smith, linux-kselftest,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, linux-arm-kernel, Dave Martin, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Mauro Carvalho Chehab,
Dmitry Vyukov, Kostya Serebryany, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Murray,
Andrew Morton, Robin Murphy, Yishai Hadas, Luc Van Oostenryck
On 5/30/19 9:11 AM, Catalin Marinas wrote:
> On Wed, May 29, 2019 at 01:16:37PM -0600, Khalid Aziz wrote:
>> mmap() can return the same tagged address but I am uneasy about kernel
>> pre-coloring the pages. Database can mmap 100's of GB of memory. That is
>> lot of work being offloaded to the kernel to pre-color the page even if
>> done in batches as pages are faulted in.
>
> For anonymous mmap() for example, the kernel would have to zero the
> faulted in pages anyway. We can handle the colouring at the same time in
> clear_user_page() (as I said below, we have to clear the colour anyway
> from previous uses, so it's simply extending this to support something
> other than tag/colour 0 by default with no additional overhead).
>
On sparc M7, clear_user_page() ends up in M7clear_user_page defined in
arch/sparc/lib/M7memset.S. M7 code use Block Init Store (BIS) to clear
the page. BIS on M7 clears the memory tags as well and no separate
instructions are needed to clear the tags. As a result when kernel
clears a page before returning it to user, the page is not only zeroed
out, its tags are also cleared to 0.
>>> Since we already need such loop in the kernel, we might as well allow
>>> user space to require a certain colour. This comes in handy for large
>>> malloc() and another advantage is that the C library won't be stuck
>>> trying to paint the whole range (think GB).
>>
>> If kernel is going to pre-color all pages in a vma, we will need to
>> store the default tag in the vma. It will add more time to page fault
>> handling code. On sparc M7, kernel will need to execute additional 128
>> stxa instructions to set the tags on a normal page.
>
> As I said, since the user can retrieve an old colour using ldxa, the
> kernel should perform this operation anyway on any newly allocated page
> (unless you clear the existing colour on page freeing).>
Tags are not cleared on sparc on freeing. They get cleared when the page
is allocated again.
>>>> We can try to store tags for an entire region in vma but that is
>>>> expensive, plus on sparc tags are set in userspace with no
>>>> participation from kernel and now we need a way for userspace to
>>>> communicate the tags to kernel.
>>>
>>> We can't support finer granularity through the mmap() syscall and, as
>>> you said, the vma is not the right thing to store the individual tags.
>>> With the above extension to mmap(), we'd have to store a colour per vma
>>> and prevent merging if different colours (we could as well use the
>>> pkeys mechanism we already have in the kernel but use a colour per vma
>>> instead of a key).
>>
>> Since tags can change on any part of mmap region on sparc at any time
>> without kernel being involved, I am not sure I see much reason for
>> kernel to enforce any tag related restrictions.
>
> It's not enforcing a tag, more like the default colour for a faulted in
> page. Anyway, if sparc is going with default 0/untagged, that's fine as
> well. We may add this mmap() option to arm64 only.
>
>>>> From sparc point of view, making kernel responsible for assigning tags
>>>> to a page on page fault is full of pitfalls.
>>>
>>> This could be just some arm64-specific but if you plan to deploy it more
>>> generically for sparc (at the C library level), you may find this
>>> useful.
>>
>> Common semantics from app developer point of view will be very useful to
>> maintain. If arm64 says mmap with MAP_FIXED and a tagged address will
>> return a pre-colored page, I would rather have it be the same on any
>> architecture. Is there a use case that justifies kernel doing this extra
>> work?
>
> So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB,
> IIUC for sparc the faulted-in pages will have random colours (on 64-byte
> granularity). Ignoring the information leak from prior uses of such
> pages, it would be the responsibility of the db program to issue the
> stxa. On arm64, since we also want to do this via malloc(), any large
> allocation would require all pages to be faulted in so that malloc() can
> set the write colour before being handed over to the user. That's what
> we want to avoid and the user is free to repaint the memory as it likes.
>
On sparc, any newly allocated page is cleared along with any old tags on
it. Since clearing tag happens automatically when page is cleared on
sparc, clear_user_page() will need to execute additional stxa
instructions to set a new tag. It is doable. In a way it is done already
if page is being pre-colored with tag 0 always ;) Where would the
pre-defined tag be stored - as part of address stored in vm_start or a
new field in vm_area_struct?
--
Khalid
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls
2019-05-30 16:05 ` Khalid Aziz@ 2019-05-30 16:57 ` Catalin Marinas0 siblings, 0 replies; 99+ messages in thread
From: Catalin Marinas @ 2019-05-30 16:57 UTC (permalink / raw)
To: Khalid Aziz
Cc: Mark Rutland, kvm, Christian Koenig, Szabolcs Nagy, Will Deacon,
dri-devel, linux-mm, Lee Smith, linux-kselftest,
Vincenzo Frascino, Jacob Bramley, Leon Romanovsky, linux-rdma,
amd-gfx, linux-arm-kernel, Dave Martin, Evgeniy Stepanov,
linux-media, Kees Cook, Ruben Ayrapetyan, Andrey Konovalov,
Kevin Brodsky, Alex Williamson, Mauro Carvalho Chehab,
Dmitry Vyukov, Kostya Serebryany, Greg Kroah-Hartman,
Felix Kuehling, linux-kernel, Jens Wiklander,
Ramana Radhakrishnan, Alexander Deucher, Andrew Murray,
Andrew Morton, Robin Murphy, Yishai Hadas, Luc Van Oostenryck
On Thu, May 30, 2019 at 10:05:55AM -0600, Khalid Aziz wrote:
> On 5/30/19 9:11 AM, Catalin Marinas wrote:
> > So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB,
> > IIUC for sparc the faulted-in pages will have random colours (on 64-byte
> > granularity). Ignoring the information leak from prior uses of such
> > pages, it would be the responsibility of the db program to issue the
> > stxa. On arm64, since we also want to do this via malloc(), any large
> > allocation would require all pages to be faulted in so that malloc() can
> > set the write colour before being handed over to the user. That's what
> > we want to avoid and the user is free to repaint the memory as it likes.
>
> On sparc, any newly allocated page is cleared along with any old tags on
> it. Since clearing tag happens automatically when page is cleared on
> sparc, clear_user_page() will need to execute additional stxa
> instructions to set a new tag. It is doable. In a way it is done already
> if page is being pre-colored with tag 0 always ;)
Ah, good to know. On arm64 we'd have to use different instructions,
although the same loop.
> Where would the pre-defined tag be stored - as part of address stored
> in vm_start or a new field in vm_area_struct?
I think we can discuss the details when we post the actual MTE patches.
In our internal hack we overloaded the VM_HIGH_ARCH_* flags and selected
CONFIG_ARCH_USES_HIGH_VMA_FLAGS (used for pkeys on x86).
For the time being, I'd rather restrict tagged addresses passed to
mmap() until we agreed that they have any meaning. If we allowed them
now but get ignored (though probably no-one would be doing this), I feel
it's slightly harder to change the semantics afterwards.
Thanks.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread

*Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel
2019-05-28 14:14 ` Andrey Konovalov
2019-05-29 6:11 ` Christoph Hellwig@ 2019-05-30 17:15 ` Catalin Marinas
2019-05-31 14:29 ` Andrey Konovalov1 sibling, 1 reply; 99+ messages in thread
From: Catalin Marinas @ 2019-05-30 17:15 UTC (permalink / raw)
To: Andrey Konovalov
Cc: Mark Rutland, kvm, Szabolcs Nagy, Will Deacon, dri-devel,
Linux Memory Management List, Khalid Aziz,
open list:KERNEL SELFTEST FRAMEWORK, Vincenzo Frascino,
Jacob Bramley, Leon Romanovsky, linux-rdma, amd-gfx,
Dmitry Vyukov, Dave Martin, Evgenii Stepanov, linux-media,
Kevin Brodsky, Kees Cook, Ruben Ayrapetyan, Ramana Radhakrishnan,
Alex Williamson, Yishai Hadas, Mauro Carvalho Chehab, Linux ARM,
Kostya Serebryany, Greg Kroah-Hartman, Felix Kuehling, LKML,
Jens Wiklander, Lee Smith, Alexander Deucher, Andrew Morton,
Elliott Hughes, Robin Murphy, Christian Koenig,
Luc Van Oostenryck
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> Thanks for a lot of valuable input! I've read through all the replies
> and got somewhat lost. What are the changes I need to do to this
> series?
>
> 1. Should I move untagging for memory syscalls back to the generic
> code so other arches would make use of it as well, or should I keep
> the arm64 specific memory syscalls wrappers and address the comments
> on that patch?
Keep them generic again but make sure we get agreement with Khalid on
the actual ABI implications for sparc.
> 2. Should I make untagging opt-in and controlled by a command line argument?
Opt-in, yes, but per task rather than kernel command line option.
prctl() is a possibility of opting in.
> 3. Should I "add Documentation/core-api/user-addresses.rst to describe
> proper care and handling of user space pointers with untagged_addr(),
> with examples based on all the cases seen so far in this series"?
> Which examples specifically should it cover?
I think we can leave 3 for now as not too urgent. What I'd like is for
Vincenzo's TBI user ABI document to go into a more common place since we
can expand it to cover both sparc and arm64. We'd need an arm64-specific
doc as well for things like prctl() and later MTE that sparc may support
differently.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel^permalinkrawreply [flat|nested] 99+ messages in thread