*[RFC v1 PATCH 00/17] prmem: protected memory@ 2018-10-23 21:34 Igor Stoppa
2018-10-23 21:34 ` [PATCH 01/17] prmem: linker section for static write rare Igor Stoppa
` (17 more replies)0 siblings, 18 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module
Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott
-- Summary --
Preliminary version of memory protection patchset, including a sample use
case, turning into write-rare the IMA measurement list.
The core idea is to introduce two new types of memory protection, beside
const and __ro_after_init, which will support:
- statically allocated "write rare" memory
- dynamically allocated "read only" and "write rare" memory
On top of that, follows a set of patches which create a "write rare"
counterpart of the kernel infrastructure used in the example chose for
hardening: the IMA measurement list.
-- Mechanism --
Statically allocated protected memory is identified by the __wr_after_init
tag, which will cause the linker to place it in a special section.
Dynamically allocated memory is obtained through vmalloc, but compacting
each allocation, where possible, in the latest obtained vmap_area.
The write rare mechanism is implemented by creating a temporary alternate
writable mapping, applying the change through this mapping and then
removing it.
All of this is possible thanks to the system MMU, which must be able to
provide write protection.
-- Brief history --
I sent out various versions of memory protection over the last year or so,
however this patchset is significantly expanded, including several helper
data structures and a use case, so I decided to reset the numbering to v1.
As reference, the latest "old" version is here [1].
The current version is not yet ready for merge, however it is sufficiently
complete for supporting an end-to-end discussion, I think.
Eventually, I plan to write a white paper, once the code is in better shape.
In the meanwhile, an overview can be had from these slides [2], which are
the support material for my presentation at the Linux Security Summit 2018
Europe.
-- Validation --
Most of the testing is done on a Fedora image, with QEMU x86_64,
however the code has been also tested on a real x86_64 PC, yielding
similar positive results.
For ARM64, I use a custom Debian installation, still with QEMU, but I have
obtained similar failures when testing with a real device, using a
Kirin970.
I have written some test cases for the most basic parts and the behaviour
of IMA and the Fedora image in general do not seem to be negatively
affected, when used in conjunction with this patchset.
However, it's far from being exhaustive testing and the torture test for
rcu is completely missing.
-- Known Issues --
As said, this version is preliminary and certain parts need rework.
This is a short and incomplete list of known issues:
* arm64 support is broken for __wr_after_init
I must create a separate section with proper mappings, similar to the
ones used for vmalloc()
* alignment of data structures has not been throughly checked
There are probably several redundant forced alignments
* there is no fallback for platforms missing MMU write protection
* some additional care might be needed when dealing with double mapping vs
data cache coherency, on multicore systems
* lots of additional (stress) tests are needed
* memory reuse (object caches) are probably needed, to support converting
more use cases, and so also other data structures.
* credits for original code: I have reimplemented various data structures,
I am not sure if I have given credit correctly to the original authors.
* documentation for the re-implemented data structures is missing
* confirm that the hardened usercopy logic is correct
-- Q&As --
During reviews of the older patchset, several objections and questions
were formulated.
They are collected here in Q&A format, with both some old and new answers:
1 - "The protection can still be undone"
Yes, it is true. Using a hypervisor, like it is done in certain Huawei and
Samsung phones, provides a better level of protection.
However, even without that, it still gives a significantly better level of
protection than not protecting the memory at all.
The main advantage of this patchset is that now the attack has to focus on
the page table, which is a significantly smaller area, than the whole
kernel data.
It is my intention, eventually, to provide also support for interaction
with a FOSS hypervisor (ex: KVM), but this patchset should
support also those cases where it's not even possible to have an hypervisor.
So it seems simpler to start from there. The hypervisor is not mandatory.
2 - "Do not provide a chain of trust, but protect some memory and refer to
it with a writable pointer."
This might be ok for protecting against bugs, but in the case of an
attacker trying to compromise the system, the unprotected pointer has
become the new target. It doesn't change much.
Samsung does use a similar implementation, for protecting LSM hooks,
however that solution also add a pointer, from the protected memory back
to the writable memory, as validation loop. And the price to pay is that
every time the unprotected pointer must be used, it first has to be
validated, to point to a certain memory range and to have a specific
alignment. It's an alternative solution to the full chain of trust and
each has its specific advantages, depending on the data structures that
one wants to protect.
3 - "Do not use a secondary mapping, unprotect the current one"
The purpose of the secondary mapping is to create a hard-to-spot window of
writability at a random address, which cannot be easily exploited.
Unprotecting the primary mapping would allow an attack where a core is
busy looping trying to figure out if a specific location becomes writable
and race against the legitimate writer. For the same reason, interrupts
are disabled on the core that is performing the write-rare operation.
4 - "Do not create another allocator over vmalloc(), use it plain"
This is not good for various reasons:
a) vmalloc() allocates at least one page for every request it receives,
leaving most of the page typically unused. While it might not be a big
deal on large systems, on IoT class devices it is possible to find
relatively powerful cores paired to relatively little memory.
Taking as example a system using SELinux, a relatively small set of rules
can genarate a few thousands of allocations (SELinux is deny-by-default).
Modeling each allocation to be about 64bytes, on a system with 4kB pages,
assuming that the grand total of allocation is 100k, that means
100k * 4kB = 390MB
while, using each 64bytes slot in a page yields:
100k * 64B = 6MB
The first case would not be very compatible with a system having only
512MB or 1GB.
b) even worse, the amount of thrashing of the TLB would be terrible, with
each allocation having its own translation.
--
Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
-- References --
[1]: https://lkml.org/lkml/2018/4/23/508
[2]: https://events.linuxfoundation.org/wp-content/uploads/2017/12/Kernel-Hardening-Protecting-the-Protection-Mechanisms-Igor-Stoppa-Huawei.pdf
-- List of patches --
[PATCH 01/17] prmem: linker section for static write rare
[PATCH 02/17] prmem: write rare for static allocation
[PATCH 03/17] prmem: vmalloc support for dynamic allocation
[PATCH 04/17] prmem: dynamic allocation
[PATCH 05/17] prmem: shorthands for write rare on common types
[PATCH 06/17] prmem: test cases for memory protection
[PATCH 07/17] prmem: lkdtm tests for memory protection
[PATCH 08/17] prmem: struct page: track vmap_area
[PATCH 09/17] prmem: hardened usercopy
[PATCH 10/17] prmem: documentation
[PATCH 11/17] prmem: llist: use designated initializer
[PATCH 12/17] prmem: linked list: set alignment
[PATCH 13/17] prmem: linked list: disable layout randomization
[PATCH 14/17] prmem: llist, hlist, both plain and rcu
[PATCH 15/17] prmem: test cases for prlist and prhlist
[PATCH 16/17] prmem: pratomic-long
[PATCH 17/17] prmem: ima: turn the measurements list write rare
-- Diffstat --
Documentation/core-api/index.rst | 1 +
Documentation/core-api/prmem.rst | 172 +++++
MAINTAINERS | 14 +
drivers/misc/lkdtm/core.c | 13 +
drivers/misc/lkdtm/lkdtm.h | 13 +
drivers/misc/lkdtm/perms.c | 248 +++++++
include/asm-generic/vmlinux.lds.h | 20 +
include/linux/cache.h | 17 +
include/linux/list.h | 5 +-
include/linux/mm_types.h | 25 +-
include/linux/pratomic-long.h | 73 ++
include/linux/prlist.h | 934 ++++++++++++++++++++++++
include/linux/prmem.h | 446 +++++++++++
include/linux/prmemextra.h | 133 ++++
include/linux/types.h | 20 +-
include/linux/vmalloc.h | 11 +-
lib/Kconfig.debug | 9 +
lib/Makefile | 1 +
lib/test_prlist.c | 252 +++++++
mm/Kconfig | 6 +
mm/Kconfig.debug | 9 +
mm/Makefile | 2 +
mm/prmem.c | 273 +++++++
mm/test_pmalloc.c | 629 ++++++++++++++++
mm/test_write_rare.c | 236 ++++++
mm/usercopy.c | 5 +
mm/vmalloc.c | 7 +
security/integrity/ima/ima.h | 18 +-
security/integrity/ima/ima_api.c | 29 +-
security/integrity/ima/ima_fs.c | 12 +-
security/integrity/ima/ima_main.c | 6 +
security/integrity/ima/ima_queue.c | 28 +-
security/integrity/ima/ima_template.c | 14 +-
security/integrity/ima/ima_template_lib.c | 14 +-
34 files changed, 3635 insertions(+), 60 deletions(-)
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 02/17] prmem: write rare for static allocation
2018-10-25 0:24 ` Dave Hansen@ 2018-10-29 18:03 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:03 UTC (permalink / raw)
To: Dave Hansen, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module
Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
Pavel Tatashin, linux-mm, linux-kernel
On 25/10/2018 01:24, Dave Hansen wrote:
>> +static __always_inline bool __is_wr_after_init(const void *ptr, size_t size)
>> +{
>> + size_t start = (size_t)&__start_wr_after_init;
>> + size_t end = (size_t)&__end_wr_after_init;
>> + size_t low = (size_t)ptr;
>> + size_t high = (size_t)ptr + size;
>> +
>> + return likely(start <= low && low < high && high <= end);
>> +}
>
> size_t is an odd type choice for doing address arithmetic.
it seemed more portable than unsigned long
>> +/**
>> + * wr_memset() - sets n bytes of the destination to the c value
>> + * @dst: beginning of the memory to write to
>> + * @c: byte to replicate
>> + * @size: amount of bytes to copy
>> + *
>> + * Returns true on success, false otherwise.
>> + */
>> +static __always_inline
>> +bool wr_memset(const void *dst, const int c, size_t n_bytes)
>> +{
>> + size_t size;
>> + unsigned long flags;
>> + uintptr_t d = (uintptr_t)dst;
>> +
>> + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
>> + return false;
>> + while (n_bytes) {
>> + struct page *page;
>> + uintptr_t base;
>> + uintptr_t offset;
>> + uintptr_t offset_complement;
>
> Again, these are really odd choices for types. vmap() returns a void*
> pointer, on which you can do arithmetic.
I wasn't sure of how much I could rely on the compiler not doing some
unwanted optimizations.
> Why bother keeping another
> type to which you have to cast to and from?
For the above reason. If I'm worrying unnecessarily, I can switch back
to void *
It certainly is easier to use.
> BTW, our usual "pointer stored in an integer type" is 'unsigned long',
> if a pointer needs to be manipulated.
yes, I noticed that, but it seemed strange ...
size_t corresponds to unsigned long, afaik
but it seems that I have not fully understood where to use it
anyway, I can stick to the convention with unsigned long
>
>> + local_irq_save(flags);
>
> Why are you doing the local_irq_save()?
The idea was to avoid the case where an attack would somehow freeze the
core doing the write-rare operation, while the temporary mapping is
accessible.
I have seen comments about using mappings that are private to the
current core (and I will reply to those comments as well), but this
approach seems architecture-dependent, while I was looking for a
solution that, albeit not 100% reliable, would work on any system with
an MMU. This would not prevent each arch to come up with own custom
implementation that provides better coverage, performance, etc.
>> + page = virt_to_page(d);
>> + offset = d & ~PAGE_MASK;
>> + offset_complement = PAGE_SIZE - offset;
>> + size = min(n_bytes, offset_complement);
>> + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
>
> Can you even call vmap() (which sleeps) with interrupts off?
I accidentally disabled sleeping while atomic debugging and I totally
missed this problem :-(
However, to answer your question, nothing exploded while I was testing
(without that type of debugging).
I suspect I was just "lucky". Or maybe I was simply not triggering the
sleeping sub-case.
As I understood the code, sleeping _might_ happen, but it's not going to
happen systematically.
I wonder if I could split vmap() into two parts: first the sleeping one,
with interrupts enabled, then the non sleeping one, with interrupts
disabled.
I need to read the code more carefully, but it seems that sleeping might
happen when memory for the mapping meta data is not immediately available.
BTW, wouldn't the might_sleep() call belong more to the part which
really sleeps, instead than to the whole vmap() ?
>> + if (WARN(!base, WR_ERR_PAGE_MSG)) {
>> + local_irq_restore(flags);
>> + return false;
>> + }
>
> You really need some kmap_atomic()-style accessors to wrap this stuff
> for you. This little pattern is repeated over and over.
I really need to learn more about the way the kernel works and is
structured. It's a work in progress. Thanks for the advice.
> ...
>> +const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range.";
>> +const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page.";
>
> Doesn't the compiler de-duplicate duplicated strings for you? Is there
> any reason to declare these like this?
I noticed I have made some accidental modifications in a couple of
cases, when replicating the command.
So I thought that if I really want to use the same string, why not doing
it explicitly? It seemed also easier, in case I want to tweak the
message. I need to do it only in one place.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 05/17] prmem: shorthands for write rare on common types
2018-10-25 0:28 ` Dave Hansen@ 2018-10-29 18:12 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:12 UTC (permalink / raw)
To: Dave Hansen, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module
Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
Pavel Tatashin, linux-mm, linux-kernel
On 25/10/2018 01:28, Dave Hansen wrote:
> On 10/23/18 2:34 PM, Igor Stoppa wrote:
>> Wrappers around the basic write rare functionality, addressing several
>> common data types found in the kernel, allowing to specify the new
>> values through immediates, like constants and defines.
>
> I have to wonder whether this is the right way, or whether a
> size-agnostic function like put_user() is the right way. put_user() is
> certainly easier to use.
I definitely did not like it either.
But it was the best that came to my mind ...
The main purpose of this code was to show what I wanted to do.
Once more, thanks for pointing out a better way to do it.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 08/17] prmem: struct page: track vmap_area
2018-10-25 2:13 ` Matthew Wilcox@ 2018-10-29 18:21 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:21 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Mimi Zohar, Kees Cook, Dave Chinner, James Morris, Michal Hocko,
kernel-hardening, linux-integrity, linux-security-module,
igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
Pavel Tatashin, linux-mm, linux-kernel
On 25/10/2018 03:13, Matthew Wilcox wrote:
> On Thu, Oct 25, 2018 at 02:01:02AM +0300, Igor Stoppa wrote:
>>>> @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>>>> if (!addr)
>>>> return NULL;
>>>> + va = __find_vmap_area((unsigned long)addr);
>>>> + for (i = 0; i < va->vm->nr_pages; i++)
>>>> + va->vm->pages[i]->area = va;
>>>
>>> I don't like it that you're calling this for _every_ vmalloc() caller
>>> when most of them will never use this. Perhaps have page->va be initially
>>> NULL and then cache the lookup in it when it's accessed for the first time.
>>>
>>
>> If __find_vmap_area() was part of the API, this loop could be left out from
>> __vmalloc_node_range() and the user of the allocation could initialize the
>> field, if needed.
>>
>> What is the reason for keeping __find_vmap_area() private?
>
> Well, for one, you're walking the rbtree without holding the spinlock,
> so you're going to get crashes. I don't see why we shouldn't export
> find_vmap_area() though.
Argh, yes, sorry. But find_vmap_area() would be enough for what I need.
> Another way we could approach this is to embed the vmap_area in the
> vm_struct. It'd require a bit of juggling of the alloc/free paths in
> vmalloc, but it might be worthwhile.
I have a feeling of unease about the whole vmap_area / vm_struct
duality. They clearly are different types, with different purposes, but
here and there there are functions that are named after some "area", yet
they actually refer to vm_struct pointers.
I wouldn't mind spending some time understanding the reason for this
apparently bizarre choice, but after I have emerged from the prmem swamp.
For now I'd stick to find_vmap_area().
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*[PATCH 10/17] prmem: documentation
2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
` (8 preceding siblings ...)
2018-10-23 21:34 ` [PATCH 09/17] prmem: hardened usercopy Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
2018-10-24 3:48 ` Randy Dunlap
` (2 more replies)
2018-10-23 21:34 ` [PATCH 11/17] prmem: llist: use designated initializer Igor Stoppa
` (7 subsequent siblings)17 siblings, 3 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module
Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, linux-doc, linux-kernel
Documentation for protected memory.
Topics covered:
* static memory allocation
* dynamic memory allocation
* write-rare
Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Randy Dunlap <rdunlap@infradead.org>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
Documentation/core-api/index.rst | 1 +
Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++
MAINTAINERS | 1 +
3 files changed, 174 insertions(+)
create mode 100644 Documentation/core-api/prmem.rst
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 26b735cefb93..1a90fa878d8d 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
gfp_mask-from-fs-io
timekeeping
boot-time-mm
+ prmem
Interfaces for kernel debugging
===============================
diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
new file mode 100644
index 000000000000..16d7edfe327a
--- /dev/null
+++ b/Documentation/core-api/prmem.rst
@@ -0,0 +1,172 @@+.. SPDX-License-Identifier: GPL-2.0
+
+.. _prmem:
+
+Memory Protection
+=================
+
+:Date: October 2018
+:Author: Igor Stoppa <igor.stoppa@huawei.com>
+
+Foreword
+--------
+- In a typical system using some sort of RAM as execution environment,
+ **all** memory is initially writable.
+
+- It must be initialized with the appropriate content, be it code or data.
+
+- Said content typically undergoes modifications, i.e. relocations or
+ relocation-induced changes.
+
+- The present document doesn't address such transient.
+
+- Kernel code is protected at system level and, unlike data, it doesn't
+ require special attention.
+
+Protection mechanism
+--------------------
+
+- When available, the MMU can write protect memory pages that would be
+ otherwise writable.
+
+- The protection has page-level granularity.
+
+- An attempt to overwrite a protected page will trigger an exception.
+- **Write protected data must go exclusively to write protected pages**
+- **Writable data must go exclusively to writable pages**
+
+Available protections for kernel data
+-------------------------------------
+
+- **constant**
+ Labelled as **const**, the data is never supposed to be altered.
+ It is statically allocated - if it has any memory footprint at all.
+ The compiler can even optimize it away, where possible, by replacing
+ references to a **const** with its actual value.
+
+- **read only after init**
+ By tagging an otherwise ordinary statically allocated variable with
+ **__ro_after_init**, it is placed in a special segment that will
+ become write protected, at the end of the kernel init phase.
+ The compiler has no notion of this restriction and it will treat any
+ write operation on such variable as legal. However, assignments that
+ are attempted after the write protection is in place, will cause
+ exceptions.
+
+- **write rare after init**
+ This can be seen as variant of read only after init, which uses the
+ tag **__wr_after_init**. It is also limited to statically allocated
+ memory. It is still possible to alter this type of variables, after
+ the kernel init phase is complete, however it can be done exclusively
+ with special functions, instead of the assignment operator. Using the
+ assignment operator after conclusion of the init phase will still
+ trigger an exception. It is not possible to transition a certain
+ variable from __wr_ater_init to a permanent read-only status, at
+ runtime.
+
+- **dynamically allocated write-rare / read-only**
+ After defining a pool, memory can be obtained through it, primarily
+ through the **pmalloc()** allocator. The exact writability state of the
+ memory obtained from **pmalloc()** and friends can be configured when
+ creating the pool. At any point it is possible to transition to a less
+ permissive write status the memory currently associated to the pool.
+ Once memory has become read-only, it the only valid operation, beside
+ reading, is to released it, by destroying the pool it belongs to.
+
+
+Protecting dynamically allocated memory
+---------------------------------------
+
+When dealing with dynamically allocated memory, three options are
+ available for configuring its writability state:
+
+- **Options selected when creating a pool**
+ When creating the pool, it is possible to choose one of the following:
+ - **PMALLOC_MODE_RO**
+ - Writability at allocation time: *WRITABLE*
+ - Writability at protection time: *NONE*
+ - **PMALLOC_MODE_WR**
+ - Writability at allocation time: *WRITABLE*
+ - Writability at protection time: *WRITE-RARE*
+ - **PMALLOC_MODE_AUTO_RO**
+ - Writability at allocation time:
+ - the latest allocation: *WRITABLE*
+ - every other allocation: *NONE*
+ - Writability at protection time: *NONE*
+ - **PMALLOC_MODE_AUTO_WR**
+ - Writability at allocation time:
+ - the latest allocation: *WRITABLE*
+ - every other allocation: *WRITE-RARE*
+ - Writability at protection time: *WRITE-RARE*
+ - **PMALLOC_MODE_START_WR**
+ - Writability at allocation time: *WRITE-RARE*
+ - Writability at protection time: *WRITE-RARE*
+
+ **Remarks:**
+ - The "AUTO" modes perform automatic protection of the content, whenever
+ the current vmap_area is used up and a new one is allocated.
+ - At that point, the vmap_area being phased out is protected.
+ - The size of the vmap_area depends on various parameters.
+ - It might not be possible to know for sure *when* certain data will
+ be protected.
+ - The functionality is provided as tradeoff between hardening and speed.
+ - Its usefulness depends on the specific use case at hand
+ - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
+
+- **Protecting the pool**
+ This is achieved with **pmalloc_protect_pool()**
+ - Any vmap_area currently in the pool is write-protected according to its initial configuration.
+ - Any residual space still available from the current vmap_area is lost, as the area is protected.
+ - **protecting a pool after every allocation will likely be very wasteful**
+ - Using PMALLOC_MODE_START_WR is likely a better choice.
+
+- **Upgrading the protection level**
+ This is achieved with **pmalloc_make_pool_ro()**
+ - it turns the present content of a write-rare pool into read-only
+ - can be useful when the content of the memory has settled
+
+
+Caveats
+-------
+- Freeing of memory is not supported. Pages will be returned to the
+ system upon destruction of their memory pool.
+
+- The address range available for vmalloc (and thus for pmalloc too) is
+ limited, on 32-bit systems. However it shouldn't be an issue, since not
+ much data is expected to be dynamically allocated and turned into
+ write-protected.
+
+- Regarding SMP systems, changing state of pages and altering mappings
+ requires performing cross-processor synchronizations of page tables.
+ This is an additional reason for limiting the use of write rare.
+
+- Not only the pmalloc memory must be protected, but also any reference to
+ it that might become the target for an attack. The attack would replace
+ a reference to the protected memory with a reference to some other,
+ unprotected, memory.
+
+- The users of rare write must take care of ensuring the atomicity of the
+ action, respect to the way they use the data being altered; for example,
+ take a lock before making a copy of the value to modify (if it's
+ relevant), then alter it, issue the call to rare write and finally
+ release the lock. Some special scenario might be exempt from the need
+ for locking, but in general rare-write must be treated as an operation
+ that can incur into races.
+
+- pmalloc relies on virtual memory areas and will therefore use more
+ tlb entries. It still does a better job of it, compared to invoking
+ vmalloc for each allocation, but it is undeniably less optimized wrt to
+ TLB use than using the physmap directly, through kmalloc or similar.
+
+
+Utilization
+-----------
+
+**add examples here**
+
+API
+---
+
+.. kernel-doc:: include/linux/prmem.h
+.. kernel-doc:: mm/prmem.c
+.. kernel-doc:: include/linux/prmemextra.h
diff --git a/MAINTAINERS b/MAINTAINERS
index ea979a5a9ec9..246b1a1cc8bb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9463,6 +9463,7 @@ F: include/linux/prmemextra.h
F: mm/prmem.c
F: mm/test_write_rare.c
F: mm/test_pmalloc.c
+F: Documentation/core-api/prmem.rst
MEMORY MANAGEMENT
L: linux-mm@kvack.org
--
2.17.1
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
@ 2018-10-24 3:48 ` Randy Dunlap
2018-10-24 14:30 ` Igor Stoppa
2018-10-24 23:04 ` Mike Rapoport
2018-10-26 9:26 ` Peter Zijlstra2 siblings, 1 reply; 140+ messages in thread
From: Randy Dunlap @ 2018-10-24 3:48 UTC (permalink / raw)
To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module
Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Mike Rapoport, linux-doc, linux-kernel
Hi,
On 10/23/18 2:34 PM, Igor Stoppa wrote:
> Documentation for protected memory.
>
> Topics covered:
> * static memory allocation
> * dynamic memory allocation
> * write-rare
>
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Jonathan Corbet <corbet@lwn.net>
> CC: Randy Dunlap <rdunlap@infradead.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: linux-doc@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
> Documentation/core-api/index.rst | 1 +
> Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++
> MAINTAINERS | 1 +
> 3 files changed, 174 insertions(+)
> create mode 100644 Documentation/core-api/prmem.rst
> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> + **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> + relocation-induced changes.
> +
> +- The present document doesn't address such transient.
transience.
> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> + require special attention.
> +
> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> + otherwise writable.
> +
> +- The protection has page-level granularity.
> +
> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**
> +
> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> + Labelled as **const**, the data is never supposed to be altered.
> + It is statically allocated - if it has any memory footprint at all.
> + The compiler can even optimize it away, where possible, by replacing
> + references to a **const** with its actual value.
> +
> +- **read only after init**
> + By tagging an otherwise ordinary statically allocated variable with
> + **__ro_after_init**, it is placed in a special segment that will
> + become write protected, at the end of the kernel init phase.
> + The compiler has no notion of this restriction and it will treat any
> + write operation on such variable as legal. However, assignments that
> + are attempted after the write protection is in place, will cause
no comma.
> + exceptions.
> +
> +- **write rare after init**
> + This can be seen as variant of read only after init, which uses the
> + tag **__wr_after_init**. It is also limited to statically allocated
> + memory. It is still possible to alter this type of variables, after
> + the kernel init phase is complete, however it can be done exclusively
> + with special functions, instead of the assignment operator. Using the
> + assignment operator after conclusion of the init phase will still
> + trigger an exception. It is not possible to transition a certain
> + variable from __wr_ater_init to a permanent read-only status, at
> + runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> + After defining a pool, memory can be obtained through it, primarily
> + through the **pmalloc()** allocator. The exact writability state of the
> + memory obtained from **pmalloc()** and friends can be configured when
> + creating the pool. At any point it is possible to transition to a less
> + permissive write status the memory currently associated to the pool.
> + Once memory has become read-only, it the only valid operation, beside
> + reading, is to released it, by destroying the pool it belongs to.
> +
> +
> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> + When creating the pool, it is possible to choose one of the following:
> + - **PMALLOC_MODE_RO**
> + - Writability at allocation time: *WRITABLE*
> + - Writability at protection time: *NONE*
> + - **PMALLOC_MODE_WR**
> + - Writability at allocation time: *WRITABLE*
> + - Writability at protection time: *WRITE-RARE*
> + - **PMALLOC_MODE_AUTO_RO**
> + - Writability at allocation time:
> + - the latest allocation: *WRITABLE*
> + - every other allocation: *NONE*
> + - Writability at protection time: *NONE*
> + - **PMALLOC_MODE_AUTO_WR**
> + - Writability at allocation time:
> + - the latest allocation: *WRITABLE*
> + - every other allocation: *WRITE-RARE*
> + - Writability at protection time: *WRITE-RARE*
> + - **PMALLOC_MODE_START_WR**
> + - Writability at allocation time: *WRITE-RARE*
> + - Writability at protection time: *WRITE-RARE*
> +
> + **Remarks:**
> + - The "AUTO" modes perform automatic protection of the content, whenever
> + the current vmap_area is used up and a new one is allocated.
> + - At that point, the vmap_area being phased out is protected.
> + - The size of the vmap_area depends on various parameters.
> + - It might not be possible to know for sure *when* certain data will
> + be protected.
> + - The functionality is provided as tradeoff between hardening and speed.
> + - Its usefulness depends on the specific use case at hand
end above sentence with a period, please, like all of the others above it.
> + - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
Please try to keep the line above and a few below to < 80 characters in length.
(because some of us read rst files as text files, with a text editor, and line
wrap is ugly)
> +
> +- **Protecting the pool**
> + This is achieved with **pmalloc_protect_pool()**
> + - Any vmap_area currently in the pool is write-protected according to its initial configuration.
> + - Any residual space still available from the current vmap_area is lost, as the area is protected.
> + - **protecting a pool after every allocation will likely be very wasteful**
> + - Using PMALLOC_MODE_START_WR is likely a better choice.
> +
> +- **Upgrading the protection level**
> + This is achieved with **pmalloc_make_pool_ro()**
> + - it turns the present content of a write-rare pool into read-only
> + - can be useful when the content of the memory has settled
> +
> +
> +Caveats
> +-------
> +- Freeing of memory is not supported. Pages will be returned to the
> + system upon destruction of their memory pool.
> +
> +- The address range available for vmalloc (and thus for pmalloc too) is
> + limited, on 32-bit systems. However it shouldn't be an issue, since not
> + much data is expected to be dynamically allocated and turned into
> + write-protected.
> +
> +- Regarding SMP systems, changing state of pages and altering mappings
> + requires performing cross-processor synchronizations of page tables.
> + This is an additional reason for limiting the use of write rare.
> +
> +- Not only the pmalloc memory must be protected, but also any reference to
> + it that might become the target for an attack. The attack would replace
> + a reference to the protected memory with a reference to some other,
> + unprotected, memory.
> +
> +- The users of rare write must take care of ensuring the atomicity of the
s/rare write/write rare/ ?
> + action, respect to the way they use the data being altered; for example,
This .. "respect to the way" is awkward, but I don't know what to
change it to.
> + take a lock before making a copy of the value to modify (if it's
> + relevant), then alter it, issue the call to rare write and finally
> + release the lock. Some special scenario might be exempt from the need
> + for locking, but in general rare-write must be treated as an operation
It seemed to me that "write-rare" (or write rare) was the going name, but now
it's being called "rare write" (or rare-write). Just be consistent, please.
> + that can incur into races.
> +
> +- pmalloc relies on virtual memory areas and will therefore use more
> + tlb entries. It still does a better job of it, compared to invoking
TLB
> + vmalloc for each allocation, but it is undeniably less optimized wrt to
s/wrt/with respect to/
> + TLB use than using the physmap directly, through kmalloc or similar.
> +
> +
> +Utilization
> +-----------
> +
> +**add examples here**
> +
> +API
> +---
> +
> +.. kernel-doc:: include/linux/prmem.h
> +.. kernel-doc:: mm/prmem.c
> +.. kernel-doc:: include/linux/prmemextra.h
Thanks for the documentation.
--
~Randy
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-24 3:48 ` Randy Dunlap@ 2018-10-24 14:30 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 14:30 UTC (permalink / raw)
To: Randy Dunlap, Mimi Zohar, Kees Cook, Matthew Wilcox,
Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
linux-integrity, linux-security-module
Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Mike Rapoport, linux-doc, linux-kernel
Hi,
On 24/10/18 06:48, Randy Dunlap wrote:
> Hi,
>
> On 10/23/18 2:34 PM, Igor Stoppa wrote:
[...]
>> +- The present document doesn't address such transient.
>
> transience.
ok
[...]
>> + are attempted after the write protection is in place, will cause
>
> no comma.
ok
[...]
>> + - Its usefulness depends on the specific use case at hand
>
> end above sentence with a period, please, like all of the others above it.
ok
>> + - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
>
> Please try to keep the line above and a few below to < 80 characters in length.
> (because some of us read rst files as text files, with a text editor, and line
> wrap is ugly)
ok, I still have to master .rst :-(
[...]
>> +- The users of rare write must take care of ensuring the atomicity of the
>
> s/rare write/write rare/ ?
thanks
>> + action, respect to the way they use the data being altered; for example,
>
> This .. "respect to the way" is awkward, but I don't know what to
> change it to.
>
>> + take a lock before making a copy of the value to modify (if it's
>> + relevant), then alter it, issue the call to rare write and finally
>> + release the lock. Some special scenario might be exempt from the need
>> + for locking, but in general rare-write must be treated as an operation
>
> It seemed to me that "write-rare" (or write rare) was the going name, but now
> it's being called "rare write" (or rare-write). Just be consistent, please.
write-rare it is, because it can be shortened as wr_xxx
rare_write becomes rw_xxx
which wrongly hints at read/write, which it definitely is not
>> + tlb entries. It still does a better job of it, compared to invoking
>
> TLB
ok
>> + vmalloc for each allocation, but it is undeniably less optimized wrt to
>
> s/wrt/with respect to/
yes
> Thanks for the documentation.
thanks for the review :-)
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-24 23:04 ` Mike Rapoport@ 2018-10-29 19:05 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:05 UTC (permalink / raw)
To: Mike Rapoport
Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
linux-kernel
Hello,
On 25/10/2018 00:04, Mike Rapoport wrote:
> I feel that foreword should include a sentence or two saying why we need
> the memory protection and when it can/should be used.
I was somewhat lost about the actual content of this sort of document.
In past reviews of older version of the docs, I go the impression that
this should focus more on the use of the API and how to use it.
And it seem that different people expect to find here different type of
information.
What is the best place where to put a more extensive description of the
feature?
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
2018-10-24 3:48 ` Randy Dunlap
2018-10-24 23:04 ` Mike Rapoport@ 2018-10-26 9:26 ` Peter Zijlstra
2018-10-26 10:20 ` Matthew Wilcox
` (4 more replies)2 siblings, 5 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-26 9:26 UTC (permalink / raw)
To: Igor Stoppa
Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
linux-kernel
Jon,
So the below document is a prime example for why I think RST sucks. As a
text document readability is greatly diminished by all the markup
nonsense.
This stuff should not become write-only content like html and other
gunk. The actual text file is still the primary means of reading this.
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index 26b735cefb93..1a90fa878d8d 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
> gfp_mask-from-fs-io
> timekeeping
> boot-time-mm
> + prmem
>
> Interfaces for kernel debugging
> ===============================
> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> + **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> + relocation-induced changes.
> +
> +- The present document doesn't address such transient.
> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> + require special attention.
What does this even mean?
> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> + otherwise writable.
Again; what does this really want to say?
> +- The protection has page-level granularity.
I don't think Linux supports non-paging MMUs.
> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**
WTH is with all those ** ?
> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> + Labelled as **const**, the data is never supposed to be altered.
> + It is statically allocated - if it has any memory footprint at all.
> + The compiler can even optimize it away, where possible, by replacing
> + references to a **const** with its actual value.
> +
> +- **read only after init**
> + By tagging an otherwise ordinary statically allocated variable with
> + **__ro_after_init**, it is placed in a special segment that will
> + become write protected, at the end of the kernel init phase.
> + The compiler has no notion of this restriction and it will treat any
> + write operation on such variable as legal. However, assignments that
> + are attempted after the write protection is in place, will cause
> + exceptions.
> +
> +- **write rare after init**
> + This can be seen as variant of read only after init, which uses the
> + tag **__wr_after_init**. It is also limited to statically allocated
> + memory. It is still possible to alter this type of variables, after
> + the kernel init phase is complete, however it can be done exclusively
> + with special functions, instead of the assignment operator. Using the
> + assignment operator after conclusion of the init phase will still
> + trigger an exception. It is not possible to transition a certain
> + variable from __wr_ater_init to a permanent read-only status, at
> + runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> + After defining a pool, memory can be obtained through it, primarily
> + through the **pmalloc()** allocator. The exact writability state of the
> + memory obtained from **pmalloc()** and friends can be configured when
> + creating the pool. At any point it is possible to transition to a less
> + permissive write status the memory currently associated to the pool.
> + Once memory has become read-only, it the only valid operation, beside
> + reading, is to released it, by destroying the pool it belongs to.
Can we ditch all the ** nonsense and put whitespace in there? More paragraphs
and whitespace are more good.
Also, I really don't like how you differentiate between static and
dynamic wr.
> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> + When creating the pool, it is possible to choose one of the following:
> + - **PMALLOC_MODE_RO**
> + - Writability at allocation time: *WRITABLE*
> + - Writability at protection time: *NONE*
> + - **PMALLOC_MODE_WR**
> + - Writability at allocation time: *WRITABLE*
> + - Writability at protection time: *WRITE-RARE*
> + - **PMALLOC_MODE_AUTO_RO**
> + - Writability at allocation time:
> + - the latest allocation: *WRITABLE*
> + - every other allocation: *NONE*
> + - Writability at protection time: *NONE*
> + - **PMALLOC_MODE_AUTO_WR**
> + - Writability at allocation time:
> + - the latest allocation: *WRITABLE*
> + - every other allocation: *WRITE-RARE*
> + - Writability at protection time: *WRITE-RARE*
> + - **PMALLOC_MODE_START_WR**
> + - Writability at allocation time: *WRITE-RARE*
> + - Writability at protection time: *WRITE-RARE*
That's just unreadable gibberish from here. Also what?
We already have RO, why do you need more RO?
> +
> + **Remarks:**
> + - The "AUTO" modes perform automatic protection of the content, whenever
> + the current vmap_area is used up and a new one is allocated.
> + - At that point, the vmap_area being phased out is protected.
> + - The size of the vmap_area depends on various parameters.
> + - It might not be possible to know for sure *when* certain data will
> + be protected.
Surely that is a problem?
> + - The functionality is provided as tradeoff between hardening and speed.
Which you fail to explain.
> + - Its usefulness depends on the specific use case at hand
How about you write sensible text inside the option descriptions
instead?
This is not a presentation; less bullets, more content.
> +- Not only the pmalloc memory must be protected, but also any reference to
> + it that might become the target for an attack. The attack would replace
> + a reference to the protected memory with a reference to some other,
> + unprotected, memory.
I still don't really understand the whole write-rare thing; how does it
really help? If we can write in kernel memory, we can write to
page-tables too.
And I don't think this document even begins to explain _why_ you're
doing any of this. How does it help?
> +- The users of rare write must take care of ensuring the atomicity of the
> + action, respect to the way they use the data being altered; for example,
> + take a lock before making a copy of the value to modify (if it's
> + relevant), then alter it, issue the call to rare write and finally
> + release the lock. Some special scenario might be exempt from the need
> + for locking, but in general rare-write must be treated as an operation
> + that can incur into races.
What?!
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-26 9:26 ` Peter Zijlstra
2018-10-26 10:20 ` Matthew Wilcox@ 2018-10-26 10:46 ` Kees Cook
2018-10-28 18:31 ` Peter Zijlstra
2018-10-26 11:09 ` Markus Heiser
` (2 subsequent siblings)4 siblings, 1 reply; 140+ messages in thread
From: Kees Cook @ 2018-10-26 10:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Igor Stoppa, Mimi Zohar, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML
On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> I still don't really understand the whole write-rare thing; how does it
> really help? If we can write in kernel memory, we can write to
> page-tables too.
I can speak to the general goal. The specific implementation here may
not fit all the needs, but it's a good starting point for discussion.
One aspect of hardening the kernel against attack is reducing the
internal attack surface. Not all flaws are created equal, so there is
variation in what limitations an attacker may have when exploiting
flaws (not many flaws end up being a fully controlled "write anything,
anywhere, at any time"). By making the more sensitive data structures
of the kernel read-only, we reduce the risk of an attacker finding a
path to manipulating the kernel's behavior in a significant way.
Examples of typical sensitive targets are function pointers, security
policy, and page tables. Having these "read only at rest" makes them
much harder to control by an attacker using memory integrity flaws. We
already have the .rodata section for "const" things. Those are
trivially read-only. For example there is a long history of making
sure that function pointer tables are marked "const". However, some
things need to stay writable. Some of those in .data aren't changed
after __init, so we added the __ro_after_init which made them
read-only for the life of the system after __init. However, we're
still left with a lot of sensitive structures either in .data or
dynamically allocated, that need a way to be made read-only for most
of the time, but get written to during very specific times.
The "write rarely" name itself may not sufficiently describe what is
wanted either (I'll take the blame for the inaccurate name), so I'm
open to new ideas there. The implementation requirements for the
"sensitive data read-only at rest" feature are rather tricky:
- allow writes only from specific places in the kernel
- keep those locations inline to avoid making them trivial ROP targets
- keep the writeability window open only to a single uninterruptable CPU
- fast enough to deal with page table updates
The proposal I made a while back only covered .data things (and used
x86-specific features). Igor's proposal builds on this by including a
way to do this with dynamic allocation too, which greatly expands the
scope of structures that can be protected. Given that the x86-only
method of write-window creation was firmly rejected, this is a new
proposal for how to do it (vmap window). Using switch_mm() has also
been suggested, etc.
We need to find a good way to do the write-windowing that works well
for static and dynamic structures _and_ for the page tables... this
continues to be tricky.
Making it resilient against ROP-style targets makes it difficult to
deal with certain data structures (like list manipulation). In my
earlier RFC, I tried to provide enough examples of where this could
get used to let people see some of the complexity[1]. Igor's series
expands this to even more examples using dynamic allocation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/write-rarely
--
Kees Cook
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-26 10:46 ` Kees Cook@ 2018-10-28 18:31 ` Peter Zijlstra
2018-10-29 21:04 ` Igor Stoppa0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-28 18:31 UTC (permalink / raw)
To: Kees Cook
Cc: Igor Stoppa, Mimi Zohar, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML
On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > I still don't really understand the whole write-rare thing; how does it
> > really help? If we can write in kernel memory, we can write to
> > page-tables too.
> One aspect of hardening the kernel against attack is reducing the
> internal attack surface. Not all flaws are created equal, so there is
> variation in what limitations an attacker may have when exploiting
> flaws (not many flaws end up being a fully controlled "write anything,
> anywhere, at any time"). By making the more sensitive data structures
> of the kernel read-only, we reduce the risk of an attacker finding a
> path to manipulating the kernel's behavior in a significant way.
>
> Examples of typical sensitive targets are function pointers, security
> policy, and page tables. Having these "read only at rest" makes them
> much harder to control by an attacker using memory integrity flaws.
Because 'write-anywhere' exploits are easier than (and the typical first
step to) arbitrary code execution thingies?
> The "write rarely" name itself may not sufficiently describe what is
> wanted either (I'll take the blame for the inaccurate name), so I'm
> open to new ideas there. The implementation requirements for the
> "sensitive data read-only at rest" feature are rather tricky:
>
> - allow writes only from specific places in the kernel
> - keep those locations inline to avoid making them trivial ROP targets
> - keep the writeability window open only to a single uninterruptable CPU
The current patch set does not achieve that because it uses a global
address space for the alias mapping (vmap) which is equally accessible
from all CPUs.
> - fast enough to deal with page table updates
The proposed implementation needs page-tables for the alias; I don't see
how you could ever do R/O page-tables when you need page-tables to
modify your page-tables.
And this is entirely irrespective of performance.
> The proposal I made a while back only covered .data things (and used
> x86-specific features).
Oh, right, that CR0.WP stuff.
> Igor's proposal builds on this by including a
> way to do this with dynamic allocation too, which greatly expands the
> scope of structures that can be protected. Given that the x86-only
> method of write-window creation was firmly rejected, this is a new
> proposal for how to do it (vmap window). Using switch_mm() has also
> been suggested, etc.
Right... /me goes find the patches we did for text_poke. Hmm, those
never seem to have made it:
https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com
like that. That approach will in fact work and not be a completely
broken mess like this thing.
> We need to find a good way to do the write-windowing that works well
> for static and dynamic structures _and_ for the page tables... this
> continues to be tricky.
>
> Making it resilient against ROP-style targets makes it difficult to
> deal with certain data structures (like list manipulation). In my
> earlier RFC, I tried to provide enough examples of where this could
> get used to let people see some of the complexity[1]. Igor's series
> expands this to even more examples using dynamic allocation.
Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
fast enough for much of anything.
And I don't suppose we can take the WP fault and then fix up from there,
because if we're doing R/O page-tables, that'll incrase the fault depth
and we'll double fault all the time, and tripple fault where we
currently double fault. And we all know how _awesome_ tripple faults
are.
But duplicating (and wrapping in gunk) whole APIs is just not going to
work.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-28 18:31 ` Peter Zijlstra@ 2018-10-29 21:04 ` Igor Stoppa
2018-10-30 15:26 ` Peter Zijlstra0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 21:04 UTC (permalink / raw)
To: Peter Zijlstra, Kees Cook
Cc: Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML
On 28/10/2018 18:31, Peter Zijlstra wrote:
> On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
>> On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> I still don't really understand the whole write-rare thing; how does it
>>> really help? If we can write in kernel memory, we can write to
>>> page-tables too.
>
>> One aspect of hardening the kernel against attack is reducing the
>> internal attack surface. Not all flaws are created equal, so there is
>> variation in what limitations an attacker may have when exploiting
>> flaws (not many flaws end up being a fully controlled "write anything,
>> anywhere, at any time"). By making the more sensitive data structures
>> of the kernel read-only, we reduce the risk of an attacker finding a
>> path to manipulating the kernel's behavior in a significant way.
>>
>> Examples of typical sensitive targets are function pointers, security
>> policy, and page tables. Having these "read only at rest" makes them
>> much harder to control by an attacker using memory integrity flaws.
>
> Because 'write-anywhere' exploits are easier than (and the typical first
> step to) arbitrary code execution thingies?
>
>> The "write rarely" name itself may not sufficiently describe what is
>> wanted either (I'll take the blame for the inaccurate name), so I'm
>> open to new ideas there. The implementation requirements for the
>> "sensitive data read-only at rest" feature are rather tricky:
>>
>> - allow writes only from specific places in the kernel
>> - keep those locations inline to avoid making them trivial ROP targets
>> - keep the writeability window open only to a single uninterruptable CPU
>
> The current patch set does not achieve that because it uses a global
> address space for the alias mapping (vmap) which is equally accessible
> from all CPUs.
I never claimed to achieve 100% resilience to attacks.
While it's true that the address space is accessible to all the CPUs, it
is also true that the access has a limited duration in time, and the
location where the access can be performed is not fixed.
Iow, assuming that the CPUs not involved in the write-rare operations
are compromised and that they are trying to perform a concurrent access
to the data in the writable page, they have a limited window of opportunity.
Said this, this I have posted is just a tentative implementation.
My primary intent was to at least give an idea of what I'd like to do:
alter some data in a way that is not easily exploitable.
>> - fast enough to deal with page table updates
>
> The proposed implementation needs page-tables for the alias; I don't see
> how you could ever do R/O page-tables when you need page-tables to
> modify your page-tables.
It's not all-or-nothing.
I hope we agree at least on the reasoning that having only a limited
amount of address space directly attackable, instead of the whole set of
pages containing exploitable data, is reducing the attack surface.
Furthermore, if we think about possible limitations that the attack
might have (maximum reach), the level of protection might be even
higher. I have to use "might" because I cannot foresee the vulnerability.
Furthermore, taking a different angle: your average attacker is not
necessarily very social and inclined to share the vulnerability found.
It is safe to assume that in most cases each attacker has to identify
the attack strategy autonomously.
Reducing the amount of individual who can perform an attack, by
increasing the expertise required is also a way of doing damage control.
> And this is entirely irrespective of performance.
I have not completely given up on performance, but, being write-rare, I
see improved performance as just a way of widening the range of possible
recipients for the hardening.
>> The proposal I made a while back only covered .data things (and used
>> x86-specific features).
>
> Oh, right, that CR0.WP stuff.
>
>> Igor's proposal builds on this by including a
>> way to do this with dynamic allocation too, which greatly expands the
>> scope of structures that can be protected. Given that the x86-only
>> method of write-window creation was firmly rejected, this is a new
>> proposal for how to do it (vmap window). Using switch_mm() has also
>> been suggested, etc.
>
> Right... /me goes find the patches we did for text_poke. Hmm, those
> never seem to have made it:
>
> https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com
>
> like that. That approach will in fact work and not be a completely
> broken mess like this thing.
That approach is x86 specific.
I preferred to start with something that would work on a broader set of
architectures, even if with less resilience.
It can be done so that individual architectures can implement their own
specific way and obtain better results.
>> We need to find a good way to do the write-windowing that works well
>> for static and dynamic structures _and_ for the page tables... this
>> continues to be tricky.
>>
>> Making it resilient against ROP-style targets makes it difficult to
>> deal with certain data structures (like list manipulation). In my
>> earlier RFC, I tried to provide enough examples of where this could
>> get used to let people see some of the complexity[1]. Igor's series
>> expands this to even more examples using dynamic allocation.
>
> Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
> fast enough for much of anything.
Part of the reason for the duplication is that, even if the wr_API I
came up with, can be collapsed with the regular API, the implementation
needs to be different, to be faster.
Ex:
* force the list_head structure to be aligned so that it will always be
fully contained by a single page
* create alternate mapping for all the pages involved (max 1 page for
each list_head)
* do the lsit operation on the remapped list_heads
* destroy all the mappings
I can try to introduce some wrapper similar to kmap_atomic(), as
suggested by Dave Hansen, which can improve the coding, but it will not
change the actual set of operations performed.
> And I don't suppose we can take the WP fault and then fix up from there,
> because if we're doing R/O page-tables, that'll incrase the fault depth
> and we'll double fault all the time, and tripple fault where we
> currently double fault. And we all know how _awesome_ tripple faults
> are.
>
> But duplicating (and wrapping in gunk) whole APIs is just not going to
> work.
Would something like kmap_atomic() be acceptable?
Do you have some better proposal, now that (I hope) it should be more
clear what I'm trying to do and why?
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-29 21:04 ` Igor Stoppa@ 2018-10-30 15:26 ` Peter Zijlstra
2018-10-30 16:37 ` Kees Cook0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-30 15:26 UTC (permalink / raw)
To: Igor Stoppa
Cc: Kees Cook, Mimi Zohar, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Andy Lutomirski, Thomas Gleixner
On Mon, Oct 29, 2018 at 11:04:01PM +0200, Igor Stoppa wrote:
> On 28/10/2018 18:31, Peter Zijlstra wrote:
> > On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
> > > On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > I still don't really understand the whole write-rare thing; how does it
> > > > really help? If we can write in kernel memory, we can write to
> > > > page-tables too.
> >
> > > One aspect of hardening the kernel against attack is reducing the
> > > internal attack surface. Not all flaws are created equal, so there is
> > > variation in what limitations an attacker may have when exploiting
> > > flaws (not many flaws end up being a fully controlled "write anything,
> > > anywhere, at any time"). By making the more sensitive data structures
> > > of the kernel read-only, we reduce the risk of an attacker finding a
> > > path to manipulating the kernel's behavior in a significant way.
> > >
> > > Examples of typical sensitive targets are function pointers, security
> > > policy, and page tables. Having these "read only at rest" makes them
> > > much harder to control by an attacker using memory integrity flaws.
> >
> > Because 'write-anywhere' exploits are easier than (and the typical first
> > step to) arbitrary code execution thingies?
> >
> > > The "write rarely" name itself may not sufficiently describe what is
> > > wanted either (I'll take the blame for the inaccurate name), so I'm
> > > open to new ideas there. The implementation requirements for the
> > > "sensitive data read-only at rest" feature are rather tricky:
> > >
> > > - allow writes only from specific places in the kernel
> > > - keep those locations inline to avoid making them trivial ROP targets
> > > - keep the writeability window open only to a single uninterruptable CPU
> >
> > The current patch set does not achieve that because it uses a global
> > address space for the alias mapping (vmap) which is equally accessible
> > from all CPUs.
>
> I never claimed to achieve 100% resilience to attacks.
You never even begun explaining what you were defending against. Let
alone how well it achieved its stated goals.
> While it's true that the address space is accessible to all the CPUs, it is
> also true that the access has a limited duration in time, and the location
> where the access can be performed is not fixed.
>
> Iow, assuming that the CPUs not involved in the write-rare operations are
> compromised and that they are trying to perform a concurrent access to the
> data in the writable page, they have a limited window of opportunity.
That sounds like security through obscurity. Sure it makes it harder,
but it doesn't stop anything.
> Said this, this I have posted is just a tentative implementation.
> My primary intent was to at least give an idea of what I'd like to do: alter
> some data in a way that is not easily exploitable.
Since you need to modify page-tables in order to achieve this, the
page-tables are also there for the attacker to write to.
> > > - fast enough to deal with page table updates
> >
> > The proposed implementation needs page-tables for the alias; I don't see
> > how you could ever do R/O page-tables when you need page-tables to
> > modify your page-tables.
>
> It's not all-or-nothing.
I really am still strugging with what it is this thing is supposed to
do. As it stands, I see very little value. Yes it makes a few things a
little harder, but it doesn't really do away with things.
> I hope we agree at least on the reasoning that having only a limited amount
> of address space directly attackable, instead of the whole set of pages
> containing exploitable data, is reducing the attack surface.
I do not in fact agree. Most pages are not interesting for an attack at
all. So simply reducing the set of pages you can write to isn't
sufficient.
IOW. removing all noninteresting pages from the writable set will
satisfy your criteria, but it avoids exactly 0 exploits.
What you need to argue is that we remove common exploit targets, and
you've utterly failed to do so.
Kees mentioned: function pointers, page-tables and a few other things.
You mentioned nothing.
> Furthermore, if we think about possible limitations that the attack might
> have (maximum reach), the level of protection might be even higher. I have
> to use "might" because I cannot foresee the vulnerability.
That's just a bunch of words together that don't really say anything
afaict.
> Furthermore, taking a different angle: your average attacker is not
> necessarily very social and inclined to share the vulnerability found.
> It is safe to assume that in most cases each attacker has to identify the
> attack strategy autonomously.
> Reducing the amount of individual who can perform an attack, by increasing
> the expertise required is also a way of doing damage control.
If you make RO all noninteresting pages, you in fact increase the
density of interesting targets and make things easier.
> > And this is entirely irrespective of performance.
>
> I have not completely given up on performance, but, being write-rare, I see
> improved performance as just a way of widening the range of possible
> recipients for the hardening.
What?! Are you saying you don't care about performance?
> > Right... /me goes find the patches we did for text_poke. Hmm, those
> > never seem to have made it:
> >
> > https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com
> >
> > like that. That approach will in fact work and not be a completely
> > broken mess like this thing.
>
> That approach is x86 specific.
It is not. Every architecture does switch_mm() with IRQs disabled, as
that is exactly what the scheduler does.
And keeping a second init_mm around also isn't very architecture
specific I feel.
> > > We need to find a good way to do the write-windowing that works well
> > > for static and dynamic structures _and_ for the page tables... this
> > > continues to be tricky.
> > >
> > > Making it resilient against ROP-style targets makes it difficult to
> > > deal with certain data structures (like list manipulation). In my
> > > earlier RFC, I tried to provide enough examples of where this could
> > > get used to let people see some of the complexity[1]. Igor's series
> > > expands this to even more examples using dynamic allocation.
> >
> > Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
> > fast enough for much of anything.
>
> Part of the reason for the duplication is that, even if the wr_API I came up
> with, can be collapsed with the regular API, the implementation needs to be
> different, to be faster.
>
> Ex:
> * force the list_head structure to be aligned so that it will always be
> fully contained by a single page
> * create alternate mapping for all the pages involved (max 1 page for each
> list_head)
> * do the lsit operation on the remapped list_heads
> * destroy all the mappings
Those constraints are due to single page aliases.
> I can try to introduce some wrapper similar to kmap_atomic(), as suggested
> by Dave Hansen, which can improve the coding, but it will not change the
> actual set of operations performed.
See below, I don't think kmap_atomic() is either correct or workable.
One thing that might be interesting is teaching objtool about the
write-enable write-disable markers and making it guarantee we reach a
disable after every enable. IOW, ensure we never leak this state.
I think that was part of why we hated on that initial thing Kees did --
well that an disabling all WP is of course completely insane ;-).
> > And I don't suppose we can take the WP fault and then fix up from there,
> > because if we're doing R/O page-tables, that'll incrase the fault depth
> > and we'll double fault all the time, and tripple fault where we
> > currently double fault. And we all know how _awesome_ tripple faults
> > are.
> >
> > But duplicating (and wrapping in gunk) whole APIs is just not going to
> > work.
>
> Would something like kmap_atomic() be acceptable?
Don't think so. kmap_atomic() on x86_32 (64bit doesn't have it at all)
only does the TLB invalidate on the one CPU, which we've established is
incorrect.
Also, kmap_atomic is still page-table based, which means not all
page-tables can be read-only.
> Do you have some better proposal, now that (I hope) it should be more clear
> what I'm trying to do and why?
You've still not talked about any actual attack vectors and how they're
mitigated by these patches.
I suppose the 'normal' attack goes like:
1) find buffer-overrun / bound check failure
2) use that to write to 'interesting' location
3) that write results arbitrary code execution
4) win
Of course, if the store of 2 is to the current cred structure, and
simply sets the effective uid to 0, we can skip 3.
Which seems to suggest all cred structures should be made r/o like this.
But I'm not sure I remember these patches doing that.
Also, there is an inverse situation with all this. If you make
everything R/O, then you need this allow-write for everything you do,
which then is about to include a case with an overflow / bound check
fail, and we're back to square 1.
What are you doing to avoid that?
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 15:26 ` Peter Zijlstra@ 2018-10-30 16:37 ` Kees Cook
2018-10-30 17:06 ` Andy Lutomirski
2018-10-31 9:27 ` Igor Stoppa0 siblings, 2 replies; 140+ messages in thread
From: Kees Cook @ 2018-10-30 16:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Igor Stoppa, Mimi Zohar, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Andy Lutomirski, Thomas Gleixner
On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> I suppose the 'normal' attack goes like:
>
> 1) find buffer-overrun / bound check failure
> 2) use that to write to 'interesting' location
> 3) that write results arbitrary code execution
> 4) win
>
> Of course, if the store of 2 is to the current cred structure, and
> simply sets the effective uid to 0, we can skip 3.
In most cases, yes, gaining root is game over. However, I don't want
to discount other threat models: some systems have been designed not
to trust root, so a cred attack doesn't always get an attacker full
control (e.g. lockdown series, signed modules, encrypted VMs, etc).
> Which seems to suggest all cred structures should be made r/o like this.
> But I'm not sure I remember these patches doing that.
There are things that attempt to protect cred (and other things, like
page tables) via hypervisors (see Samsung KNOX) or via syscall
boundary checking (see Linux Kernel Runtime Guard). They're pretty
interesting, but I'm not sure if there is a clear way forward on it
working in upstream, but that's why I think these discussions are
useful.
> Also, there is an inverse situation with all this. If you make
> everything R/O, then you need this allow-write for everything you do,
> which then is about to include a case with an overflow / bound check
> fail, and we're back to square 1.
Sure -- this is the fine line in trying to build these defenses. The
point is to narrow the scope of attack. Stupid metaphor follows: right
now we have only a couple walls; if we add walls we can focus on make
sure the doors and windows are safe. If we make the relatively
easy-to-find-in-memory page tables read-only-at-rest then a whole
class of very powerful exploits that depend on page table attacks go
away.
As part of all of this is the observation that there are two types of
things clearly worth protecting: that which is updated rarely (no need
to leave it writable for so much of its lifetime), and that which is
especially sensitive (page tables, security policy, function pointers,
etc). Finding a general purpose way to deal with these (like we have
for other data-lifetime cases like const and __ro_after_init) would be
very nice. I don't think there is a slippery slope here.
-Kees
--
Kees Cook
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 16:37 ` Kees Cook@ 2018-10-30 17:06 ` Andy Lutomirski
2018-10-30 17:58 ` Matthew Wilcox
2018-11-13 14:25 ` Igor Stoppa
2018-10-31 9:27 ` Igor Stoppa1 sibling, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-30 17:06 UTC (permalink / raw)
To: Kees Cook
Cc: Peter Zijlstra, Igor Stoppa, Mimi Zohar, Matthew Wilcox,
Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>
>> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>>
>> 1) find buffer-overrun / bound check failure
>> 2) use that to write to 'interesting' location
>> 3) that write results arbitrary code execution
>> 4) win
>>
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
>
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).
>
>> Which seems to suggest all cred structures should be made r/o like this.
>> But I'm not sure I remember these patches doing that.
>
> There are things that attempt to protect cred (and other things, like
> page tables) via hypervisors (see Samsung KNOX) or via syscall
> boundary checking (see Linux Kernel Runtime Guard). They're pretty
> interesting, but I'm not sure if there is a clear way forward on it
> working in upstream, but that's why I think these discussions are
> useful.
>
>> Also, there is an inverse situation with all this. If you make
>> everything R/O, then you need this allow-write for everything you do,
>> which then is about to include a case with an overflow / bound check
>> fail, and we're back to square 1.
>
> Sure -- this is the fine line in trying to build these defenses. The
> point is to narrow the scope of attack. Stupid metaphor follows: right
> now we have only a couple walls; if we add walls we can focus on make
> sure the doors and windows are safe. If we make the relatively
> easy-to-find-in-memory page tables read-only-at-rest then a whole
> class of very powerful exploits that depend on page table attacks go
> away.
>
> As part of all of this is the observation that there are two types of
> things clearly worth protecting: that which is updated rarely (no need
> to leave it writable for so much of its lifetime), and that which is
> especially sensitive (page tables, security policy, function pointers,
> etc). Finding a general purpose way to deal with these (like we have
> for other data-lifetime cases like const and __ro_after_init) would be
> very nice. I don't think there is a slippery slope here.
>
>
Since I wasn’t cc’d on this series:
I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
If anyone wants to use CR0.WP instead, I’ll remind them that they have to fix up the entry code and justify the added complexity. And make performance not suck in a VM (i.e. CR0 reads on entry are probably off the table). None of these will be easy.
If anyone wants to use kmap_atomic-like tricks, I’ll point out that we already have enough problems with dangling TLB entries due to SMP issues. The last thing we need is more of them. If someone proposes a viable solution that doesn’t involve CR3 fiddling, I’ll be surprised.
Keep in mind that switch_mm() is actually decently fast on modern CPUs. It’s probably considerably faster than writing CR0, although I haven’t benchmarked it. It’s certainly faster than writing CR4. It’s also faster than INVPCID, surprisingly, which means that it will be quite hard to get better performance using any sort of trickery.
Nadav’s patch set would be an excellent starting point.
P.S. EFI is sort of grandfathered in as a hackish alternate page table hierarchy. We’re not adding another one.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 17:06 ` Andy Lutomirski@ 2018-10-30 17:58 ` Matthew Wilcox
2018-10-30 18:03 ` Dave Hansen
` (3 more replies)
2018-11-13 14:25 ` Igor Stoppa1 sibling, 4 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 17:58 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kees Cook, Peter Zijlstra, Igor Stoppa, Mimi Zohar, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> I support the addition of a rare-write mechanism to the upstream kernel.
> And I think that there is only one sane way to implement it: using an
> mm_struct. That mm_struct, just like any sane mm_struct, should only
> differ from init_mm in that it has extra mappings in the *user* region.
I'd like to understand this approach a little better. In a syscall path,
we run with the user task's mm. What you're proposing is that when we
want to modify rare data, we switch to rare_mm which contains a
writable mapping to all the kernel data which is rare-write.
So the API might look something like this:
void *p = rare_alloc(...); /* writable pointer */
p->a = x;
q = rare_protect(p); /* read-only pointer */
To subsequently modify q,
p = rare_modify(q);
q->a = y;
rare_protect(p);
Under the covers, rare_modify() would switch to the rare_mm and return
(void *)((unsigned long)q + ARCH_RARE_OFFSET). All of the rare data
would then be modifiable, although you don't have any other pointers
to it. rare_protect() would switch back to the previous mm and return
(p - ARCH_RARE_OFFSET).
Does this satisfy Igor's requirements? We wouldn't be able to
copy_to/from_user() while rare_mm was active. I think that's a feature
though! It certainly satisfies my interests (kernel code be able to
mark things as dynamically-allocated-and-read-only-after-initialisation)
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 18:03 ` Dave Hansen@ 2018-10-31 9:18 ` Peter Zijlstra0 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31 9:18 UTC (permalink / raw)
To: Dave Hansen
Cc: Matthew Wilcox, Andy Lutomirski, Kees Cook, Igor Stoppa,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On Tue, Oct 30, 2018 at 11:03:51AM -0700, Dave Hansen wrote:
> On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> > Does this satisfy Igor's requirements? We wouldn't be able to
> > copy_to/from_user() while rare_mm was active. I think that's a feature
> > though! It certainly satisfies my interests (kernel code be able to
> > mark things as dynamically-allocated-and-read-only-after-initialisation)
>
> It has to be more than copy_to/from_user(), though, I think.
>
> rare_modify(q) either has to preempt_disable(), or we need to teach the
> context-switching code when and how to switch in/out of the rare_mm.
> preempt_disable() would also keep us from sleeping.
Yes, I think we want to have preempt disable at the very least. We could
indeed make the context switch code smart and teach is about this state,
but I think allowing preemption in such section is a bad idea. We want
to keep these sections short and simple (and bounded), such that they
can be analyzed for correctness.
Once you allow preemption, things tend to grow large and complex.
Ideally we'd even disabled interrupts over this, to further limit what
code runs in the rare_mm context. NMIs need special care anyway.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 19:20 ` Matthew Wilcox@ 2018-10-30 20:43 ` Igor Stoppa
2018-10-30 21:02 ` Andy Lutomirski
2018-10-30 21:35 ` Matthew Wilcox0 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 20:43 UTC (permalink / raw)
To: Matthew Wilcox, Tycho Andersen
Cc: Andy Lutomirski, Kees Cook, Peter Zijlstra, Mimi Zohar,
Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On 30/10/2018 21:20, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>> And I think that there is only one sane way to implement it: using an
>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>
>>> I'd like to understand this approach a little better. In a syscall path,
>>> we run with the user task's mm. What you're proposing is that when we
>>> want to modify rare data, we switch to rare_mm which contains a
>>> writable mapping to all the kernel data which is rare-write.
>>>
>>> So the API might look something like this:
>>>
>>> void *p = rare_alloc(...); /* writable pointer */
>>> p->a = x;
>>> q = rare_protect(p); /* read-only pointer */
With pools and memory allocated from vmap_areas, I was able to say
protect(pool)
and that would do a swipe on all the pages currently in use.
In the SELinux policyDB, for example, one doesn't really want to
individually protect each allocation.
The loading phase happens usually at boot, when the system can be
assumed to be sane (one might even preload a bare-bone set of rules from
initramfs and then replace it later on, with the full blown set).
There is no need to process each of these tens of thousands allocations
and initialization as write-rare.
Would it be possible to do the same here?
>>>
>>> To subsequently modify q,
>>>
>>> p = rare_modify(q);
>>> q->a = y;
>>
>> Do you mean
>>
>> p->a = y;
>>
>> here? I assume the intent is that q isn't writable ever, but that's
>> the one we have in the structure at rest.
>
> Yes, that was my intent, thanks.
>
> To handle the list case that Igor has pointed out, you might want to
> do something like this:
>
> list_for_each_entry(x, &xs, entry) {
> struct foo *writable = rare_modify(entry);
Would this mapping be impossible to spoof by other cores?
I'm asking this because, from what I understand, local interrupts are
enabled here, so an attack could freeze the core performing the
write-rare operation, while another scrapes the memory.
But blocking interrupts for the entire body of the loop would make RT
latency unpredictable.
> kref_get(&writable->ref);
> rare_protect(writable);
> }
>
> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
This seems suspiciously close to the duplication of kernel interfaces
that I was roasted for :-)
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 20:43 ` Igor Stoppa@ 2018-10-30 21:02 ` Andy Lutomirski
2018-10-30 21:07 ` Kees Cook
` (2 more replies)
2018-10-30 21:35 ` Matthew Wilcox1 sibling, 3 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-30 21:02 UTC (permalink / raw)
To: Igor Stoppa
Cc: Matthew Wilcox, Tycho Andersen, Kees Cook, Peter Zijlstra,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>>> And I think that there is only one sane way to implement it: using an
>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>>
>>>> I'd like to understand this approach a little better. In a syscall path,
>>>> we run with the user task's mm. What you're proposing is that when we
>>>> want to modify rare data, we switch to rare_mm which contains a
>>>> writable mapping to all the kernel data which is rare-write.
>>>>
>>>> So the API might look something like this:
>>>>
>>>> void *p = rare_alloc(...); /* writable pointer */
>>>> p->a = x;
>>>> q = rare_protect(p); /* read-only pointer */
>
> With pools and memory allocated from vmap_areas, I was able to say
>
> protect(pool)
>
> and that would do a swipe on all the pages currently in use.
> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation.
>
> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set).
>
> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
>
> Would it be possible to do the same here?
I don’t see why not, although getting the API right will be a tad complicated.
>
>>>>
>>>> To subsequently modify q,
>>>>
>>>> p = rare_modify(q);
>>>> q->a = y;
>>>
>>> Do you mean
>>>
>>> p->a = y;
>>>
>>> here? I assume the intent is that q isn't writable ever, but that's
>>> the one we have in the structure at rest.
>> Yes, that was my intent, thanks.
>> To handle the list case that Igor has pointed out, you might want to
>> do something like this:
>> list_for_each_entry(x, &xs, entry) {
>> struct foo *writable = rare_modify(entry);
>
> Would this mapping be impossible to spoof by other cores?
>
Indeed. Only the core with the special mm loaded could see it.
But I dislike allowing regular writes in the protected region. We really only need four write primitives:
1. Just write one value. Call at any time (except NMI).
2. Just copy some bytes. Same as (1) but any number of bytes.
3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.
Actually getting a modifiable pointer should be disallowed for two reasons:
1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes.
2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP.
We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:02 ` Andy Lutomirski@ 2018-10-30 21:07 ` Kees Cook
2018-10-30 21:25 ` Igor Stoppa
2018-10-30 22:15 ` Igor Stoppa
2018-10-31 9:45 ` Peter Zijlstra2 siblings, 1 reply; 140+ messages in thread
From: Kees Cook @ 2018-10-30 21:07 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Igor Stoppa, Matthew Wilcox, Tycho Andersen, Peter Zijlstra,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On Tue, Oct 30, 2018 at 2:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>
>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>>>> And I think that there is only one sane way to implement it: using an
>>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>>>
>>>>> I'd like to understand this approach a little better. In a syscall path,
>>>>> we run with the user task's mm. What you're proposing is that when we
>>>>> want to modify rare data, we switch to rare_mm which contains a
>>>>> writable mapping to all the kernel data which is rare-write.
>>>>>
>>>>> So the API might look something like this:
>>>>>
>>>>> void *p = rare_alloc(...); /* writable pointer */
>>>>> p->a = x;
>>>>> q = rare_protect(p); /* read-only pointer */
>>
>> With pools and memory allocated from vmap_areas, I was able to say
>>
>> protect(pool)
>>
>> and that would do a swipe on all the pages currently in use.
>> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation.
>>
>> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set).
>>
>> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
>>
>> Would it be possible to do the same here?
>
> I don’t see why not, although getting the API right will be a tad complicated.
>
>>
>>>>>
>>>>> To subsequently modify q,
>>>>>
>>>>> p = rare_modify(q);
>>>>> q->a = y;
>>>>
>>>> Do you mean
>>>>
>>>> p->a = y;
>>>>
>>>> here? I assume the intent is that q isn't writable ever, but that's
>>>> the one we have in the structure at rest.
>>> Yes, that was my intent, thanks.
>>> To handle the list case that Igor has pointed out, you might want to
>>> do something like this:
>>> list_for_each_entry(x, &xs, entry) {
>>> struct foo *writable = rare_modify(entry);
>>
>> Would this mapping be impossible to spoof by other cores?
>>
>
> Indeed. Only the core with the special mm loaded could see it.
>
> But I dislike allowing regular writes in the protected region. We really only need four write primitives:
>
> 1. Just write one value. Call at any time (except NMI).
>
> 2. Just copy some bytes. Same as (1) but any number of bytes.
>
> 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.
>
> Actually getting a modifiable pointer should be disallowed for two reasons:
>
> 1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes.
>
> 2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP.
>
> We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism.
We still have to deal with certain structures under the write-rare
window. For example, see:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474
They are wrappers to non-inline functions that have the same sanity-checking.
--
Kees Cook
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:07 ` Kees Cook@ 2018-10-30 21:25 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 21:25 UTC (permalink / raw)
To: Kees Cook, Andy Lutomirski
Cc: Matthew Wilcox, Tycho Andersen, Peter Zijlstra, Mimi Zohar,
Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On 30/10/2018 23:07, Kees Cook wrote:
> We still have to deal with certain structures under the write-rare
> window. For example, see:
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474
> They are wrappers to non-inline functions that have the same sanity-checking.
Even if I also did something similar, it was not intended to be final.
Just a stop-gap solution till the write-rare mechanism is identified.
If the size of the whole list_head is used as alignment, then the whole
list_head structure can be given an alternate mapping and the plain list
function can be used on this alternate mapping.
It can halve the overhead or more.
The typical case is when not only one list_head is contained in one
page, but also the other, like when allocating and queuing multiple
items from the same pool.
One single temporary mapping would be enough.
But it becomes tricky to do it, without generating code that is
almost-but-not-quite-identical.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:02 ` Andy Lutomirski
2018-10-30 21:07 ` Kees Cook@ 2018-10-30 22:15 ` Igor Stoppa
2018-10-31 10:11 ` Peter Zijlstra
2018-10-31 9:45 ` Peter Zijlstra2 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 22:15 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Matthew Wilcox, Tycho Andersen, Kees Cook, Peter Zijlstra,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On 30/10/2018 23:02, Andy Lutomirski wrote:
>
>
>> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
>>
>> Would it be possible to do the same here?
>
> I don’t see why not, although getting the API right will be a tad complicated.
yes, I have some first-hand experience with this :-/
>>
>>>>>
>>>>> To subsequently modify q,
>>>>>
>>>>> p = rare_modify(q);
>>>>> q->a = y;
>>>>
>>>> Do you mean
>>>>
>>>> p->a = y;
>>>>
>>>> here? I assume the intent is that q isn't writable ever, but that's
>>>> the one we have in the structure at rest.
>>> Yes, that was my intent, thanks.
>>> To handle the list case that Igor has pointed out, you might want to
>>> do something like this:
>>> list_for_each_entry(x, &xs, entry) {
>>> struct foo *writable = rare_modify(entry);
>>
>> Would this mapping be impossible to spoof by other cores?
>>
>
> Indeed. Only the core with the special mm loaded could see it.
>
> But I dislike allowing regular writes in the protected region. We really only need four write primitives:
>
> 1. Just write one value. Call at any time (except NMI).
>
> 2. Just copy some bytes. Same as (1) but any number of bytes.
>
> 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.
Atomic? RCU?
Yes, they are technically just memory writes, but shouldn't the "normal"
operation be executed on the writable mapping?
It is possible to sandwich any call between a rare_modify/rare_protect,
however isn't that pretty close to having a write-rare version of these
plain function.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 10:11 ` Peter Zijlstra@ 2018-10-31 20:38 ` Andy Lutomirski
2018-10-31 20:53 ` Andy Lutomirski0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 20:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Igor Stoppa, Matthew Wilcox, Tycho Andersen, Kees Cook,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>> On 30/10/2018 23:02, Andy Lutomirski wrote:
>
>>> But I dislike allowing regular writes in the protected region. We
>>> really only need four write primitives:
>>>
>>> 1. Just write one value. Call at any time (except NMI).
>>>
>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>>
>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>> region. This is purely an optimization.
>>
>> Atomic? RCU?
>
> RCU can be done, that's not really a problem. Atomics otoh are a
> problem. Having pointers makes them just work.
>
> Andy; I understand your reason for not wanting them, but I really don't
> want to duplicate everything. Is there something we can do with static
> analysis to make you more comfortable with the pointer thing?
I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.
And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 20:38 ` Andy Lutomirski@ 2018-10-31 20:53 ` Andy Lutomirski0 siblings, 0 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 20:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Igor Stoppa, Matthew Wilcox, Tycho Andersen, Kees Cook,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
> On Oct 31, 2018, at 1:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
>>> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>>> On 30/10/2018 23:02, Andy Lutomirski wrote:
>>
>>>> But I dislike allowing regular writes in the protected region. We
>>>> really only need four write primitives:
>>>>
>>>> 1. Just write one value. Call at any time (except NMI).
>>>>
>>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>>>
>>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>>> region. This is purely an optimization.
>>>
>>> Atomic? RCU?
>>
>> RCU can be done, that's not really a problem. Atomics otoh are a
>> problem. Having pointers makes them just work.
>>
>> Andy; I understand your reason for not wanting them, but I really don't
>> want to duplicate everything. Is there something we can do with static
>> analysis to make you more comfortable with the pointer thing?
>
> I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.
>
> And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?
Also, if we’re using a hypervisor, then there are a couple ways it could be done:
1. VMFUNC. Pointers work fine. This is stronger than any amount of CR3 trickery because it can’t be defeated by page table attacks.
2. A hypercall to do the write. No pointers.
Basically, I think that if we can get away without writable pointers, we get more flexibility and less need for fancy static analysis. If we do need pointers, then so be it.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 20:43 ` Igor Stoppa
2018-10-30 21:02 ` Andy Lutomirski@ 2018-10-30 21:35 ` Matthew Wilcox
2018-10-30 21:49 ` Igor Stoppa
2018-10-31 4:41 ` Andy Lutomirski1 sibling, 2 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 21:35 UTC (permalink / raw)
To: Igor Stoppa
Cc: Tycho Andersen, Andy Lutomirski, Kees Cook, Peter Zijlstra,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
> On 30/10/2018 21:20, Matthew Wilcox wrote:
> > > > So the API might look something like this:
> > > >
> > > > void *p = rare_alloc(...); /* writable pointer */
> > > > p->a = x;
> > > > q = rare_protect(p); /* read-only pointer */
>
> With pools and memory allocated from vmap_areas, I was able to say
>
> protect(pool)
>
> and that would do a swipe on all the pages currently in use.
> In the SELinux policyDB, for example, one doesn't really want to
> individually protect each allocation.
>
> The loading phase happens usually at boot, when the system can be assumed to
> be sane (one might even preload a bare-bone set of rules from initramfs and
> then replace it later on, with the full blown set).
>
> There is no need to process each of these tens of thousands allocations and
> initialization as write-rare.
>
> Would it be possible to do the same here?
What Andy is proposing effectively puts all rare allocations into
one pool. Although I suppose it could be generalised to multiple pools
... one mm_struct per pool. Andy, what do you think to doing that?
> > but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
>
> This seems suspiciously close to the duplication of kernel interfaces that I
> was roasted for :-)
Can you not see the difference between adding one syntactic sugar function
and duplicating the entire infrastructure?
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:35 ` Matthew Wilcox@ 2018-10-30 21:49 ` Igor Stoppa
2018-10-31 4:41 ` Andy Lutomirski1 sibling, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 21:49 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Tycho Andersen, Andy Lutomirski, Kees Cook, Peter Zijlstra,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On 30/10/2018 23:35, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>> Would it be possible to do the same here?
>
> What Andy is proposing effectively puts all rare allocations into
> one pool. Although I suppose it could be generalised to multiple pools
> ... one mm_struct per pool. Andy, what do you think to doing that?
The reason to have pools is that, continuing the SELinux example, it
supports reloading the policyDB.
In this case it seems to me that it would be faster to drop the entire
pool in one go, and create a new one when re-initializing the rules.
Or maybe the pool could be flushed, without destroying the metadata.
One more reason for having pools is to assign certain property to the
pool and then rely on it to be applied to every subsequent allocation.
I've also been wondering if pools can be expected to have some well
defined property.
One might be that they do not need to be created on the fly.
The number of pools should be known at compilation time.
At least the meta-data could be static, but I do not know how this would
work with modules.
>>> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
>>
>> This seems suspiciously close to the duplication of kernel interfaces that I
>> was roasted for :-)
>
> Can you not see the difference between adding one syntactic sugar function
> and duplicating the entire infrastructure?
And list_add()? or any of the other list related functions?
Don't they end up receiving a similar treatment?
I might have misunderstood your proposal, but it seems to me that they
too will need the same type of pairs rare_modify/rare_protect. No?
And same for hlist, including the _rcu variants.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:35 ` Matthew Wilcox
2018-10-30 21:49 ` Igor Stoppa@ 2018-10-31 4:41 ` Andy Lutomirski
2018-10-31 9:08 ` Igor Stoppa
2018-10-31 10:02 ` Peter Zijlstra1 sibling, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 4:41 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Igor Stoppa, Tycho Andersen, Kees Cook, Peter Zijlstra,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
> > On 30/10/2018 21:20, Matthew Wilcox wrote:
> > > > > So the API might look something like this:
> > > > >
> > > > > void *p = rare_alloc(...); /* writable pointer */
> > > > > p->a = x;
> > > > > q = rare_protect(p); /* read-only pointer */
> >
> > With pools and memory allocated from vmap_areas, I was able to say
> >
> > protect(pool)
> >
> > and that would do a swipe on all the pages currently in use.
> > In the SELinux policyDB, for example, one doesn't really want to
> > individually protect each allocation.
> >
> > The loading phase happens usually at boot, when the system can be assumed to
> > be sane (one might even preload a bare-bone set of rules from initramfs and
> > then replace it later on, with the full blown set).
> >
> > There is no need to process each of these tens of thousands allocations and
> > initialization as write-rare.
> >
> > Would it be possible to do the same here?
>
> What Andy is proposing effectively puts all rare allocations into
> one pool. Although I suppose it could be generalised to multiple pools
> ... one mm_struct per pool. Andy, what do you think to doing that?
Hmm. Let's see.
To clarify some of this thread, I think that the fact that rare_write
uses an mm_struct and alias mappings under the hood should be
completely invisible to users of the API. No one should ever be
handed a writable pointer to rare_write memory (except perhaps during
bootup or when initializing a large complex data structure that will
be rare_write but isn't yet, e.g. the policy db).
For example, there could easily be architectures where having a
writable alias is problematic. On such architectures, an entirely
different mechanism might work better. And, if a tool like KNOX ever
becomes a *part* of the Linux kernel (hint hint!)
If you have multiple pools and one mm_struct per pool, you'll need a
way to find the mm_struct from a given allocation. Regardless of how
the mm_structs are set up, changing rare_write memory to normal memory
or vice versa will require a global TLB flush (all ASIDs and global
pages) on all CPUs, so having extra mm_structs doesn't seem to buy
much.
(It's just possible that changing rare_write back to normal might be
able to avoid the flush if the spurious faults can be handled
reliably.)
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 4:41 ` Andy Lutomirski@ 2018-10-31 9:08 ` Igor Stoppa
2018-10-31 19:38 ` Igor Stoppa
2018-10-31 10:02 ` Peter Zijlstra1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31 9:08 UTC (permalink / raw)
To: Andy Lutomirski, Matthew Wilcox, Paul Moore, Stephen Smalley
Cc: Tycho Andersen, Kees Cook, Peter Zijlstra, Mimi Zohar,
Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
linux-integrity, LSM List, Igor Stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner, selinux
Adding SELinux folks and the SELinux ml
I think it's better if they participate in this discussion.
On 31/10/2018 06:41, Andy Lutomirski wrote:
> On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>>>> So the API might look something like this:
>>>>>>
>>>>>> void *p = rare_alloc(...); /* writable pointer */
>>>>>> p->a = x;
>>>>>> q = rare_protect(p); /* read-only pointer */
>>>
>>> With pools and memory allocated from vmap_areas, I was able to say
>>>
>>> protect(pool)
>>>
>>> and that would do a swipe on all the pages currently in use.
>>> In the SELinux policyDB, for example, one doesn't really want to
>>> individually protect each allocation.
>>>
>>> The loading phase happens usually at boot, when the system can be assumed to
>>> be sane (one might even preload a bare-bone set of rules from initramfs and
>>> then replace it later on, with the full blown set).
>>>
>>> There is no need to process each of these tens of thousands allocations and
>>> initialization as write-rare.
>>>
>>> Would it be possible to do the same here?
>>
>> What Andy is proposing effectively puts all rare allocations into
>> one pool. Although I suppose it could be generalised to multiple pools
>> ... one mm_struct per pool. Andy, what do you think to doing that?
>
> Hmm. Let's see.
>
> To clarify some of this thread, I think that the fact that rare_write
> uses an mm_struct and alias mappings under the hood should be
> completely invisible to users of the API.
I agree.
> No one should ever be
> handed a writable pointer to rare_write memory (except perhaps during
> bootup or when initializing a large complex data structure that will
> be rare_write but isn't yet, e.g. the policy db).
The policy db doesn't need to be write rare.
Actually, it really shouldn't be write rare.
Maybe it's just a matter of wording, but effectively the policyDB can be
trated with this sequence:
1) allocate various data structures in writable form
2) initialize them
3) go back to 1 as needed
4) lock down everything that has been allocated, as Read-Only
The reason why I stress ReadOnly is that differentiating what is really
ReadOnly from what is WriteRare provides an extra edge against attacks,
because attempts to alter ReadOnly data through a WriteRare API could be
detected
5) read any part of the policyDB during regular operations
6) in case of update, create a temporary new version, using steps 1..3
7) if update successful, use the new one and destroy the old one
8) if the update failed, destroy the new one
The destruction at points 7 and 8 is not so much a write operation, as
it is a release of the memory.
So, we might have a bit different interpretation of what write-rare
means wrt destroying the memory and its content.
To clarify: I've been using write-rare to indicate primarily small
operations that one would otherwise achieve with "=", memcpy or memset
or more complex variants, like atomic ops, rcu pointer assignment, etc.
Tearing down an entire set of allocations like the policyDB doesn't fit
very well with that model.
The only part which _needs_ to be write rare, in the policyDB, is the
set of pointers which are used to access all the dynamically allocated
data set.
These pointers must be updated when a new policyDB is allocated.
> For example, there could easily be architectures where having a
> writable alias is problematic. On such architectures, an entirely
> different mechanism might work better. And, if a tool like KNOX ever
> becomes a *part* of the Linux kernel (hint hint!)
Something related, albeit not identical is going on here [1]
Eventually, it could be expanded to deal also with write rare.
> If you have multiple pools and one mm_struct per pool, you'll need a
> way to find the mm_struct from a given allocation.
Indeed. In my patchset, based on vmas, I do the following:
* a private field from the page struct points to the vma using that page
* inside the vma there is alist_head used only during deletion
- one pointer is used to chain vmas fro mthe same pool
- one pointer points to the pool struct
* the pool struct has the property to use for all the associated
allocations: is it write-rare, read-only, does it auto protect, etc.
> Regardless of how
> the mm_structs are set up, changing rare_write memory to normal memory
> or vice versa will require a global TLB flush (all ASIDs and global
> pages) on all CPUs, so having extra mm_structs doesn't seem to buy
> much.
1) it supports differnt levels of protection:
temporarily unprotected vs read-only vs write-rare
2) the change of write permission should be possible only toward more
restrictive rules (writable -> write-rare -> read-only) and only to the
point that was specified while creating the pool, to avoid DOS attacks,
where a write-rare is flipped into read-only and further updates fail
(ex: prevent IMA from registering modifications to a file, by not
letting it store new information - I'm not 100% sure this would work,
but it gives the idea, I think)
3) being able to track all the allocations related to a pool would allow
to perform mass operations, like reducing the writabilty or destroying
all the allocations.
> (It's just possible that changing rare_write back to normal might be
> able to avoid the flush if the spurious faults can be handled
> reliably.)
I do not see the need for such a case of degrading the write permissions
of an allocation, unless it refers to the release of a pool of
allocations (see updating the SELinux policy DB)
[1] https://www.openwall.com/lists/kernel-hardening/2018/10/26/11
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 9:08 ` Igor Stoppa@ 2018-10-31 19:38 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31 19:38 UTC (permalink / raw)
To: Andy Lutomirski, Matthew Wilcox, Paul Moore, Stephen Smalley
Cc: Tycho Andersen, Kees Cook, Peter Zijlstra, Mimi Zohar,
Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
linux-integrity, LSM List, Igor Stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner, selinux
On 31/10/2018 11:08, Igor Stoppa wrote:
> Adding SELinux folks and the SELinux ml
>
> I think it's better if they participate in this discussion.
>
> On 31/10/2018 06:41, Andy Lutomirski wrote:
>> On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org>
>> wrote:
>>>
>>> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>>>>> So the API might look something like this:
>>>>>>>
>>>>>>> void *p = rare_alloc(...); /* writable pointer */
>>>>>>> p->a = x;
>>>>>>> q = rare_protect(p); /* read-only pointer */
>>>>
>>>> With pools and memory allocated from vmap_areas, I was able to say
>>>>
>>>> protect(pool)
>>>>
>>>> and that would do a swipe on all the pages currently in use.
>>>> In the SELinux policyDB, for example, one doesn't really want to
>>>> individually protect each allocation.
>>>>
>>>> The loading phase happens usually at boot, when the system can be
>>>> assumed to
>>>> be sane (one might even preload a bare-bone set of rules from
>>>> initramfs and
>>>> then replace it later on, with the full blown set).
>>>>
>>>> There is no need to process each of these tens of thousands
>>>> allocations and
>>>> initialization as write-rare.
>>>>
>>>> Would it be possible to do the same here?
>>>
>>> What Andy is proposing effectively puts all rare allocations into
>>> one pool. Although I suppose it could be generalised to multiple pools
>>> ... one mm_struct per pool. Andy, what do you think to doing that?
>>
>> Hmm. Let's see.
>>
>> To clarify some of this thread, I think that the fact that rare_write
>> uses an mm_struct and alias mappings under the hood should be
>> completely invisible to users of the API.
>
> I agree.
>
>> No one should ever be
>> handed a writable pointer to rare_write memory (except perhaps during
>> bootup or when initializing a large complex data structure that will
>> be rare_write but isn't yet, e.g. the policy db).
>
> The policy db doesn't need to be write rare.
> Actually, it really shouldn't be write rare.
>
> Maybe it's just a matter of wording, but effectively the policyDB can be
> trated with this sequence:
>
> 1) allocate various data structures in writable form
>
> 2) initialize them
>
> 3) go back to 1 as needed
>
> 4) lock down everything that has been allocated, as Read-Only
> The reason why I stress ReadOnly is that differentiating what is really
> ReadOnly from what is WriteRare provides an extra edge against attacks,
> because attempts to alter ReadOnly data through a WriteRare API could be
> detected
>
> 5) read any part of the policyDB during regular operations
>
> 6) in case of update, create a temporary new version, using steps 1..3
>
> 7) if update successful, use the new one and destroy the old one
>
> 8) if the update failed, destroy the new one
>
> The destruction at points 7 and 8 is not so much a write operation, as
> it is a release of the memory.
>
> So, we might have a bit different interpretation of what write-rare
> means wrt destroying the memory and its content.
>
> To clarify: I've been using write-rare to indicate primarily small
> operations that one would otherwise achieve with "=", memcpy or memset
> or more complex variants, like atomic ops, rcu pointer assignment, etc.
>
> Tearing down an entire set of allocations like the policyDB doesn't fit
> very well with that model.
>
> The only part which _needs_ to be write rare, in the policyDB, is the
> set of pointers which are used to access all the dynamically allocated
> data set.
>
> These pointers must be updated when a new policyDB is allocated.
>
>> For example, there could easily be architectures where having a
>> writable alias is problematic. On such architectures, an entirely
>> different mechanism might work better. And, if a tool like KNOX ever
>> becomes a *part* of the Linux kernel (hint hint!)
>
> Something related, albeit not identical is going on here [1]
> Eventually, it could be expanded to deal also with write rare.
>
>> If you have multiple pools and one mm_struct per pool, you'll need a
>> way to find the mm_struct from a given allocation.
>
> Indeed. In my patchset, based on vmas, I do the following:
> * a private field from the page struct points to the vma using that page
> * inside the vma there is alist_head used only during deletion
> - one pointer is used to chain vmas fro mthe same pool
> - one pointer points to the pool struct
> * the pool struct has the property to use for all the associated
> allocations: is it write-rare, read-only, does it auto protect, etc.
>
>> Regardless of how
>> the mm_structs are set up, changing rare_write memory to normal memory
>> or vice versa will require a global TLB flush (all ASIDs and global
>> pages) on all CPUs, so having extra mm_structs doesn't seem to buy
>> much.
>
> 1) it supports differnt levels of protection:
> temporarily unprotected vs read-only vs write-rare
>
> 2) the change of write permission should be possible only toward more
> restrictive rules (writable -> write-rare -> read-only) and only to the
> point that was specified while creating the pool, to avoid DOS attacks,
> where a write-rare is flipped into read-only and further updates fail
> (ex: prevent IMA from registering modifications to a file, by not
> letting it store new information - I'm not 100% sure this would work,
> but it gives the idea, I think)
>
> 3) being able to track all the allocations related to a pool would allow
> to perform mass operations, like reducing the writabilty or destroying
> all the allocations.
>
>> (It's just possible that changing rare_write back to normal might be
>> able to avoid the flush if the spurious faults can be handled
>> reliably.)
>
> I do not see the need for such a case of degrading the write permissions
> of an allocation, unless it refers to the release of a pool of
> allocations (see updating the SELinux policy DB)
A few more thoughts about pools. Not sure if they are all correct.
Note: I stick to "pool", instead of mm_struct, because what I'll say is
mostly independent from the implementation.
- As Peter Zijlstra wrote earlier, protecting a target moves the focus
of the attack to something else. In this case, probably, the "something
else" would be the metadata of the pool(s).
- The amount of pools needed should be known at compilation time, so the
metadata used for pools could be statically allocated and any change to
it could be treated as write-rare.
- only certain fields of a pool structure would be writable, even as
write-rare, after the pool is initialized.
In case the pool is a mm_struct or a superset (containing also
additional properties, like the type of writability: RO or WR), the field
struct vm_area_struct *mmap;
is an example of what could be protected. It should be alterable only
when creating/destroying the pool and making the first initialization.
- to speed up and also improve the validation of the target of a
write-rare operation, it would be really desirable if the target had
some intrinsic property which clearly differentiates it from
non-write-rare memory. Its address, for example. The amount of write
rare data needed by a full kernel should not exceed a few tens of
megabytes. On a 64 bit system it shouldn't be so bad to reserve an
address range maybe one order of magnitude lager than that.
It could even become a parameter for the creation of a pool.
SELinux, for example, should fit within 10-20MB. Or it could be a
command line parameter.
- even if an hypervisor was present, it would be preferable to use it
exclusively as extra protection, which triggers an exception only when
something abnormal happens. The hypervisor should not become aware of
the actual meaning of kernel (meta)data. Ideally, it would be mostly
used for trapping unexpected writes to pages which are not supposed to
be modified.
- one more reason for using pools is that, if each pool was also acting
as memory cache for its users, attacks relying on use-after-free would
not have access to possible vulnerability, because the memory and
addresses associated with a pool would stay with it.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 4:41 ` Andy Lutomirski
2018-10-31 9:08 ` Igor Stoppa@ 2018-10-31 10:02 ` Peter Zijlstra
2018-10-31 20:36 ` Andy Lutomirski1 sibling, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31 10:02 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
> To clarify some of this thread, I think that the fact that rare_write
> uses an mm_struct and alias mappings under the hood should be
> completely invisible to users of the API. No one should ever be
> handed a writable pointer to rare_write memory (except perhaps during
> bootup or when initializing a large complex data structure that will
> be rare_write but isn't yet, e.g. the policy db).
Being able to use pointers would make it far easier to do atomics and
other things though.
> For example, there could easily be architectures where having a
> writable alias is problematic.
Mostly we'd just have to be careful of cache aliases, alignment should
be able to sort that I think.
> If you have multiple pools and one mm_struct per pool, you'll need a
> way to find the mm_struct from a given allocation.
Or keep track of it externally. For example by context. If you modify
page-tables you pick the page-table pool, if you modify selinux state,
you pick the selinux pool.
> Regardless of how the mm_structs are set up, changing rare_write
> memory to normal memory or vice versa will require a global TLB flush
> (all ASIDs and global pages) on all CPUs, so having extra mm_structs
> doesn't seem to buy much.
The way I understand it, the point is that if you stick page-tables and
selinux state in different pools, a stray write in one will never affect
the other.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 10:02 ` Peter Zijlstra@ 2018-10-31 20:36 ` Andy Lutomirski
2018-10-31 21:00 ` Peter Zijlstra0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 20:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
>> To clarify some of this thread, I think that the fact that rare_write
>> uses an mm_struct and alias mappings under the hood should be
>> completely invisible to users of the API. No one should ever be
>> handed a writable pointer to rare_write memory (except perhaps during
>> bootup or when initializing a large complex data structure that will
>> be rare_write but isn't yet, e.g. the policy db).
>
> Being able to use pointers would make it far easier to do atomics and
> other things though.
This stuff is called *rare* write for a reason. Do we really want to allow atomics beyond just store-release? Taking a big lock and then writing in the right order should cover everything, no?
>
>> For example, there could easily be architectures where having a
>> writable alias is problematic.
>
> Mostly we'd just have to be careful of cache aliases, alignment should
> be able to sort that I think.
>
>> If you have multiple pools and one mm_struct per pool, you'll need a
>> way to find the mm_struct from a given allocation.
>
> Or keep track of it externally. For example by context. If you modify
> page-tables you pick the page-table pool, if you modify selinux state,
> you pick the selinux pool.
>
>> Regardless of how the mm_structs are set up, changing rare_write
>> memory to normal memory or vice versa will require a global TLB flush
>> (all ASIDs and global pages) on all CPUs, so having extra mm_structs
>> doesn't seem to buy much.
>
> The way I understand it, the point is that if you stick page-tables and
> selinux state in different pools, a stray write in one will never affect
> the other.
>
Hmm. That’s not totally crazy, but the API would need to be carefully designed. And some argument would have to be made as to why it’s better to use a different address space as opposed to checking in software along the lines of the uaccess checking.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 20:36 ` Andy Lutomirski@ 2018-10-31 21:00 ` Peter Zijlstra
2018-10-31 22:57 ` Andy Lutomirski0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31 21:00 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
>
> > On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
> >> To clarify some of this thread, I think that the fact that rare_write
> >> uses an mm_struct and alias mappings under the hood should be
> >> completely invisible to users of the API. No one should ever be
> >> handed a writable pointer to rare_write memory (except perhaps during
> >> bootup or when initializing a large complex data structure that will
> >> be rare_write but isn't yet, e.g. the policy db).
> >
> > Being able to use pointers would make it far easier to do atomics and
> > other things though.
>
> This stuff is called *rare* write for a reason. Do we really want to
> allow atomics beyond just store-release? Taking a big lock and then
> writing in the right order should cover everything, no?
Ah, so no. That naming is very misleading.
We modify page-tables a _lot_. The point is that only a few sanctioned
sites are allowed writing to it, not everybody.
I _think_ the use-case for atomics is updating the reference counts of
objects that are in this write-rare domain. But I'm not entirely clear
on that myself either. I just really want to avoid duplicating that
stuff.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-31 21:00 ` Peter Zijlstra@ 2018-10-31 22:57 ` Andy Lutomirski
2018-10-31 23:10 ` Igor Stoppa
2018-11-01 17:08 ` Peter Zijlstra0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 22:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
>>
>>>> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
>>>> To clarify some of this thread, I think that the fact that rare_write
>>>> uses an mm_struct and alias mappings under the hood should be
>>>> completely invisible to users of the API. No one should ever be
>>>> handed a writable pointer to rare_write memory (except perhaps during
>>>> bootup or when initializing a large complex data structure that will
>>>> be rare_write but isn't yet, e.g. the policy db).
>>>
>>> Being able to use pointers would make it far easier to do atomics and
>>> other things though.
>>
>> This stuff is called *rare* write for a reason. Do we really want to
>> allow atomics beyond just store-release? Taking a big lock and then
>> writing in the right order should cover everything, no?
>
> Ah, so no. That naming is very misleading.
>
> We modify page-tables a _lot_. The point is that only a few sanctioned
> sites are allowed writing to it, not everybody.
>
> I _think_ the use-case for atomics is updating the reference counts of
> objects that are in this write-rare domain. But I'm not entirely clear
> on that myself either. I just really want to avoid duplicating that
> stuff.
Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.
Can we see a *real* use case before we over complicate the API?
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:25 ` Matthew Wilcox@ 2018-10-30 21:55 ` Igor Stoppa
2018-10-30 22:08 ` Matthew Wilcox
2018-10-31 9:29 ` Peter Zijlstra1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 21:55 UTC (permalink / raw)
To: Matthew Wilcox, Andy Lutomirski
Cc: nadav.amit, Kees Cook, Peter Zijlstra, Mimi Zohar, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On 30/10/2018 23:25, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
>> Finally, one issue: rare_alloc() is going to utterly suck
>> performance-wise due to the global IPI when the region gets zapped out
>> of the direct map or otherwise made RO. This is the same issue that
>> makes all existing XPO efforts so painful. We need to either optimize
>> the crap out of it somehow or we need to make sure it’s not called
>> except during rare events like device enumeration.
>
> Batching operations is kind of the whole point of the VM ;-) Either
> this rare memory gets used a lot, in which case we'll want to create slab
> caches for it, make it a MM zone and the whole nine yeards, or it's not
> used very much in which case it doesn't matter that performance sucks.
>
> For now, I'd suggest allocating 2MB chunks as needed, and having a
> shrinker to hand back any unused pieces.
One of the prime candidates for this sort of protection is IMA.
In the IMA case, there are ever-growing lists which are populated when
accessing files.
It's something that ends up on the critical path of any usual
performance critical use case, when accessing files for the first time,
like at boot/application startup.
Also the SELinux AVC is based on lists. It uses an object cache, but it
is still something that grows and is on the critical path of evaluating
the callbacks from the LSM hooks. A lot of them.
These are the main two reasons, so far, for me advocating an
optimization of the write-rare version of the (h)list.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:55 ` Igor Stoppa@ 2018-10-30 22:08 ` Matthew Wilcox0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 22:08 UTC (permalink / raw)
To: Igor Stoppa
Cc: Andy Lutomirski, nadav.amit, Kees Cook, Peter Zijlstra,
Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, linux-security-module,
Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On Tue, Oct 30, 2018 at 11:55:46PM +0200, Igor Stoppa wrote:
> On 30/10/2018 23:25, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> > > Finally, one issue: rare_alloc() is going to utterly suck
> > > performance-wise due to the global IPI when the region gets zapped out
> > > of the direct map or otherwise made RO. This is the same issue that
> > > makes all existing XPO efforts so painful. We need to either optimize
> > > the crap out of it somehow or we need to make sure it’s not called
> > > except during rare events like device enumeration.
> >
> > Batching operations is kind of the whole point of the VM ;-) Either
> > this rare memory gets used a lot, in which case we'll want to create slab
> > caches for it, make it a MM zone and the whole nine yeards, or it's not
> > used very much in which case it doesn't matter that performance sucks.
> >
> > For now, I'd suggest allocating 2MB chunks as needed, and having a
> > shrinker to hand back any unused pieces.
>
> One of the prime candidates for this sort of protection is IMA.
> In the IMA case, there are ever-growing lists which are populated when
> accessing files.
> It's something that ends up on the critical path of any usual performance
> critical use case, when accessing files for the first time, like at
> boot/application startup.
>
> Also the SELinux AVC is based on lists. It uses an object cache, but it is
> still something that grows and is on the critical path of evaluating the
> callbacks from the LSM hooks. A lot of them.
>
> These are the main two reasons, so far, for me advocating an optimization of
> the write-rare version of the (h)list.
I think these are both great examples of why doubly-linked lists _suck_.
You have to modify three cachelines to add an entry to a list. Walking a
linked list is an exercise in cache misses. Far better to use an XArray /
IDR for this purpose.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 21:25 ` Matthew Wilcox
2018-10-30 21:55 ` Igor Stoppa@ 2018-10-31 9:29 ` Peter Zijlstra1 sibling, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31 9:29 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andy Lutomirski, nadav.amit, Kees Cook, Igor Stoppa, Mimi Zohar,
Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Oct 30, 2018 at 02:25:51PM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> > Finally, one issue: rare_alloc() is going to utterly suck
> > performance-wise due to the global IPI when the region gets zapped out
> > of the direct map or otherwise made RO. This is the same issue that
> > makes all existing XPO efforts so painful. We need to either optimize
> > the crap out of it somehow or we need to make sure it’s not called
> > except during rare events like device enumeration.
>
> Batching operations is kind of the whole point of the VM ;-) Either
> this rare memory gets used a lot, in which case we'll want to create slab
> caches for it, make it a MM zone and the whole nine yeards, or it's not
> used very much in which case it doesn't matter that performance sucks.
Yes, for the dynamic case something along those lines would be needed.
If we have a single rare zone, we could even have __GFP_RARE or whatever
that manages this.
The page allocator would have to grow a rare memblock type, and every
rare alloc would allocate from a rare memblock, when none is available,
creation of a rare block would set up the mappings etc..
> For now, I'd suggest allocating 2MB chunks as needed, and having a
> shrinker to hand back any unused pieces.
Something like the percpu allocator?
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 17:06 ` Andy Lutomirski
2018-10-30 17:58 ` Matthew Wilcox@ 2018-11-13 14:25 ` Igor Stoppa
2018-11-13 17:16 ` Andy Lutomirski1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 14:25 UTC (permalink / raw)
To: Andy Lutomirski, Kees Cook, Peter Zijlstra, Nadav Amit
Cc: Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
Hello,
I've been studying v4 of the patch-set [1] that Nadav has been working on.
Incidentally, I think it would be useful to cc also the
security/hardening ml.
The patch-set seems to be close to final, so I am resuming this discussion.
On 30/10/2018 19:06, Andy Lutomirski wrote:
> I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
After reading the code, I see what you meant.
I think I can work with it.
But I have a couple of questions wrt the use of this mechanism, in the
context of write rare.
1) mm_struct.
Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
(live patch?), which seems to happen sequentially and in a relatively
standardized way, like replacing the NOPs specifically placed in the
functions that need patching.
This is a bit different from the more generic write-rare case, applied
to data.
As example, I have in mind a system where both IMA and SELinux are in use.
In this system, a file is accessed for the first time.
That would trigger 2 things:
- evaluation of the SELinux rules and probably update of the AVC cache
- IMA measurement and update of the measurements
Both of them could be write protected, meaning that they would both have
to be modified through the write rare mechanism.
While the events, for 1 specific file, would be sequential, it's not
difficult to imagine that multiple files could be accessed at the same time.
If the update of the data structures in both IMA and SELinux must use
the same mm_struct, that would have to be somehow regulated and it would
introduce an unnecessary (imho) dependency.
How about having one mm_struct for each writer (core or thread)?
2) Iiuc, the purpose of the 2 pages being remapped is that the target of
the patch might spill across the page boundary, however if I deal with
the modification of generic data, I shouldn't (shouldn't I?) assume that
the data will not span across multiple pages.
If the data spans across multiple pages, in unknown amount, I suppose
that I should not keep interrupts disabled for an unknown time, as it
would hurt preemption.
What I thought, in my initial patch-set, was to iterate over each page
that must be written to, in a loop, re-enabling interrupts in-between
iterations, to give pending interrupts a chance to be served.
This would mean that the data being written to would not be consistent,
but it's a problem that would have to be addressed anyways, since it can
be still read by other cores, while the write is ongoing.
Is this a valid concern/approach?
--
igor
[1] https://lkml.org/lkml/2018/11/11/110^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 14:25 ` Igor Stoppa@ 2018-11-13 17:16 ` Andy Lutomirski
2018-11-13 17:43 ` Nadav Amit
2018-11-13 18:26 ` Igor Stoppa0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 17:16 UTC (permalink / raw)
To: Igor Stoppa
Cc: Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
> Hello,
> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> Incidentally, I think it would be useful to cc also the
> security/hardening ml.
> The patch-set seems to be close to final, so I am resuming this discussion.
>
> On 30/10/2018 19:06, Andy Lutomirski wrote:
>
> > I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>
> After reading the code, I see what you meant.
> I think I can work with it.
>
> But I have a couple of questions wrt the use of this mechanism, in the
> context of write rare.
>
>
> 1) mm_struct.
>
> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> (live patch?), which seems to happen sequentially and in a relatively
> standardized way, like replacing the NOPs specifically placed in the
> functions that need patching.
>
> This is a bit different from the more generic write-rare case, applied
> to data.
>
> As example, I have in mind a system where both IMA and SELinux are in use.
>
> In this system, a file is accessed for the first time.
>
> That would trigger 2 things:
> - evaluation of the SELinux rules and probably update of the AVC cache
> - IMA measurement and update of the measurements
>
> Both of them could be write protected, meaning that they would both have
> to be modified through the write rare mechanism.
>
> While the events, for 1 specific file, would be sequential, it's not
> difficult to imagine that multiple files could be accessed at the same time.
>
> If the update of the data structures in both IMA and SELinux must use
> the same mm_struct, that would have to be somehow regulated and it would
> introduce an unnecessary (imho) dependency.
>
> How about having one mm_struct for each writer (core or thread)?
>
I don't think that helps anything. I think the mm_struct used for
prmem (or rare_write or whatever you want to call it) should be
entirely abstracted away by an appropriate API, so neither SELinux nor
IMA need to be aware that there's an mm_struct involved. It's also
entirely possible that some architectures won't even use an mm_struct
behind the scenes -- x86, for example, could have avoided it if there
were a kernel equivalent of PKRU. Sadly, there isn't.
>
>
> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> the patch might spill across the page boundary, however if I deal with
> the modification of generic data, I shouldn't (shouldn't I?) assume that
> the data will not span across multiple pages.
The reason for the particular architecture of text_poke() is to avoid
memory allocation to get it working. i think that prmem/rare_write
should have each rare-writable kernel address map to a unique user
address, possibly just by offsetting everything by a constant. For
rare_write, you don't actually need it to work as such until fairly
late in boot, since the rare_writable data will just be writable early
on.
>
> If the data spans across multiple pages, in unknown amount, I suppose
> that I should not keep interrupts disabled for an unknown time, as it
> would hurt preemption.
>
> What I thought, in my initial patch-set, was to iterate over each page
> that must be written to, in a loop, re-enabling interrupts in-between
> iterations, to give pending interrupts a chance to be served.
>
> This would mean that the data being written to would not be consistent,
> but it's a problem that would have to be addressed anyways, since it can
> be still read by other cores, while the write is ongoing.
This probably makes sense, except that enabling and disabling
interrupts means you also need to restore the original mm_struct (most
likely), which is slow. I don't think there's a generic way to check
whether in interrupt is pending without turning interrupts on.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 17:16 ` Andy Lutomirski@ 2018-11-13 17:43 ` Nadav Amit
2018-11-13 17:47 ` Andy Lutomirski
2018-11-13 18:26 ` Igor Stoppa1 sibling, 1 reply; 140+ messages in thread
From: Nadav Amit @ 2018-11-13 17:43 UTC (permalink / raw)
To: Andy Lutomirski, Igor Stoppa
Cc: Kees Cook, Peter Zijlstra, Mimi Zohar, Matthew Wilcox,
Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
linux-integrity, LSM List, Igor Stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
From: Andy Lutomirski
Sent: November 13, 2018 at 5:16:09 PM GMT
> To: Igor Stoppa <igor.stoppa@gmail.com>
> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
>
>
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>> Hello,
>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>> Incidentally, I think it would be useful to cc also the
>> security/hardening ml.
>> The patch-set seems to be close to final, so I am resuming this discussion.
>>
>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>>
>>> I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>>
>> After reading the code, I see what you meant.
>> I think I can work with it.
>>
>> But I have a couple of questions wrt the use of this mechanism, in the
>> context of write rare.
>>
>>
>> 1) mm_struct.
>>
>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>> (live patch?), which seems to happen sequentially and in a relatively
>> standardized way, like replacing the NOPs specifically placed in the
>> functions that need patching.
>>
>> This is a bit different from the more generic write-rare case, applied
>> to data.
>>
>> As example, I have in mind a system where both IMA and SELinux are in use.
>>
>> In this system, a file is accessed for the first time.
>>
>> That would trigger 2 things:
>> - evaluation of the SELinux rules and probably update of the AVC cache
>> - IMA measurement and update of the measurements
>>
>> Both of them could be write protected, meaning that they would both have
>> to be modified through the write rare mechanism.
>>
>> While the events, for 1 specific file, would be sequential, it's not
>> difficult to imagine that multiple files could be accessed at the same time.
>>
>> If the update of the data structures in both IMA and SELinux must use
>> the same mm_struct, that would have to be somehow regulated and it would
>> introduce an unnecessary (imho) dependency.
>>
>> How about having one mm_struct for each writer (core or thread)?
>
> I don't think that helps anything. I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it) should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved. It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU. Sadly, there isn't.
>
>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
>
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working. i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant. For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.
>
>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>>
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>>
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
>
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow. I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.
I guess that enabling IRQs might break some hidden assumptions in the code,
but is there a fundamental reason that IRQs need to be disabled? use_mm()
got them enabled, although it is only suitable for kernel threads.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 17:43 ` Nadav Amit@ 2018-11-13 17:47 ` Andy Lutomirski
2018-11-13 18:06 ` Nadav Amit
2018-11-13 18:31 ` Igor Stoppa0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 17:47 UTC (permalink / raw)
To: Nadav Amit
Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Andy Lutomirski
> Sent: November 13, 2018 at 5:16:09 PM GMT
> > To: Igor Stoppa <igor.stoppa@gmail.com>
> > Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> > Subject: Re: [PATCH 10/17] prmem: documentation
> >
> >
> > On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
> >> Hello,
> >> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> >> Incidentally, I think it would be useful to cc also the
> >> security/hardening ml.
> >> The patch-set seems to be close to final, so I am resuming this discussion.
> >>
> >> On 30/10/2018 19:06, Andy Lutomirski wrote:
> >>
> >>> I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
> >>
> >> After reading the code, I see what you meant.
> >> I think I can work with it.
> >>
> >> But I have a couple of questions wrt the use of this mechanism, in the
> >> context of write rare.
> >>
> >>
> >> 1) mm_struct.
> >>
> >> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> >> (live patch?), which seems to happen sequentially and in a relatively
> >> standardized way, like replacing the NOPs specifically placed in the
> >> functions that need patching.
> >>
> >> This is a bit different from the more generic write-rare case, applied
> >> to data.
> >>
> >> As example, I have in mind a system where both IMA and SELinux are in use.
> >>
> >> In this system, a file is accessed for the first time.
> >>
> >> That would trigger 2 things:
> >> - evaluation of the SELinux rules and probably update of the AVC cache
> >> - IMA measurement and update of the measurements
> >>
> >> Both of them could be write protected, meaning that they would both have
> >> to be modified through the write rare mechanism.
> >>
> >> While the events, for 1 specific file, would be sequential, it's not
> >> difficult to imagine that multiple files could be accessed at the same time.
> >>
> >> If the update of the data structures in both IMA and SELinux must use
> >> the same mm_struct, that would have to be somehow regulated and it would
> >> introduce an unnecessary (imho) dependency.
> >>
> >> How about having one mm_struct for each writer (core or thread)?
> >
> > I don't think that helps anything. I think the mm_struct used for
> > prmem (or rare_write or whatever you want to call it) should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved. It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU. Sadly, there isn't.
> >
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working. i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant. For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
> >
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow. I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> I guess that enabling IRQs might break some hidden assumptions in the code,
> but is there a fundamental reason that IRQs need to be disabled? use_mm()
> got them enabled, although it is only suitable for kernel threads.
>
For general rare-writish stuff, I don't think we want IRQs running
with them mapped anywhere for write. For AVC and IMA, I'm less sure.
--Andy
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 17:47 ` Andy Lutomirski@ 2018-11-13 18:06 ` Nadav Amit
2018-11-13 18:31 ` Igor Stoppa1 sibling, 0 replies; 140+ messages in thread
From: Nadav Amit @ 2018-11-13 18:06 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
From: Andy Lutomirski
Sent: November 13, 2018 at 5:47:16 PM GMT
> To: Nadav Amit <nadav.amit@gmail.com>
> Cc: Igor Stoppa <igor.stoppa@gmail.com>, Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
>
>
> On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Andy Lutomirski
>> Sent: November 13, 2018 at 5:16:09 PM GMT
>>> To: Igor Stoppa <igor.stoppa@gmail.com>
>>> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
>>> Subject: Re: [PATCH 10/17] prmem: documentation
>>>
>>>
>>> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>>> Hello,
>>>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>>>> Incidentally, I think it would be useful to cc also the
>>>> security/hardening ml.
>>>> The patch-set seems to be close to final, so I am resuming this discussion.
>>>>
>>>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>>>>
>>>>> I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>>>>
>>>> After reading the code, I see what you meant.
>>>> I think I can work with it.
>>>>
>>>> But I have a couple of questions wrt the use of this mechanism, in the
>>>> context of write rare.
>>>>
>>>>
>>>> 1) mm_struct.
>>>>
>>>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>>>> (live patch?), which seems to happen sequentially and in a relatively
>>>> standardized way, like replacing the NOPs specifically placed in the
>>>> functions that need patching.
>>>>
>>>> This is a bit different from the more generic write-rare case, applied
>>>> to data.
>>>>
>>>> As example, I have in mind a system where both IMA and SELinux are in use.
>>>>
>>>> In this system, a file is accessed for the first time.
>>>>
>>>> That would trigger 2 things:
>>>> - evaluation of the SELinux rules and probably update of the AVC cache
>>>> - IMA measurement and update of the measurements
>>>>
>>>> Both of them could be write protected, meaning that they would both have
>>>> to be modified through the write rare mechanism.
>>>>
>>>> While the events, for 1 specific file, would be sequential, it's not
>>>> difficult to imagine that multiple files could be accessed at the same time.
>>>>
>>>> If the update of the data structures in both IMA and SELinux must use
>>>> the same mm_struct, that would have to be somehow regulated and it would
>>>> introduce an unnecessary (imho) dependency.
>>>>
>>>> How about having one mm_struct for each writer (core or thread)?
>>>
>>> I don't think that helps anything. I think the mm_struct used for
>>> prmem (or rare_write or whatever you want to call it) should be
>>> entirely abstracted away by an appropriate API, so neither SELinux nor
>>> IMA need to be aware that there's an mm_struct involved. It's also
>>> entirely possible that some architectures won't even use an mm_struct
>>> behind the scenes -- x86, for example, could have avoided it if there
>>> were a kernel equivalent of PKRU. Sadly, there isn't.
>>>
>>>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>>>> the patch might spill across the page boundary, however if I deal with
>>>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>>>> the data will not span across multiple pages.
>>>
>>> The reason for the particular architecture of text_poke() is to avoid
>>> memory allocation to get it working. i think that prmem/rare_write
>>> should have each rare-writable kernel address map to a unique user
>>> address, possibly just by offsetting everything by a constant. For
>>> rare_write, you don't actually need it to work as such until fairly
>>> late in boot, since the rare_writable data will just be writable early
>>> on.
>>>
>>>> If the data spans across multiple pages, in unknown amount, I suppose
>>>> that I should not keep interrupts disabled for an unknown time, as it
>>>> would hurt preemption.
>>>>
>>>> What I thought, in my initial patch-set, was to iterate over each page
>>>> that must be written to, in a loop, re-enabling interrupts in-between
>>>> iterations, to give pending interrupts a chance to be served.
>>>>
>>>> This would mean that the data being written to would not be consistent,
>>>> but it's a problem that would have to be addressed anyways, since it can
>>>> be still read by other cores, while the write is ongoing.
>>>
>>> This probably makes sense, except that enabling and disabling
>>> interrupts means you also need to restore the original mm_struct (most
>>> likely), which is slow. I don't think there's a generic way to check
>>> whether in interrupt is pending without turning interrupts on.
>>
>> I guess that enabling IRQs might break some hidden assumptions in the code,
>> but is there a fundamental reason that IRQs need to be disabled? use_mm()
>> got them enabled, although it is only suitable for kernel threads.
>
> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
Oh.. Of course. It is sort of a measure to prevent a single malicious/faulty
write from corrupting the sensitive memory. Doing so limits the code that
can compromise the security by a write to the protected data-structures
(rephrasing for myself).
I think I should add it as a comment in your patch.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 17:47 ` Andy Lutomirski
2018-11-13 18:06 ` Nadav Amit@ 2018-11-13 18:31 ` Igor Stoppa
2018-11-13 18:33 ` Igor Stoppa
2018-11-13 18:48 ` Andy Lutomirski1 sibling, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 18:31 UTC (permalink / raw)
To: Andy Lutomirski, Nadav Amit
Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On 13/11/2018 19:47, Andy Lutomirski wrote:
> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
Why would these be less sensitive?
But I see a big difference between my initial implementation and this one.
In my case, by using a shared mapping, visible to all cores, freezing
the core that is performing the write would have exposed the writable
mapping to a potential attack run from another core.
If the mapping is private to the core performing the write, even if it
is frozen, it's much harder to figure out what it had mapped and where,
from another core.
To access that mapping, the attack should be performed from the ISR, I
think.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 18:31 ` Igor Stoppa@ 2018-11-13 18:33 ` Igor Stoppa
2018-11-13 18:36 ` Andy Lutomirski
2018-11-13 18:48 ` Andy Lutomirski1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 18:33 UTC (permalink / raw)
To: Andy Lutomirski, Nadav Amit
Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
I forgot one sentence :-(
On 13/11/2018 20:31, Igor Stoppa wrote:
> On 13/11/2018 19:47, Andy Lutomirski wrote:
>
>> For general rare-writish stuff, I don't think we want IRQs running
>> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>
> Why would these be less sensitive?
>
> But I see a big difference between my initial implementation and this one.
>
> In my case, by using a shared mapping, visible to all cores, freezing
> the core that is performing the write would have exposed the writable
> mapping to a potential attack run from another core.
>
> If the mapping is private to the core performing the write, even if it
> is frozen, it's much harder to figure out what it had mapped and where,
> from another core.
>
> To access that mapping, the attack should be performed from the ISR, I
> think.
Unless the secondary mapping is also available to other cores, through
the shared mm_struct ?
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 18:33 ` Igor Stoppa@ 2018-11-13 18:36 ` Andy Lutomirski
2018-11-13 19:03 ` Igor Stoppa
2018-11-21 16:34 ` Igor Stoppa0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 18:36 UTC (permalink / raw)
To: Igor Stoppa
Cc: Nadav Amit, Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> I forgot one sentence :-(
>
> On 13/11/2018 20:31, Igor Stoppa wrote:
> > On 13/11/2018 19:47, Andy Lutomirski wrote:
> >
> >> For general rare-writish stuff, I don't think we want IRQs running
> >> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
> >
> > Why would these be less sensitive?
> >
> > But I see a big difference between my initial implementation and this one.
> >
> > In my case, by using a shared mapping, visible to all cores, freezing
> > the core that is performing the write would have exposed the writable
> > mapping to a potential attack run from another core.
> >
> > If the mapping is private to the core performing the write, even if it
> > is frozen, it's much harder to figure out what it had mapped and where,
> > from another core.
> >
> > To access that mapping, the attack should be performed from the ISR, I
> > think.
>
> Unless the secondary mapping is also available to other cores, through
> the shared mm_struct ?
>
I don't think this matters much. The other cores will only be able to
use that mapping when they're doing a rare write.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 18:36 ` Andy Lutomirski
2018-11-13 19:03 ` Igor Stoppa@ 2018-11-21 16:34 ` Igor Stoppa
2018-11-21 17:36 ` Nadav Amit
2018-11-21 18:15 ` Andy Lutomirski1 sibling, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-21 16:34 UTC (permalink / raw)
To: Andy Lutomirski, Igor Stoppa
Cc: Nadav Amit, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
Hi,
On 13/11/2018 20:36, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>
>> I forgot one sentence :-(
>>
>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>
>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>>>
>>> Why would these be less sensitive?
>>>
>>> But I see a big difference between my initial implementation and this one.
>>>
>>> In my case, by using a shared mapping, visible to all cores, freezing
>>> the core that is performing the write would have exposed the writable
>>> mapping to a potential attack run from another core.
>>>
>>> If the mapping is private to the core performing the write, even if it
>>> is frozen, it's much harder to figure out what it had mapped and where,
>>> from another core.
>>>
>>> To access that mapping, the attack should be performed from the ISR, I
>>> think.
>>
>> Unless the secondary mapping is also available to other cores, through
>> the shared mm_struct ?
>>
>
> I don't think this matters much. The other cores will only be able to
> use that mapping when they're doing a rare write.
I'm still mulling over this.
There might be other reasons for replicating the mm_struct.
If I understand correctly how the text patching works, it happens
sequentially, because of the text_mutex used by arch_jump_label_transform
Which might be fine for this specific case, but I think I shouldn't
introduce a global mutex, when it comes to data.
Most likely, if two or more cores want to perform a write rare
operation, there is no correlation between them, they could proceed in
parallel. And if there really is, then the user of the API should
introduce own locking, for that specific case.
A bit unrelated question, related to text patching: I see that each
patching operation is validated, but wouldn't it be more robust to first
validate all of then, and only after they are all found to be
compliant, to proceed with the actual modifications?
And about the actual implementation of the write rare for the statically
allocated variables, is it expected that I use Nadav's function?
Or that I refactor the code?
The name, referring to text would definitely not be ok for data.
And I would have to also generalize it, to deal with larger amounts of data.
I would find it easier, as first cut, to replicate its behavior and
refactor only later, once it has stabilized and possibly Nadav's patches
have been acked.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-21 16:34 ` Igor Stoppa@ 2018-11-21 17:36 ` Nadav Amit
2018-11-21 18:01 ` Igor Stoppa
2018-11-21 18:15 ` Andy Lutomirski1 sibling, 1 reply; 140+ messages in thread
From: Nadav Amit @ 2018-11-21 17:36 UTC (permalink / raw)
To: Igor Stoppa
Cc: Andy Lutomirski, Igor Stoppa, Kees Cook, Peter Zijlstra,
Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
Michal Hocko, Kernel Hardening, linux-integrity, LSM List,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
> Hi,
>
> On 13/11/2018 20:36, Andy Lutomirski wrote:
>> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>> I forgot one sentence :-(
>>>
>>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>>
>>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>>> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>>>>
>>>> Why would these be less sensitive?
>>>>
>>>> But I see a big difference between my initial implementation and this one.
>>>>
>>>> In my case, by using a shared mapping, visible to all cores, freezing
>>>> the core that is performing the write would have exposed the writable
>>>> mapping to a potential attack run from another core.
>>>>
>>>> If the mapping is private to the core performing the write, even if it
>>>> is frozen, it's much harder to figure out what it had mapped and where,
>>>> from another core.
>>>>
>>>> To access that mapping, the attack should be performed from the ISR, I
>>>> think.
>>>
>>> Unless the secondary mapping is also available to other cores, through
>>> the shared mm_struct ?
>> I don't think this matters much. The other cores will only be able to
>> use that mapping when they're doing a rare write.
>
>
> I'm still mulling over this.
> There might be other reasons for replicating the mm_struct.
>
> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>
> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
I think that if you create per-CPU temporary mms as you proposed, you do not
need a global lock. You would need to protect against module unloading, and
maybe refactor the code. Specifically, I’m not sure whether protection
against IRQs is something that you need. I’m also not familiar with this
patch-set so I’m not sure what atomicity guarantees do you need.
> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
>
> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?
It’s not “my” function. ;-)
IMHO the code is in relatively good and stable state. The last couple of
versions were due to me being afraid to add BUG_ONs as Peter asked me to.
The code is rather simple, but there are a couple of pitfalls that hopefully
I avoided correctly.
Regards,
Nadav
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-21 17:36 ` Nadav Amit@ 2018-11-21 18:01 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-21 18:01 UTC (permalink / raw)
To: Nadav Amit
Cc: Andy Lutomirski, Igor Stoppa, Kees Cook, Peter Zijlstra,
Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
Michal Hocko, Kernel Hardening, linux-integrity, LSM List,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
On 21/11/2018 19:36, Nadav Amit wrote:
>> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
[...]
>> There might be other reasons for replicating the mm_struct.
>>
>> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>>
>> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
>> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
>
> I think that if you create per-CPU temporary mms as you proposed, you do not
> need a global lock. You would need to protect against module unloading,
yes, it's unlikely to happen and probably a bug in the module, if it
tries to write while being unloaded, but I can do it
> and
> maybe refactor the code. Specifically, I’m not sure whether protection
> against IRQs is something that you need.
With the initial way I used to do write rare, which was done by creating
a temporary mapping visible to every core, disabling IRQs was meant to
prevent that the "writer" core would be frozen and then the mappings
scrubbed for the page in writable state.
Without shared mapping of the page, the only way to attack it should be
to generate an interrupt on the "writer" core, while the writing is
ongoing, and to perform the attack from the interrupt itself, because it
is on the same core that has the writable mapping.
Maybe it's possible, but it seems to have become quite a corner case.
> I’m also not familiar with this
> patch-set so I’m not sure what atomicity guarantees do you need.
At the very least, I think I need to ensure that pointers are updated
atomically, like with WRITE_ONCE() And spinlocks.
Maybe atomic types can be left out.
>> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
>>
>> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?
>
> It’s not “my” function. ;-)
:-P
ok, what I meant is that the signature of the __text_poke() function is
quite specific to what it's meant to do.
I do not rule out that it might be eventually refactored as a special
case of a more generic __write_rare() function, that would operate on
different targets, but I'd rather do the refactoring after I have a
clear understanding of how to alter write-protected data.
The refactoring could be the last patch of the write rare patchset.
> IMHO the code is in relatively good and stable state. The last couple of
> versions were due to me being afraid to add BUG_ONs as Peter asked me to.
>
> The code is rather simple, but there are a couple of pitfalls that hopefully
> I avoided correctly.
Yes, I did not mean to question the quality of the code, but I'd prefer
to not have to carry also this patchset, while it's not yet merged.
I actually hope it gets merged soon :-)
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-21 16:34 ` Igor Stoppa
2018-11-21 17:36 ` Nadav Amit@ 2018-11-21 18:15 ` Andy Lutomirski
2018-11-22 19:27 ` Igor Stoppa1 sibling, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-21 18:15 UTC (permalink / raw)
To: Igor Stoppa
Cc: Andy Lutomirski, Igor Stoppa, Nadav Amit, Kees Cook,
Peter Zijlstra, Mimi Zohar, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
> Hi,
>
>>> On 13/11/2018 20:36, Andy Lutomirski wrote:
>>> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>>
>>> I forgot one sentence :-(
>>>
>>>>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>>>
>>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>>> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>>>>
>>>> Why would these be less sensitive?
>>>>
>>>> But I see a big difference between my initial implementation and this one.
>>>>
>>>> In my case, by using a shared mapping, visible to all cores, freezing
>>>> the core that is performing the write would have exposed the writable
>>>> mapping to a potential attack run from another core.
>>>>
>>>> If the mapping is private to the core performing the write, even if it
>>>> is frozen, it's much harder to figure out what it had mapped and where,
>>>> from another core.
>>>>
>>>> To access that mapping, the attack should be performed from the ISR, I
>>>> think.
>>>
>>> Unless the secondary mapping is also available to other cores, through
>>> the shared mm_struct ?
>> I don't think this matters much. The other cores will only be able to
>> use that mapping when they're doing a rare write.
>
>
> I'm still mulling over this.
> There might be other reasons for replicating the mm_struct.
>
> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>
> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA. You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention.
>
> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
>
> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?
> Or that I refactor the code?
I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context.
>
> The name, referring to text would definitely not be ok for data.
> And I would have to also generalize it, to deal with larger amounts of data.
>
> I would find it easier, as first cut, to replicate its behavior and refactor only later, once it has stabilized and possibly Nadav's patches have been acked.
>
Sounds reasonable. You’ll still want Nadav’s code for setting up the mm in the first place, though.
> --
> igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-21 18:15 ` Andy Lutomirski@ 2018-11-22 19:27 ` Igor Stoppa
2018-11-22 20:04 ` Matthew Wilcox0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-22 19:27 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andy Lutomirski, Igor Stoppa, Nadav Amit, Kees Cook,
Peter Zijlstra, Mimi Zohar, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On 21/11/2018 20:15, Andy Lutomirski wrote:
>> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
[...]
>> There might be other reasons for replicating the mm_struct.
>>
>> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>>
>> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
>> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
>
> Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA. You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention.
I have studied the code involved with Nadav's patchset.
I am perplexed about these sentences you wrote.
More to the point (to the best of my understanding):
poking_init()
-------------
1. it gets one random poking address and ensures to have at least 2
consecutive PTEs from the same PMD
2. it then proceeds to map/unmap an address from the first of the 2
consecutive PTEs, so that, later on, there will be no need to
allocate pages, which might fail, if poking from atomic context.
3. at this point, the page tables are populated, for the address that
was obtained at point 1, and this is ok, because the address is fixed
write_rare
----------
4. it can happen on any available core / thread at any time, therefore
each of them needs a different address
5. CPUs support hotplug, but from what I have read, I might get away
with having up to nr_cpus different addresses (determined at init)
and I would follow the same technique used by Nadav, of forcing the
mapping of 1 or 2 (1 could be enough, I have to loop anyway, at
some point) pages at each address, to ensure the population of the
page tables
so far, so good, but ...
6. the addresses used by each CPU are fixed
7. I do not understand the reference you make to
"allocation and freeing of rare-writable ranges", because if I
treat the range as such, then there is a risk that I need to
populate more entries in the page table, which would have problems
with the atomic context, unless write_rare from atomic is ruled
out. If write_rare from atomic can be avoided, then I can also have
one-use randomized addresses at each write-rare operation, instead
of fixed ones, like in point 6.
and, apologies for being dense: the following is still not clear to me:
8. you wrote:
> You’ll still need a mutex (mmap_sem) to synchronize allocation
> and freeing of rare-writable ranges, but that shouldn’t have much
> contention.
What causes the contention? It's the fact that the various cores
are using the same mm, if I understood correctly.
However, if there was one mm for each core, wouldn't that make it
unnecessary to have any mutex?
I feel there must be some obvious reason why multiple mms are not a
good idea, yet I cannot grasp it :-(
switch_mm_irqs_off() seems to have lots of references to
"this_cpu_something"; if there is any optimization from having the
same next across multiple cores, I'm missing it
[...]
> I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context.
yes
[...]
> You’ll still want Nadav’s code for setting up the mm in the first place, though.
yes
--
thanks, igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-22 19:27 ` Igor Stoppa@ 2018-11-22 20:04 ` Matthew Wilcox
2018-11-22 20:53 ` Andy Lutomirski0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2018-11-22 20:04 UTC (permalink / raw)
To: Igor Stoppa
Cc: Andy Lutomirski, Andy Lutomirski, Igor Stoppa, Nadav Amit,
Kees Cook, Peter Zijlstra, Mimi Zohar, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> I have studied the code involved with Nadav's patchset.
> I am perplexed about these sentences you wrote.
>
> More to the point (to the best of my understanding):
>
> poking_init()
> -------------
> 1. it gets one random poking address and ensures to have at least 2
> consecutive PTEs from the same PMD
> 2. it then proceeds to map/unmap an address from the first of the 2
> consecutive PTEs, so that, later on, there will be no need to
> allocate pages, which might fail, if poking from atomic context.
> 3. at this point, the page tables are populated, for the address that
> was obtained at point 1, and this is ok, because the address is fixed
>
> write_rare
> ----------
> 4. it can happen on any available core / thread at any time, therefore
> each of them needs a different address
No? Each CPU has its own CR3 (eg each CPU might be running a different
user task). If you have _one_ address for each allocation, it may or
may not be mapped on other CPUs at the same time -- you simply don't care.
The writable address can even be a simple formula to calculate from
the read-only address, you don't have to allocate an address in the
writable mapping space.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-22 20:04 ` Matthew Wilcox@ 2018-11-22 20:53 ` Andy Lutomirski
2018-12-04 12:34 ` Igor Stoppa0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-22 20:53 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Igor Stoppa, Andrew Lutomirski, Igor Stoppa, Nadav Amit,
Kees Cook, Peter Zijlstra, Mimi Zohar, Dave Chinner,
James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
Thomas Gleixner
On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> > I have studied the code involved with Nadav's patchset.
> > I am perplexed about these sentences you wrote.
> >
> > More to the point (to the best of my understanding):
> >
> > poking_init()
> > -------------
> > 1. it gets one random poking address and ensures to have at least 2
> > consecutive PTEs from the same PMD
> > 2. it then proceeds to map/unmap an address from the first of the 2
> > consecutive PTEs, so that, later on, there will be no need to
> > allocate pages, which might fail, if poking from atomic context.
> > 3. at this point, the page tables are populated, for the address that
> > was obtained at point 1, and this is ok, because the address is fixed
> >
> > write_rare
> > ----------
> > 4. it can happen on any available core / thread at any time, therefore
> > each of them needs a different address
>
> No? Each CPU has its own CR3 (eg each CPU might be running a different
> user task). If you have _one_ address for each allocation, it may or
> may not be mapped on other CPUs at the same time -- you simply don't care.
>
> The writable address can even be a simple formula to calculate from
> the read-only address, you don't have to allocate an address in the
> writable mapping space.
>
Agreed. I suggest the formula:
writable_address = readable_address - rare_write_offset. For
starters, rare_write_offset can just be a constant. If we want to get
fancy later on, it can be randomized.
If we do it like this, then we don't need to modify any pagetables at
all when we do a rare write. Instead we can set up the mapping at
boot or when we allocate the rare write space, and the actual rare
write code can just switch mms and do the write.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-22 20:53 ` Andy Lutomirski@ 2018-12-04 12:34 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-12-04 12:34 UTC (permalink / raw)
To: Andy Lutomirski, Matthew Wilcox
Cc: Andrew Lutomirski, Igor Stoppa, Nadav Amit, Kees Cook,
Peter Zijlstra, Mimi Zohar, Dave Chinner, James Morris,
Michal Hocko, Kernel Hardening, linux-integrity, LSM List,
Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner
Hello,
apologies for the delayed answer.
Please find my reply to both last mails in the thread, below.
On 22/11/2018 22:53, Andy Lutomirski wrote:
> On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
>>> I have studied the code involved with Nadav's patchset.
>>> I am perplexed about these sentences you wrote.
>>>
>>> More to the point (to the best of my understanding):
>>>
>>> poking_init()
>>> -------------
>>> 1. it gets one random poking address and ensures to have at least 2
>>> consecutive PTEs from the same PMD
>>> 2. it then proceeds to map/unmap an address from the first of the 2
>>> consecutive PTEs, so that, later on, there will be no need to
>>> allocate pages, which might fail, if poking from atomic context.
>>> 3. at this point, the page tables are populated, for the address that
>>> was obtained at point 1, and this is ok, because the address is fixed
>>>
>>> write_rare
>>> ----------
>>> 4. it can happen on any available core / thread at any time, therefore
>>> each of them needs a different address
>>
>> No? Each CPU has its own CR3 (eg each CPU might be running a different
>> user task). If you have _one_ address for each allocation, it may or
>> may not be mapped on other CPUs at the same time -- you simply don't care.
Yes, somehow I lost track of that aspect.
>> The writable address can even be a simple formula to calculate from
>> the read-only address, you don't have to allocate an address in the
>> writable mapping space.
>>
>
> Agreed. I suggest the formula:
>
> writable_address = readable_address - rare_write_offset. For
> starters, rare_write_offset can just be a constant. If we want to get
> fancy later on, it can be randomized.
ok, I hope I captured it here [1]
> If we do it like this, then we don't need to modify any pagetables at
> all when we do a rare write. Instead we can set up the mapping at
> boot or when we allocate the rare write space, and the actual rare
> write code can just switch mms and do the write.
I did it. I have little feeling about the actual amount of data
involved, but there is a (probably very remote) chance that the remap
wouldn't work, at least in the current implementation.
It's a bit different from what I had in mind initially, since I was
thinking to have one single approach to both statically allocated memory
(is there a better way to describe it) and what is provided from the
allocator that would come next.
As I wrote, I do not particularly like the way I implemented multiple
functionality vs remapping, but I couldn't figure out any better way to
do it, so eventually I kept this one, hoping to get some advice on how
to improve it.
I did not provide yet an example, yet, but IMA has some flags that are
probably very suitable, since they depend on policy reloading, which can
happen multiple times, but could be used to disable it.
[1] https://www.openwall.com/lists/kernel-hardening/2018/12/04/3
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 18:31 ` Igor Stoppa
2018-11-13 18:33 ` Igor Stoppa@ 2018-11-13 18:48 ` Andy Lutomirski
2018-11-13 19:35 ` Igor Stoppa1 sibling, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 18:48 UTC (permalink / raw)
To: Igor Stoppa
Cc: Nadav Amit, Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> On 13/11/2018 19:47, Andy Lutomirski wrote:
>
> > For general rare-writish stuff, I don't think we want IRQs running
> > with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>
> Why would these be less sensitive?
I'm not really saying they're less sensitive so much as that the
considerations are different. I think the original rare-write code is
based on ideas from grsecurity, and it was intended to protect static
data like structs full of function pointers. Those targets have some
different properties:
- Static targets are at addresses that are much more guessable, so
they're easier targets for most attacks. (Not spraying attacks like
the ones you're interested in, though.)
- Static targets are higher value. No offense to IMA or AVC, but
outright execution of shellcode, hijacking of control flow, or compete
disablement of core security features is higher impact than bypassing
SELinux or IMA. Why would you bother corrupting the AVC if you could
instead just set enforcing=0? (I suppose that corrupting the AVC is
less likely to be noticed by monitoring tools.)
- Static targets are small. This means that the interrupt latency
would be negligible, especially in comparison to the latency of
replacing the entire SELinux policy object.
Anyway, I'm not all that familiar with SELinux under the hood, but I'm
wondering if a different approach to thinks like the policy database
might be appropriate. When the policy is changed, rather than
allocating rare-write memory and writing to it, what if we instead
allocated normal memory, wrote to it, write-protected it, and then
used the rare-write infrastructure to do a much smaller write to
replace the pointer?
Admittedly, this creates a window where another core could corrupt the
data as it's being written. That may not matter so much if an
attacker can't force a policy update. Alternatively, the update code
could re-verify the policy after write-protecting it, or there could
be a fancy API to allocate some temporarily-writable memory (by
creating a whole new mm_struct, mapping the memory writable just in
that mm_struct, and activating it) so that only the actual policy
loader could touch the memory. But I'm mostly speculating here, since
I'm not familiar with the code in question.
Anyway, I tend to think that the right way to approach mainlining all
this is to first get the basic rare write support for static data into
place and then to build on that. I think it's great that you're
pushing this effort, but doing this for SELinux and IMA is a bigger
project than doing it for static data, and it might make sense to do
it in bite-sized pieces.
Does any of this make sense?
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 18:48 ` Andy Lutomirski@ 2018-11-13 19:35 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 19:35 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Nadav Amit, Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On 13/11/2018 20:48, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>
>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>
>>> For general rare-writish stuff, I don't think we want IRQs running
>>> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>>
>> Why would these be less sensitive?
>
> I'm not really saying they're less sensitive so much as that the
> considerations are different. I think the original rare-write code is
> based on ideas from grsecurity, and it was intended to protect static
> data like structs full of function pointers. Those targets have some
> different properties:
>
> - Static targets are at addresses that are much more guessable, so
> they're easier targets for most attacks. (Not spraying attacks like
> the ones you're interested in, though.)
>
> - Static targets are higher value. No offense to IMA or AVC, but
> outright execution of shellcode, hijacking of control flow, or compete
> disablement of core security features is higher impact than bypassing
> SELinux or IMA. Why would you bother corrupting the AVC if you could
> instead just set enforcing=0? (I suppose that corrupting the AVC is
> less likely to be noticed by monitoring tools.)
>
> - Static targets are small. This means that the interrupt latency
> would be negligible, especially in comparison to the latency of
> replacing the entire SELinux policy object.
Your analysis is correct.
In my case, having already taken care of those, I was going *also* after
the next target in line.
Admittedly, flipping a bit located at a fixed offset is way easier than
spraying dynamically allocated data structured.
However, once the bit is not easily writable, the only options are to
either find another way to flip it (unprotect it or subvert something
that can write it) or to identify another target that is still writable.
AVC and policyDB fit the latter description.
> Anyway, I'm not all that familiar with SELinux under the hood, but I'm
> wondering if a different approach to thinks like the policy database
> might be appropriate. When the policy is changed, rather than
> allocating rare-write memory and writing to it, what if we instead
> allocated normal memory, wrote to it, write-protected it, and then
> used the rare-write infrastructure to do a much smaller write to
> replace the pointer?
Actually, that's exactly what I did.
I did not want to overload this discussion, but since you brought it up,
I'm not sure write rare is enough.
* write_rare is for stuff that sometimes changes all the time, ex: AVC
* dynamic read only is for stuff that at some point should not be
modified anymore, but could still be destroyed. Ex: policyDB
I think it would be good to differentiate, at runtime, between the two,
to minimize the chance that a write_rare function is used against some
read_only data.
Releasing dynamically allocated protected memory is also a big topic.
In some cases it's allocated and released continuously, like in the AVC.
Maybe it can be optimized, or maybe it can be turned into an object
cache of protected object.
But for releasing, it would be good, I think, to have a mechanism for
freeing all the memory in one loop, like having a pool containing all
the memory that was allocated for a specific use (ex: policyDB)
> Admittedly, this creates a window where another core could corrupt the
> data as it's being written. That may not matter so much if an
> attacker can't force a policy update. Alternatively, the update code
> could re-verify the policy after write-protecting it, or there could
> be a fancy API to allocate some temporarily-writable memory (by
> creating a whole new mm_struct, mapping the memory writable just in
> that mm_struct, and activating it) so that only the actual policy
> loader could touch the memory. But I'm mostly speculating here, since
> I'm not familiar with the code in question.
They are all corner cases ... possible but unlikely.
Another, maybe more critical, one is that the policyDB is not available
at boot.
There is a window of opportunity, before it's loaded. But it could be
mitigated by loading a barebone set of rules, either from initrd or even
as "firmware".
> Anyway, I tend to think that the right way to approach mainlining all
> this is to first get the basic rare write support for static data into
> place and then to build on that. I think it's great that you're
> pushing this effort, but doing this for SELinux and IMA is a bigger
> project than doing it for static data, and it might make sense to do
> it in bite-sized pieces.
>
> Does any of this make sense?
Yes, sure.
I *have* to do SELinux, but I do not necessarily have to wait for the
final version to be merged upstream. And anyways Android is on a
different kernel.
However, I think both SELinux and IMA have a value in being sufficiently
complex cases to be used for validating the design as it evolves.
Each of them has static data that could be the first target for
protection, in a smaller patch.
Lists of write rare data are probably the next big thing, in terms of
defining the API.
But I could start with introducing __wr_after_init.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 17:16 ` Andy Lutomirski
2018-11-13 17:43 ` Nadav Amit@ 2018-11-13 18:26 ` Igor Stoppa
2018-11-13 18:35 ` Andy Lutomirski1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 18:26 UTC (permalink / raw)
To: Andy Lutomirski, Igor Stoppa
Cc: Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On 13/11/2018 19:16, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
[...]
>> How about having one mm_struct for each writer (core or thread)?
>>
>
> I don't think that helps anything. I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it)
write_rare / rarely can be shortened to wr_ which is kinda less
confusing than rare_write, since it would become rw_ and easier to
confuse with R/W
Any advice for better naming is welcome.
> should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.
Yes, that is fine. In my proposal I was thinking about tying it to the
core/thread that performs the actual write.
The high level API could be something like:
wr_memcpy(void *src, void *dst, uint_t size)
> It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU. Sadly, there isn't.
The mm_struct - or whatever is the means to do the write on that
architecture - can be kept hidden from the API.
But the reason why I was proposing to have one mm_struct per writer is
that, iiuc, the secondary mapping is created in the secondary mm_struct
for each writer using it.
So the updating of IMA measurements would have, theoretically, also
write access to the SELinux AVC. Which I was trying to avoid.
And similarly any other write rare updater. Is this correct?
>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
>
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working. i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant. For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.
Yes, that is true. I think it's safe to assume, from an attack pattern,
that as long as user space is not started, the system can be considered
ok. Even user-space code run from initrd should be ok, since it can be
bundled (and signed) as a single binary with the kernel.
Modules loaded from a regular filesystem are a bit more risky, because
an attack might inject a rogue key in the key-ring and use it to load
malicious modules.
>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>>
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>>
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
>
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow. I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.
The only "excuse" I have is that write_rare is opt-in and is "rare".
Maybe the enabling/disabling of interrupts - and the consequent switch
of mm_struct - could be somehow tied to the latency configuration?
If preemption is disabled, the expectations on the system latency are
anyway more relaxed.
But I'm not sure how it would work against I/O.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 18:26 ` Igor Stoppa@ 2018-11-13 18:35 ` Andy Lutomirski
2018-11-13 19:01 ` Igor Stoppa0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 18:35 UTC (permalink / raw)
To: Igor Stoppa
Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> On 13/11/2018 19:16, Andy Lutomirski wrote:
>
> > should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved.
>
> Yes, that is fine. In my proposal I was thinking about tying it to the
> core/thread that performs the actual write.
>
> The high level API could be something like:
>
> wr_memcpy(void *src, void *dst, uint_t size)
>
> > It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU. Sadly, there isn't.
>
> The mm_struct - or whatever is the means to do the write on that
> architecture - can be kept hidden from the API.
>
> But the reason why I was proposing to have one mm_struct per writer is
> that, iiuc, the secondary mapping is created in the secondary mm_struct
> for each writer using it.
>
> So the updating of IMA measurements would have, theoretically, also
> write access to the SELinux AVC. Which I was trying to avoid.
> And similarly any other write rare updater. Is this correct?
If you call a wr_memcpy() function with the signature you suggested,
then you can overwrite any memory of this type. Having a different
mm_struct under the hood makes no difference. As far as I'm
concerned, for *dynamically allocated* rare-writable memory, you might
as well allocate all the paging structures at the same time, so the
mm_struct will always contain the mappings. If there are serious bugs
in wr_memcpy() that cause it to write to the wrong place, we have
bigger problems.
I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
but that can wait for v2. And it still doesn't obviously need
multiple mm_structs.
>
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working. i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant. For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
>
> Yes, that is true. I think it's safe to assume, from an attack pattern,
> that as long as user space is not started, the system can be considered
> ok. Even user-space code run from initrd should be ok, since it can be
> bundled (and signed) as a single binary with the kernel.
>
> Modules loaded from a regular filesystem are a bit more risky, because
> an attack might inject a rogue key in the key-ring and use it to load
> malicious modules.
If a malicious module is loaded, the game is over.
>
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow. I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> The only "excuse" I have is that write_rare is opt-in and is "rare".
> Maybe the enabling/disabling of interrupts - and the consequent switch
> of mm_struct - could be somehow tied to the latency configuration?
>
> If preemption is disabled, the expectations on the system latency are
> anyway more relaxed.
>
> But I'm not sure how it would work against I/O.
I think it's entirely reasonable for the API to internally break up
very large memcpys.
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-11-13 18:35 ` Andy Lutomirski@ 2018-11-13 19:01 ` Igor Stoppa0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 19:01 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Thomas Gleixner
On 13/11/2018 20:35, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
[...]
>> The high level API could be something like:
>>
>> wr_memcpy(void *src, void *dst, uint_t size)
[...]
> If you call a wr_memcpy() function with the signature you suggested,
> then you can overwrite any memory of this type. Having a different
> mm_struct under the hood makes no difference. As far as I'm
> concerned, for *dynamically allocated* rare-writable memory, you might
> as well allocate all the paging structures at the same time, so the
> mm_struct will always contain the mappings. If there are serious bugs
> in wr_memcpy() that cause it to write to the wrong place, we have
> bigger problems.
Beside bugs, I'm also thinking about possible vulnerability.
It might be overthinking, though.
I do not think it's possible to really protect against control flow
attacks, unless there is some support from the HW and/or the compiler.
What is left, are data-based attacks. In this case, it would be an
attacker using one existing wr_ invocation with doctored parameters.
However, there is always the objection that it would be possible to come
up with a "writing kit" for plowing through the page tables and
unprotect anything that might be of value.
Ideally, that should be the only type of data-based attack left.
In practice, it might just be an excess of paranoia from my side.
> I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
> but that can wait for v2. And it still doesn't obviously need
> multiple mm_structs.
I left that out, for now, but yes, typing would play some role here.
[...]
> I think it's entirely reasonable for the API to internally break up
> very large memcpys.
The criteria for deciding if/how to break it down is not clear to me,
though. The single page was easy, but might be (probably is) too much.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-30 16:37 ` Kees Cook
2018-10-30 17:06 ` Andy Lutomirski@ 2018-10-31 9:27 ` Igor Stoppa1 sibling, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31 9:27 UTC (permalink / raw)
To: Kees Cook, Peter Zijlstra
Cc: Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
Michal Hocko, Kernel Hardening, linux-integrity,
linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport,
open list:DOCUMENTATION, LKML, Andy Lutomirski, Thomas Gleixner
On 30/10/2018 18:37, Kees Cook wrote:
> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>>
>> 1) find buffer-overrun / bound check failure
>> 2) use that to write to 'interesting' location
>> 3) that write results arbitrary code execution
>> 4) win
>>
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
>
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).
Reading these points I see where I was not clear.
Maybe I can fix that. SELinux is a good example of safeguard against a
takeover of root, so it is a primary target. Unfortunately it contains
some state variables that can be used to disable it.
Other attack that comes to my mind, for executing code within the
kernel, without sweating too much with ROP is the following:
1) find the rabbit hole to write kernel data, whatever it might be
2) spray the keyring with own key
3) use the module loading infrastructure to load own module, signed with
the key at point 2)
Just to say that direct arbitrary code execution is becoming harder to
perform, but there are ways around it which rely more on overwriting
unprotected data.
They are a lower hanging fruit for an attacker.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 10/17] prmem: documentation
2018-10-26 9:26 ` Peter Zijlstra
` (3 preceding siblings ...)
2018-10-26 15:05 ` Jonathan Corbet@ 2018-10-29 20:35 ` Igor Stoppa4 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 20:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
linux-kernel
On 26/10/2018 10:26, Peter Zijlstra wrote:
>> +- Kernel code is protected at system level and, unlike data, it doesn't
>> + require special attention.
>
> What does this even mean?
I was trying to convey the notion that the pages containing kernel code
do not require any special handling by the author of a generic kernel
component, for example a kernel driver.
Pages containing either statically or dynamically allocated data,
instead, are not automatically protected.
But yes, the sentence is far from being clear.
>
>> +Protection mechanism
>> +--------------------
>> +
>> +- When available, the MMU can write protect memory pages that would be
>> + otherwise writable.
>
> Again; what does this really want to say?
That it is possible to use the MMU also for write-protecting pages
containing data which was not declared as constant.
>
>> +- The protection has page-level granularity.
>
> I don't think Linux supports non-paging MMUs.
This probably came from a model I had in mind where a separate execution
environment, such as an hypervisor, could trap and filter writes to a
certain parts of a page, rejecting some and performing others,
effectively emulating sub-page granularity.
>
>> +- An attempt to overwrite a protected page will trigger an exception.
>> +- **Write protected data must go exclusively to write protected pages**
>> +- **Writable data must go exclusively to writable pages**
>
> WTH is with all those ** ?
[...]
> Can we ditch all the ** nonsense and put whitespace in there? More paragraphs
> and whitespace are more good.
Yes
> Also, I really don't like how you differentiate between static and
> dynamic wr.
ok, but why? what would you suggest, instead?
[...]
> We already have RO, why do you need more RO?
I can explain, but I'm at loss of what is the best place: I was under
the impression that this sort of document should focus mostly on the API
and its use. I was even considering removing most of the explanation and
instead put it in a separate document.
>
>> +
>> + **Remarks:**
>> + - The "AUTO" modes perform automatic protection of the content, whenever
>> + the current vmap_area is used up and a new one is allocated.
>> + - At that point, the vmap_area being phased out is protected.
>> + - The size of the vmap_area depends on various parameters.
>> + - It might not be possible to know for sure *when* certain data will
>> + be protected.
>
> Surely that is a problem?
>
>> + - The functionality is provided as tradeoff between hardening and speed.
>
> Which you fail to explain.
>
>> + - Its usefulness depends on the specific use case at hand
>
> How about you write sensible text inside the option descriptions
> instead?
>
> This is not a presentation; less bullets, more content.
I tried to say something, without saying too much, but it was clearly a
bad choice.
Where should i put a thoroughly detailed explanation?
Here or in a separate document?
>
>> +- Not only the pmalloc memory must be protected, but also any reference to
>> + it that might become the target for an attack. The attack would replace
>> + a reference to the protected memory with a reference to some other,
>> + unprotected, memory.
>
> I still don't really understand the whole write-rare thing; how does it
> really help? If we can write in kernel memory, we can write to
> page-tables too.
It has already been answered by Kees, but I'll provide also my answer:
the exploits used to write in kernel memory are specific to certain
products and SW builds, so it's not possible to generalize too much,
however there might be some limitation in the reach of a specific
vulnerability. For example, if a memory location is referred to as
offset from a base address, the maximum size of the offset limits the
scope of the attack. Which might make it impossible to use that specific
vulnerability for writing directly to the page table. But something
else, say a statically allocated variable, might be within reach.
said this, there is also the almost orthogonal use case of providing
increased robustness by trapping accidental modifications, caused by bugs.
> And I don't think this document even begins to explain _why_ you're
> doing any of this. How does it help?
ok, point taken
>> +- The users of rare write must take care of ensuring the atomicity of the
>> + action, respect to the way they use the data being altered; for example,
>> + take a lock before making a copy of the value to modify (if it's
>> + relevant), then alter it, issue the call to rare write and finally
>> + release the lock. Some special scenario might be exempt from the need
>> + for locking, but in general rare-write must be treated as an operation
>> + that can incur into races.
>
> What?!
Probably something along the lines of:
"users of write-rare are responsible of using mechanisms that allow
reading/writing data in a consistent way"
and if it seems obvious, I can just drop it.
One of the problems I have faced is to decide what level of knowledge or
understanding I should expect from the reader of such document: what I
can take for granted and what I should explain.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
2018-10-24 11:37 ` Mathieu Desnoyers@ 2018-10-24 14:03 ` Igor Stoppa
2018-10-24 14:56 ` Tycho Andersen
2018-10-28 9:52 ` Steven Rostedt0 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 14:03 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
James Morris, Michal Hocko, kernel-hardening, linux-integrity,
linux-security-module, igor stoppa, Dave Hansen, Jonathan Corbet,
Laura Abbott, Thomas Gleixner, Kate Stewart, David S. Miller,
Greg Kroah-Hartman, Philippe Ombredanne, Paul E. McKenney,
Josh Triplett, rostedt, Lai Jiangshan, linux-kernel
On 24/10/18 14:37, Mathieu Desnoyers wrote:
> I could not find a description of the overall context of this patch
> (e.g. a patch 00/17 ?) that would explain the attack vectors this aims
> to protect against.
Apologies, I have to admit I was a bit baffled about what to do: the
patchset spans across several subsystems and I was reluctant at spamming
all the mailing lists.
I was hoping that by CC-ing kernel.org, the explicit recipients would
get both the mail directly intended for them (as subsystem
maintainers/supporters) and the rest.
The next time I'll err in the opposite direction.
In the meanwhile, please find the whole set here:
https://www.openwall.com/lists/kernel-hardening/2018/10/23/3> This might help figuring out whether this added complexity in the core
> kernel is worth it.
I hope it will.
> Also, is it the right approach to duplicate existing APIs, or should we
> rather hook into page fault handlers and let the kernel do those "shadow"
> mappings under the hood ?
This question is probably a good candidate for the small Q&A section I
have in the 00/17.
> Adding a new GFP flags for dynamic allocation, and a macro mapping to
> a section attribute might suffice for allocation or definition of such
> mostly-read-only/seldom-updated data.
I think what you are proposing makes sense from a pure hardening
standpoint. From a more defensive one, I'd rather minimise the chances
of giving a free pass to an attacker.
Maybe there is a better implementation of this, than what I have in
mind. But, based on my current understanding of what you are describing,
there would be few issues:
1) where would the pool go? The pool is a way to manage multiple vmas
and express common property they share. Even before a vma is associated
to the pool.
2) there would be more code that can seamlessly deal with both protected
and regular data. Based on what? Some parameter, I suppose.
That parameter would be the new target.
If the code is "duplicated", as you say, the actual differences are
baked in at compile time. The "duplication" would also allow to have
always inlined functions for write-rare and leave more freedom to the
compiler for their non-protected version.
Besides, I think the separate wr version also makes it very clear, to
the user of the API, that there will be a price to pay, in terms of
performance. The more seamlessly alternative might make this price less
obvious.
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
2018-10-24 14:03 ` Igor Stoppa@ 2018-10-24 14:56 ` Tycho Andersen
2018-10-24 22:52 ` Igor Stoppa
2018-10-28 9:52 ` Steven Rostedt1 sibling, 1 reply; 140+ messages in thread
From: Tycho Andersen @ 2018-10-24 14:56 UTC (permalink / raw)
To: Igor Stoppa
Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
Paul E. McKenney, Josh Triplett, rostedt, Lai Jiangshan,
linux-kernel
On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote:
> On 24/10/18 14:37, Mathieu Desnoyers wrote:
> > Also, is it the right approach to duplicate existing APIs, or should we
> > rather hook into page fault handlers and let the kernel do those "shadow"
> > mappings under the hood ?
>
> This question is probably a good candidate for the small Q&A section I have
> in the 00/17.
>
>
> > Adding a new GFP flags for dynamic allocation, and a macro mapping to
> > a section attribute might suffice for allocation or definition of such
> > mostly-read-only/seldom-updated data.
>
> I think what you are proposing makes sense from a pure hardening standpoint.
> From a more defensive one, I'd rather minimise the chances of giving a free
> pass to an attacker.
>
> Maybe there is a better implementation of this, than what I have in mind.
> But, based on my current understanding of what you are describing, there
> would be few issues:
>
> 1) where would the pool go? The pool is a way to manage multiple vmas and
> express common property they share. Even before a vma is associated to the
> pool.
>
> 2) there would be more code that can seamlessly deal with both protected and
> regular data. Based on what? Some parameter, I suppose.
> That parameter would be the new target.
> If the code is "duplicated", as you say, the actual differences are baked in
> at compile time. The "duplication" would also allow to have always inlined
> functions for write-rare and leave more freedom to the compiler for their
> non-protected version.
>
> Besides, I think the separate wr version also makes it very clear, to the
> user of the API, that there will be a price to pay, in terms of performance.
> The more seamlessly alternative might make this price less obvious.
What about something in the middle, where we move list to list_impl.h,
and add a few macros where you have list_set_prev() in prlist now, so
we could do,
// prlist.h
#define list_set_next(head, next) wr_ptr(&head->next, next)
#define list_set_prev(head, prev) wr_ptr(&head->prev, prev)
#include <linux/list_impl.h>
// list.h
#define list_set_next(next) (head->next = next)
#define list_set_next(prev) (head->prev = prev)
#include <linux/list_impl.h>
I wonder then if you can get rid of some of the type punning too? It's
not clear exactly why that's necessary from the series, but perhaps
I'm missing something obvious :)
I also wonder how much the actual differences being baked in at
compile time makes. Most (all?) of this code is inlined.
Cheers,
Tycho
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
2018-10-24 14:56 ` Tycho Andersen@ 2018-10-24 22:52 ` Igor Stoppa
2018-10-25 8:11 ` Tycho Andersen0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 22:52 UTC (permalink / raw)
To: Tycho Andersen
Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
Paul E. McKenney, Josh Triplett, rostedt, Lai Jiangshan,
linux-kernel
On 24/10/2018 17:56, Tycho Andersen wrote:
> On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote:
>> On 24/10/18 14:37, Mathieu Desnoyers wrote:
>>> Also, is it the right approach to duplicate existing APIs, or should we
>>> rather hook into page fault handlers and let the kernel do those "shadow"
>>> mappings under the hood ?
>>
>> This question is probably a good candidate for the small Q&A section I have
>> in the 00/17.
>>
>>
>>> Adding a new GFP flags for dynamic allocation, and a macro mapping to
>>> a section attribute might suffice for allocation or definition of such
>>> mostly-read-only/seldom-updated data.
>>
>> I think what you are proposing makes sense from a pure hardening standpoint.
>> From a more defensive one, I'd rather minimise the chances of giving a free
>> pass to an attacker.
>>
>> Maybe there is a better implementation of this, than what I have in mind.
>> But, based on my current understanding of what you are describing, there
>> would be few issues:
>>
>> 1) where would the pool go? The pool is a way to manage multiple vmas and
>> express common property they share. Even before a vma is associated to the
>> pool.
>>
>> 2) there would be more code that can seamlessly deal with both protected and
>> regular data. Based on what? Some parameter, I suppose.
>> That parameter would be the new target.
>> If the code is "duplicated", as you say, the actual differences are baked in
>> at compile time. The "duplication" would also allow to have always inlined
>> functions for write-rare and leave more freedom to the compiler for their
>> non-protected version.
>>
>> Besides, I think the separate wr version also makes it very clear, to the
>> user of the API, that there will be a price to pay, in terms of performance.
>> The more seamlessly alternative might make this price less obvious.
>
> What about something in the middle, where we move list to list_impl.h,
> and add a few macros where you have list_set_prev() in prlist now, so
> we could do,
>
> // prlist.h
>
> #define list_set_next(head, next) wr_ptr(&head->next, next)
> #define list_set_prev(head, prev) wr_ptr(&head->prev, prev)
>
> #include <linux/list_impl.h>
>
> // list.h
>
> #define list_set_next(next) (head->next = next)
> #define list_set_next(prev) (head->prev = prev)
>
> #include <linux/list_impl.h>
>
> I wonder then if you can get rid of some of the type punning too? It's
> not clear exactly why that's necessary from the series, but perhaps
> I'm missing something obvious :)
nothing obvious, probably there is only half a reference in the slides I
linked-to in the cover letter :-)
So far I have minimized the number of "intrinsic" write rare functions,
mostly because I would want first to reach an agreement on the
implementation of the core write-rare.
However, once that is done, it might be good to convert also the prlists
to be "intrinsics". A list node is 2 pointers.
If that was the alignment, i.e. __align(sizeof(list_head)), it might be
possible to speed up a lot the list handling even as write rare.
Taking as example the insertion operation, it would be probably
sufficient, in most cases, to have only two remappings:
- one covering the page with the latest two nodes
- one covering the page with the list head
That is 2 vs 8 remappings, and a good deal of memory barriers less.
This would be incompatible with what you are proposing, yet it would be
justifiable, I think, because it would provide better performance to
prlist, potentially widening its adoption, where performance is a concern.
> I also wonder how much the actual differences being baked in at
> compile time makes. Most (all?) of this code is inlined.
If the inlined function expects to receive a prlist_head *, instead of a
list_head *, doesn't it help turning runtime bugs into buildtime bugs?
Or maybe I miss your point?
--
igor
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
2018-10-24 22:52 ` Igor Stoppa@ 2018-10-25 8:11 ` Tycho Andersen0 siblings, 0 replies; 140+ messages in thread
From: Tycho Andersen @ 2018-10-25 8:11 UTC (permalink / raw)
To: Igor Stoppa
Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
Paul E. McKenney, Josh Triplett, rostedt, Lai Jiangshan,
linux-kernel
On Thu, Oct 25, 2018 at 01:52:11AM +0300, Igor Stoppa wrote:
> On 24/10/2018 17:56, Tycho Andersen wrote:
> > On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote:
> > > On 24/10/18 14:37, Mathieu Desnoyers wrote:
> > > > Also, is it the right approach to duplicate existing APIs, or should we
> > > > rather hook into page fault handlers and let the kernel do those "shadow"
> > > > mappings under the hood ?
> > >
> > > This question is probably a good candidate for the small Q&A section I have
> > > in the 00/17.
> > >
> > >
> > > > Adding a new GFP flags for dynamic allocation, and a macro mapping to
> > > > a section attribute might suffice for allocation or definition of such
> > > > mostly-read-only/seldom-updated data.
> > >
> > > I think what you are proposing makes sense from a pure hardening standpoint.
> > > From a more defensive one, I'd rather minimise the chances of giving a free
> > > pass to an attacker.
> > >
> > > Maybe there is a better implementation of this, than what I have in mind.
> > > But, based on my current understanding of what you are describing, there
> > > would be few issues:
> > >
> > > 1) where would the pool go? The pool is a way to manage multiple vmas and
> > > express common property they share. Even before a vma is associated to the
> > > pool.
> > >
> > > 2) there would be more code that can seamlessly deal with both protected and
> > > regular data. Based on what? Some parameter, I suppose.
> > > That parameter would be the new target.
> > > If the code is "duplicated", as you say, the actual differences are baked in
> > > at compile time. The "duplication" would also allow to have always inlined
> > > functions for write-rare and leave more freedom to the compiler for their
> > > non-protected version.
> > >
> > > Besides, I think the separate wr version also makes it very clear, to the
> > > user of the API, that there will be a price to pay, in terms of performance.
> > > The more seamlessly alternative might make this price less obvious.
> >
> > What about something in the middle, where we move list to list_impl.h,
> > and add a few macros where you have list_set_prev() in prlist now, so
> > we could do,
> >
> > // prlist.h
> >
> > #define list_set_next(head, next) wr_ptr(&head->next, next)
> > #define list_set_prev(head, prev) wr_ptr(&head->prev, prev)
> >
> > #include <linux/list_impl.h>
> >
> > // list.h
> >
> > #define list_set_next(next) (head->next = next)
> > #define list_set_next(prev) (head->prev = prev)
> >
> > #include <linux/list_impl.h>
> >
> > I wonder then if you can get rid of some of the type punning too? It's
> > not clear exactly why that's necessary from the series, but perhaps
> > I'm missing something obvious :)
>
> nothing obvious, probably there is only half a reference in the slides I
> linked-to in the cover letter :-)
>
> So far I have minimized the number of "intrinsic" write rare functions,
> mostly because I would want first to reach an agreement on the
> implementation of the core write-rare.
>
> However, once that is done, it might be good to convert also the prlists to
> be "intrinsics". A list node is 2 pointers.
> If that was the alignment, i.e. __align(sizeof(list_head)), it might be
> possible to speed up a lot the list handling even as write rare.
>
> Taking as example the insertion operation, it would be probably sufficient,
> in most cases, to have only two remappings:
> - one covering the page with the latest two nodes
> - one covering the page with the list head
>
> That is 2 vs 8 remappings, and a good deal of memory barriers less.
>
> This would be incompatible with what you are proposing, yet it would be
> justifiable, I think, because it would provide better performance to prlist,
> potentially widening its adoption, where performance is a concern.
I guess the writes to these are rare, right? So perhaps it's not such
a big deal :)
> > I also wonder how much the actual differences being baked in at
> > compile time makes. Most (all?) of this code is inlined.
>
> If the inlined function expects to receive a prlist_head *, instead of a
> list_head *, doesn't it help turning runtime bugs into buildtime bugs?
In principle it's not a bug to use the prmem helpers where the regular
ones would do, it's just slower (assuming the types are the same). But
mostly, it's a way to avoid actually copying and pasting most of the
implementations of most of the data structures. I see some other
replies in the thread already, but this seems not so good to me.
Tycho
^permalinkrawreply [flat|nested] 140+ messages in thread

*Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
2018-10-24 14:03 ` Igor Stoppa
2018-10-24 14:56 ` Tycho Andersen@ 2018-10-28 9:52 ` Steven Rostedt
2018-10-29 19:43 ` Igor Stoppa1 sibling, 1 reply; 140+ messages in thread
From: Steven Rostedt @ 2018-10-28 9:52 UTC (permalink / raw)
To: Igor Stoppa
Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
Paul E. McKenney, Josh Triplett, Lai Jiangshan, linux-kernel
On Wed, 24 Oct 2018 17:03:01 +0300
Igor Stoppa <igor.stoppa@gmail.com> wrote:
> I was hoping that by CC-ing kernel.org, the explicit recipients would
> get both the mail directly intended for them (as subsystem
> maintainers/supporters) and the rest.
>
> The next time I'll err in the opposite direction.
Please don't.
>
> In the meanwhile, please find the whole set here:
>
> https://www.openwall.com/lists/kernel-hardening/2018/10/23/3
Note, it is critical that every change log stands on its own. You do
not need to Cc the entire patch set to everyone. Each patch should have
enough information in it to know exactly what the patch does.
It's OK if each change log has duplicate information from other
patches. The important part is that one should be able to look at the
change log of a specific patch and understand exactly what the patch is
doing.
This is because 5 years from now, if someone does a git blame and comes
up with this commit, they wont have access to the patch series. All
they will have is this single commit log to explain why these changes
were done.
If a change log depends on other commits for context, it is
insufficient.
Thanks,
-- Steve
^permalinkrawreply [flat|nested] 140+ messages in thread