Monday, May 15, 2017

Are we doing memory corruption mitigations wrong?

Introduction

Before we get into it, let's start by stating that the progression of memory corruption mitigations over the years has been intensely valuable. The progression of mitigations continues to make exploiting bugs harder and more time consuming. The pool of people who have both the skill and commitment to exploit any given bug (either reliably or at all) is shrinking.

The list of mitigations to credit is long, but includes:

Non-executable virtual memory regions.

ASLR (address space layout randomization).

Stack canaries.

Stack variable re-ordering.

Heap metadata hardening.

Heap partitioning.

Control flow integrity (Microsoft CFG, Clang CFI).

etc.

The items in the above list have this in common: they target the side effects of memory corruption, as opposed to the memory corruption itself. Some of the side effects targeted are close to the time of the memory corruption, such as a stack canary being checked when a function that caused the corruption returns. Other side effects targeted are potentially very far from the time of the memory corruption, such as control flow integrity on a virtual call, after the attacker has read and written memory arbitrarily.

When targeting side effects of a problem, as opposed to the problem itself, it's often hard to prevent the problem.

I started working on this problem but unfortunately have to put it aside due to other commitments at this time. This blog post documents my thoughts and notes for anyone interested in this problem space.

An alternative?

Should we invest more time in technologies that attempt to stop memory corruption in the first place? I think we should. It's certainly an under researched area.

Let's first clarify that we're talking about defenses for existing C and C++ programs. There are plenty of essays advocating the replacement of C programs with those written in something safer, such as Go. This is a good longer term suggestion but in the shorter term, there's a whole boatload of C legacy that no-one is rushing to move away from.

Performance

The main challenge immediately raised when talking about blocking memory corruption at the source is performance. For sure, this is going to be challenging, but here are the tools and situations that may help us:

Modern compiler frameworks. A modern compiler framework such as LLVM allows security enhancement transforms to be applied before the optimization passes, so the compiler can lift security checks out of loops, etc.

64-bit processors. Newer processors tend to offer more registers and also "spare" bits in pointer values, both of which may lessen the performance impact of schemes previously tried on 32-bit.

New processor instructions. As yet unexplored, but do processor extensions such as Intel MPX offer any paths to thorough yet high performance coverage?

Rollback of mitigations. With a C runtime that does not permit buffer bounds to be exceeded, certain mitigations become redundant and can be rolled back. For example, if pointers are bounds checked, there's no urgent need for stack canary checking (which cost a percent or two), because it becomes hard to get the return address corrupted. Likewise for heap metadata hardening and possibly CFI (which also cost a percent or two).

Faster heap implementation. Some of the ideas below rely on fast mapping of pointers to heap chunk metadata, such as is done in partitionAlloc() in Chrome. It just so happens that partitionAlloc is faster than almost anything else (including tcmalloc and certainly glibc malloc), so some performance could be clawed back here.

Stricter C / C++ semantics

In the quest for safer compiled C / C++, we might consider placing stricter requirements on the program. For example, while it's already undefined behavior to create certain out-of-bounds pointers, we could compile to a stricter variant of this.

Or, if we decide to dabble in reference counting, it could be mandatory to NULL out all pointer references to a chunk before calling free().

Reserving the top unused bits of 64-bit pointers might also be useful.

Scheme #1: Safe memcpy() with fast pointer lineage checking

This first scheme illustrates how there are some low hanging fruits with a moderate impact and trivial performance overhead. Imagine if memcpy() overflow from one heap chunk to another would be a thing of the past? It fairly trivially can.

Imagine a runtime system where the heap implementation is partitionAlloc(). This heap implementation can take an arbitrary pointer that is part way into an allocation chunk, and reliably and quickly return the chunk size, and the chunk pointer base. So you can have a memcpy() that does this (pseudocode):

void* memcpy(void* dst, const void* src, size_t len) {

char* dst_base;

char* dst_extent;

size_t dst_size;

partitionAllocGetPtrDetails(&dst_base, &dst_size, dst);

dst_extent = dst_base + dst_size;

if (len > dst_extent - dst) {

TAKE_A_DUMP();

}

/* Add a similar check for src, why not. */

...

}

Pretty neat huh? However, there are a lot of challenges and polish to apply before this could be 100% awesome:

partitionAlloc() currently couldn't provide a working partitionAllocGetPtrDetails() for arbitrary middle-of-chunk pointers inside allocations greater than about 1MB or so. This would need to be fixed, which might require a rejig of where metadata is stored, to be more ASAN-like.

You'd need to stop the compiler thinking it knows all about memcpy() and inlining memcpy() for small fixed size values.

This scheme will crash horribly if a stack, BSS or other non-heap pointer value is passed to memcpy(). A generic method of mapping arbitrary pointers to memory metadata needs to be extended to all pointers -- or the checks need to only fire for heap locations.

This scheme won't protect for intra-chunk overflows, e.g. if the memcpy() destination is a buffer within a heap allocated struct, and the struct contains sensitive data after the buffer. Perhaps a candidate here for compiler struct re-ordering?

But back to awesomeness, note that memcpy() is just a simple loop that iterates memory so you can imagine adding compiler support to decorate more complicated loops with high performance bounds checks.

Scheme #2: Pointer invalidation with pointer arithmetic checking

We can make the observation that a lot of existing memory checker tools perform checks on every pointer dereference (e.g. ASAN). However, the time when a pointer goes from "good" to "bad" is often when pointer arithmetic is applied to a pointer, such as bumping a pointer along at the end of a loop.

Given that pointer arithmetic is a little less common that pointer dereference, would it be higher performance to instead do checks when a pointer is incremented to see if it crosses a chunk boundary? As a bonus, this would provide stronger guarantees than the ASAN model because in the ASAN world, e.g. *(ptr + 70) might not fault if that jumps over a redzone.

Scheme #3: Use-after-free detection with pointer cookies

Somewhat specific to 64-bit, but let's take x86-64 and presume that we reserve the upper 16 bits of every pointer value to stamp in a cookie. Then:

At malloc() time, a random 16 bit cookie is assigned to the heap chunk and stored as metadata associated with the chunk.

The returned pointer has the cookie value as the upper 16 bits.

Copying a pointer around, or doing pointer arithmetic, proceeds at full speed and preserves the upper 16 bits.

Dereferencing a pointer requires masking off the upper 16 bits and comparing them to the cookie value associated with the referenced heap chunk.

As an optimization, the cookie value only needs to be re-checked if free() has been called, or a function is called that might call free(). So tight loops proceed with only the one check up front.

When a heap chunk is freed, the cookie value is cleared. The cookie values for existing pointers into the chunk will not match while the chunk is free, and also will not match if the chunk is reallocated.

Scheme #n: Your scheme here

I'm sure there are cleverer people than me out there who can propose better schemes. My point is that we need to try, because there are a ton of little tricks we can pull in this space to make some defenses that are both meaningful and fast.

Conclusion

I really think we can stop memory corruption at the source for some vulnerability classes. There's going to be a performance hit but we can claw some of it back by retiring existing mitigations as they become less relevant.

And is there any reason not to burn a couple years of Moore's law to kill off some memory corruption classes, instead of repeatedly trying to patch up their symptoms?

4 comments:

In SPARC, we've added to the architecture Application Data Integrity (part of Silicon Secured Memory), which allows us to check at runtime for linear memory corruption, with fairly low impact (to the point that we can enable it at large). The idea behind ADI stems from one of the things you point out above: the VA space is larger than the physical space, so we get extra bits of information that we can use to 'tag' memory. This is has been theorized for a long time, but has always hit a wall when the backing memory left the caching hierarchy. On SPARC, we've solved this by storing the metadata all the way down to the physical DIMM.

and I'm looking forward to the next release of Solaris to showcase how we've leveraged it system wide for different bug classes ;) (the amount of incorrect handling that just popped up across the board by throwing test suites to run under our new defenses has been pretty fascinating)

This idea isn't exactly new, but why don't we simply split the stack? Like having one for data and one for return addresses. So even if the data stack overflows, we wouldn't need such things as control flow integrety, because return addresses can not be overwritten with a simple memcpy stack overflow. And to protect against data corruption we still have stack cookies. That should even be possible with x86 cpus. Basepointer for return addresses and stackpointer for data. I mean why are there even 2 pointers? arm can do perfectly fine addressing local variables with stackpointer only.

The split stack stuff is what some CFI solutions (e.g. CPI) do, and basically also what Intel CET does (shadow stack for return addresses), but it only covers the case where you actually want to mount a ROP attack. There are a number of data-only attacks that rely on memory corruption and do not need to corrupt the saved return address.

You also ask why there are two pointers: that's mostly to ease debugging when chaining together frames. The use of the base pointer is not required by the architecture, x86 can do just as fine using EBP/RBP as a general purpose register: e.g. see -fomit-frame-pointer GCC option.