This document defines the inner workings of PaX's reference counter protection and aims to create a bigger community around the project.

It begins with an overview of the PaX Project and reference counter protection, goes into a higher-level explanation of how it is implemented, and then goes deep inside the implementation's code, enabling readers to more easily port the feature to new platforms.

Special focus is placed on the reasoning for the PowerPC implementation, developed by the author while learning the internals of this protection mechanism.

PaX is a Linux Kernel Patch project to provide stronger security and limit exploitation primitives in a system. For some bug classes (for example, user-mode pointer dereference), PaX is able to fully prevent exploitation of the entire bug class. It was initially created in 2000.

PaX can be considered a host intrusion prevention system (HIPS) and focuses on the prevention of memory corruption bugs.

Reference counters (refcount for short) are a way for the operating systems to control access to allocated objects [2].

When there is a path inside the kernel where a refcount is incremented more than it is decremented [3], we have what is called a reference counter overflow bug. The reference counter overflow can be exploited by repeatedly forcing the leaky path to execute until the counter reaches INT_MAX. Once this point is reached, the next increment will wrap to INT_MIN, then gradually approaches zero with each subsequent increment. Normally when a reference counter reaches zero, it is a sign that the object containing the reference counter no longer has any users. When an object no longer has any users, it can be safely freed. In the case of our exploit, however, we still have several legitimate users of the object. So ideally what we want to do is wrap to the value of one, then trigger a legitimate releasing of the object. This would exercise the codepath that decrements the reference counter, then checks its new value against zero. If the new value is zero, then the object will be freed. We now have a classic use-after-free vulnerability which can be exploited to achieve code execution, information leaking, and more [4].

It is not difficult to see that other integer elements in the system also have potential for overflows and could be protected in the same way. PaX's SIZE_OVERFLOW plugin can achieve this in certain common and security-relevant cases [5].

Assuming all other paths touching the same refcount leave it unchanged and an attacker starts exploiting a path that forces the reference counter to repeatedly increment, once the counter hits INT_MAX (2^31-1), under PAX_REFCOUNT the next increment will trigger a signed overflow in the CPU and the assembly logic in PaX reacts to it by reverting (x86) or not allowing (ARM/MIPS/POWERPC/SPARC) the operation (so the counter remains at INT_MAX - see Appendix A for a special case on x86) and reports the event.

Through this enhancement, execution of any of the paths (both the leaky and the normal ones) that cause an increment will keep (saturate) the counter at INT_MAX (Appendix A explains special cases) and report an overflow event (which when used together with Grsecurity [6] will invoke the lockout mechanism) and decrements will make the counter INT_MAX-<small integer, at most NR_CPUS>.

This defines that normal paths will simply preserve INT_MAX (or something a bit smaller) while the leaky path will keep it at INT_MAX. The consequence is there is no permanent decrement (or further increment) towards zero (the counter value oscillates at/below INT_MAX at most), thus there is no longer a use-after-free situation, only an unavoidable memory leak (which is better than the alternative of code execution in a successful exploitation).

In considering the refcount underflow problem, rather than the aforementioned overflow problem, it is not solvable with the same approach since the vulnerability occurs on the transition from one to zero, without passing across INT_MIN and INT_MAX. Also, detecting a decrement attempt on a zero refcount can only be done after the object has already been freed and possibly reused (exploited). Detection of such underflows would require some sort of delayed refcount decrement operation (where the delay could be indefinite), and there are no methods known to the author that have been proposed to accomplish that at the moment.

In this part of the article we will analyze at a higher level the implementation decisions, trying to be as general (architecture-independent) as possible, while in the next part of the article we will dig into the architecture-specific details of the implementation.

The first insight for the automated handling of refcount overflows is that in Linux most such variables have a special type (atomic_t, atomic64_t, atomic_long_t) which come with accessor functions. Were it not for this fact we'd have to manually patch every refcount access which is clearly not a scalable or maintainable solution.

This also means that refcounts of regular integer types are not covered by this feature. In this article we do not cover the changes made by the PaX Project in the kernel to expand the usage of the modified functions (atomic operations) in other parts of the kernel where such usage makes sense (switching certain structure fields to atomic_t such as fs_struct.users, tty_port.count, tty_ldisc_ops.refcount, pipe_inode_info.{readers|writers|files|waiting_writers}, kmem_cache.refcount (SLAB and SLUB allocators), etc).

The second insight of PAX_REFCOUNT is that the large majority of atomic_t usage is for reference counters with a small minority for other purposes (statistics, unique identifiers, etc). This means that we have to restrict the usage of atomic_t types to reference counters and introduce new types (atomic_unchecked_t, atomic64_unchecked_t, atomic_long_unchecked_t) to handle non-reference counter uses where overflow is expected and harmless.

The implementation first locates the reference counter accessor functions in the kernel and changes the arithmetic operations for the equivalents that update the processor flags in the event of an overflow.

This paper will only mention the implementation of the _unchecked versions of the accessor functions, but will not explain where they are replaced in the kernel, except to say that the replacement happens in situations where the kernel uses atomic operations where overflows are expected, as in certain fields related to keeping statistics on system events.

Even though PaX supports in its configuration options the activation of the feature (CONFIG_PAX_REFCOUNT), it patches the arithmetic instructions regardless of the option being enabled. This is due to the negligible performance impact over the overflowing equivalent operations versus the complexity of the patch. The difference is that when the option is not on, the system will not trigger the detection logic itself.

The way the detection logic works is relatively straightforward:

* In arithmetic operations, with the modified instructions that update flags on signed overflows, it is possible to check for the occurrence of the overflow in the operation * The check generates an exception in case of overflow, which is then handled by PaX (the way to define if this is a REFCOUNT protection exception is architecture specific)- For that, an instruction that causes a conditional exception is needed (a conditional jump over an instruction that generates an identifiable exception can also be used).- In some architectures (please see Appendix A for details), the logic needs to revert the operation. On other architectures it just does not commit it to memory. * The handler will terminate the process that triggered the overflow and trigger the user lockout mechanism in grsecurity

The implementation is elegant and involves both architecture-independent code and obviously some architecture-dependent parts. Here we analyze the architecture-independent part of PaX without considering the expansion of protected reference counter usage in other parts of the kernel (the fs_struct.users, tty_port.count, etc mentioned earlier).

fs/exec.c New headers and the function: pax_report_refcount_overflow(regs)

pax_report_refcount_overflow() is called by the architecture-specific code to handle the reaction to the overflow by:

* Logging the refcount overflow event * Sending a SIGKILL to the current process

All the functions that are used to operate on reference counters are updated to generate overflows.

The overflow detection logic uses a conditional jump in case the overflow is not detected. If the overflow is not detected, it will jump over a bkpt instruction. That means, if an overflow is detected, the conditional jump is not taken, thus executing a bkpt instruction, which causes an exception. bkpt instructions on arm receive a parameter, that is tested in the exception handler to identify if the exception came from the REFCOUNT logic. The same parameter is not used by any other parts of the Linux kernel.

x86 uses an overflow exception that nothing else triggers, so the handler can assume that exception was always due to the reference count protection (and even if it was not, this is an abnormal exception in kernel mode).

An example of the logic is here (taken from: arch/x86/include/asm/atomic64_64.h):

asm volatile(LOCK_PREFIX "addq %1,%0\n" -> will set the overflow flag in case of an overflow

#ifdef CONFIG_PAX_REFCOUNT "jno 0f\n" -> if no overflow occurred, will jump to label 0: LOCK_PREFIX "subq %1,%0\n" -> if an overflow occurred, will undo the increment above "int $4\n0:\n" -> and generate an int 4 trap that is handled _ASM_EXTABLE(0b, 0b) -> this says that at label 0 backwards (int instruction) an exception might occur and the code should continue on label 0 backwards (after the int instr) the first label isn't before the 'int' instruction because int 4 is a trap, not a fault #endif

In the PowerPC implementation, PaX uses the instructions lwarx/stwcx to guarantee the exclusive access in the counters check code. Since the architecture does not provide conditional traps neither an bkpt equivalent instruction that can be easily identified by a handler, a bit of hack was necessary:

* We defined an illegal instruction (or an illegal value in an instruction - like executing wait with WC=0b10) and patch the illegal instruction handler to catch the exception * The usual way for choosing the instruction is to see what Linux generatesfor the BUG macro and try to use something similar (for PowerPC it is inarch/powerpc/include/asm/bug.h and it is ' b00b00') - we use an opcode thattrigger the same exception (c00b00). * PowerPC uses anchored exceptions, which means the exceptions go to afixed address in memory, so in arch/powerpc/kernel/head_32.S we have theaddress 0x700 (the handler for the program exception, triggered for invalidinstructions) -> This handler calls the function that should be patched

The process to choose the invalid instruction (and validate the idea) was:

1: lwarx %0,0,%3 # atomic_add\n" -> Loads the value in a temporary register (so if we don'tcommit the operation, it will not take effect) #ifdef CONFIG_PAX_REFCOUNT mcrxr cr0\n" -> This instruction copies bits 0...3 from XER to CR0 (SO, OV and CA flags) and zeroes them out

bf 4*cr0+so, 3f\n" -> bf is a conditional branch if the bit condition is false. 4*cr0+so means that it is testing the SO bit and that each field is 4 bits wide). The jump target is named 3 and is forward.

+2:.long " "0x00c00b00""\n" -> this will only happen if the above branch is not taken, thus meaning in case of an overflow. Since c00b00 is an invalid instruction, the handler will be activated.

#else -> if CONFIG_PAX_REFCOUNT is not on in the kernel configuration add %0,%2,%0\n" -> normal add operation is the one used by the Linux Kernel #endif3:\n -> this is the branch target in case of no overflow PPC405_ERR77(0,%3) -> Linux uses that to simulate hardware conditions, this has nothing to do with the protection stwcx. %0,0,%3 \n\ -> We commit the operation ...

#ifdef CONFIG_PAX_REFCOUNT4:\n _ASM_EXTABLE(2b, 4b) -> This is the exception section update to specify that in the label 2 backwards we havethe possibility of an exception and to continue after the exception handler the point is the label 4 backwards (thus, notcommitting the instruction) #endif

PaX recently added PAX_REFCOUNT support for MIPS, based on code contributed by Corey Minyard.

[----- Detection logic

The logic on MIPS is trivial, since the usual instructions generate exceptions on overflow naturally. Thus in that case, PaX just needs to replace the ones that do not generate exceptions for the ones that do:

MIPS provides special instructions that generate a trap on overflow (for example, the instruction 'add' will trap if there is an overflow). If you don't want traps on overflows, you need to use the equivalents that do not trap (e.g., addu).

That said, on MIPS the exception handler uses the fixup_exception() function to differentiate the REFCOUNT overflow:

The fixup_exception() function uses the _ASM_EXTABLE exception lookup available in the Linux kernel. A good write up on the subject can be seen in the Kernel Documentation [7], but for our purposes it is enough to know that _ASM_EXTABLE defines the address of the potential exception and the address of the handler.

x86 presents a special case for the implementation due to a race possibility. For example, due to SPARC's RISC architecture, there is no support for complex atomic operations on memory operands (it uses LL/SC mechanism instead - please refer to Appendix C). In x86, on the other hand, due to that capability, the result of an overflowed refcount can become visible to other CPUs, even if for a brief period of time (cycles, perhaps a few dozen for dirty cache-line transfers).

The risk here is that an attacker could in theory time two or more threads to execute the leaky path in parallel and by hitting the race window, allow one of them to increment past INT_MAX (so that further increments would not trigger the signed overflow detection logic and hence allow a full wraparound to zero and the use-after-free situation).

To fix that problem (risk), instead of using the original amount to revert the overflowing operation, one can use a multiple of it (NR_CPUS as a multiplier at least). Another approach would be to detect a negative result instead of a signed overflow but this would then need further analysis and special case handling for refcounts that can have negative values legitimately (e.g., page._mapcount). PaX considered this to be such an impractical case that it never implemented this additional logic.

This paper and the implementation are strongly based on the conversations with PaX Team. The whole idea and explanations of how the protection mechanism works are an interpretation of his explanations.

A brief update regarding the race condition on x86: during the forward port to the 4.8 kernel we decided to reduce the memory footprint and performance impact of the REFCOUNT instrumentation on the x86 architecture which in turn made it a lot easier to eliminate the exploitation of the race altogether.