Linux kernel user to kernel space range checks

So, I was recently looking at a Linux kernel code that as it turns out it isn’t buggy. I was tricked because of the lack of the classic sequence of access_ok() before performing the actual read operation from the userland. That’s why I decided to make this post here to give an explanation of the range checks for user to kernel space copy (read/write) operations of the Linux kernel.
For convenience I’ll stick with x86 architecture in this post since this is the most widely deployed one. As you probably already know, if there is no check on the user supplied pointer, it could lead to pretty badvulnerabilities because a user could specify a pointer located to kernel space and consequently force the kernel into performing a read or write operation to an arbitrary kernel location.

access_ok()
The most common check to avoid this is the access_ok() macro which can be found at /arch/x86/include/asm/uaccess.h for the x86 architecture.

/**
* access_ok: - Checks if a user space pointer is valid
* @type: Type of access: %VERIFY_READ or %VERIFY_WRITE. Note that
* %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe
* to write to a block, it is always safe to read from it.
* @addr: User space pointer to start of block to check
* @size: Size of block to check
*
* Context: User context only. This function may sleep.
*
* Checks if a pointer to a block of memory in user space is valid.
*
* Returns true (nonzero) if the memory block may be valid, false (zero)
* if it is definitely invalid.
*
* Note that, depending on architecture, this function probably just
* checks that the pointer is in the user space range - after calling
* this function, memory access functions may still return -EFAULT.
*/
#define access_ok(type, addr, size) (likely(__range_not_ok(addr, size) == 0))

You can read the comment to get an introduction to that useful Linux kernel’s C macro. So, likely() C macro is used for efficiency (branch prediction) and as you can see, access_ok() simply calls __range_not_ok() which can be found in the same source code file.

The rest of the code is a simple GCC inline assembly code that basically does the following:

add size, addr ; get the range of the requested pointer
sbb 0, flag ; substract the carry flag (CF) from 'flag'
cmp addr, current_thread_info()->addr_limit.seg ; compare the thread's address limit with the requested pointer
sbb flag, flag ; this will update the flag based on the compare instruction
; that it could change the CF flag.

So, this is the range check that access_ok() performs. It checks that the given range (address + size) is within the current thread’s limit. For completeness, here is current_thread_info() macro which is defined at arch/x86/include/asm/thread_info.h header file.

And ‘addr_limit’ which is the value containing our thread’s address limit is of ‘mm_segment_t’ type which is defined at arch/x86/include/asm/processor.h.

typedef struct {
unsigned long seg;
} mm_segment_t;

Before moving on to the next one, it’s important to note here that access_ok() on the x86 architecture completely ignores the type of access to be performed (VERIFY_READ or VERIFY_WRITE). It only checks the range of the given address.

get_user()/put_user()

This is a common call for reading or writing variables from user space. Since the aim of this post is to deal with the range checks I’m just going to cover get_user() because the same range checks are being employed by put_user() too. Its code resides at arch/x86/include/asm/uaccess.h header file for the x86 architecture that we’ll be looking at.

Obviously, the whole routine is only enabled if ‘in_atomic’ that checks if this is running in an atomic context is true as we can read at include/linux/hardirq.h header file. If this is the case, it will make an initial check to ensure that preemptable counter hasn’t reached preemptive offset and IRQs aren’t disabled using irqs_disabled() macro. If this is true it will immediately return. Also, if the system’s state is ‘SYSTEM_RUNNING’ or there is a kernel OOPS in progress it will also return since there is either no error or there is an ongoing kernel OOPS. Next, it will update the ‘jiffies’ value and print some debugging information for the sleeping function.
The other macro that will be likely invoked is the might_resched() which resides at include/linux/kernel.h header file.

If the function requires rescheduling (and of course, preemptive kernel is active), it will call __cond_resched() that increments preemptive counter using add_preempt_count(), then schedules the task through schedule() and finally decrements the counter.
We won’t go any deeper to the sleeping and scheduling since it’s out of the scope of this post. Back to get_user() you can see that depending on the size of the given pointer it will invoke __get_user_x() changing its first argument to either 1, 2, 4 , 8 or anything else passed to it through ‘X’ variable. This moves us to arch/x86/include/asm/uaccess.h where this code resides.

As you can read here, this inline assembly code will simply call __get_user_{1,2,4,8,X} depending on the previously given value. Those functions are available at arch/x86/lib/getuser.S and since we only care about the range check I will only discuss the __get_user_1() which is this:

The ‘CFI_STARTPROC’ initializes DWARF-2 debugging section ‘.cfi_startproc’ if ‘CONFIG_AS_CFI’ is enabled. The next call to GET_THREAD_INFO() it will return the thread info structure for the given pointer as we can read at arch/x86/include/asm/thread_info.h.

That moves ‘THREAD_SIZE’ to ‘reg’ and then simply alignes ESP with that mask.
As you can see we then have a compare instruction that checks the user address ‘%_ASM_AX’ with the value of ‘TI_addr_limit’ passing the previously acquired ‘thread_info’ structure’s address to it. The latter symbol is defined at arch/x86/kernel/asm-offsets_32.c like this:

Which defines symbol ‘TI_addr_limit’ having the ‘addr_limit’ value which is contained in ‘thread_info’ structure. Basically, ‘TI_addr_limit’ for the given pointer is another way of accessing current_thread_info()->addr_limit.seg which was discussed earlier.
So, if everything is fine the next ‘jae’ (Jump if Above or Equal) will be skipped and ‘movzb’ will copy a single Byte of the user provided pointer to EDX general purpose register, then zero out EAX (which is the function’s return value) and return. However, if the comparison was incorrect then bad_get_user() will be executed.

That will clear out the destination EDX register, set the user’s return value to ‘-EFAULT’ and return. That was pretty much the range check inside get_user()/put_user().

copy_to_user()/copy_from_user()
Those are the most common functions for reading and writing data to and from user space. Once again, since our goal is to understand the range checks of these routines I’ll just use copy_from_user() since the same range check applies to copy_to_user() routine too. The code of that function is placed at arch/x86/include/asm/uaccess_32.h header file.

So, if compiled with newer GCC releases (4.0 or newer) it will use the __builtin_object_size() function to return the object’s size. Back to copy_from_user() we can read that it checks that object’s size isn’t greater than size of data to be copied and that is neither -1. If this is the case, it will move to the next level and call _copy_from_user(). Otherwise, it will call copy_from_user_overflow() from arch/x86/lib/usercopy_32.c.

It uses access_ok() to check the user space ‘from’ pointer for ‘n’ Bytes and if it’s within the address limit it will continue to the actual copy through __copy_from_user(). If this fails, it will just zero out the kernel memory pointed by ‘to’ pointer and return.

There is a person who I should thank here but I don’t know if he wants to be referenced. :)