Linux Kernel Vmsplice Vulnerability

Category

CTU Research

February 19, 2008By Sean Caulfield

I spent some time this week analyzing the recently disclosed vulnerability in the Linux kernel syscall, vmsplice. Several POC's have been released and I was curious as to how they exploited the kernel.

Background on the vulnerability: the vmsplice function is a system call that allows a programmer to map an I/O vector (basically, an array of buffers) to a pipe. From the main page:

"The vmsplice() system call maps nr_segs ranges of user memory described by iov into a pipe. The file descriptor fd must refer to a pipe."

The kernel adjudicates the whole transaction, dutifully mapping/copying the user specified memory to the pipe's buffers or vice versa.

The trouble is that the routine for sys_vmsplice didn't follow best practices for kernel programming and check the pointers passed from userspace for validity. In at least three places in fs/splice.c, data in the user-specified iov array was copied to or from without verifying it's validity via access_ok().

TheexploitI examined only worked on kernel versions 2.6.23 to 2.6.24.1. Rafal Wojtczuk has anexcellent write-upon the 2.6.17 and upexploit. You should check it out.

In 2.6.23, code was added to handle copying from the pipe to the user iov.

Unfortunately, there was no check that this destination address was a valid mapping for the user process:

Note that base and len are only checked for being non-zero, rather than the more detailed check performed by access_ok(). Thus, we can pass in values that are unmapped (less useful for exploitation) or are mapped but unwritable.

This later case is whatqaaz's exploitutilizes. By specifying the entry point of another system call (in this case, the rarely used sys_vm86old) as the "base" for copying, qaaz tricks the kernel into overwriting it's own syscall table:

Here, get_target() finds the target system call. TRAMP_CODE is a static buffer containing our privilege escalation syscall. gimmeroot() is a macro to invoke the newly overwritten syscall with function to set our process' UID/GID to root's:

You would expect the user process to be terminated for an access violation. After all, it is asking the kernel to write to what should be a protected area of kernel memory. Moreover, kernel memory isn't usually mapped into a process's address space. Trying to access it should generate a page fault and subsequent kernel oops, plus a SIGSEGV for the userland process.

However, this is not the case with system calls: they must be mapped in userland as well as kernelspace, since user processes need to call them. This exploit takes advantage of this, as the copy is done with the current process's memory mappings but with the elevated permissions (and reduced access checks) of running in kernel mode.

There are a number of steps that could be taken to prevent this exploit from working:

2. Protect certain chunks of kernel memory from being overwritten after it's initial load. Unfortunately, the syscall table isn't a very good candidate for this tactic: it needs to be modified at runtime, at the very least for registering new system calls from loadable modules.

3. Audit system calls and flag any ones with unusual parameters. In this case, the user process was passing a pointer into what should have been (from the process's perspective) an unwritable page. This tactic would be very expensive, since you'd essentially be double checking most of what the kernel already verifies. Or, at least it should verify.

4. Careful code auditing to make sure #1 always happens. Note that the tactic of preventing user processes from mapping very low memory (used inthe other vmsplice exploit) assuggestedelsewherewould not have prevented this particular exploit from working. It uses a different path through sys_vmsplice to achieve it's code execution and no zero page mappings or any such funkiness.

By the way, the vulnerability has been patched in 2.6.24.2. Though it is a local-only exploit, it is still a significant risk.