ia32 syscall emulation

There are two ways to invoke system calls on the Intel/AMD family of processors:

Software interrupt 0x80.

The sysenter family of instructions.

The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.

The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.

ptrace(2) and the ia32 syscall emulation layer

From the ptrace(2) man page (emphasis mine):

The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.

If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:

This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.

Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.

Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax

This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.

The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.

While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:

This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).

This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).

Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:

tying it all together

When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.

Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.

Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.

call *ia32_sys_call_table(,%rax,8)

Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.

kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.

The system has been rooted.

The fix

The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.

tl;dr

This post will describe how I stumbled upon a code path in the Linux kernel which allows external programs to be launched when a core dump is about to happen. I provide a link to a short and ugly Ruby script which captures a faulting process, runs gdb to get a backtrace (and other information), captures the core dump, and then generates a notification email.

Your processes may get automatically restarted when a fault occurs and you may even get an email letting you know your process died. Both of those things are useful, but it turns out that with just a tiny bit of extra work you can actually get very detailed emails showing a stack trace, register information, and a snapshot of the process’ files in /proc.

random walking the linux kernel

One day I was sitting around wondering how exactly the coredump code paths are wired. I cracked open the kernel source and started reading.

In the above code, core_pattern (which can be set with a sysctl or via /proc/sys/kernel/core_pattern) to determine if the first character is a |. If so, format_corename returns 1. So | seems relatively important, but at this point I’m still unclear on what it actually means.

Scanning the rest of the code for do_coredump reveals something very interesting3 (this is more code from the function in the first code snippet above):

WTF?call_usermodehelper_fns? umh_pipe_setup? This is looking pretty interesting. If you follow the code down a few layers, you end up at call_usermodehelper_exec which has the following very enlightening comment:

/**
* call_usermodehelper_exec - start a usermode application
*
* . . .
*
* Runs a user-space application. The application is started
* asynchronously if wait is not set, and runs as a child of keventd.
* (ie. it runs with full root capabilities).
*/

what it all means

All together this is actually pretty fucking sick:

You can set /proc/sys/kernel/core_pattern to run a script when a process is going to dump core.

Your script is run before the process is killed.

A pipe is opened and attached to your script. The kernel writes the coredump to the pipe. Your script can read it and write it to storage.

Your script can attach GDB, get a backtrace, and gather other information to send a detailed email.

But the coolest part of all:

All of the files in /proc/[pid] for that process remain intact and can be inspected. You can check the open file descriptors, the process’s memory map, and much much more.

ruby codez to harness this amazing code path

I whipped up a pretty simple, ugly, ruby script. You can get it here. I set up my system to use it by:

Why didn’t you read the documentation instead?

This (as far as I can tell) little-known feature is documented at linux-kernel-source/Documentation/sysctl/kernel.txt under the “core_pattern” section. I didn’t read the documentation because (little known fact) I actually don’t know how to read. I found the code path randomly and it was much more fun an interesting to discover this little feature by diving into the code.

Conclusion

This could/should probably be a feature/plugin/whatever for god/monit/etc instead of a stand-alone script.

Reading code to discover features doesn’t scale very well, but it is a lot more fun than reading documentation all the time. Also, you learn stuff and reading code makes you a better programmer.

The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag.

UPDATED: Graphs are now 0 based on the y axis. Links in the tidbits section (below conclusion) for my ugly test harness and terminal session of the build of the test case in the bug report, objdump, and corresponding system information.

Hold the #gccfail tweets, son.

Everyone fucks up. The point of this post is not to rag on GCC. If writing a C compiler was easy then every asshole with a keyboard would write one for fun.

WARNING: THERE IS MATH, SCIENCE, AND GRAPHS BELOW.

Watch yourself.

The original bug report for -fomit-frame-pointer.

I stumbled across a bug report for GCC that was very interesting. It points out a very subtle bug that occurs when the -fomit-frame-pointer flag is passed to GCC. The bug report is for 32bit code, however after some testing I found that this bug also rears its head in 64bit code.

What is -fomit-frame-pointer supposed to do?

The -fomit-frame-pointer flag is intended to direct GCC to avoid saving and restoring the frame pointer (%ebp or %rbp). This is supposed to make function calls faster, since the function is doing less work each invocation. It should also make function code take fewer bytes since there are fewer instructions being executed.

A caveat of using -fomit-frame-pointer is that it may make debugging impossible on certain systems. To combat this on Linux, .debug_frame and .eh_frame sections are added to ELF binaries to assist in the stack unwinding process when the frame pointer is omitted.

What is the bug?

The bug is that when -fomit-frame-pointer is used, GCC erroneously uses the frame pointer register as a general purpose register when a different register could be used instead.

wat.

The amd64 and i386 ABIs12 specify a list of caller and callee saved registers.

The frame pointer register is callee saved. That means that if a function is going to use the frame pointer register, it must save and restore the value in the register.

The test case provided in the bug report shows that other caller saved registers were available for use.

Had the function used a caller saved register instead, there would be no need for the additional save and restore instructions in the function.

Removing those instructions would take fewer bytes and execute faster.

What are the consequences?

Let’s take a look at two potential pieces of code.

The first piece is the code that would be generated if -fomit-frame-pointeris not used:

The above assembly sequence is what is generated when GCC decides to use the frame pointer register as a general purpose register. Since it is callee saved, it must be saved before being modified and restored after being modified.

So -fomit-frame-pointer makes your binary fatter, but does it make it slower?

Only one way to find out: do science.

I built a simple (and very ugly) testing harness to test the above pieces of code to determine which piece of code is faster. Before we get into the benchmark results, I want to tell you why my benchmark is bullshit.

Yes, bullshit.

You see, it makes me sad when people post benchmarks and neglect to tell others why their benchmark may be inaccurate. So, lemme start the trend.

This benchmark is useless because:

Reading the CPU cycle counter is unreliable (more on this below the conclusion). I also tracked wall clock time, too.

I don’t have the ideal test environment. I ran this on bare metal hardware, and set the CPU affinity to keep the process pinned to a single CPU… BUT

I could have done better if I had pinned init to CPU0 (thereby forcing all children of init to be pinned to CPU0 – remember child processes inherit the affinity mask). I would have then had an entire CPU for nothing but my benchmark.

I could have done better if I forced the CPU running my benchmark program to not handle any IRQs.

The lfence instruction is a serializing instruction that ensures that all load instructions which were issued before the lfence instruction have been executed before proceeding. I did this to make sure that the cycle counter was being read after all operations in the test functions were executed.

The values returned by this function are misleading because CPU frequency may be scaled at any time. This is why I also measured wall clock time.