Two Bugs, One Func()

Share

> part i: bug 0x1, kernel panic ya’ll!

Aloha it’s Patrick, Chief Security Researcher at Synack. In my free time, I also run a small OS X security website objective-see.com, where I share my personal OS X security tools and blog about OS X security and coding topics. Below is one such post originally published on my site, which discusses one of my tools and more generally once way to monitor processing creation on OS X. Read & enjoy!

Background
While I occasionally hunt for bugs in the macOS kernel, sometimes I get lucky and they simply appear! In this two part blog, we’ll first track down the cause of a kernel panic that was inadvertently triggered by some benign user-mode code I was working on. We’ll then see that while this initial bug, though interesting, does not appear not exploitable – another bug, found within the same function could easily be used to execute arbitrary (unsigned) code in the context of the kernel. Sweet!
One of Objective-See’s most popular tools is RansomWhere?. By continually monitoring the file-system for the creation of encrypted files by suspicious processes, RansomWhere? aims to protect personal files, generically stopping ransomware in its tracks.

To keep things simple, the original version of RansomWhere? did not track or monitor process creations. That is to say, it just monitored the file system for file system events related to encryption. And while the file system events it monitored (e.g. file creations) did have a process identifier (pid) associated with each, this inherently resulted in the following limitations:

Short-lived processes could exit before their file system events would be delivered to RansomWhere?. As such file system events only contain the pid of the responsible process, when RansomWhere? would attempt to map the pid to a process path, this would fail. This was problematic, as the process path to it’s binary image is needed to classify the binary (as trusted, or not).

Parents that spawned separate child processes to encrypt each file, would not be flagged. Again, as the initial versions of RansomWhere? only tracked file system events, while it would see each child process generating a (single) file system event, had no notion of parents or process ancestry, and thus not accumulate an accurate ‘parent-based’ count of total encrypted files to trigger an alert.

In order to determine if a process was untrusted, RansomWhere? checks to see if the process is signed by Apple proper, is from the Mac App Store, etc. Since process creations were not monitored, these (CPU-intensive) checks were performed as the file system events were delivered. While caching the results (i.e. “pid xyz maps to a trusted binary”) helped, this still inherently introduced some latency and inefficiencies into RansomWhere? resulting in CPU spikes and “slower” ransomware detection.

To keep things simple, the original version of RansomWhere? did not track or monitor process creations. That is to say, it just monitored the file system for file system events related to encryption. And while the file system events it monitored (e.g. file creations) did have a process identifier (pid) associated with each, this inherently resulted in the following limitations:

Have a comprehensive and accurate pid to process path mapping.

Track and accumulate the number of encrypted files created by child processes, even if each child only created a single encrypted file.

Classify processes (as trusted or not), at the time of their creation, instead of when they begin creating (possibly encrypted) files.

I previously blogged about methods of monitoring process creation, and incorrectly concluded that “all user-mode options (to the best of my knowledge)” were not adequate, and thus process monitoring should be done from ring-0.

Adding a kernel extension (kext) into RansomWhere? just to perform process monitoring seemed like overkill, so I decided to re-examine various user-mode options. Turns out, a method I had previously discounted (as I couldn’t get it working consistently) – process monitoring via OpenBSM, provides an adequate means to track process creation from user-mode!

An upcoming blog post will go into the details, as OpenBSM auditing on macOS is rather poorly documented and I found it rather non-trivial to get it correctly handle the various ways that processes may be created on macOS (fork, exec, posix_spawn, etc).

Cool, so I had and effective way to monitor process creations in user-mode! I integrated this new code into RansomWhere? and installed the new version on my personal/developer computer….then went back to Twitter so see what collective InfoSec community was arguing about that day 😉 Randomly, about a day later, my box panicked.

This panic, mostly annoyed me (IDA Pro, why you no have auto-save!?!), but I didn’t’ think too much of it, as my Mac locks up or misbehaves from time to time (Russia: fix your EFI implant! jeez). When a second panic happened a few days later though, I decided to investigate.

From Panicked Confusion to Understanding
When the kernel panics, if one is lucky, a usable panic report is created. This is saved in the /Library/Logs/DiagnosticReports directory (look for a file; Kernel_<date>_<computerName>.panic). The Console app also shows such reports – just click on the ‘System Reports’ icon.

Looking at kernel panic report can be somewhat confusing. But once we break it all down, it’s really not too bad!

First, let’s zoom in on the panic type – it’s “type 14=page fault”… which likely means something so do with an invalid memory read or write. The “Fault CR2: 0xffffff803db4f000” shows the memory address that was invalidly accessed triggering the unhandled page fault.

Next we have the registers. For now, just focus on RIP, the 64-bit program counter register. This will hold the address of the actual faulting instruction that caused the panic (here, a page fault). Its value is 0xffffff800892544b.

Next up, we have the backtrace. Think of these as the list of instructions that brought us to the panic; basically the flow of execution (at the function level) that led to the faulting instruction.

Following this, is the process that triggered the crash. syslogd… hrmm, odd. More on this later.

After the kernel version (macOS Sierra 10.12.3), is the kernel slide. This is the value that the kernel was shifted in memory due to ASLR. We need this value to map the faulting instruction in memory, to the faulting instruction on disk (i.e. in the kernel binary image).

Ok, we’ve now got what we need from the kernel panic (most notably a faulting instruction’s address), so time to dive into IDA. Fire it up and disassemble to kernel proper (/System/Library/Kernels/kernel). Kernel disassembly; how exciting, ya!?

First thing, rebase the kernel so that the addresses aligns with the ASLR’d address in the panic crash report. In IDA, to rebase a program, click ‘Edit’ -> ‘Segments’ -> ‘Rebase Program’

At the time of the kernel crash, the panic file shows it was ASLR‘d (slid), by 0x0000000008200000. Turns out you have to add 0x100000 this value to get the ‘on-disk’ file image of the kernel to properly align in IDA. Thus we enter 0xFFFFFF8008300000 as the rebase value in IDA.

Once the kernel disassembly in IDA has been rebased, hit ‘G’ to bring up the ‘Jump to address’ popup. Enter the address of the faulting instruction, that was we extracted (from RIP) in the kernel panic report, 0xffffff800892544b

This should take you straight to the faulting instruction: cmp byte ptr [rbx+r13+2], 0

Cool so now we know the address that caused the kernel to panic. Looks like it’s dereferencing some pointer base-value (RBX) that has an offset (R13) and a 2 added to it, and checking if the byte at that memory address is zero. Recall that the kernel panicked due to an unhandled page fault. Considering the faulting instruction; it seems reasonable to assume that the instruction results in an invalid (likely ‘out of bounds’) read.

From the IDA screenshot, one can see that the faulting instruction is within the audit_arg_sockaddr function. Lucky for us, this part of the macOS XNU kernel is open-sourced in bsd/security/audit/audit_arg.c (mahalo Apple!). Here’s the entire code of the audit_arg_sockaddr function from the macOS 10.12.3 kernel. It’s pretty short:

First thing to note, is that from both the problematic function name “audit_arg_sockaddr”, and location in source code (security/audit/audit_arg.c), it’s easy to see that this code is part of the kernel mode audit ‘framework.’ Though normally not executed, when any user-mode program (such as RansomWhere? or Apple’s praudit utility) begins auditing events, the audit_* functions in the kernel will be invoked. As their name implies, these functions generate audit events (such as when a socket is created, bound, etc). Note that one has to be root in order to consume audit events. This explains that while the kernel panic was triggered by the syslogd process – indirectly the new auditing code in RansomWhere?, was ‘responsible’ as it directed the kernel to ‘turn on’ auditing in ring-0.

So what is the actual bug? Looking at the audit_arg_sockaddr function there is only one line of code where a base-pointer value has two offsets added to it, then is dereferenced to check for 0:

Knowing the layout of a sockaddr_un structure, it’s easy to see that the +2 in the cmp instruction is how the code is getting the offset to the sun_path (which is at offset 2 within a sockaddr_un structure, such as the one stored in the ‘sun’ variable). That is to say, +2 in the disassembly of the faulting instruction ‘matches’ to sun->sun_path in the source code.

What about the RBX and R13 registers in the faulting instruction? One is obviously the pointer to the sockaddr_un structure (‘sun’) while the other is a length (‘slen’).

Looking a few instructions back in IDA disassembly, it’s easy to see that RBX is the pointer to the sockaddr_un structure. For example, it is dereferenced in multiple places to both extract the length of the structure and ‘switch on’ the socket family type. Below is a snippet of disassembly, a few lines before the faulting instruction, which shows the code extracting the socket family type (‘sun->sun_family’), and jumping to the logic which handles sockets of type AF_UNIX. Recall that the ‘sun_family’ member is at offset 0x1 within the sockaddr_un structure:

Since we’ve shown that RBX holds the ‘sun’variable, a pointer to a copy of the sockaddr_un structure that is being audited, this just leaves R13. By process of elimination, we can assume it holds the value of the ‘slen’ variable. Looking back a few instructions before the faulting instruction in the disassembly confirms this:

Since we know that RBX holds a pointer to a sockaddr_un structure, dereferencing the byte at offset 0 (movzx r13d, byte ptr [rbx]) into R13 is extracting the size, or length (‘sun_len’). The next instruction then adds -2 (0FFFFFFFFFFFFFFFEh) to the length. This value is then checked in order to validate that the size of structure (‘sun_len’) is at least 2. (Since the offset to the path of the socket starts at offset 0x2, it makes sense to validate that the size is at least 2!). We can match this these assembly instructions with the following lines in the audit_arg_sockaddr function’s source code:

Ok, so we’ve shown that the R13 register holds the value of ‘slen’. This is the size or length of the socket structure, minus 2 (to account for the ‘sun_len’ and ‘sun_family’ members of the structure). In other words, its the length of the name, or path, of the unix socket (‘sun_path’).

To summarize:

faulting instruction: cmp byte ptr [rbx+r13+2], 0

RBX: pointer to a sockaddr_un structure (‘sun’)

R13: length of the structure, minus 2 (length of ‘sun_path’)

+2: the offset where the name or path of the socket begins (sun->sun_path).

Now that we’ve identified what each register represents in the faulting instruction, let’s return to the panic report to extract their values to understand why the panic occurred.

Since the kernel panicked with an unhandled page fault, due to a cmp instruction, it’s safe to assume that an invalid memory address was accessed. Since we know the value of the registers at the time of the crash as well as the faulting instruction, we can calculate this address:

RBX (0xffffff803db4eff0), is very close to (specifically 0x10 bytes before) the end of page boundary

R13 is two less than 0x10

Adding RBX + R13 + 2 results in an address that falls on the first byte of a new page

…and recall that the panic crash dump contained “Fault CR2: 0xffffff803db4f000“ …which is the invalid memory address that was accessed! This matches the address we just (re)computed 🙂

What is happening should be clear at this point! A sockaddr_un structure of size 0x10 (0xE + 2), is allocated at the very end of a memory page (00xffffff803db4e000 -> 0xffffff803db4efff). The faulting instruction, cmp byte ptr [rbx+r13+2], 0, is trying to access a byte of memory, one byte outside the sockaddr_un structure…which in this case is at an address on a new page that is doesn’t exist! Opps unhandled page fault => kernel panic!

If this isn’t 100% clear, (took me a minute to see) let’s look at an example using the source code:

A socket structure comes into the audit_arg_sockaddr function

It has a size of 16 bytes (0x10), as specified in the ‘sun_len’ member of the structure

The ‘slen’ variable is calculated as follows: slen = sun->sun_len – offsetof(struct sockaddr_un, sun_path), The offfsetof macro simply returns the offset into a structure for a member. For example, the offset of ‘sun_path’ is 2, so 2 is returned. So ‘slen’ is the total length of the structure, minus the space taken up by the sun_len and sun_family. That is to say, it is the length of the socket name (sun_path)

The code checks to make sure the path is NULL-terminated via: sun->sun_path[slen] != 0. This results in a classic ‘off-by-one’! Why? recall the size of this socket structure in this example is 16 bytes, meaning ‘slen’ will be 14 (16 – 2). However, since the code is checking the sun_path member (which starts at 2), the cumulative memory offset that is referenced is: sun[2 (for ->sun_path), + 14] => sun[16]. As computers index starting at 0 (as opposed to 1), dereferencing ‘sun[16]’ will actually try to read the byte 17 bytes from the start of ‘sun’, the socket structure. However the length of the structure is only 16. Thus reading a byte 17 bytes in, is one byte outside the socket structure. And while this generally won’t be an issue, if socket structure is allocated precisely at the end of page, this code will try to access a byte from a memory address on a new page that doesn’t’ exist. Sure, somewhat of a rare condition, but once RansomWhere? was consuming audit events (and thus, causing the audit_arg_sockaddr() to be invoked by the kernel), its a panic was triggered about once a day.

Actually, I think the easiest way to understand this all is diagrammatically!

Ok, so we’ve got classic off-by-one. Is is exploitable? As it’s just a off-by-one in read instruction (versus a write), it doesn’t appear to be. However, malware could use this to crash a system that has auditing enabled – which is kind of neat! And of course, anytime that user-mode code can trigger a kernel panic; ya not good!

You may be wondering if this off-by-one is simply a naive programming mistake. But I think there is more to it as it appears to be intentional. Like, not in malicious sense of the word, but due to the complexity that is name lengths and structure sizes of AF_UNIX sockets.

It turns out that determining the length isn’t trivial! Why? Well for one, the name or path, (‘sun_path’) doesn’t have to be null terminated: “it need not be null delimited, although it is recommended that it is.” The StackOverflow posting, “proper length of an AF_UNIX socket when calling bind()” sheds some insight into the methods that are used (sometimes incorrectly!) to calculate the length of a unix socket’s name:

And moreover, while the macOS man page for unix sockets doesn’t contain this, (man unix) the linux man pages has a section titled ‘bugs’ and states:

“However, there is one case where confusing behavior can result: if 108 non-null bytes are supplied when a socket is bound, then the addition of the null terminator takes the length of the pathname beyond sizeof(sun_path)”

Things really get crazy when we look at linux’s source code. Specifically the implementation of the unix_mkname function which checks a “unix socket name”:

static int unix_mkname(struct sockaddr_un *sunaddr, int len, unsigned int *hashp)
{
… /*
* This may look like an off by one error but it is a bit more
* subtle. 108 is the longest valid AF_UNIX path for a binding.
* sun_path[108] doesn’t as such exist. However in kernel space
* we are guaranteed that it is a valid memory location in our
* kernel address buffer.
*/
((char *)sunaddr)[len] = 0;
len = strlen(sunaddr->sun_path)+1+sizeof(short);
return len;

So we can see they (linux) handle this “this sort of messiness” with an intentional off-by-one. However, they note, “in kernel space we are guaranteed that it is a valid memory location in our kernel address buffer”… so it’s ok?

I’m guessing the code in Apple XNU/BSD audit_arg_sockaddr function is similar to linux’s unix_mkname…in the sense that the off-by-one is intentional. Of course I have no proof of this, but it seems at least somewhat plausible! Unfortunately, on macOS we’ve shown the that “the guarantee” that the memory beyond the socket structure is valid, does not always hold, and that a kernel panic will occur.

Conclusions
After discovering this bug, and uncovering its root cause I reported it to Apple:

They told me, it would be patched in macOS 10.12.4 which was just released today. In the next part of the blog, we’ll briefly discuss how the bug was patched. More excitingly though, we’ll also show how a line of code (bcopy(sa, &ar->k_ar.ar_arg_sockaddr, sa->sa_len)), a just a few lines up from the off-by-one bug, results in an exploitable kernel heap-overflow: