Friday, August 3, 2007

Abstract

What follows is a detailed account of the creation of a stack-overflow exploit targeting TinyOS 2.x on a Tmote Sky wireless sensor node, which uses the Texas Instruments MSP430 microcontroller. An example application is used as a target, rather than one which might be found in the wild. A firm knowledge of C, assembly language, and embedded systems architectures is assumed, but the details of NesC, the MSP430, and TinyOS are reviewed for those new to this platform. Finally, preventative measures are discussed.

Before We Begin

By default, the NesC compiler attaches the inline keyword to every function that it generates, even if that function began as a C function. To prevent this, use __attribute__ ((noinline)). Without this attribute, you'll go through hell trying to understand a function with twenty others embedded within it. Note that the inline keyword isn't what makes these attacks possible, it just makes them easier to understand.

Disassembly

I'll begin with a short example which uses gdb on a local image of a simple TinyOS application. This requires msp430-gdb, but does not require a JTAG debugger or any physical hardware.

Note that the function acts just as its C equivalent would. red+0 loads the constant #1 from the constant generator r3 into r15, the register which contains the first parameter to a C function in the MSP430 version of GCC. A call is then made to 0x53fc, which we know as Leds.set().

Disassemble also accepts an address as its argument, so let's take a peek under the hood and see what Leds.set() does.

which suggests that commands are rendered straight to C functions with the call keyword merely calling the function. In actual usage, call is used not to determine the way in which the function is called, but whether it's allowed. A command may not be called from an interrupt handler unless it also possesses the async keyword. Note that set() would have been automatically inlined if it had not been called from multiple source functions.

Inline Assembly

Inline assembly language is quite easy as well. Suppose that we would like to find the value of the stack pointer:

In the above example, x/bx means "examine as hexadecimal bytes." I could also have used x/hx to examine half-word bytes (words are 32 bit here, not 16), but the little-endian nature of the target platform makes that a little confusing, as the bytes are printed out of order.

What this means is that we can declare a C byte array of {0x0f,0x41,0x30,0x41} or a string of "\x0f\x41\x30\x41" at any even address and execute a call to its address in order to execute it. The even addressing is essential, as r0--the PC--cannot hold an unaligned address and unaligned word accesses are not supported by the MSP430. Because this architecture is little endian, the code as a 16-bit integer array is not {0x0f41, 0x3041} but rather {0x410f,0x4130}.

I've been using gcc and gdb to generate machine code, but the mspgcc project has made a single-instruction assembler available through the web. Remember that it gives results as little-endian words.

Instruction Emulation

It's important to realize that MSP430 assembly language contains many statements which don't exist on the physical chip. Instead they're emulated by translation in the assembler.

For example, suppose we have the following function using inline assembly:

void __attribute__ ((noinline)) setled(){ asm("inv &0x0031");}

INV is an emulated instruction which flips the bits of its destination by XORing them with 0xFFFF. Why should the chip have a separate instruction, when the programmer could simply call XOR #-1,&0x0031? In practice, that is what happens as our disassembly shows:

The above code executes the machine language code to blink the LED by inverting the memory-mapped port at 0x31. The integers of machlang() may reside anywhere in the memory space, which is to say anywhere in RAM or ROM.

Buffer Overflow Stack Injection

Machine language injection works by virtue of the call stack, which grows downward in TinyOS from nearly the top of RAM (high address) to the bottom of RAM (low address, 0x200). When a function begins, the stack's lowest word contains the address of the calling function, such that when a function calls the "RET" instruction, it copies a value from the address pointed to by R1 (SP) into R0 (PC) and increments R1 to shrink the stack by a word.

The following code overwrites the stack's stored copy of the calling PC such that when it returns, control jumps to machlang instead of the calling function:

void __attribute__ ((noinline)) setled(){ //call it the rude way by overwriting the return address int *oldpc=&oldpc;//point to top of frame oldpc++;//inc by 2, not 1 *oldpc=machlang;//overwrite old PC return;//return to machlang, not calling function}

In the above code, the pointer oldpc is declared and incremented such that it points at the stack value pushed before itself, which is of course the stored PC value that the function will jump to when it returns. When return; is called, the processor jumps not to the calling function but rather to the machine code, causing it to be executed.

Buffer overflow injections work in a similar way, but rather than set the pointer explicitly by C code, they instead have a string that--when copied into a buffer--exceeds the end of the buffer and writes to the next position. The following code does just that, by calling strcpy() on a string composed of the machine language entry address repeated many times.

This is a bit crude, in that it overwrites more than just the stored PC. Note, however, that the string can be dropped in with no specialized code in the copying function. In order to view the success of this, the machlang array must be changed to enter an infinite loop when complete. It cannot successfully return because it overwrote more than just the stack pointer it intended to. This is an unavoidable side-effect when the stack is as dense as it is on the MSP430, as the null terminator must be copied--thus unless the high word--that is the latter word in little endian--of the target address happens to be 0x00, the address immediately above that which we intend to overwrite must necessarily be clobbered.

Using a JTAG debugger (TI MSP-FETP430-PIF or TI MSP-FET430-UIF), it's trivial to view the stack. You'll notice that the stack is rather shallow, only two functions deep. As 'BlinkC$Timer0$fired' is called without stack parameters--those that don't fit into registers--a simple RET suffices to return past it. If parameters were on the stack, they could be removed with the POP instruction.

Program received signal SIGTRAP, Trace/breakpoint trap.BlinkC$setled () at BlinkC.nc:99(gdb) where#0 BlinkC$setled () at BlinkC.nc:99#1 0x00004d46 in BlinkC$Timer0$fired () at BlinkC.nc:128#2 0x00004d46 in BlinkC$Timer0$fired () at BlinkC.nc:128(gdb)

A Complete Exploit

Now that we've got machine code and a way to force it onto the stack, we are still left with the issue of knowing at which address it will be. On a workstation, desktop, or server, it's common practice to include NOP instructions before the code you intend to execute, such that you can guess at the target address. On x86 processors, this is particularly easy because of support for a byte-length NOP instruction (0x90) and unaligned access.

Wireless sensor nodes and other embedded systems require a different strategy. The payload of a packet is often so small that a single packet has barely got room for anything interesting, much less a bunch of word-length NOP instructions (0x4303, which is really MOV r3,r3). Fortunately, these systems emphasize static allocation. malloc() and similar usage of a heap is strongly discouraged, to the point that much documentation claims the method doesn't exist.

A consequence of static allocation is that of twenty nodes running the same firmware, twenty nodes will have every non-stack variable in the same location. This includes the functions which handle reception of an incoming packet. Thus, the easiest way to inject code into a live wireless sensor node by a single 802.15.4 packet is to craft a packet which--when copied over the stack--overwrites the return address with the address of the global copy of itself, not the stack's copy.

Executing the stack's copy is also possible, of course.

Target Application

For an example of a remote exploit, I threw together a simple application that accepts the shortened name of a color--RED, GREN, or BLUE--within a packet and enables the appropriate LED. The code is below:

The first step is to determine where the packet resides in memory on the victim. I suppose it's possible to dig around TinyOS for the symbol of packet, but when debugging symbols might not be available, a more reliable technique is to search for the contents of the last packet sent:

(gdb) x/s 0x2a20x2a2: "\006RED"(gdb)

Trying again for a different packet, at the same address I find

(gdb) x/s 0x2a20x2a2: "\006BLUE"(gdb)

And one last time for the green led, I find

(gdb) x/s 0x2a20x2a2: "\006GREN"(gdb)

At each stage, the light matches the string being given. This gives me both good and bad news. The good news is that my packet gets through, the bad news is that it's mis-aligned. The "\006" character is at 0x2a2, which means that the packet's string doesn't begin until 0x2a3, which is an odd address. Machine code may only reside at even addresses on the MSP430 and many other processors, with the X86 being a notable exception.

Once the target address is known, crafting an attack is as simple as stuffing the following things into the packet:1. The executable machine code, even-aligned in the global packet.2. The entry address off the machine code, even-aligned in the overflow onto the stack, offset such that it overwrites the program counter.3. A terminating null character or word, such that strcpy() or its equivalent doesn't hit flash ROM.

These rules can be quite a juggling act, but expressed for the above example:1. Executable code should begin at the second letter of the enclosed string, which will be 0x02a4 on the target.2. 0x02a4 (0xa4 0x02 as bytes) must reside in bytes 7 and 8 of the string.3. The string must end in zeros.4. The first letter must not be a zero.

While it's not terribly difficult to do these things on paper, it's cleaner to do it in C. First we define our machine code in an array, as we did before:

This has a lot of empty space and doesn't contain as much information as it might. The machine code in the suffix just blinks the LEDs in an infinite while loop, though in practice they blink faster than the human eye can see. The following code packs the machine code and the target address into a single string for sending as a packet. Note that the address and the machine code are differently aligned. This is because the address must be even aligned with the destination of the strcpy(), while the machine code must be evenly aligned in the source.

void __attribute__ ((noinline)) build_exploit(void* vstr){ char *str=(char*)vstr; //machlang has zeroes, so it may not be used before the address. //load the machine code, with weird but correct offset. memcpy(attack+1,&machlang,16);

//load the attack address memcpy(attack+6,&attackinit,2);

attack[8]=0;//zero out end, just in case. memcpy(str,attack,20);}

Prevention

Randomizing addresses would make these attacks more difficult to stage, particularly if every node ran a different build, such that no two nodes would store incoming packets at the same address. The compiler could also push an object of random size onto the stack such that it would follow a sort of drunkard's walk to prevent stack code from being jumped to.

A still more effective alternative would be an addition to the MSP430 itself, one that would branch to an exception handler if the program counter were ever outside of flash memory. This isn't a workstation, and there's rarely any reason to execute instructions from RAM. Thus, a simple register configuration which enabled and disabled execution from RAM would make the platform much more difficult to exploit.

As always, null-terminated string functions of unspecified length should never be used. There are other mistakes that make the stack vulnerable to corruption, but this is by and large the most common. strcpy() and its like should have been culled from the C language decades ago, but they're still with us and still being taught in introductory computer science classes.

I gave an informal introduction to this technique at the 2007 ACS Control Systems Cyber Security Conference in Knoxville, Tennessee. By far, the most common objections were that cryptography made this un-exploitable in practice or that the fence-line prevented malicious packets from reaching the target system.

Although it's true that cryptographically verifying received packets makes code injection more difficult, it does not make such injection impossible. Key management must have such a strict policy that a stolen node is de-authorized before an attacker can attach a JTAG cable to forcibly grab the key. Further, the JTAG fuse ought to be burned such that the node must be taken off-site for firmware extraction.

Regarding the fence-line, it's just a line in the sand as far as an attacker is concerned. 2.4Ghz amplifiers are rather easy to acquire, and the 150 meter range listed on the radios spec-sheet doesn't apply when an extra amplifier is attached. Even if we assume that the fence-line is effective--such as on a submarine--it's still possible to either bribe or trick an authorized employee into bringing a transmitter within the fence, or a packet sniffer in and then back out.

Thursday, August 2, 2007

Recently, I've been digging into the documentation of Texas Instruments' MSP430 micro-controller family. After covering the CPU itself, I continued into the documentation[1] for the mspgcc project, a port of GCC to the MSP430. After realizing that the ABI used for mspgcc had never been defined in the chip's documentation[3], I dug up the manual[2] for IAR's compiler and compared the two.

I quickly discovered that IAR's compiler wastes registers when passing 16-bit parameters to a C function. By its ABI, the first 16-bit parameter is placed into R12 and the second into R14. R13 and R15 remain unused, as they are reserved for the high words of 32-bit parameters. GCC follows the much more logical route of only assigning a single register to a 16-bit value, such that R15 is used for the first parameter, R14 for the second, R13 for the third, and R12 for the fourth. This allows it to accept four parameters by register, while IAR's compiler will push the third and fourth onto the stack while leaving two clobber registers unused!

To demonstrate this, I have compiled a simple C program containing only a function foo() which returned the sum of its four inputs and a main() method to call foo(). This was compiled to assembly language using mspgcc 3.2.3 and IAR MSP430 C/C++ Compiler V3.42A/W32.

In both compilers, four assembly instructions were used to add the values and return the result in the single register of the first parameter, R12 for IAR and R15 for GCC. The table below lists the assembly generated by each compiler, with instructions converted from the GCC format (lowercase, .W omitted) to the IAR format for clear comparison. GCC, by virtue of its more efficient register usage, avoids both having to PUSH.W two parameters onto the stack and avoids having to use the indexed addressing mode, as X(SP), within the function.

Pages 3-72 and 3-73 of the MSP430 Family Guide[3] detail the full cost of these additions, which increase not only the runtime but also the storage requirements of the function. According to those pages, "ADD.W r14,r15" takes 1 cycle and 1 word of memory while "ADD.W 0x2(SP), R12" takes 3 cycles and 2 words of memory. Additionally, each of the two PUSH.W statements required to call foo() in the IAR compiler takes 3 cycles, which are unnecessary in GCC.

Texas Instruments' Code Composer Essentials does not suffer from IAR's inefficiency; rather, it uses an ABI similar to but incompatible with GCC. TICCE allocates register R12 for the first parameter, then R13, R14, and R15. The result is returned in R12. GCC uses registers in the opposite order and returns in R15. See the Users Guide[4] for more details.

What's the reasoning behind IAR's design? It makes functions of two 32-bit values easily compatible with those of two 16-bit values, but this compatibility breaks as soon as the third parameter comes into play, which is pushed onto the stack as a single word. If such compatibility were essential, the trick could be maintained by using R13 for the third parameter and R15 for the fourth.