[This post was written by ViRb3, if you want to post on this blog you can! Go here for more information…]

A month ago I was at the annual Hack Cambridge. Apart from all the programming and social fun I had, I also stumbled upon a daunting CTF challenge made by a team from Avast. In fact, it intrigued me so much that I took it home and finished it here. Among the puzzles there was a particularity interesting one - a binary that self-decrypted its code twice to reveal a secret message! We will solve that level today, with the help of x64dbg. More info about the challenge in the end.

We are left with the following text after completing the previous parts of the level:

There are two main ways to approach the situation: we can use a x86 emulator (e.g. Unicorn), or use a debugger to hijack any program’s execution flow and replace its instructions with the ones given in our dump. I found the latter method a lot more fun, and even potentially faster, so in this solution we will stick to it.

First, we have to convert the dump to plain bytes. I did this by hand, and separated the hex dump and memory dump like this:

Next, we have to find ourselves some executable space. We start up x32dbg (not x64dbg, since we are working with x32 code), and open any 32-bit executable. Let’s use x32dbg.exe itself.

The process initializes, and we stop at the System breakpoint:

We now have to insert our first dump at the origin (current execution point) using Ctrl+Shift+V, or right-click > Binary > Paste (Ignore Size). We end up with the following block of instructions:

Now, it is very important to paste the hex dump and memory dump bytes exactly 0xFDE bytes apart (distance between 0x00402000 and 0x00401022), so the original structure is intact. The easiest way to do so is selecting the last instruction of the first block (HALT), pressing Ctrl+G or Go to > Expression, and appending +FDE. For me the field is: 7714DB02+FDE.

This will lead us to the exact location where we should paste the second dump. After pasting it looks like this: (end trimmed in screenshot)

We note the beginning address of this block - 7714EAE0 - for reference, and go back to the origin (Numpad *).

Now, we step through the first instructions, until we reach CALL EAX:

We look at the EAX register: it is 00402000. Does that ring a bell? This is the address of the second block (check the original dump). In our case, however, this address is invalid, and we have to replace it with the real address we wrote down a moment ago. We double-click on the register value and change it. For me that was 7714EAE0. We step into the call and continue stepping over, until we are back in our first block.

Now comes a tricky bit. Notice the following instruction:

It replaces the byte at a given address. This isn’t usually a problem, but in our case it will raise an exception. The reason is that we are currently in the .text section, which is executable code, and it cannot be overwritten! To fix this, we have to select the memory pages that correspond to this section and mark them all as FULL ACCESS, or at least give them WRITE ACCESS.

In x64dbg we do this by right-clicking the above instruction > Follow in Memory Map. We then right-click the highlighted page > Set Page Memory Rights > Select All > FULL ACCESS > Set Rights and close the window.

We can now return to the CPU tab.

Before we continue, we analyze the following instructions:

While not necessary, we can deduce that this is essentially a XOR decryption loop.
The code enclosed between 7714DAF0 - 7714DAF9 will loop for 0x64 times, or 100. Surely we won’t want to step over that manually, so we select the instruction after the jump (sub cl,5F), and press F2 or right-click > Breakpoint > Toggle) to place a breakpoint. Now Run the program (F9). When we break, we step in the next few instructions, until we jump to the second block, which is now decrypted.

We see the following instructions:

Clearly even more decryption? If we check EAX+A we see that it leads to the code in the first block!
Pay attention to the end of this routine: a combination of PUSH + RET becomes a JMP, since the RET returns to the value on top of the stack, and here that is the value PUSH just pushed (EAX).

Another tricky bit is the XOR instructon just before that. In terms of the original dump, this will set the last two bits of the address to 0, and so land at the beginning of the hex dump block (00401000). In our case, however, this XOR will mess things up, since our (real) address doesn’t end with 00. To fix this, we step in until we reach the PUSH instruction, and then change EAX to the address of the first instruction of the first block (7714DAE0 for me).

We step over the PUSH and RET instructions and we land back at the first block.
Again, we look at the code:

We already know what these instructions do. Step over to CALL EAX, change EAX to the address of the second block (7714EAE0), step in once to land at the second block, then step over until you come back in the first block.

Now, we examine the code:

Same decryption, with a different XOR value. We breakpoint directly on the CALL EAX, Run (F9), and step in once. We land at the second block.

We now analyze the final routine in this binary challenge:

By carefully reading the instructions, we notice something unusual: the byte at EBX is overwritten every time in the loop, and in the end even overwritten with F4, which in turn will end the program execution. It is therefore safe to bet that the values of EBX (or DL) will be interesting for us.

To log these values, we set a breakpoint at our point of interest (mov byte ptr ds:[ebx],dl). We then head to the Breakpoints tab, find our breakpoint, and right-click > Edit. We can now specify a Log Text, which will be logged every time x64dbg executes this instruction. In our case, we want it to log the value of DL, so we set Log Text to: {DL}. String formatting occurs inside the curly brackets, where you can insert an expression. The expression here is the DL register. We also set the Break Condition to 0, so we only log, and not break.

For more information about string formatting, check the documentation.

We go back to the CPU tab and put an extra breakpoint on the instruction after the JNE (mov byte ptr ds:[ebx],F4).

Comments

Yesterday I was debugging some programs and after restarting I saw that the status label stayed stuck on Initializing. At first it didn’t seem to impact anything, but pretty soon after that other things started breaking as well.

Reproduction steps:

Load some debuggee

Hold step for some time

Press restart

Repeat until the bug shows

Observed behaviours:

The label stays stuck on Initializing

The label stays stuck on Paused (appears to be more rare)

A shot in the dark

After getting more or less stable reproductions I started to look into why this could be happening. On the surface the TaskThread appeared to be correct, but since the WakeUp function was probably failing I put an assert on ReleaseSemaphore, which should trigger the TaskThread:

template<typenameF,typename...Args>voidTaskThread_<F,Args...>::WakeUp(Args..._args){++this->wakeups;EnterCriticalSection(&this->access);this->args=CompressArguments(std::forward<Args>(_args)...);LeaveCriticalSection(&this->access);// This will fail silently if it's redundant, which is what we want.
if(!ReleaseSemaphore(this->wakeupSemaphore,1,nullptr))__debugbreak();}

I tried to reproduce the bug and unsurprisingly the assert triggered! At this point I suspected memory corruption, so I inserted a bunch of debug tricks in the TaskThread to store the original handle in a safe memory location:

structDebugStruct{HANDLEwakeupSemaphore=nullptr;};template<intN,typenameF,typename...Args>TaskThread_<N,F,Args...>::TaskThread_(Ffn,size_tminSleepTimeMs,DebugStruct*debug):fn(fn),minSleepTimeMs(minSleepTimeMs){//make the semaphore named to find it more easily in a handles viewer
wchar_tname[256];swprintf_s(name,L"_TaskThread%d_%p",N,debug);this->wakeupSemaphore=CreateSemaphoreW(nullptr,0,1,name);if(debug){if(!this->wakeupSemaphore)__debugbreak();debug->wakeupSemaphore=this->wakeupSemaphore;}InitializeCriticalSection(&this->access);this->thread=std::thread([this,debug]{this->Loop(debug);});}

Now I started x64dbg and used Process Hacker to find the _TaskThread6_XXXXXXXX semaphore to take note of the handle. I then reproduced and found to my surprise that the value of wakeupSemaphore was 0x640, the same value as on startup!

However when I checked the handle view again, 0x640 was no longer the handle to a semaphore, but rather to a mapped file!

Pushing our luck

This started to smell more and more like bad WinAPI usage. Tools like Application Verifier exist to find these kind of issues, but I could not get it to work so I had to roll my own.

Winner winner chicken dinner!

The actual bug turned out to be in TitanEngine. The ForceClose function is supposed to close all the DLL handles from the current debug session, but all of these handles were already closed at the end of the same LOAD_DLL_DEBUG_EVENT handler.

But how does the semaphore handle value come to be the same as a previous file handle? The answer to that puzzling question is given when you look at the flow of events:

LOAD_DLL_DEBUG_EVENT gets a file handle that is stored in the library list.

LOAD_DLL_DEBUG_EVENT immediately closes said file handle during the debug session.

The static initializer for the TaskThread is called when the debugger pauses for the first time and the semaphore is created with the same handle value as the (now closed) file handle from the LOAD_DLL_DEBUG_EVENT.

All goes well, until the ForceClose function is called and the file handle from LOAD_DLL_DEBUG_EVENT is closed once again.

Hell breaks loose because the TaskThread breaks.

Now for why this doesn’t happen every single time (sometimes I had to restart the debuggee 20 or more times), the handle value is ‘randomly’ reused from the closed handle pool and it’s kind of a coin toss as to when this happens. I found that you can greatly increase the likelyhood of this happening when your PC has been on for a few days and you have 70k handles open. Probably the kernel will use a more aggressive recycling strategy when low on handles, but that’s just my guess.

If you are interested in trying to reproduce this at home, you can use the handle_gamble branch. You can also take a look at the relevant issue.

So… this all began about a month ago, when mrexodia came into our Gitter, explaining that he’d like to replace Capstone in x64dbg. He asked whether we had considered writing a Capstone emulation interface on top of Zydis, allowing for drop-in replacement. We weren’t opposed to the idea, but after checking out the Capstone interface, decided that full emulation and mapping of all structures, flags and constants would be far from trivial and extremely error prone. This is especially true since nobody in our team had previous experience with Capstone and how it behaves in all the edge cases that might come. So instead, we decided to go on the journey of just contributing the port to x64dbg ourselves!

I checked out the repo and wiki for a guide on how to build the project, located one, followed the instructions and a few minutes later, found myself standing in front of a freshly built x64dbg binary. The port itself was pretty straight-forward. I began by reworking the Capstone wrapper class to no longer use Capstone, but Zydis instead. The rest of the work mainly consisted of replacing Capstone constants and structure accesses with their Zydis equivalents in places where the debugger and GUI code didn’t just use the abstraction, but accessed disassembler structures directly. I really won’t bore you with the details here, it was mostly search and replace work.

After completing the basic port, I threw my ass into the x64dbg IRC and had a little chit-chat with mrexodia. He suggested that we should copy & paste the old instruction formatter (CsTokenizer, the part of x64dbg that translates the binary instruction structure to the textual representation you see in the GUI) to a second file, using both Capstone and Zydis simultaneously, comparing their outputs. I quickly implemented that idea and started diffing.

Every time I found a collision between Capstone and Zydis, I added a whitelist entry, recompiled and continued diffing, throwing various different binaries and random data at it. This process not only showed up various issues in my ported CsTokenizer, it also found us 3 bugs in Zydis and >20 in Capstone, some of which have open issues created in 2015 connected to them.

So, what did x64dbg gain from the switch?

Most importantly: significantly more precise disassembly

As such, less room for misleading reversers

Support for more X86 ISA extensions

Support for many lesser known and undocumented instructions

We collected and diffed our data-tables from and against various different sources, such as Intel XED, LLVM and even did a full sweep through the Intel SDM for the sake of checking side-effects of all instructions

However, in a project like x64dbg, that probably only affects the speed of whole module analysis (CTRL+A)

A decrease in binary size

Zydis is about ⅓ the size of Capstone (CS X86 only, all features compiled in)

Not that anyone would practically have to care these days

Nevertheless, low-level people tend to have a thing for small binaries

Finally, aside from all the negativity, I would like to make it clear that we very much appreciate all the work done in Capstone. The project simply has a different focus: it’s a great library if you’re looking into supporting many different target architectures. Zydis, on the other hand, is focused on supporting X86 — and supporting it well.

If you’re interested in checking out our work outside of x64dbg, you can take a look at the repo.