Tuesday, October 28, 2014

This article is about a funny way to obfuscate code that takes advantage of the Windows 64bit capability to manage and run 32bit processes. As we will see, it's a very effective technique that can really be time consuming and annoying.

Windows 64bit natively runs 64bit processes and kernel drivers, but, of course, because of retro-compatibility, it offers the possibility to run old 32bit executables through the WoW64 subsystem. On Intel x86-64 architecture this is implemented via hardware features offered by the CPU that allow 32bit mode code to switch to 64bit mode and viceversa.

The trick relies in these 32bit/64bit switches: you can craft an executable that contains both 32bit and 64bit code, and you can make the code jump from one to the other at any time. Unfortunately, almost all debuggers seem to be ineffective in dealing with these jumps (only remote kernel debugging using Windbg can step through the code). Also the disassemblers don't handle the situation very well, as they are designed to handle only one architecture at a time.Long story short: a real mess and a nightmare for analysis!

2 - 32bit/64bit switch

Let's start analysing how the switch between 32bit and 64bit works, then we can see how it can be abused and what are the problems that it causes to static analysis tools.

2.1 - The basics: how it works

The best way to understand how Windows 64bit handles 32bit processes is to see it in action: let's start a remote kernel debugging session and let's see what happens when we debug a 32bit process. In particular, we are going to debug the 32bit API CreateFile to see how the code interfaces with the 64bit operating system. Starting from the API entry point, we will arrive to the following code:

00000000`7698b62b 89450c mov dword ptr [ebp+0Ch],eax

00000000`7698b62e 8d45f8 lea eax,[ebp-8]

00000000`7698b631 50 push eax

00000000`7698b632 ffd6 call esi {ntdll_772b0000!ZwCreateFile}

00000000`7698b634 8bd8 mov ebx,eax

00000000`7698b636 bf220000c0 mov edi,0C0000022h

00000000`7698b63b 3bdf cmp ebx,edi

This is where the library KERNELBASE.dll is calling the ntdll.dll API ZwCreateFile. In the good old 32bit windows, ntdll, among other things, acts as a wrapper providing the transition from usermode to kernelmode (that is, it implements a syscall). Now things are different: we step into the call and we get:

ntdll_772b0000!ZwCreateFile:

00000000`772d00a4 b852000000 mov eax,52h

00000000`772d00a9 33c9 xor ecx,ecx

00000000`772d00ab 8d542404 lea edx,[esp+4]

00000000`772d00af 64ff15c0000000 call dword ptr fs:[0C0h]

00000000`772d00b6 83c404 add esp,4

00000000`772d00b9 c22c00 ret 2Ch

There is no sysenter/syscall/int 2E here, so this code is not calling the kernel yet. Instead, it is calling the following:

wow64cpu!X86SwitchTo64BitMode:

00000000`74c62320 ea1e27c6743300 jmp 0033:74C6271E

A far jump? You don't really see this type of jump very often in 32bit, so why is it used here? Because it is switching to 64bit mode (the normal usermode code segment for 32bit is 0x0023, and this jump is going to segment 0x0033)! In fact, segment 0x0033 has some specific properties, let's have a look:

It is a code segment with Read/Execute attributes, usermode privilege (ring 3), and the Long bit is set (that is, the segment is for 64bit mode). So now we know how to switch from 32bit to 64bit, but what about the opposite? Since we are executing a 32bit process, it must be possible to switch back to 32bit from 64bit. If we keep debugging, we will pass through the following APIs:

wow64cpu!CpupReturnFromSimulatedCode

wow64cpu!TurboDispatchJumpAddressStart

wow64!Wow64SystemServiceEx

wow64!whNtCreateFile

and finally land on:

ntdll!NtCreateFile

0033:00000000`77121860 4c8bd1 mov r10,rcx

0033:00000000`77121863 b852000000 mov eax,52h

0033:00000000`77121868 0f05 syscall

0033:00000000`7712186a c3 ret

The system call itself happens in 64bit mode: in fact, it is not allowed to use a syscall instruction from 32bit mode, or else an exception will be raised. This is an interesting detail, because it tells us that all the APIs that require a transition to kernelmode must switch to 64bit. (Hint: if you can control the switch to 64bit you can implement a cheap API logger ;))

We finish debugging this API and we get to what we were looking for:

0033:00000000`74c626b0 4489442410 mov dword ptr [rsp+10h],r8d

0033:00000000`74c626b5 458b85c8000000 mov r8d,dword ptr [r13+0C8h]

0033:00000000`74c626bc 4c89442418 mov qword ptr [rsp+18h],r8

0033:00000000`74c626c1 458b85bc000000 mov r8d,dword ptr [r13+0BCh]

0033:00000000`74c626c8 4c890424 mov qword ptr [rsp],r8

0033:00000000`74c626cc 48cf iretq

The iretq instruction is similar to a ret: it returns to the address that is on the top of the stack, but it will also get from there the values that will be used to restore the registers CS, EFL, RSP, SS. We have come full circle:

And this is all we need to know about the mode switches.

2.2 - Abusing 32bit/64bit switches

If Windows library code can simply jump back and forth from 32bit and 64bit mode, then why can't we? In fact, we can just fine! As an example I have crafted a 32bit executable that performs a jump to 64bit mode, and then it jumps back to 32bit. Here it is:

.text:00401000 _main proc near

.text:00401000 call ds:DebugBreak

...

.text:00401010 jmp far ptr 33h:401019

.text:00401010 _main endp

.text:00401010

...

.text:00401019 db 48h ; sub rsp, 4

.text:0040101A db 83h

.text:0040101B db 0ECh

.text:0040101C db 4

.text:0040101D db 89h ; mov dword ptr [rsp], eax

.text:0040101E db 4

.text:0040101F db 24h

.text:00401020 db 48h ; mov rax, rsp

.text:00401021 db 8Bh

.text:00401022 db 0C4h

.text:00401023 db 50h ; push rax

.text:00401024 db 90h ; nop

.text:00401025 db 90h ; nop

.text:00401026 db 90h ; nop

.text:00401027 db 90h ; nop

.text:00401028 db 5Bh ; pop rbx

.text:00401029 db 48h ; mov rax, 2Bh

.text:0040102A db 0B8h

.text:0040102B db 2Bh

.text:0040102C db 0

.text:0040102D db 0

.text:0040102E db 0

.text:0040102F db 0

.text:00401030 db 0

.text:00401031 db 0

.text:00401032 db 0

.text:00401033 db 50h ; push rax

.text:00401034 db 53h ; push rbx

.text:00401035 db 48h ; mov rax, 246h

.text:00401036 db 0B8h

.text:00401037 db 46h

.text:00401038 db 2

.text:00401039 db 0

.text:0040103A db 0

.text:0040103B db 0

.text:0040103C db 0

.text:0040103D db 0

.text:0040103E db 0

.text:0040103F db 50h ; push rax

.text:00401040 db 48h ; mov rax, 23h

.text:00401041 db 0B8h

.text:00401042 db 23h

.text:00401043 db 0

.text:00401044 db 0

.text:00401045 db 0

.text:00401046 db 0

.text:00401047 db 0

.text:00401048 db 0

.text:00401049 db 0

.text:0040104A db 50h ; push rax

.text:0040104B db 48h ; mov rax, 401080h

.text:0040104C db 0B8h

.text:0040104D db 80h

.text:0040104E db 10h

.text:0040104F db 40h

.text:00401050 db 0

.text:00401051 db 0

.text:00401052 db 0

.text:00401053 db 0

.text:00401054 db 0

.text:00401055 db 50h ; push rax

.text:00401056 db 48h ; iretq

.text:00401057 db 0CFh

...

.text:00401080 pop eax

...

I compiled a simple C program, and in the main() function I put a call to DebugBreak to conveniently spawn the remote debugger, then a series of nops which I later modified with the opcodes I needed. You can clearly see the far jump at line 0x00401010: it jumps to the segment 0x0033 and to the virtual address 0x00401019. The code at 0x00401019 is to be read as 64bit instructions, but the executable is loaded in IDA as a 32bit PE, so you see it as data and not as 64bit instructions.

I have put comments on line 0x00401019, 0x0040101d etc. to indicate the 64bit instructions, they are simply pushing the correct values on the stack in order to be able to switch back to 32bit mode. In order, the following values are pushed:

The iretq will restore all these values, starting the execution in 32bit mode from address 0x0023:0x00401080, but bear in mind that the 64bit code also changes the state of the registers in 32bit mode. So it's up to you to preserve the registers that need to be saved across switches.

2.3 - Some issues with the decompilers

Of course you can always open the PE file as a binary file in IDA64, and then manually decompile those instructions, but there are some issues:

The file is opened as a binary file, which means that if an opcode is referencing a memory location IDA will not show you the x-refs. For instance, if you have "mov rax, 0x00402000", since the file is loaded as a binary file and not as a PE, there will not be a reference to the virtual address 0x00402000.

IDA will not know where the 64bit code snippets are in the file, so you will need to manually get every virtual address from the 32bit PE, translate it to a file offset and then find it in the 64bit binary file loaded in IDA. Annoying!

If you have a complex computation (for example, a decryption routine) that interleaves 32bit and 64bit instructions to perform a task, then following the whole routine through static analysis is really a pain: you need to use two sessions of IDA to understand all the code.

.text:004012D8 retnNotice that you don't have any cross references for memory locations between segments, even manually using the "offset" command won't work.These issues show up mainly in static analysis, if you are debugging the code you can just follow it and the obfuscation won't matter. Or will it? Well, it turns out that debuggers don't work very well with 64bit code, and besides, it is common to analyse parts of an executable without having the possibility to run them, so this is a serious issue.

3 - Debuggers

Let's have a quick overview of the debugging problems.

3.1 - Which one works?

I have tested some common debuggers and, as I briefly mentioned in section 2.3, the results are poor:

Ollydbg - It can debug a 32bit process, but it won't be able to trace the far jumps. If you try to step over/into one of those jumps, the debugger will lose control, and will end up somewhere else in the code.

Syser Win32 Debugger- Same as Ollydbg.

Syser kernel debugger- It doesn't run on 64bit Windows.

Windbg local debugger - Same as Ollydbg.

Windbg remote kernel debugger - The only one that works. When doing remote debugging, you can step into the far jumps and the iretqs, so you can debug the code. Unfortunately there are some other limitations, like the code assembler (that is, the "a" command) does not support 64bit instructions, so if you have to patch an executable for any reason, you will have to patch the opcode bytes manually. Not the end of the world, but not nice either.

IDA - You can try and use IDA's built-in debugger, but it won't directly load 64bit PE executables. It requires you to use dbgsrv component from Windbg and then start a remote debugging session. I have not fully tested this feature, but since it uses dbgsrv it may work. Still, it requires remote debugging.

If you want to debug an executable that switches between 32bit and 64bit you need to use Windbg remote kernel debugging, I have not found another easy way to do it. Luckily, machines nowadays are pretty powerful and capable of running virtual machines, but still, it would be much easier to be able to debug this sort of code locally.

3.2 - A small workaround

I have said that Ollydbg (and basically all other usermode debuggers) is not able to step through far jumps, and that if you try you lose control of the execution, but there is still a way to bypass the problem. If you know the 32bit address at which the 64bit code will return to (via an iretq), then you can put a bpx on it, let the program run, and the debugger will break on it, thus bypassing the 64bit code completely. To explain it more clearly:

you arrive at a far jump that will switch to 64bit mode

you know that the 64bit code will return to 32bit address xyz

you set a bpx on address xyz

you let the program run

the debugger will break on xyz

In this way, you completely bypass the 64bit snippet. But of course, it requires you to have previously analysed such snippet, and determined which 32bit address it will return to, which slows everything down.

4 - Some examples of obfuscation

The state of the registers (and of the memory, stack etc.) is maintained across switches, which means you can perform any computation splitting parts of it between 32bit and 64bit.

For example, we can modify the test code in section 2.2 as follows (this time, for clarity, I'm writing the assembly code instead of the opcodes):

------------ 32bit code ------------

.text:00401008 mov edx, 12345678h ; set edx before 64bit

.text:0040100D nop

.text:0040100E nop

.text:0040100F nop

.text:00401010 jmp far ptr 33h:401019h

...

------------ 64bit code ------------

.text:00401019 sub rsp, 4

.text:0040101D mov dword ptr [rsp], eax

.text:00401020movrax, rsp

.text:00401023pushrax

.text:00401024nop

.text:00401025nop

.text:00401026nop

.text:00401027nop

.text:00401028poprbx

.text:00401029movrax, 2Bh

.text:00401033pushrax

.text:00401034pushrbx

.text:00401035movrax, 246h

.text:0040103Fpushrax

.text:00401040movrax, 23h

.text:0040104Apushrax

.text:0040104Bmovrax, 401080h

.text:00401055pushrax

.text:00401056addrdx, rdx ; modifies edx

.text:00401059iretq ; returns to 32bit address 0x00401080

...

------------ 32bit code ------------

.text:00401080pop eax ; edx is now 0x12345678 + 0x12345678

...

The code starts by setting a value (0x12345678) in the register EDX. Then, it jumps to 64bit mode, and the 64bit instructions simply double up the value of EDX. At this point, when the code returns in 32bit mode, EDX contains the value that has been doubled in the 64bit snippet (it would be 0x2468ACF0). The same holds for the stack: you can push 32bit values from the 64bit mode, and they will remain on the stack (assuming you don't change it with the iretq). This means you can hide stack parameters for API calls. Moreover, you can hide the API call itself: all you need to do is to jump in 64bit mode and call its corresponding 64bit version.

This is an example of how you can call an API from 64bit, but of course you can do it in many other ways, or you can even invoke the SYSCALL yourself.

Another interesting trick is that of using a snippet of code that can be executed in both 32bit and 64bit mode, and it will perform a different computation depending on which mode you are in. For example the sequence of bytes

48 03 D2

can be:

- 64bit

add rdx, rdx

- 32bit

dec eax

add edx, edx

so you can call the same opcodes and have them behave differently. Or, even worse, you can add JMP instructions in your code from both 32bit and 64bit to the same opcodes, but only one of them is really executed at runtime, for example:

.text:00401010 jmp far ptr 33h:401050h

...

.text:00401020 jmp 401050h

...

.text:00401050 48 03 D2 ???

it becomes difficult to understand which of the two jumps is actually going to be executed at runtime, this is particularly annoying if you are trying to write a tool that automatically finds 64bit code snippets and disassembles them for you. In this case, if the tool blindly disassembles the line 0x00401050, then maybe the real code executed it only in 32bit mode from line 0x00401020, etc.

5 - Tools

Compilers are not designed to handle this situation either! So developing this trick is not straight forward. Compilers, like debuggers and disassemblers, are designed to handle ONE architecture at a time. Mixing 32bit and 64bit is not easy, but it is not too difficult to write tools or plugins that can generate 64bit snippets to be embedded inside a 32bit executable. You can for example use the "__emit" compiler intrinsic available in old Visual Studio versions, or you can use NASM or other assemblers to generate both 32bit and 64bit code and then merge them in one single executable.

Here are my proposals to help you implement this kind of obfuscation.

5.1 - How to include the obfuscation in a Visual Studio project

To show you how to implement the obfuscation in your own Visual Studio project, I have crafted a POC that you can easily modify.

I have first created a 32bit Visual Studio project called "Asm_C" containing two files: "main.cpp" and "test.asm". "main.cpp" simply executes "run_asm64()", the assembly routine that is located in "test.asm", and demonstrates how this routine modifies the value of the "Key" variable.

In particular, this routine consists of:

opcodes to jump in 64bit mode;

the 64bit opcodes corresponding to the assembly code you want to execute (in case, the ones that modify the "Key" variable);

opcodes to return in 32bit mode.

This is done to bypass the lack of support for the two architectures together: you can't mix 32bit code and 64 bit code in the same project, but you can use the corresponding opcodes instead!

Note that I have put the opcodes also for the code to jump to 64bit mode although it's run in 32bit. This is done because MASM does not seem to support far jumps properly.

To obtain the 64bit opcodes I've created a 64bit Visual Studio project named "Dummy64" containing two files: "main.cpp" and "dummy.asm".

"dummy.asm" contains the 64bit assembly code that we want to compile to obtain the corresponding binary opcodes.

"main.cpp" loops through all the opcodes of the compiled "DummyAsm" routine and then prints them but, first, it looks for a jump (opcode 0xE9) and skips it. This is done because some compilers (Visual Studio, for instance) use to include a snippet, called "trampoline area", that jumps to the function body: so, basically, this check is meant to skip the trampoline itself.

The code also supports a sort of relocation procedure: for example, in this POC, we use the variable "Var1" to refer to the "Key" variable in the "Asm_C" project.

Of course, you can use the same trick every time you want to employ in your 64bit code something that has been defined in the 32bit code.

You create a 32bit project ("Asm_C", in this case) containing both the C/C++ files with the 32bit code and the ASM files in which you will put the 64bit routines.

Each 64bit routine must contain proper code to enter/exit in/from the 64bit mode.

Each 64bit routine must be codified as opcodes, using the "Dummy64" project.

If you want to use a portion of memory that has been previously allocated from the 32bit code (like a variable, an array, a structure and so on..), just use a different one in 64bit and remember to relocate it to the one you are really referring to in 32bit, using the trick we saw in the "Dummy64" project.

5.2 - How to (nearly) automate the obfuscation

To automate the obfuscation you can take advantage of Visual Studio itself! In fact, you can use the /FA option in the Visual Studio command line (or from "Project Properties -> Configuration properties -> C/C++ -> Output files -> Assembler output -> Assembly-Only Listing") and then /GL option (or from "Project properties -> Configuration properties -> C/C++ -> Optimization -> Whole Program Optimization") to obtain the assembly sources related to your project without optimizations. Finally you can compile and link the obtained assembly files by typing: "ml file_1.asm ... file_n.asm" in the Visual Studio command line.

N.B. The /GL option is crucial, because it tells the compiler not to mix the code between the project files: in this way, if a routine is located in "main.cpp", the corresponding assembly one will be in "main.asm", while without this option, due to optimization, it could be located in any other generated assembly file and the ML command won't work!

So you can:

Create a 32bit Visual Studio C/C++ project and compile it in the way described above.

Select any instruction from the obtained assembly listing and substitute it with a bunch of opcodes in 64bit mode that have the same behavior, taking care of adding the code to jump in and out 64bit.

Compile and link the assembly files.

Of course, you can automate step 2 very easily and craft your own obfuscator: it won't take long if you use any programming language that supports regular expressions.

For example, I followed these steps and substituted the assembly instruction "push 14h" with the following assembly code:

db 0EAh; jump to enter 64 bit

dd offset LocEnter

db 033h, 000h

LocEnter:

;------------------------

db 048h, 083h, 0ech, 004h, 0c7h, 004h, 024h, 014h; sub rsp, 4

db 000h, 000h, 000h; mov dword ptr [rsp], 14h

;------------------------

db 048h, 083h, 0ech, 004h; sub rsp, 4

db 089h, 004h, 024h; mov dword ptr [rsp], eax

db 048h, 08bh, 0c4h; mov rax, rsp

db 06ah, 02bh; push stack segment selector

db 50h; push stack pointer (in rax)

db 068h, 046h, 002h, 000h, 000h; push eflags

db 06ah, 023h; push code selector

db 068h

dd offset LocExit

db 048h, 0cfh; iretq

LocExit:

pop eax

I then linked the assembly files and it worked just fine. Moreover I've decompiled the executable you obtain before and after that modification, here are the listings.

Totally messy and, as I mentioned before, a very effective way to hide the parameters of a function. Note that this kind of obfuscation is really powerful and, unlike standard packers, the clear code never appears ready to be dumped from memory. Also, you can use this idea to implement any other obfuscation technique. For example, you can easily create a little program that adds a lot of junk code all over the assembly listing. Also spreading the trick at the end of section 4, that is filling your source with pieces of code that can be interpreted in both 32bit and 64bit, will be very frustrating to whoever will have to analyse your program.

6 - Evolutions

This trick alone is very effective, but there are other good obfuscation techniques that have been used in various malwares/packers. Well, combine the old obfuscation techniques with this new one and you can obtain a code that is nearly impossible to analyze... well, not impossible but very very hard!

7 - (Not) detecting the obfuscation

I tried using Intel's Pin instrumentation toolkit (I used the 32bit version) to trace the test application I created, hoping that Pin would be able to identify and follow the far jumps that go from 32bit to 64bit. Unfortunately, Pin seems to be unable to handle these jumps as well (I also found people reporting this problem in the official Pin's forum). This is the source code of the Pintool I have written:

It simply identifies the opcode of a far jump, and if found, prints the address of the instruction that immediately follows it. Running the test produces the following log before making the application crash:

Exception handler address: 772f0124

Starting Pintool

Jump seg 64! eip 748f2320

after jmp 64: eip 773010b2

Jump seg 64! eip 748f2320

after jmp 64: eip 772ffb9a

Jump seg 64! eip 748f2320

after jmp 64: eip 772ffa1a

Jump seg 64! eip 748f2320

after jmp 64: eip 772ffa1a

...

Jump seg 64! eip 01121022

after jmp 64: eip 772f0124

As we can see, the jumps within system DLLs are correctly detected and the problem occurs only at address 0x01121022, that is the first application's far jump. We notice this also because the following instruction is located at address 0x772f0124, which is the address of KiUserExceptionDispatcher (one of the functions called by Windows when an exception occurs).

Moreover, the application works perfectly if run normally and crashes only when run under Pin.

I haven't investigated these details deeply, but it seems that something happens within Pin's instrumented code in case of the application far jumps, while Pin may have its own logic to handle Windows internal API calls.

And there goes another tool...!

As a note: you can use the 32bit version of Pin to instrument a 64bit process too (although Pin also exists in 64bit): the process will be running in 32bit mode, but the 64bit module is loaded and can be run without problems. So, I think it should be also possible, from a 64bit mode process, to call 32bit code, but I have not tried this yet.

8 - Conclusion

Legacy software and hardware are always a pain, and this is a good example of why they are. This obfuscation derives from the 32bit legacy in our new shiny 64bit CPUs, and it can present many advantages:

it hides computations mixing operations in 32bit and 64bit modes

it hides parameters for API calls

it hides API calls

it destroys code and data cross references

it makes analysis time consuming

it can be only debugged via remote debugging

it is difficult to have automated tools to solve this obfuscation

64bit support in analysis tools in general is not very good

Note that when I say "hide" I mean that the code is difficult to visualize correctly in the disassembler or in usermode debuggers.The code is there however, but the current tools have difficulties in dealing with it.

Note: I wrote this blog entry about two years ago and I proposed it for the Phrack magazine. At the end they decided to decline the offer just a few months ago and I decided to publish it now anyway. Some of the findings reported here were new at the time of writing, but were later published by other researchers (see the references). Also, even if I took some time to review this material again, some limitations I outlined to handle this obfuscation could have been fixed with newer software releases. Hope you enjoyed the article anyway :)