Archive for December, 2015

Brief Overview of WoW64

“Heaven’s Gate” refers to a technique first popularized by the infamous “Roy G. Biv” of 29afame, and later re-published in Valhalla #1. Cited and improved in various new forms, and even seen in the wild used by the Vawtrak banking malware, it centers around the fact that on a 64-bit Windows OS, seeing as how all kernel-mode components always execute in 64-bit mode, the address space, core OS structures (EPROCESS, PEB, etc…), and code segments for processes are all initially setup for 64-bit “long mode” execution, regardless of the process actually being hosted by a 32-bit executable binary.

In fact, on 64-bit Windows, the first piece of code to execute in *any* process, is always the 64-bit NTDLL, which takes care of initializing the process in user-mode (as a 64-bit process!). It’s only later that the Windows-on-Windows (WoW64) interface takes over, loads a 32-bit NTDLL, and execution begins in 32-bit mode through a far jump to a compatibility code segment. The 64-bit world is never entered again, except whenever the 32-bit code attempts to issue a system call. The 32-bit NTDLL that was loaded, instead of containing the expected SYSENTER instruction, actually contains a series of instructions to jump back into 64-bit mode, so that the system call can be issued with the SYSCALL instruction, and so that parameters can be sent using the x64 ABI, sign-extending as needed.

This process is accurately described in manysources, including in the Windows Internals books, so if you’re interested in reading more, you can do so, but I’ll spare additional details here.

Enter Heaven’s Gate

Heaven’s Gate, then, refers to subverting the fact that a 64-bit NTDLL exists (and a 64-bit heap, PEB and TEB), and manually jumping into the long-mode code segment without having to issue a system call and being subjected to the code flow that WoW64 will attempt to enforce. In other words, it gives one the ability to create “naked” 64-bit code, which will be able to run covertly, including issuing system calls, without the majority of products able to intercept and/or introspect its execution:

Microsoft’s EMET, as well as a myriad of similar tools and sandboxes, only hook/protect the 32-bit NTDLL for WoW64 processes, under the assumption that the 64-bit NTDLL can’t be reached in any other way. The mitigations can therefore be bypassed using Heaven’s Gate. The same technique has been used by the Phenom malware to bypass AV solutions.

When debugging a 32-bit application with a 64-bit debugger (such as WinDBG), you will initially see the 64-bit state (heap, stack, NTDLL, TEB, etc…). Since this state is uninteresting, as it only contains the WoW64 system call layer, manual commands and extensions must be used to investigate the 32-bit state instead — and so in order to avoid this, even Microsoft often recommends using the 32-bit WinDBG instead, which will provide a much more seamless debugging experience and show the 32-bit state of the process. Other 3rd party debuggers, which are 32-bit only, will also behave the same way. The problem, therefore, is that by using Heaven’s Gate, there IS now interesting 64-bit state, that these debuggers will miss.

Many emulation/detonation engines will, upon seeing a 32-bit executable, emulate it using x86 instructions. They will either ignore or be unable to handle x64 instructions, as they never expect them to run. In fact, this was recently shown by a blog post over at Hexacorn. Heaven’s Gate allows such x64 instructions to run, rendering the x86 code into “dummy” code for misdirection purposes.

Memory Restrictions

These and other “benefits” make Heaven’s Gate a tool of choice for malicious code. However, there always existed an interesting limitation in 32-bit applications running under WoW64: even when executing in 64-bit long-mode, addresses above the 4 GB could never be allocated (in fact, addresses above 2 GB could normally never be used for compatibility purposes, unless the image was linked with /LARGEADDRESSAWARE — the switch was originally designed to support /3GB x86 server environments, but outgrew its original intent to allow full 4 GB addresses under WoW64, a fact leveraged by many 32-bit games and browsers even today).

Using a kernel debugger and the !vad command, it’s simple to see why, such as on this Windows 7 system, where I’ve typed the command before the process has any chance of executing even a single instruction — not even NTDLL has loaded here, folks. This is an interesting view of what are the “earliest” memory structures you can find in a WoW64 process (at least on Windows 7).

Note that a giant VAD at the end, highlighted in teal, occupies the entire 64-bit portion of the address space. Let’s see what !vad has to say about it:

Seeing as how it’s configured as a “NoChange” and “OneSecured” VAD, it cannot be freed or modified in any way. This is further confirmed by the commit charge of -1.

On Windows 8 and later, however, the output changes, as you can see below. Note that I’ve re-used the same colors as in the Windows 7 output for clarity (and the uncolored VADs correspond to the CFG entries).

The 64-bit NTDLL is actually loaded in 64-bit address space now! And we have not one, but two teal-colored VADs, which surround it, re-creating the “no man’s land” just as on Windows 7 and earlier. This change was briefly mentioned, I believe, by Matt Miller (of skape fame) at one of Microsoft’s BlackHat presentations: it made it a bit harder to guess the location of the 64-bit NTDLL by simply adding a fixed size to the 32-bit NTDLL. In my screenshot, since this is a CFG-enabled process, the VADs don’t exactly envelop NTDLL — rather they surround the native CFG bitmap + NTDLL, but the point remains.

This change in NTDLL load behavior also had the likely intended side effect of making hooks in 64-bit NTDLL extremely hard, or outright impossible. You see, without consuming an enormous amount of space, it’s simply not possible to overwrite an x64 instruction with a call or jmp to an absolute 64-bit address efficiently. Instead, hooking engines will allocate a “trampoline” that is within the 32-bit address range of the hooked function, and use a much smaller 5 byte 32-bit relative jump, which happens to fit nicely in the “hotpatch aware” region that Microsoft binaries have (or anyone linking with /hotpadmin). The trampoline then uses the full 64-bit absolute jump instruction.

As you’ve figured out by now, if the trampoline needs to be within 2GB, but there are two large VADs blocking off all 64-bit addresses around NTDLL, this hooking technique is dead in the water. Other, more complex and error-prone techniques must (and can) be used instead.

Nevertheless, nothing stops Heaven’s Gate on Windows 8. There some minor WoW64 changes which one must adapt to, and accessing or hooking 64-bit NTDLL becomes harder.

Control Flow Guard and WoW64

In Windows 10, a new exploit mitigation is introduced called Control Flow Guard, or CFG. It too, has been rather well described in multiplesources, so I won’t go into details inside of this post. The important piece to remember about CFG is that all relative function calls are now subject to an additional compiler-generated check, which is implemented by NTDLL: only valid function prologues (within 8 bytes of alignment) can be the target of such a call. Valid function prologues, in turn, are marked by a bit being set in a very large bitmap (bit array) structure, which describes the entire user-mode address space (all 128TB of it!). I previously posted on some interesting changes this required in the memory manager, as this bit array obviously becomes quite large (2 TB, in fact).

What’s not been documented too clearly in most research is that on 64-bit systems, there are in fact not one, but two CFG bitmaps: one for 32-bit code, and one for 64-bit code. The addresses of both of these bitmaps is stored in the per-process working set structure (called MMWSL). This structure is pointed to by the MMSUPPORT structure inside of EPROCESS (i..e.: PsGetCurrentProcess()->Vm.VmWorkingSetList), but a unique thing about it, is that it’s stored in a region of memory called “hyperspace”, which is at a fixed address… much like the per-process page table entry array. On recent 64-bit systems, this hard-coded address is 0xFFFFF58010804000, a fact I pointed out in a previous blog post addressing the 64-bit address space of Windows 8.1 and later.

As one can see in the symbols that WinDBG can dump, the MMWSL structure contains a field:

Clearly, thus, a 64-bit Windows 10 kernel contains not one, but two CFG bitmaps. And indeed, the 32-bit NTDLL will utilize the address of the WoW64 bitmap, while the 64-bit NTDLL will utilize the Native bitmap. But why use two separate bitmaps? What separates a WoW64 bitmap from a native bitmap? One would imagine that 64-bit code is marked as executable in the native bitmap, and 32-bit code is marked as executable in the WoW64 bitmap… but that’s not quite the full story.

At verification time, indeed, it is the version of NTDLL that is being used, which determines which bitmap will be looked at. But how does the OS populate the bits?

In CFG-aware versions of Windows, the CFG bitmap is touched through two paths: MiCommitVadCfgBits, and MiCfgMarkValidEntries. These, in turn, correspond to either intrinsic CFG modifications (side-effects of allocating, protecting and/or mapping executable memory), or explicit CFG modifications (effect of calling SetValidCallTargets). Both of these paths will eventually call MiSelectCfgBitMap, whose pseudo-code is shown below.

As is quite clear from the code, any private memory allocations below the 64-bit boundary will be marked only in the 32-bit bitmap, while the opposite applies to the 64-bit bitmap. In fact, this is the result of an optimization: instead of having two 2TB bit arrays for each processor execution mode, a single 2TB array is used for 64-bit native code, while a single 32MB array is used for 32-bit native code, greatly reducing address space consumption.

Closing the Gate

Basing the decision of which CFG bitmap to populate on the virtual address of the executable allocation creates an obvious dichotomy: 64-bit code, if running in a 32-bit address range, will instantly trip up CFG, because the NTDLL library that is active in that environment is the 64-bit version, which will check the 64-bit bitmap, which will not have any bits set in the 0-4 GB range. Similarly, any 32-bit code must be running below the 4 GB boundary, else the 32-bit NTDLL’s CFG validation routine will trip up, as the 32-bit bitmap isn’t even large enough to account for addresses above 4 GB.

A naive solution is therefore proposed: simply allocate 64-bit code above the 4GB range, and the problem goes away. There is, of course, a problem with this approach: the NoChange VADs which block the entire > 4 GB region of memory and mark it unusable, leaving only 64-bit NTDLL as the only valid allocation in that address range.

In Windows 10, these two factors combined result in the inability to execute any useful 64-bit code in a 32-application/WoW64 process, because the two restrictions combine, creating an impossible condition. You may be tempted to dismiss the reality by stating that all the 64-bit malicious code has to do is not to have been compiled with CFG. In this case, the compiler should not be emitting calls to the validation routine. However, this misses a critical point: it’s not the process’ own executable code/shellcode which are necessarily performing the 64-bit CFG checks — it’s the 64-bit NTDLL itself, or any other additional 64-bit DLLs you may have injected through the initial 64-bit shellcode, into your own process.

Even worse, even if no other 64-bit DLLs are imported, some core system functionality, implemented by NTDLL, also validates the CFG bitmap: Exceptions, User-Mode Callbacks, and APCs. Any usage of these system mechanisms, because they always initially execute in 64-bit mode, will cause a CFG violation if the target is not in the bitmap — which it cannot possibly be. The same goes for higher level functionality like using the Thread Pool, or any other callback-based mechanism owned by NTDLL in 64-bit mode. For example, because kernel-mode injects user-mode APCs through the 64-bit NTDLL, the user-mode APC routine cannot possibly be a custom, non-DLL function: it would’ve been impossible to allocate it > 4 GB, and the APC dispatcher will validate the CFG bits for any address < 4 GB, and be unable to find it.

Perhaps the best example of these unexpected side-effects is to analyze what Heaven’s Gate-using malware often does to gain some usefulness in the hidden 64-bit context: it will lookup LdrLoadDll inside of NTDLL.DLL and attempt to load additional 64-bit DLLs, such as kernel32.dll. With some coercing (as some of the articles I linked to at the beginning showed), this can be made to work. The problem, in a CFG-aware NTDLL.DLL, is that LdrpCallInitRoutines will perform a CFG bitmap check before calling the DllMain of this DLL. As the DLL will be loaded in 32-bit address space, the WoW64 CFG bitmap will be marked, and not the Native CFG bitmap — causing the 64-bit NTDLL to believe that DllMain is not a valid relative call target, and crash the process.

Suffice it to say, although it still is possible to have a very simple 64-bit piece of code, even possibly performing some system calls, execute in the hidden 64-bit world of a WoW64 process/32-bit application, any attempts to load additional DLLs, use APCs, handle exceptions or user-mode callbacks in 64-bit mode will result in the process crashing, as a CFG violation will be tripped. For most intents and purposes, therefore, CFG has a potentially unintended side-effect: it closes down Heaven’s Gate.

Reopening the gate is left as an exercise to the reader 😉

Final Note

Astute readers may have noticed the following discrepancies, especially if following along on their own systems:

This explains the three, not two VADs in my dump: in the original CFG implementation on Windows 8.1, 64-bit code could live in the 32-bit address range, as the Native bitmap had a “Wow64Low” portion. In Windows 10, this is now gone (saving 32MB of address space) — Native code is only aware of the 64-bit address ranges.