Archive for the ‘Articles and Presentations’ Category

Windows 10 introduces an exciting new feature with potential security implications – dynamic tracing which finally enables long awaited-for features in the operating system.

At boot, the OS now calls KiInitDynamicTraceSupport, which only if kernel debugging is enabled, will call into the TraceInitSystem export provided by the ext-win-ms-ntos-trace-L-1-1-0 API Set, which is not currently shipping in the public OS schemas. This export receives a callback table with the following 4 functions:

KeSetSystemServiceCallback

KeSetTracepoint

EtwRegisterEventCallback

MmDbgCopyMemory

If the function returns successfully, KiDynamicTraceEnabled is now set.

The last routine is not terribly interesting, as it is already used by the debugger when accessing physical memory through commands such as !dd or dd /p. But the other three routines, well, Christmas came early this year.

Kernel Mode ETW Event Callbacks

EtwRegisterEventCallback is a new internal function, accessible only by the dynamic trace system, which allows associating a custom ETW event callback routine, and associated context, with any ETW Logger ID. The function validates that the callback function is valid by calling KeIsValidTraceCallbackTarget, which does two things:

Is Dynamic Tracing Enabled? (i.e.: KiDynamicTraceEnabled == 1)

Is this a valid callback (same requirements as Ps and Ob callbacks, i.e.: was the driver containing the callback linked with /INTEGRITYCHECK)

Once the check succeeds, the matching logger context structure (WMI_LOGGER_CONTEXT) is looked up, and an appropriate ETW_EVENT_CALLBACK_CONTEXT structure is allocated from the pool (tag EtwC), and inserted into the CallbackContext field of the logger context.

At this point, any time an ETW event is thrown by this logger, this kernel-mode callback is also called, introducing, for the first time, support for kernel-mode consumption of ETW events, one of the biggest asks of the security industry in the last decade. This call is done by EtwpInvokeEventCallback, which calls the registered ETW callback with the raw ETW buffer data at the correct offset where this event starts, and the size of the event in the buffer. This new callback is called from:

EtwTraceEvent and EtwTraceRaw

EtwpLogKernelEvent and EtwpWriteUserEvent

EtwpEventWriteFull

EtwpTraceMessageVa

EtwpLogSystemEventUnsafe

This essentially gives access to the callback to any and all ETW event data, including even WPP and TraceLog debug messages.

System Call Hooks

KeSetSystemServiceCallback, on the other hand, fulfills the second Christmas wish of every Windows security researcher: officially implemented system call hooks. The API allows the dynamic trace system to register a system call hook by name and pass an associated callback function and context. It introduces a new table, called the KiSystemServiceTraceCallbackTable, which copies the contents of the KiServicesTab (a new, more comprehensive system call table) into a Red-Black tree which contains an entry for each system call with its absolute location, number of arguments, and a pre and post callback (and context).

Before continuing, it’s worth talking about the format of the new KiServicesTab structure, as it introduces some valuable information for reverse engineering:

The first 32-bit value is the hash of the system call function’s name

The second 32-bit value is the argument count of the function

The last 64-bit value, is the absolute pointer to the function

The hash function, implemented in a sane language as C looks as follows:

The trick, of course, is that the function name must be passed in without its Nt prefix – countless hours having been wasted trying to debug the hash algorithm by yours truly. Let’s take a look at some debugger output:

Back to KeSetSystemServiceCallback, if a system call callback is being registered, the KiSystemServiceTraceCallbackCount variable is incremented, and the KiDynamicTraceMask has its lowest bit set (the operations are reversed in the case of a system call callback unregistration). Unregistration is done by looping while KiSystemServiceTraceCallbacksActive is set, acting as a barrier to avoid unregistration in the middle of a call. All of these operations are further done under a lock (KiSystemServiceTraceCallbackLock).

Once the callback is registered (which must also satisfy the checks done by KeIsValidTraceCallbackTarget), it will interact with the system call handler as follows: inside of KiSystemServiceCopyEnd, a check is made with KiDynamicTraceMask to verify if the lowest bit is set, if so, system call execution goes through a path where KiTrackSystemCallEntry is called, passing in all of the register-based arguments in a single stack-based structure. This uses the KiSystemServiceTraceCallbackTable Red-Black Tree to locate any matching callbacks, and if one is present, and KiDynamicTraceEnabled is set, KiSystemServiceTraceCallbacksActive is incremented, the callback is made, and then the KiSystemServiceTraceCallbacksActive is decremented.

When this function returns, the return value is captured, and the actual system call handler is called. Then, KiTrackSystemCallExit is called, passing in both the capture result from earlier, as well as the return value of the system call handler. It performs the same operations as the entry routine, but calling the exit callback instead. Note that callbacks cannot override input parameters nor the return value, at the moment.

Trace Points

KeSetTracepoint is the last of the new capabilities, and introduces an ability to register dynamic trace points, enable and disable them, and finally unregister them. The idea of a ‘trace point’ should be familiar with anyone that has used Linux-based kprobes before.

A trace point is registered by passing in an address and which is then looked up against any currently loaded kernel modules. As long as the address is not part of the INIT (which is discarded by now, or soon will be) or KVASCODE section (which is the KVA Shadow space used to mitigate Meltdown), a trace point structure is allocated in non-paged pool (with the Ftrp tag). Next, the KiTpHashTable is used to scan for existing trace points on the same address. If one is found, an error is returned, as only a single trace point is supported per function. Note that trace point callbacks are also validated by calling KeIsValidTraceCallbackTarget just like in the case of the previous callbacks.

KiTpSetupCompletion is used to finalize registration of a trace point, which first calls KiTpReadImageData based on the instruction size that was specified. An instruction parser (KiTpParseInstructionPrefix, KiTpFetchInstructionBytes) is used, followed by an emulator (KiTpEmulateInstruction, KiTpEmulateMovzx, and many more) are used to determine the instruction size that is required. Once the information is known, the original instructions are copied. For what it’s worth, KiTpReadImageData is a simple function which attaches to the input process and basically does a memcpy of the address and specified bytes.

Once registered, the KiTpRegisteredCount variable is incremented, and the trace point can now be enabled. The first time this happens, KiTpEnabledCount is incremented, and the KiDynamicTraceMask is modified, this time setting the next lowest bit. Then, KiTpWriteMemory is called, which follows a similar code path as when using the debugger to set breakpoints (attaching to the process, if any, calling MmDbgCopyMemory to probe the address, and then using MmDbgCopyMemory wrapped inside of KdEnterDebugger and KdExitDebugger to make the patch.

Disabling a trace point follows the same pattern, but in the opposite direction. Just like system call callback unregistration, a variable, this time called KiTpActiveTrapsCount, is used to avoid removing a trace point while it is still active, and all operations are done by holding a lock (KiTpStateLock).

So how are trace points actually triggered? Simple (again, no surprise to kprobe users) – an “INT 3” instruction is what ends up getting patched on top of the existing code at the target address, which will result in an eventual exception to be handled by KiDispatchException. If the status code is STATUS_BREAKPOINT, and KiDynamicTraceMask has Bit 1 set, KiTpHandleTrap is called.

This increments KiTpActiveTrapsCount to protect against racing unregistration, and looks up the address in KiTpHashTable. If there’s a match, it then makes sure that an INT 3 is actually present at the address. In case of a match, the handler checks if this is a first chance exception and if dynamic tracing is enabled. If so, the callback is executed, and its return value is used to determine if the exception should be raised or not. If the function returns FALSE, then KiTpEmulateInstruction is called to emulate the original instruction stream and resume execution. Otherwise, if dynamic tracing is not enabled, or if this is a second chance exception, KiTpWriteMemory is used to restore the original code to avoid any further traps on that address.

Conclusion

Well, there you have it – the latest 19H1 release of Windows should introduce some exciting new functionality for tracing and debugging. Realistically, it’s unlikely that this will ever be exposed for 3rd security product party use (or even to internal competing security tools), but the addition of these capabilities may one day lead to that functionality being exposed in some way (especially the ETW tracing capability). Since there is no publicly shipping host module for the Tracing API Set, it may be that this functionality will only ever internally be used by Microsoft for their own testing, but it would be great to one day see it for 3rd party debugging and tracing as well.

Finally, it’s worth noting that the KiDynamicTraceEnabled variable is protected by the PsKernelRangeList, which is PatchGuard’s internal way of monitoring specific variables and tables outside of its regular set of behaviors, so attackers that try to manipulate this behavior illicitly will likely incur its wrath. Still, since this functionality is meant to be used when a kernel debugger is attached (which disables PatchGuard), it’s certainly possible to build a custom hand-crafted driver that enables and uses this functionality for legitimate purposes inside of say, a sandbox product.

Introduction

A few months ago, as part of looking through the changes in Windows 10 Anniversary Update for the Windows Internals 7th Edition book, I noticed that the kernel began enforcing usage of the CR4[FSGSBASE] feature (introduced in Intel Ivy Bridge processors, see Section 4.5.3 in the AMD Manuals) in order to allow usage of User Mode Scheduling (UMS).

This led me to further analyze how UMS worked before this processor feature was added – something which I knew a little bit about, but not enough to write on.

What I discovered completely changed my understanding of 64-bit Long Mode semantics and challenged many assumptions I was making – pinging a few other experts, it seems they were as equally surprised as I was (even Mateusz”j00ru” Jurczyk wasn’t aware!).

Throughout this blog post, you’ll see how x64 processors, even when operating in 64-bit long mode:

Still support the usage of a Local Descriptor Table (LDT)

Still support the usage of Call Gates, using a new descriptor format

Still support descriptor-table-based (GDT/LDT) segmentation using the fs/gs segment – ignoring the new MSR-based mechanism that was intended to “replace” it

At the end of the day, we’ll show that j00ru’s and Gynvael Coldwind’s amazing paper on abusing Descriptor Tables is still relevant, even on x64 systems, on systems up to Windows 10 Anniversary Update. As such, reading that paper should be considered a prerequisite to this post.

Please, take into consideration that all these techniques no longer work on Anniversary Update systems or later, nor will they work on Intel Ivy Bridge processors or later, which is why I am presenting them now. Additionally, there is no “vulnerability” or “zero-day” presented here, so there is no cause for alarm. This is simply an interesting combination of CPU, System, and OS Internals, which on older systems, could’ve been used as a way to gain code execution in Ring 0, in the presence of an already existing vulnerability.

A brief primer on User Mode Scheduling

UMS efficiently allows user-mode processes to switch between multiple “user” threads without involving the kernel – an extension and large improvement of the older “fiber” mechanism. A number of videos on Channel 9 explain how this is done, as does the patent.

One of the key issues that arises, when trying to switch between threads without involving the kernel, is the per-thread register that’s used on x86 systems and x64 systems to point to the TEB. On x86 systems, the FS segment is used, leveraging an entry in the GDT (KGDT_R3_TEB), and on x64, the GS segment is used, leveraging the two Model Specific Registers (MSRs) that AMD implemented: MSR_GS_BASE and MSR_KERNEL_GS_SWAP.

Because UMS would now need to allow switching the base address of this per-thread register from user-mode (as involving a kernel transition would defy the whole point), two problems exist:

On x86 systems, this could be implemented through segmentation, allowing a process to have additional FS segments. But doing so in the GDT would limit the number of UMS threads available on the system (plus cause performance degradation if multiple processes use UMS), while doing so in the LDT would clash with the existing usage of the LDT in the system (such as NTVDM).

On x64 systems, modifying the base address of the GS segment requires modifying the aforementioned MSRs — which is a Ring 0 operation.

It is worth bringing up the fact that fibers never solved this problem –instead having all fibers share a single thread (and TEB). But the whole point of UMS is to provide true thread isolation. So, what can Windows do?

Well, it turns out that close reading of the AMD Manuals (Section 4.8.2) indicate the following:

“Segmentation is disabled in 64-bit mode”

“Data segments referenced by the FS and GS segment registers receive special treatment in 64-bit mode.”

“For these segments, the base address field is not ignored, and a non-zero value can be used in virtual-address calculations.

I can’t begin to count how many times I’ve heard, seen, and myself repeated the first bullet. But that FS/GS can still be used with a data segment, even in 64-bit long mode? This literally brought back memories of Unreal Mode.

Clearly, though, Microsoft was paying attention (did they request this?). As you can probably now guess, UMS leverages this particular feature (which is why it is only available on x64 versions of Windows). As a matter of fact, the kernel creates a Local Descriptor Table as soon as one UMS thread is present in the process.

This was my second surprise, as I had no idea LDTs were still something supported when executing native 64-bit code (i.e.: ‘long mode’). But they still are, and so adding in the TABLE_INDICATOR (TI) bit (0x4) in a segment will result in the processor reading the LDTR to recover the LDT base address and dereference the segment indicated by the other bits.

Let’s see how we can get our own LDT for a process.

Local Descriptor Table on x64

Unlike the x86 NtSetLdtEntries API and the ProcessLdtInformation information class, the x64 Windows kernel does not provide a mechanism for arbitrary user-mode applications to create an LDT. In fact, these APIs all return STATUS_NOT_SUPPORTED.

That being said, by calling the user-mode API EnterUmsSchedulingMode, which basically calls NtSetInformationThread with the ThreadUmsInformation class, the kernel will go through the creation of an LDT (KeInitializeProcessLdt).

This, in turn, will populate the following fields in KPROCESS:

LdtFreeSelectorHint which indicates the first free selector index in the LDT

LdtTableLength which stores the total number of LDT entries – this is hardcoded to 8192, revealing the fact that a static 64K LDT is allocated

LdtSystemDescriptor which stores the LDT entry that will be stored in the GDT

LdtBaseAddress which stores a pointer to the LDT of this process

LdtProcessLock which is a FAST_MUTEX used to synchronize changes to the LDT

Finally, a DPC is sent to all processors which loads the LDT into all the processors.

This is done by reading the KPROCESS->LdtSystemDescriptor and writing into the GDT at offset 0x60 on Windows 10, or offset 0x70 on Windows 8.1 (bonus round: we’ll see why there’s a difference a bit later).

Then, the LLDT instruction is used, and the selector is stored in the KPRCB->LdtSelector field. At this point, the process has an LDT. The next step is to fill it out.

The function now reads the address of the TEB. If the TEB happens to fall in the 32-bit portion of the address space (i.e.: than 0xFFFFFF000), it is set as the base address of a new segment in the LDT (using LdtFreeSelectorHint to choose which selector – in this case, 0x00), and the TebMappedLowVa field in KTHREAD replicates the real TEB address.

On the other hand, if the TEB address is above 4GB, Windows 8.1 and earlier will transform the private allocation holding the TEB into a shared mapping (using a prototype PTE) and re-allocate a second copy at the first available top-down address available (which would usually be 0xFFFFE000). Then, TebMappedLowVa will have this re-mapped address below 4GB.

Additionally, the VAD, which remains “private” (and this will not show up as a truly shared allocation) will be marked as NoChange, and further will have the VadFlags.Teb field set to indicate it is a special allocation. This prevents any changes to be made to this address through calls such as VirtualProtect.

Why this 4GB limitation and re-mapping? How does an LDT help here? Well, it turns out that the AMD64 manuals are pretty clear about the fact that the mov gs, XXX and pop gs instructions:

Wipe the upper 32-bit address of the GS base address shadow register

Load the lower 32-bit address of the GS base address shadow register with the contents of the descriptor table entry at the given selector

Therefore, x86-style segmentation is still fully supported when it comes to FS and GS, even when operating in long mode, and overrides the 64-bit base address stored in MSR_GS_BASE. However, because there is no 64-bit data segment descriptor table entry, only a 32-bit base address can be used, requiring this complex remapping done by the kernel.

On Windows 10, however, this functionality is not present, and instead, the kernel checks for presence of the FSGSBASE CPU feature. If the feature is present, an LDT is not created at all, and instead the fact that user-mode applications can use the WRGSBASE and RDGSBASE instructions is leveraged to avoid having to re-map a < 4GB TEB. On the other hand, if the CPU feature is not available, as long as the real TEB ends up below 4GB, an LDT will still be used.

A further, and final change, occurs in Anniversary Update, where the LDT functionality is completely removed – even if the TEB is below 4GB, FSGSBASE is enforced for UMS availability.

Lastly, during every context switch, if the KPROCESS of the newly scheduled thread contains an LDT base address that’s different than the one currently loaded in the GDT, the new LDT base address is loaded in the GDT, and the LDT selector is loaded against (hardcoded from 0x60 or 0x70 again).

Note that if the new KPROCESS does not have an LDT, the LDT entry in the GDT is not deleted – therefore the GDT will always have an LDT entry now that at least one UMS thread in a process has been created, as can be seen in this debugger output:

Call Gates on x64

Call gates are a mechanism which allows 16-bit and 32-bit legacy applications to go from a lower privilege level to a higher privilege level. Although Windows NT never used such call gates internally, a number of poorly written AV software did, a few emulators, as well as exploits, both on 9x and NT systems, because of the easy way they allowed someone with access to physical memory (or with a Write-What-Where vulnerability in virtual memory) to create a backdoor way to elevate privileges.

With the advent of Supervisor Mode Execution Prevention (SMEP), however, this technique seems to have fallen out of fashion. Additionally, on x64 systems, since Call Gates are expected to be inserted into the Global Descriptor Table (GDT), which PatchGuard is known to protect, the technique is even further degraded. On top of that, most people (myself included) assumed that AMD had simply removed this oft-unused feature completely from the x64 architecture.

Yet, interestingly, AMD did go through the trouble of re-defining a new x64 long mode call gate descriptor format, removing the legacy “parameter count”, and extending it to a 16-byte format to make room for a 64-bit offset, as shown below:

That means that if a call gate were to find itself into a descriptor table, the processor would still support the usage of a far call or far jmp in order to reference a call gate descriptor and change CS:RIP to a new location!

Exploit Technique: Finding the LDT

First, although SMEP makes a Ring 3 RIP unusable for the purposes of getting Ring 0 execution, setting the Target Offset of a 64-bit Call Gate to a stack pivot instruction, then RET’ing into a disable-SMEP gadget will allow Ring 0 code execution to continue.

Obviously, HyperGuard now prevents this behavior, but HyperGuard was only added in Anniversary Update, which disables usage of the LDT anyway.

This means that the ability to install a 64-bit Call Gate is still a viable technique for getting controlled execution with Ring 0 privileges.

That being said, if the GDT is protected by PatchGuard, then it means that inserting a call gate is not really viable – there’s a chance that it may be detected as soon as its inserted, and even an attempt to clean-up the call gate after using it might come too late. When trying to implement a stable, persistent, exploit technique, it’s best to avoid things which PatchGuard will detect.

On the other hand, now we know that x64 processors still support using an LDT, and that Windows leverages this when implementing UMS. Additionally, since arbitrary processes can have arbitrary LDTs, PatchGuard does not guard individual process’ LDT entries, unlike the GDT.

That still leaves the question of how do we find the LDT of the current process, once we’ve enabled UMS? Well, given that the LDT is a static, 64KB allocation, from non-paged pool, this does still leave us with an option. As explained a few years ago on my post about the Big Pool, such a large allocation will be easily enumerable from user-mode as long as its tag is known:

While this is a nice information leak even on Windows 10, a mitigation comes into play unfortunately in Windows 8.1: Low IL processes can no longer use the API I described, meaning that the LDT address can only be leaked (without an existing Ring 0 arbitrary read/infoleak vulnerability) at Medium IL or higher.

Given that this is a fairly large size allocation, however, it means that if a controlled 64KB allocation can be made in non-paged pool and its address leaked from Low IL, one can still guess the LDT address. Ways for doing so are left as an exercise to the reader 🙂

Alternatively, if an arbitrary read vulnerability is available to the attacker, the LDT address is easily retrievable from the KPROCESS structure by reading the LdtBaseAddress field or by computing it from the LdtSystemDescriptor field. Getting the KPROCESS is easy through a variety of undocumented APIs, although these are now also blocked on Windows 8.1 from Low IL.

Therefore, another common technique is to use a GDI or User object which has an owner such a tagTHREADINFO, which then points to ETHREAD (which then points to EPROCESS). Alternatively, one could retrieve the GDT base address from the KPCR’s GdtBase field, if a way of leaking the KPCR is available, and then read the segment base address at offset 0x60 or 0x70. The myriad ways of leaking pointers and bypassing KASLR, even from Low IL, is beyond (beneath?) the content of this post.

Exploit Technique: Building a Call Gate

The next step is to now write a call gate in one of the selectors present in the LDT. By default, if this is the initial scheduler thread, we expect to find its TEB. Indeed, on this sample Windows 8.1 VM, we can see the re-mapped TEB at 0xFFFFE000:

Converting this data segment into a call gate can be achieved by merely converting the type from 0x13 (User Data Segment, R/W, Accessed) to 0x0C (System Segment, Call Gate).

However, doing so will now create a call gate with the following CS:[RIP] => E000:00000000FFFF1820

We have thus two problems:

0xE000 is not a valid segment

0xFFFF1820 is a user-mode address, which will cause a SMEP violation on most modern systems.

The first problem is not easy to solve – while we could create thousands of UMS threads, causing 0xE000 to become a valid segment (which we’d then convert into a Ring 0 Code Segment), this would be segment 0xE004. And if one can change 0xE000, might as well avoid the trouble, and set it to its correct value – (KGDT64_R0_CODE) 0x10, from the get go.

The second problem can be fixed in a few ways.

An arbitrary write can be used to set BaseUpper, BaseHigh, LimitHigh, Flags2, and LimitLow (which make up the 64-bits of Code Offset) to the desired Ring 0 RIP that contains a stack pivot or some other interesting instruction or gadget.

Or, an arbitrary write to modify the PTE to make it Ring 0, since the PTE base address is not randomized on the Windows versions vulnerable to an LDT-based attack.

Lastly, if one is only interested in SYSTEM->Ring 0 escalation, systems prior to Windows 10 can be attacked through the AWE-based attack I described at Infiltrate 2015, which will allow the creation of an executable Ring 0 page.

It is also worth mentioning that since Windows 7 has all of non-paged pool marked as executable, and the LDT is itself a 64KB non-paged pool allocation, it is made up of entirely executable pages, so an arbitrary write could be used to set the Call Gate offset to somewhere within the LDT allocation itself.

Exploit Technique: Writing the Ring 0 Payload

Writing x64 Ring 0 payload code is a lot harder than x86.

For starters, the GS segment must be immediately set to its correct value, else a triple fault could occur. This is done through the swapgs instruction.

Next, it’s important to realize that a call gate sets the stack segment selector (SS) to 0. While x64 natively operates in this fashion, Windows expects SS to be KGDT64_R0_DATA, or 0x18, and it may be a good idea to respect that.

Additionally, note that the value to which RSP will be set to is equal to the TSS’s Rsp0, normally used for interrupts, while a typical system call would use the KPRCB’s RspBase field. These ought to be in sync, but keep in mind that a call gate does not disable interrupts automatically, unlike an interrupt gate.

A reliable exploit must take note of all these details to avoid crashing the machine.

Further, exiting from a call gate must be done with the ‘far return’ instruction. Once again, another caveat applies: some assemblers may not generate a true 64-bit far return (i.e.: lacking a rex.w prefix), which will incorrectly pop 32-bit data from the stack. Make sure that a ‘retfq’ or ‘retfl’ or ‘rex.w retf’ is generated instead.

Note that we’ve gone through some difficulty in obtaining the address of the LDT, and describing the ways in which the UMS TEB entries could be corrupted in a way to convert them to Call Gate entries, it’s useful to mention that perhaps a much easier (depending on the attack parameters and vulnerability) technique is to just overwrite the LdtSystemDescriptor field in EPROCESS (something which j00ru’s x86-based paper also pointed out).

That’s because, at the next context switch, the GDT will automatically be updated a copy of this descriptor, which could be set to a user-mode base address (due to a lack of SMAP in the OS), avoiding the need to either patch the GDT (and locating it — which is hard when Hyper-V’s NPIEP feature is enabled) or modifying the true kernel LDT (and leaking its address).

Indeed, for this to work, a single 32-bit (in fact, even less) arbitrary write is required, which must, at minimum, set the fields:

P to 1 (Making the segment present)

Type to 2 (Setting the segment as an LDT entry)

BaseMid to 1 (Setting the base to 0x10000, as an example, as addresses below this are no longer allowed)

Therefore, a write of 0x00008201, for example, is sufficient to achieve the desired result of setting this process’ LDT to 0x10000.

As soon as a context switch occurs back to the process, the GDT will have this LDT segment descriptor loaded:

But wait – isn’t setting a limit of 0 creating an empty LDT? Not to worry! In long mode, limits on LDT descriptor entries are completely ignored… unfortunately though, although this is what the AMD64 manual states, I get access violations, at least on Hyper-V x64, if the limit is not large enough to contain the segment. So your mileage my vary.

But that’s all right – we can still limit this to a simple 4-byte overwrite! The trick lies into simply going through the process of creating a real LDT in the first place, then leaking its address (as described). Following that, allocate the user-mode fake LDT at the same lower 32-bit address, keeping the upper 32-bits zeroed. Then, use the 4-byte overwrite to clear the KPROCESS’ LdtSystemDescriptor’s BaseUpper field.

Even if the kernel LDT address cannot be leaked for some reason, one can easily “guess” every possibility (knowing it will be page aligned) and spray the entire 32-bit address space. This sounds like a lot, but is really only about a million allocations.

Finally, an alternate technique is to leverage exception handling: if the wrong LDT is overwritten, the kernel won’t crash when loading the invalid LDT segment (as long as it’s canonical, the PTE isn’t checked for validity). Instead, only when the exploit attempts to use the call gate, will a GPF be generated, and only in the context of the Ring 3 application. As such, one can progressively try each possible lower 32-bit LDT address until a GPF is no longer issued. Voila: we have found the correct lower 32-bits.

As another bonus, why is it that the selector for the LDT is 0x70 on Windows 8.1 and earlier, but 0x60 on Windows 10?

The answer lies in an even lesser known fact: up until the latter, the kernel created a Ring 0 Compatibility Mode Segment at offset 0x60! This means that a sneaky attacker can set CS to 0x60 and enjoy a weird combination of 32-bit legacy code execution with Ring 0 privileges (a number of caveats apply, including what an interrupt would do when returning, and the fact that no kernel API could be used at all).

Finally, note that even once a UMS-leveraging process exists, the GDT entry is not cleared, and points to a freed pool allocation. This means that if a way to allocate 64KB of controlled non-paged pool memory is known (such as some of the ways described in my Big Pool blog post), the GDT entry could be made to point to controlled memory (such as a named pipe buffer) which will re-use the same pointer. Then, some way to make the system continue to trust this address/entry should be achieved (either by causing an LLDT of 0x60/0x70 to be issued or having an EPROCESS’s LdtSystemDescriptor field re-use this address).

This is more of an anti-forensics technique than anything, because it keeps the GDT pointing to a kernel-mode LDT, even though it’s attacker controlled.

PoC||GTFO

While I won’t be releasing sample code leveraging this attack, it could easily be added to the various PowerShell-based “Vulnerable Driver” techniques that @b33f has been creating.

Here’s a sample screenshot of the attack based on a C program, with me using the debugger to perform the 32-bit arbitrary write (vs. sending an IOCTL to a vulnerable driver).

It sits in a loop (after leaking and allocating the data that it needs) and attempts to execute the call gate every second, until the arbitrary write is performed.

Conclusion: Windows 10 Anniversary Update Mitigations

In Windows 10 Anniversary Update (“Redstone 1”), a number of changes make these exploit techniques impossible to use:

All of the LDT-related fields and code in the kernel is removed. There is now no way of having an LDT through any Windows-supported mechanism.

PatchGuard now checks the LDTR register. If it’s non-zero, it crashes the system.

MSRC and the various security teams at Microsoft deserve kudos for thinking about — and plugging — the attack vector that LDTs provided, which is certainly not a coincidence 🙂

Further, the following generic mitigations make classes of such attacks much harder to exploit:

Randomization of the PTE base make it harder to bypass SMEP by making Ring 3 memory appear as Ring 0.

Technologies such as KCFG make it even harder to exploit control over arbitrary CS:RIP.

Finally, as described earlier, even on Windows 8.1, if the FSGSBASE feature is available in KeFeatureBits, the kernel will not allow the creation of an LDT, nor will it load the LDT during a context switch. You can easily verify this by calling (Ex/Rtl)IsProcessorFeaturePresent(PF_RDWRFSGSBASE_AVAILABLE).

What am I up to?

Long-time readers of this blog are probably aware that updates have been rare in the past few years, although I do try to keep time for some interesting articles from time to time. Most of my public research lately has been done through the Infosec Conference Circuit, so if you were not already aware, you can download slides from all my talks at the following URL:

Surface Aggregator Module (SAM) Internals — a little chip on your Surface Pro/Laptop/Book that you probably didn’t know was there. If you liked my past talks on the Apple SMC, you’ll enjoy this as well — Recon in Montreal (June)

I also have a number of interesting design flaws I discovered this year in various Windows components — as these get patched (they are not Tavis-worthy wormable RCEs, not to worry), I have been mulling over a “Windows Design Flaw Garage Sale” talk similar to the famous one that Stefan Esser (i0nic) gave a few years ago about Apple/iOS — covering some past bugs (fixed and unfixed) and more recent ones.

However, this post is not about such small research updates — but rather about a much bigger piece of work that has taken up my time these last 12 months — the release of Windows Internals, 7th Edition (Part 1)!

Windows Internals, 7th Edition

Some history…

After the release of the 6th Edition of the book, which covered Windows 7, it’s fair to say that I was pretty burned out. The book incurred heavy delays due to my juggling of college, internships, and various relationships, while also requiring a massive amount of work due to the ambitious new sections, and coverage of the many, many changes that Windows 7 brought to the table (either fine-tuning many small things from Vista, or completely new kernel modules). Additionally, my co-authors also had new plans: David Solomon went on to retire and sunset his training business (David Solomon Expert Seminars), and Mark Russinovich was fully committed to his new role at Microsoft which eventually took him to Azure, where he is now the Chief Technology Officer (CTO), and kicking some major cloud/fabric butt with his extensive OS experience and security background. All of this to say — there was not much of an appetite to immediately begin writing a new book, with Windows 8 looming on the horizon (at that point still called Windows Blue).

Something else happened at that time: under leadership from Satya Nadella, Microsoft began delivering on its “Windows as a Service (WaaS)” model, furiously releasing a Windows 8.1 Update within a year of Windows 8 having shipped. Given that a single OS update had taken us years to cover, this release cycle was simply too rapid to successfully think about releasing a book in a timely fashion. I stopped thinking that a new edition of the book would ever be released, and I certainly didn’t think I’d be able to do one.

All gaps create opportunities, and two other authors decided that they could take on the 7th Edition and ship a successful update. They re-arranged the book in three parts, instead of two, with the first one focusing on Windows 8 User-Mode Metro (now UWP) Application Development, the second one on the Kernel, and the third one on Driver Development. I was not contacted or involved in these changes, and honestly, was not too happy about them. There are excellent driver programming books, just as there are application development books (even on Metro/UWP). This felt, to me, like an attempt to significantly cut down on the kernel portions of the book, and monetize on the Metro/Driver programming books, which obviously have a much wider audience.

Additionally, with Windows 8 having shipped, Part 1 was slated for that year, with Part 2 (Windows 8.1 would now be out) the year after, and finally, Part 3, a year after that (Windows 10 would now be in beta). By the time you’d get to the last part, the OS would’ve already moved two releases further — or, each part could cover that OS. Becoming a Windows 8 Metro App Development book, with Windows 8.1 Kernel Internals book, and Windows 10 Driver Development book. These were just my personal thoughts at the time — which I kept to myself, because every author needs a chance to be successful, and others may well have liked this model, and the book may have sold more copies than all previous combined – who was I to judge?

One year passed. Then another… then another. By now, given that my name was still on the cover — regardless of my lack of involvement — many people would come to me and ask me “What’s going on? Why are you taking so long? Do you need help?” on the friendly side… and of course, some not-so-friendly comments, from people that had pre-ordered on Day 1, paying anywhere between $30-90, and receiving nothing 3 years later, with an ever-delayed release date. I strongly considered putting out a statement that I had nothing to do with this book — but chose to simply ask Microsoft Press to remove my name from the cover and all marketing materials. I preferred losing my association with this Bible, rather than be responsible for its contents, and its delays.

A new hope

Around the time that I did that, however, I realized that yet –another– name had been added to the pool! It was that of Pavel Yosifovich, a Microsoft MVP whose blog I had followed a few times, and whom I had heard about doing some Windows Internals training in the past, mostly in Israel. I thought highly of Pavel — and he was an established author of previous books. Additionally, he now had a Microsoft e-mail address — suggesting that once again, the series would have a real “internal” presence, who would communicate with the developer team, read source code comments, and more — while Mark and I had only, and solely, been reverse engineering, we had always had help from David’s connections and insight into the developer team, which the new books would’ve lacked.

started writing the #Windows-Internals book 7th edition… almost done with chapter 1

So I reached out, and to my pleasure, found out that Pavel had now become the sole co-author, the previous two having completely abandoned the project with no materials to show for it. Pavel was doing a herculean task of updating the entire book to now only cover Windows 8 and 8.1, but of course Windows 10 as well, which had reached its Threshold 2 (1511) Update, with Redstone 1 (1607) currently shipping to the Windows Insider Program (WIP). While having source access helps, this is still a task that I knew a single person would struggle with — and I really wanted the book to succeed for all of those that had placed their faith in it. I had also, over the last few years, had made lots of Windows reverse engineering, as many of you know, covering large parts of new Windows 8 and later components. This meant significantly reduced research time for me — all while having an amazing co-author. It seemed obvious that I should jump into the deep abyss of Windows Internals once again.

Pavel was extremely gracious in accepting an uninvited guest to the party, allowing me to make many changes to chapters that he had already completed (I don’t know if I would’ve done the same!). This started adding delays to the book, and Redstone 1 was about to ship — we decided to update the book to cover Redstone 1 from now on, and to go back to any places we knew there were changes. As we kept writing, I came up with new ideas and changes to the book — moving some things around, adding new kernel components, expanding on experiments, and the scope continued to increase. It was clear that I was once again, going to cause delays, which deeply bothered me.

Yet, Pavel was always there to pick up the slack, go beyond the call of duty, and spend nights on researching components as well as the more mundane parts of a book (screenshots and graphics). I could not have asked for a more humble host inside the world of his book. As we were wrapping up, I realized that Redstone 2 (1703) was nearing its feature complete date (around January of this year). I made yet another potentially delaying decision to go back, once again, and to hurriedly find any places where I knew changes had been made, and to update as much of the book as I could. I saw an opportunity — to release a Windows Internals book within weeks of a Windows release, covering that Windows release. A feat which had not happened in many, many releases.

And so, here we are today, a little over a month since Windows 10 Creators Update — Redstone 2 — 1703 has shipped, with the update slowly rolling out over the month of April to hundreds of millions of users, with Build 2017 right around the corner, and with a Windows Internals book in the midst of it all, covering the very same operating system. While I apologize for the additional six months this has cost your pre-orders, I do believe it was the right call.

What’s new in the book? What’s changed?

One of the first things that Pavel had changed (other than returning the book to its usual two-part focus on the kernel and related system components) is to better organize key Windows concepts into the first part of the book, instead of having them spread out over both parts — this way, people could get what would likely be 80% of the material that is relevant to 90% of people as soon as the first part was released, instead of having to wait for both. This meant making the following changes:

Once I joined, it made sense, with this new flow, to make a few additional changes:

Processes and Jobs, now being its own chapter, became Processes, Jobs and Silos, which is the internal name for Windows Server Containers as well as Centennial/Desktop Bridge containers.

It made little sense that we were covering the User-Mode Loader (a section I first added in the 5th Edition) as a System Mechanism, instead of an integral part of the Process section (which made constant references to Part 2). I moved this section to be part of the same chapter.

Outside of these broad strokes, a full list of all the changes would obviously be too complex. I would estimate the sheer amount of new pages to be around 150 — with probably 50 other pages that have received heavy modification and/or updating. You can definitely expect coverage of the following new features:

Auto Boost [Scheduling]

Directed Switch [Scheduling]

Memory Partitions [Memory]

Priority Donation/Inheritance [Scheduling]

Security/Process Mitigations [Security]

CPU Sets [Scheduling]

Windows Containers [Processes]

Store Manager [Memory]

API Sets [Processes]

AppContainer [Security]

Token Attributes & Claims [Security]

Protected Process Light [Security / Processes]

Windows Subsystem for Linux [Architecture]

Memory Compression [Memory]

Virtual Trust Levels [Architecture]

Device Guard & Credential Guard [Security]

Processor Enclaves [Memory]

Secure Kernel Mode / Isolated User Mode [Architecture]

Pico Processes [Processes]

Power Management Framework (PoFx) [I/O Manager]

Power Availability Requests [I/O Manager]

And a lot more

Thank You!

Finally, I’d like to thank many people, inside and outside of Microsoft, that helped with some of the content, ideas, experiments, etc. Especially Andrea Allievi, who helped with some very hairy parts of the Memory Management section!

I know both Pavel and I hope you’ll enjoy this flow a bit better, and that you’ll have lots of reading to do in this new Edition. Feel free to hit me up at @aionescu as usual.

Brief Overview of WoW64

“Heaven’s Gate” refers to a technique first popularized by the infamous “Roy G. Biv” of 29afame, and later re-published in Valhalla #1. Cited and improved in various new forms, and even seen in the wild used by the Vawtrak banking malware, it centers around the fact that on a 64-bit Windows OS, seeing as how all kernel-mode components always execute in 64-bit mode, the address space, core OS structures (EPROCESS, PEB, etc…), and code segments for processes are all initially setup for 64-bit “long mode” execution, regardless of the process actually being hosted by a 32-bit executable binary.

In fact, on 64-bit Windows, the first piece of code to execute in *any* process, is always the 64-bit NTDLL, which takes care of initializing the process in user-mode (as a 64-bit process!). It’s only later that the Windows-on-Windows (WoW64) interface takes over, loads a 32-bit NTDLL, and execution begins in 32-bit mode through a far jump to a compatibility code segment. The 64-bit world is never entered again, except whenever the 32-bit code attempts to issue a system call. The 32-bit NTDLL that was loaded, instead of containing the expected SYSENTER instruction, actually contains a series of instructions to jump back into 64-bit mode, so that the system call can be issued with the SYSCALL instruction, and so that parameters can be sent using the x64 ABI, sign-extending as needed.

This process is accurately described in manysources, including in the Windows Internals books, so if you’re interested in reading more, you can do so, but I’ll spare additional details here.

Enter Heaven’s Gate

Heaven’s Gate, then, refers to subverting the fact that a 64-bit NTDLL exists (and a 64-bit heap, PEB and TEB), and manually jumping into the long-mode code segment without having to issue a system call and being subjected to the code flow that WoW64 will attempt to enforce. In other words, it gives one the ability to create “naked” 64-bit code, which will be able to run covertly, including issuing system calls, without the majority of products able to intercept and/or introspect its execution:

Microsoft’s EMET, as well as a myriad of similar tools and sandboxes, only hook/protect the 32-bit NTDLL for WoW64 processes, under the assumption that the 64-bit NTDLL can’t be reached in any other way. The mitigations can therefore be bypassed using Heaven’s Gate. The same technique has been used by the Phenom malware to bypass AV solutions.

When debugging a 32-bit application with a 64-bit debugger (such as WinDBG), you will initially see the 64-bit state (heap, stack, NTDLL, TEB, etc…). Since this state is uninteresting, as it only contains the WoW64 system call layer, manual commands and extensions must be used to investigate the 32-bit state instead — and so in order to avoid this, even Microsoft often recommends using the 32-bit WinDBG instead, which will provide a much more seamless debugging experience and show the 32-bit state of the process. Other 3rd party debuggers, which are 32-bit only, will also behave the same way. The problem, therefore, is that by using Heaven’s Gate, there IS now interesting 64-bit state, that these debuggers will miss.

Many emulation/detonation engines will, upon seeing a 32-bit executable, emulate it using x86 instructions. They will either ignore or be unable to handle x64 instructions, as they never expect them to run. In fact, this was recently shown by a blog post over at Hexacorn. Heaven’s Gate allows such x64 instructions to run, rendering the x86 code into “dummy” code for misdirection purposes.

Memory Restrictions

These and other “benefits” make Heaven’s Gate a tool of choice for malicious code. However, there always existed an interesting limitation in 32-bit applications running under WoW64: even when executing in 64-bit long-mode, addresses above the 4 GB could never be allocated (in fact, addresses above 2 GB could normally never be used for compatibility purposes, unless the image was linked with /LARGEADDRESSAWARE — the switch was originally designed to support /3GB x86 server environments, but outgrew its original intent to allow full 4 GB addresses under WoW64, a fact leveraged by many 32-bit games and browsers even today).

Using a kernel debugger and the !vad command, it’s simple to see why, such as on this Windows 7 system, where I’ve typed the command before the process has any chance of executing even a single instruction — not even NTDLL has loaded here, folks. This is an interesting view of what are the “earliest” memory structures you can find in a WoW64 process (at least on Windows 7).

Note that a giant VAD at the end, highlighted in teal, occupies the entire 64-bit portion of the address space. Let’s see what !vad has to say about it:

Seeing as how it’s configured as a “NoChange” and “OneSecured” VAD, it cannot be freed or modified in any way. This is further confirmed by the commit charge of -1.

On Windows 8 and later, however, the output changes, as you can see below. Note that I’ve re-used the same colors as in the Windows 7 output for clarity (and the uncolored VADs correspond to the CFG entries).

The 64-bit NTDLL is actually loaded in 64-bit address space now! And we have not one, but two teal-colored VADs, which surround it, re-creating the “no man’s land” just as on Windows 7 and earlier. This change was briefly mentioned, I believe, by Matt Miller (of skape fame) at one of Microsoft’s BlackHat presentations: it made it a bit harder to guess the location of the 64-bit NTDLL by simply adding a fixed size to the 32-bit NTDLL. In my screenshot, since this is a CFG-enabled process, the VADs don’t exactly envelop NTDLL — rather they surround the native CFG bitmap + NTDLL, but the point remains.

This change in NTDLL load behavior also had the likely intended side effect of making hooks in 64-bit NTDLL extremely hard, or outright impossible. You see, without consuming an enormous amount of space, it’s simply not possible to overwrite an x64 instruction with a call or jmp to an absolute 64-bit address efficiently. Instead, hooking engines will allocate a “trampoline” that is within the 32-bit address range of the hooked function, and use a much smaller 5 byte 32-bit relative jump, which happens to fit nicely in the “hotpatch aware” region that Microsoft binaries have (or anyone linking with /hotpadmin). The trampoline then uses the full 64-bit absolute jump instruction.

As you’ve figured out by now, if the trampoline needs to be within 2GB, but there are two large VADs blocking off all 64-bit addresses around NTDLL, this hooking technique is dead in the water. Other, more complex and error-prone techniques must (and can) be used instead.

Nevertheless, nothing stops Heaven’s Gate on Windows 8. There some minor WoW64 changes which one must adapt to, and accessing or hooking 64-bit NTDLL becomes harder.

Control Flow Guard and WoW64

In Windows 10, a new exploit mitigation is introduced called Control Flow Guard, or CFG. It too, has been rather well described in multiplesources, so I won’t go into details inside of this post. The important piece to remember about CFG is that all relative function calls are now subject to an additional compiler-generated check, which is implemented by NTDLL: only valid function prologues (within 8 bytes of alignment) can be the target of such a call. Valid function prologues, in turn, are marked by a bit being set in a very large bitmap (bit array) structure, which describes the entire user-mode address space (all 128TB of it!). I previously posted on some interesting changes this required in the memory manager, as this bit array obviously becomes quite large (2 TB, in fact).

What’s not been documented too clearly in most research is that on 64-bit systems, there are in fact not one, but two CFG bitmaps: one for 32-bit code, and one for 64-bit code. The addresses of both of these bitmaps is stored in the per-process working set structure (called MMWSL). This structure is pointed to by the MMSUPPORT structure inside of EPROCESS (i..e.: PsGetCurrentProcess()->Vm.VmWorkingSetList), but a unique thing about it, is that it’s stored in a region of memory called “hyperspace”, which is at a fixed address… much like the per-process page table entry array. On recent 64-bit systems, this hard-coded address is 0xFFFFF58010804000, a fact I pointed out in a previous blog post addressing the 64-bit address space of Windows 8.1 and later.

As one can see in the symbols that WinDBG can dump, the MMWSL structure contains a field:

Clearly, thus, a 64-bit Windows 10 kernel contains not one, but two CFG bitmaps. And indeed, the 32-bit NTDLL will utilize the address of the WoW64 bitmap, while the 64-bit NTDLL will utilize the Native bitmap. But why use two separate bitmaps? What separates a WoW64 bitmap from a native bitmap? One would imagine that 64-bit code is marked as executable in the native bitmap, and 32-bit code is marked as executable in the WoW64 bitmap… but that’s not quite the full story.

At verification time, indeed, it is the version of NTDLL that is being used, which determines which bitmap will be looked at. But how does the OS populate the bits?

In CFG-aware versions of Windows, the CFG bitmap is touched through two paths: MiCommitVadCfgBits, and MiCfgMarkValidEntries. These, in turn, correspond to either intrinsic CFG modifications (side-effects of allocating, protecting and/or mapping executable memory), or explicit CFG modifications (effect of calling SetValidCallTargets). Both of these paths will eventually call MiSelectCfgBitMap, whose pseudo-code is shown below.

As is quite clear from the code, any private memory allocations below the 64-bit boundary will be marked only in the 32-bit bitmap, while the opposite applies to the 64-bit bitmap. In fact, this is the result of an optimization: instead of having two 2TB bit arrays for each processor execution mode, a single 2TB array is used for 64-bit native code, while a single 32MB array is used for 32-bit native code, greatly reducing address space consumption.

Closing the Gate

Basing the decision of which CFG bitmap to populate on the virtual address of the executable allocation creates an obvious dichotomy: 64-bit code, if running in a 32-bit address range, will instantly trip up CFG, because the NTDLL library that is active in that environment is the 64-bit version, which will check the 64-bit bitmap, which will not have any bits set in the 0-4 GB range. Similarly, any 32-bit code must be running below the 4 GB boundary, else the 32-bit NTDLL’s CFG validation routine will trip up, as the 32-bit bitmap isn’t even large enough to account for addresses above 4 GB.

A naive solution is therefore proposed: simply allocate 64-bit code above the 4GB range, and the problem goes away. There is, of course, a problem with this approach: the NoChange VADs which block the entire > 4 GB region of memory and mark it unusable, leaving only 64-bit NTDLL as the only valid allocation in that address range.

In Windows 10, these two factors combined result in the inability to execute any useful 64-bit code in a 32-application/WoW64 process, because the two restrictions combine, creating an impossible condition. You may be tempted to dismiss the reality by stating that all the 64-bit malicious code has to do is not to have been compiled with CFG. In this case, the compiler should not be emitting calls to the validation routine. However, this misses a critical point: it’s not the process’ own executable code/shellcode which are necessarily performing the 64-bit CFG checks — it’s the 64-bit NTDLL itself, or any other additional 64-bit DLLs you may have injected through the initial 64-bit shellcode, into your own process.

Even worse, even if no other 64-bit DLLs are imported, some core system functionality, implemented by NTDLL, also validates the CFG bitmap: Exceptions, User-Mode Callbacks, and APCs. Any usage of these system mechanisms, because they always initially execute in 64-bit mode, will cause a CFG violation if the target is not in the bitmap — which it cannot possibly be. The same goes for higher level functionality like using the Thread Pool, or any other callback-based mechanism owned by NTDLL in 64-bit mode. For example, because kernel-mode injects user-mode APCs through the 64-bit NTDLL, the user-mode APC routine cannot possibly be a custom, non-DLL function: it would’ve been impossible to allocate it > 4 GB, and the APC dispatcher will validate the CFG bits for any address < 4 GB, and be unable to find it.

Perhaps the best example of these unexpected side-effects is to analyze what Heaven’s Gate-using malware often does to gain some usefulness in the hidden 64-bit context: it will lookup LdrLoadDll inside of NTDLL.DLL and attempt to load additional 64-bit DLLs, such as kernel32.dll. With some coercing (as some of the articles I linked to at the beginning showed), this can be made to work. The problem, in a CFG-aware NTDLL.DLL, is that LdrpCallInitRoutines will perform a CFG bitmap check before calling the DllMain of this DLL. As the DLL will be loaded in 32-bit address space, the WoW64 CFG bitmap will be marked, and not the Native CFG bitmap — causing the 64-bit NTDLL to believe that DllMain is not a valid relative call target, and crash the process.

Suffice it to say, although it still is possible to have a very simple 64-bit piece of code, even possibly performing some system calls, execute in the hidden 64-bit world of a WoW64 process/32-bit application, any attempts to load additional DLLs, use APCs, handle exceptions or user-mode callbacks in 64-bit mode will result in the process crashing, as a CFG violation will be tripped. For most intents and purposes, therefore, CFG has a potentially unintended side-effect: it closes down Heaven’s Gate.

Reopening the gate is left as an exercise to the reader 😉

Final Note

Astute readers may have noticed the following discrepancies, especially if following along on their own systems:

This explains the three, not two VADs in my dump: in the original CFG implementation on Windows 8.1, 64-bit code could live in the 32-bit address range, as the Native bitmap had a “Wow64Low” portion. In Windows 10, this is now gone (saving 32MB of address space) — Native code is only aware of the 64-bit address ranges.

The book will mark a return to the previous format of the series — unlike the last edition which covered all supported Windows NT operating systems (2000, XP, 2003), this edition will only cover the Vista and Server 2008 operating systems. This is a great change, because it means less time is explaining minute differences between the 4 different algorithms used in a lookup in each version, and more time is spent talking about what really matters — the behavior and design decisions of the OS.

I’m happy to say this new edition will have a least 250 pages of new information, not only updating various chapters with new Vista/Server 2008 changes, but also adding entirely new sections which previous editions had never touched on, such as the user mode loader in Ntdll.dll, the user mode debugging framework, and the hypervisor. It isn’t only the internals information that will benefit from the update; as a matter of fact, all references to tools, resources and books have also been updated, including up-to-date information on the latest Sysinternals tools, as well as exciting, helpful new experiments that demonstrate the behaviors explained in the text. For those of you who have read Mark’s TechNet Magazine “Inside Vista Kernel Changes” series and its Server 2008 counterpart (for those who haven’t, I strongly suggest you do!), you’ll be glad to know that the book includes all that information and expands on it as well.

The book work is still ongoing, but planned to end soon, after which it should go to print in October and be on shelves in January 2009. My book work is about to reach its one-year anniversary, and I must say that working with Mark and David has been a pleasurable learning experience, as well as a great chance to continue my reverse engineering work and hone my skills. What made ReactOS fun was being able to share my discoveries with the world as code — the book work has allowed me to share that information as text, part of the best internals book available, to a much wider audience. I’m thankful for that, and I can’t wait for everyone to have a chance to read it!

Look for the book hitting your nearest bookstore just after New Year’s. As a sneak peak, here’s a high-quality copy of the book’s cover while you wait.

This year I had the chance to present some security-related findings that I had made earlier during the year inside Win32k.sys, the Windows GUI Subsystem and Window Manager. I presented a total of four bugs, all local Denial of Service (DoS) attacks. Two of these attacks could be done on any system up to Vista SP0 (one up to Server 2008) from any account, while the other two were NULL-pointer dereferences that required administrative access to exploit (however, since they are NULL-pointer dereferences, the risk for user->kernel elevation exists in one of them).

Because the first two attacks (from any guest account) relied on highly internal behavior of the Windows NT Kernel (used in all of Microsoft’s modern operating systems today), I thought it was interesting to talk a bit about the internals themselves, and then present the actual bug, as well as some developer guidance on how to avoid hitting this bug.

I’d like to thank everyone who came to see the talk, and I really appreciated the questions and comments I received. I’ve also finally found out why Win32k functions are called “xxx” and “yyy”. Now I just need to find out about “zzz”. I’ll probably share this information in a later post, when I can talk more about Win32k internals.

As promised, you’ll find below a link to the slides. As for proof of concept code, it’s currently sitting on my home machine, so it’ll take another week. On the other hand, the slides make three of the bugs very clear in pseudo-code. I’ll re-iterate them at the bottom of the post.

One final matter — it seems I have been unable to make the bh08@alex-ionescu.com email work due to issues with my host. For the moment, please use another e-mail to contact me, such as my Google Mail account. My email address is the first letter of my first name (Alex) followed by my entire last name (Ionescu).

Bug 3:
1) Handle = CreateWindowStation(…)
2) Loop all handles in the process with the NtQuerySystemInformation API, or, just blindly loop every handle from 4 to 16 million and mark it as protected, or, use Process Explorer to see what the current Window Station handle is, and hard-code that for this run. The point is to find the handle to the current window station. Usually this will be a low number, typically under 0x50.
3) SetHandleInformation(CurrentWinstaHandle, ProtectFromClose)
4) SwitchProcessWindowStation(Handle);

In part 1 of the series, we covered some of the changes behind Vista’s new Session 0 Isolation and showcased the UI Detection Service. Now, we’ll look at the internals behind this compatibility mechanism and describe its behavior.

First of all, let’s take a look at the service itself — although its file name suggests the name “UI 0 Detection Service”, it actually registers itself with the Service Control Manager (SCM) as the “Interactive Services Detection” service. You can use the Services.msc MMC snap-in to take a look at the details behind the service, or use Sc.exe along with the service’s name (UI0Detect) to query its configuration. In both cases, you’ll notice the service is most probably “stopped” on your machine — as we’ve seen in Part 1, the service is actually started on demand.

But if the service isn’t actually running yet seems to catch windows appearing on Session 0 and automatically activate itself, what’s making it able to do this? The secret lies in the little-known Wls0wndh.dll, which was first pointed out by Scott Field from Microsoft as a really bad name for a DLL (due to its similarity to a malicious library name) during his Blackhat 2006 presentation. The version properties for the file mention “Session0 Viewer Window Hook DLL” so it’s probably safe to assume the filename stands for “WinLogon Session 0 Window Hook”. So who loads this DLL, and what is its purpose? This is where Wininit.exe comes into play.

Wininit.exe too is a component of the session 0 isolation done in Windows Vista — because session 0 is now unreachable through the standard logon process, there’s no need to bestow upon it a fully featured Winlogon.exe. On the other hand, Winlogon.exe has, over the ages, continued to accumulate more and more responsibility as a user-mode startup bootstrapper, doing tasks such as setting up the Local Security Authority Process (LSASS), the SCM, the Registry, starting the process responsible for extracting dump files from the page file, and more. These actions still need to be performed on Vista (and even more have been added), and this is now Wininit.exe’s role. By extension, Winlogon.exe now loses much of its non-logon related work, and becomes a more agile and lean logon application, instead of a poorly modularized module for all the user-mode initialization tasks required on Windows.

One of Wininit.exe’s new tasks (that is, tasks which weren’t simply grandfathered from Winlogon) is to register the Session 0 Viewer Window Hook Dll, which it does with an aptly named RegisterSession0ViewerWindowHookDll function, called as part of Wininit’s WinMain entrypoint. If the undocumented DisableS0Viewer value isn’t present in the HKLM\System\CurrentControlSet\Control\WinInit registry key, it attempts to load the aforementioned Wls0wndh.dll, then proceeds to query the address of the Session0ViewerWindowProcHook inside it. If all succeeds, it switches the thread desktop to the default desktop (session 0 has a default desktop and a Winlogon desktop, just like other sessions), and registers the routine as a window hook with the SetWindowsHookEx routine. Wininit.exe then continues with other startup tasks on the system.

I’ll assume readers are somewhat familiar with window hooks, but this DLL providesjust a standard run-of-the-mill systemwide hook routine, whose callback will be notified for each new window on the desktop. We now have a mechanism to understand how it’s possible for the UI Detection Service to automatically start itself up — clearly, the hook’s callback must be responsible! Let’s look for more clues.

Inside Session0ViewerWindowProcHook, the logic is quite simple: a check is made on whether or not this is a WM_SHOWWINDOW window message, which signals the arrival of a new window on the desktop. If it is, the DLL first checks for the $$$UI0Background window name, and bails out if this window already exists. This window name is created by the service once it’s already started up, and the point of this check is to avoid attempting to start the service twice.

The second check that’s done is how much time has passed since the last attempt to start the service — because more than a window could appear on the screen during the time the UI Detection Service is still starting up, the DLL tries to avoid sending multiple startup requests.
If it’s been less than 300 seconds, the DLL will not start the service again.

Finally, if all the checks succeed, a work item is queued using the thread pool APIs with the StartUI0DetectThreadProc callback as the thread start address. This routine has a very simple job: open a handle to the SCM, open a handle to the UI0Detect service, and call StartService to instruct the SCM to start it.

This concludes all the work performed by the DLL — there’s no other code apart from these two routines, and the DLL simply acknowledges window hook callbacks as they come through in cases where it doesn’t need to start the service. Since the service is now started, we’ll turn our attention to the actual module responsible for it — UI0Detect.exe.

Because UI0Detect.exe handles both the user (client) and service part of the process, it operates in two modes. In the first mode, it behaves as a typical Windows service, and registers itself with the SCM. In the second mode, it realizes that is has been started up on a logon session, and enters client mode. Let’s start by looking at the service functionality.

If UI0Detect.exe isn’t started with a command-line, then this means the SCM has started it at the request of the Window Hook DLL. The service will proceed to call StartServiceCtrlDispatcher with the ServiceStart routine specific to it. The service first does some validation to make sure it’s running on the correct WinSta0\Default windowstation and desktop and then notifies the SCM of success or failure.

Once it’s done talking to the SCM, it calls an internal function, SetupMainWindows, which registers the $$$UI0Background class and the Shell0 Window window. Because this is supposed to be the main “shell”
window that the user will be interacting with on the service desktop (other than the third-party or legacy service), this window is also registered as the current Shell and Taskbar window through the SetShellWindow and SetTaskmanWindow APIs. Because of this, when Explorer.exe is actually started up through the trick mentioned in Part 1, you’ll notice certain irregularities — such as the task bar disappearing at times. These are caused because Explorer hasn’t been able to properly register itself as the shell (the same APIs will fail when Explorer calls them). Once the window is created and a handler is setup (BackgroundWndProc), the new Vista “Task” Dialog is created and shown, after which the service enters a typical GetMessage/DispatchMessage window message handling loop.

Let’s now turn our attention to BackgroundWndProc, since this is where the main initialization and behavioral tasks for the service will occur. As soon as the WM_CREATE message comes through, UI0Detect will use the RegisterWindowMessage API with the special SHELLHOOK parameter, so that it can receive certain shell notification messages. It then initializes its list of top level windows, and registers for new session creation notifications from the Terminal Services service. Finally, it calls SetupSharedMemory to initialize a section it will use to communicate with the UI0Detect processes in “client” mode (more on this later), and calls EnumWindows to enumerate all the windows on the session 0 desktop.

Another message that this window handler receives is WM_DESTROY, which, as its name implies, unregisters the session notification, destroys the window lists, and quits.

This procedure also receives the WM_WTSSESSION_CHANGE messages that it registered for session notifications. The service listens either for remote or local session creates, or for disconnections. When a new session has been created, it requests the resolution of the new virtual screen, so that it knows where and how to display its own window. This functionality exists to detect when a switch to session 0 has actually been made.

Apart from these three messages, the main “worker” message for the window handler is WM_TIMER. All the events we’ve seen until now call SetTimer with a different parameter in order to generate some sort of notification. These notifications are then parsed by the window procedure, either to handle returning back to the user’s session, to measure whether or not there’s been any input in session 0 lately, as well as to handle dynamic window create and destroy operations.

Let’s see what happens during these two operations. The first time that window creation is being acted upon is during the afore-mentionned EnumWindows call, which generates the initial list of windows. The function only looks for windows without a parent, meaning top-level windows, and calls OnTopLevelWindowCreation to analyze the window. This analysis consists of querying the owner of the window, getting its PID, and then querying module information about this process. The version information is also extracted, in order to get the company name. Finally, the window is added to a list of tracked windows, and a global count variable is incremented.

This module and version information goes into the shared memory section we briefly described above, so let’s look into it deeper. It’s created by SetupSharedMemory, and actually generates two handles. The first handle is created with SECTION_MAP_WRITE access, and is saved internally for write access. The handle is then duplicated with SECTION_MAP_READ access, and this handle will be used by the client.

What’s in this structure? The type name is UI0_SHARED_INFO, and it contains several informational fields: some flags, the number of windows detected on the session 0 desktop, and, more importantly, the module, window, and version information for each of those windows. The function we just described earlier (OnTopLevelWindowCreation) uses this structure to save all the information it detects, and the OnTopLevelWindowDestroy does the opposite.

Now that we understand how the service works, what’s next? Well, one of the timer-related messages that the window procedure is responsible for will check whether or not the count of top level windows is greater than 0. If it is, then a call to CreateClients is made, along with the handle to the read-only memory mapped section. This routine will call the WinStationGetSessionIds API and enumerate through each ID, calling the CreateClient routine. Logically, this is so that each active user session can get the notification.

Finally, what happens in CreateClient is pretty straightforward: the WTSQueryUserToken and ImpersonateLoggedOnUser APIs are used to get an impersonation logon token corresponding to the user on the session ID given, and a command line for UI0Detect is built, which contains the handle of the read-only memory mapped section. Ultimately, CreateProcessAsUser is called, resulting in the UI0Detect process being created on that Session ID, running with the user’s credentials.

What happens next will depend on user interaction with the client, as the service will continue to behave according to the rules we’ve already described, detecting new windows as they’re being created, and detection destroyed windows as well, and waiting for a notification that someone has logged into session 0.

We’ll follow up on the client-mode behavior on Part 3, where we’ll also look at the UI0_SHARED_INFO structure and an access validation flaw which will allow us to spoof the client dialog.

One of the many exciting changes in Windows Vista’s service security hardening mechanisms (which have been aptly explained and documented in multipleblogs and whitepapers , so I’ll refrain from rehashing old material) is Session 0 Isolation. I’ve thought it would be useful to talk about this change and describe the behaviour and implementation of the UI Detection Service (UI0Detect), an important part of the infrastructure in terms of providing compatible behaviour with earlier versions of Windows.

As a brief refresher or introduction to Session 0 Isolation, let’s remember how services used to work on previous versions of Windows: you could run them under various accounts (the most common being System, Local Service and Network Service), and they ran in the same session the console user, which was logged-on to session 0 as well. Services were not supposed to display GUIs, but, if they really had to, they could be marked as interactive, meaning that they could display windows on the interactive window station for session 0.

Windows implemented this by allowing such services to connect to the Winsta0 Windowstation , which is the default interactive Windowstation for the current session — unlike non-interactive services, which belonged to a special “Service-0x0-xxx$” Windowstation, where xxx was a logon session identifer (you can look at the WDK header ntifs.h for a list of the built-in account identifiers (LUIDs)). You can see the existence of these windowstations by enumerating them in the object manager namespace with a tool such as Sysinternals’ WinObj.

Essentially, this meant three things: applications could either do Denial of Service attacks against named objects that the service would expect to own and create, they could feed malicious data to objects such as sections which were incorrectly secured or trusted by the service, and , for interactive services, they could also attempt shatter attacks — sending window messages with executable payloads in their buffer, exploting service bugs and causing the payload code to execute at higher privileges.

Session 0 Isolation puts an end to all that, by first having a separate session for the console user (any user session starts at 1, thus protecting named objects), and second, by disabling support for interactive services — that is, even though services may still display a UI, it won’t appear on any user’s desktop (instead, it will appear on the default desktop of the session 0 interactive windowstation).

That’s all fine and dandy for protecting the objects, but what if the service hasn’t been recompiled not to directly show a UI (but to instead use a secondary process started with CreateProcessAsUser, or to use the WTSSendMessage API) and depends on user input before continuing? Having a dialog box on the session 0 desktop without the user’s awareness would potentially have significant application compatibility issues — this is where the UI Detection Service comes into play.

If you’re like most Vista users, you’ve actually probably never seen the default desktop on session 0’s interactive windowstation in your life (or in simpler words, never “logged-on” or “switched to” session 0)! Since you can’t log on to it, and since interactive services which displayed UIs are thankfully rare, it remains a hidden mystery of Windows Vista, unless you’ve encountered such a service. So let’s follow Alice down the rabbit hole into session 0 wonderland, with some simple Service Controller (Sc.exe) commands to create our very own interactive service.

Using an elevated command prompt, create a service called RabbitHole with the following command:

Be careful to get the right spaces — there’s a space after each equal sign! You should expect to get a warning from Sc.exe, notifying you of changes in Windows Vista and later (the ones I’ve just described).

Now let’s start the service, and see what happens:

sc start RabbitHole

If all went well, Sc.exe should appear as if it’s waiting on the command to complete, and a new window should appear on your taskbar (it does not appear in the foreground). That window is a notification from the UI Detection Service, the main protagonist of this story.

Get ready to click on “Show me the Message” as soon as you can! Starting an essentialy fake service through Sc.exe will eventually annoy the Service Control Manager (SCM), causing it to kill notepad behind your back (don’t worry if this happens, just use the scstartRabbitHole command again).

You should now be in Session 0 (and probably unable to read the continuation of this blog, in which case the author hopes you’ve been able to find your way back!) As you can notice, Session 0 is a rather deserted place, due to the lack of any sort of shell or even the Theme service, creating a Windows 2000-like look that may bring back tears of joy (or agony) to the more nostalgic of users.

On the other hand, this desolate session it does contain our Notepad, which you should’ve seen disappear if you stayed long enough — that would be the SCM reaching its timeout of how long it’s willing to wait around hoping for Notepad to send a “service start” message back (which it never will).

Note that you can’t start any program you want on Session 0 — Cmd.exe and Explorer.exe are examples of programs that for one reason or another won’t accept to be loaded this way. However, if you’re quick enough, you can use an old trick common to getting around early 90ies “sandbox” security applications found in many libraries and elementary schools — use the common dialog control (from File, Open) to browse executable files (switch the file type to *.*, All Files), go to the System32 folder, right-click on Explorer.exe, and select Open. Notepad will now spawn the shell, and even if the SCM kills Notepad, it will remain active — feel free to browse around (try to be careful not to browse around in IE too much, you are running with System privileges!)

That’s it for this introduction to this series. In part 2, we’ll look at what makes this service tick, and in part 3, we’ll look at a technique for spoofing the dialog to lie to the user about which service is actually requesting input. For now, let’s delete the RabbitHole, unless you want to keep it around for impressing your colleagues:

My apologies for the long delay until this fourth part was published. I have been teaching in Seattle for the previous two weeks, and have just started to settle in Cupertino for my Apple internship, and I had very few spare moments in my hands.

In Part 3, we discussed how generic shims modify key parts of the system, usually through API hooking or undocumented flags, in order to provide compatibility with a variety of applications. We looked at shims such as the Windows 9x Heap Manager implementation in NT,Â and several re-direction and reflection APIs, as well as even some security bypassing shims. Today, we’ll take a look at how certain applications have specific shims implemented specifically just for them. We can find these with CDD easily, by noticing that the Shim name is usually a program name, as well as looking in the DLL which implements it. Finally, specific shims never have any descriptive text describing them. While looking through the Shim dump, I’ve chosen this one (arbitrarly):

Dumping Entry:

SHIMNAME="CorelSiteBuilder"
DLLFILE="AcSpecfc.DLL"

Any continued analysis on this shim must be done through reverse engineering, since we have no hint as to what this shim is attempting to do. By using IDA on the DLL specified, one can notice it is a series of C++ classes, each which represent a specific shim (there are of course other classes such as CString and the generic Shim Engine initialization classes). The prefix for the specific shims seems to be “NS”, so it was easy to locate our target of interest: NS_CorelSiteBuilder. Every shim class also has an initialization function that gets called, and is responsible for initializing the class and its hooks. This is usually called IniitalizeHooksMulti. In the disassembly of this function, pay special attention to loc_714F3691. This is where this class initializes the API hooks that make up this specific shim (other specific shims can also have other types of hooks, such as patches or COM hooks). The tagHOOKAPI structure contains the information required to patch an API, and one can clearly see that SetWindowTextA inside user32.dll is being hooked, and re-directed to NS_CorelSiteBuilder::APIHook_SetWindowTextA.

Now the actual hook can be looked at, and I’ve provided an analyzed and commented disassembly here. This is a pretty simple hook, and seems to check on whether the window handlw and window text that are beingÂ sent as argumentsÂ match the previous window handle and window text that the shim had saved durinng the last call. If they do match, it will simply return TRUE (success) without actually calling the original API, otherwise, the hook will save the window text that’s being set as the “old” window text (so that when the hook is called again, it will compare against this name now), and then perform a call to the original API (in tagAPIHOOK+0xC) with the unmodified arguments.

In other words, the whole point of this shim is to “absorb” SetWindowTextA calls to the Corel Site Builder window if the new text that’s being set matches the previous text, and simply return success. The reason on why such a shim would be necessary is left as an excercise to the reader.

In the next article, I will release the first version of the CDD utility which I’ve used when showing some of the Shims available, and document some of its uses.

Continuing over from where we left last time, today’s entry will look at how the loader interacts with the AppCompat/Shim Engine Interfaces to determine that a module requires shimming or not. Unfortunately, it seems like this process underwent several revisions inside Microsoft’s codebase, so it may be difficult to experiment on your own based on this information. I will however, present all the known implementations to me in a generic fashion, without going too much under the hood in terms of actual assembly code.

Like many Win32-specific features, the Shim Engine actually gets initialized by the parent process through kernel32.dll, and not by the actual PE Loader/Startup routines inside the NT System DLL (ntdll.dll), although it also plays an important role in the process. As CreateProcessInternalW executes, it eventually calls BasepCheckBadApp (which is actually an exported API). The first thing that immediately happens is a check on whether or not the Shim Engine is disabled, followed by a lookup inside the Application Compatibility Shim Cache (done through BaseCheckAppcompatCache).

This cache is implemented in 2 different ways depending on the OS. On pre-Windows 2003, kernel32 maintains a shared section which other instances can use for caching the information, and a lock/unlock is done each time the cache is accessed. On post 5.2 kernels, there is a new Native API, NtApphelpCacheControl which supports the following classes:

In both cases, if the cache lookup doesn’t find anything, a “long” lookup is performed. This is where the architectural differences are the largest. In Windows XP SP0, this is done by using CSRSS, and calling BaseSrvCheckApplicationCompatibility in basesrv.dll. In SP2, apphelp.dll is imported and ApphelpCheckExe is called directly. In Windows Server 2003, a connection to the LPC Port AELPort is made, and a lookup LPC message is sent. Finally, in Vista, we’re back to SP2’s method, albeit with a newer API, ApphelpCheckExeEx.

The end result, however, is that the Peb’s pShimData member is now filled out with Apphelp Information (we’ll see what happens with this later) if this is a “bad” application indeed, meaning that it needs to be shimmed. How are these checks actually made, however? Recall that one of the “constructs” or entries that an SDB can have is the Matching File entry. The checks first discover the “Executable” entry for the filename given, and if one is found, all Matching File entries are parsed. This can include the name of the application and its helper files, the publisher, vendor, company, version, file size, timestamp and even linker version and other obscure data. Several boolean operations are available which can be built on top of inclusion and exclusion rules. If the Matching File entries check out, then the pShimData is filled with an opaque Apphelp structure.

The next important part of the Shim Engine’s mingling with our application happens in LdrpInitializeProcess, which is part of the PE Loader inside ntdll.dll. Here, a check is made if pShimData is non-NULL. If this is the case, then this pointer is saved then cleared, and the Shim Engine DLL is loaded with a call to LdrpLoadShimEngine. A variety of callbacks are then setup through LdrpGetShimEngineInterface, which mostly consist of pre and post initialization, and DLL load/unload notifications.

A bit later during initialization, the pre-initialization hook is called, if the Shim Engine was previously loaded, and the old pShimData pointer is passed along to the Shim Engine, so that it may begin initialization. The routine is SE_InstallBeforeInit inside shimeng.dll, and most of the work is done by SeiGetShimData and SeiInit. The former unpacks the information from the PEB Shim Data pointer that it received. It also has a check to disable shimming for ntsd and windbg, as well as slsvc, on Vista (since this is a semi-protected process related to licensing). As for the latter, it will process all the compatibility layers, shims, flags and finally patches which are defined in the Apphelp entry for the executable.

Shims will usually consist of either internal flags that are saved inside shimeng.dll or inside the PEB (seeÂ +0x1d8 AppCompatFlags and +0x1e0 AppCompatFlagsUser), or by the IAT of the shimmed process to be hooked and redirected into one of the Ac***.dll files which contain an alternate, or hacked implementation. These contain two main exports, GetHookAPIs and NotifyShims which allow the Shim Engine to know which APIs should be hooked and to send notifications during loader events. The Shim Engine is smart and will also hook GetProcAddress to make sure that APIs are properly caught. Patches are done through a method that will be looked into more detail later.

During the next entry, we’ll take a look at an actual shimmed application in action, and future parts will cover patches/flags in more detail. It is my hope that this part was useful into giving some insight on how the hooking is performed. Many vendors of application/DLL hooking software risk of running into the Shim Engine during their testing and development process, so having a good handle on how and when everything happens is certainly helpful.