Tuesday, February 17, 2015

WinDbg is an awesome debugger, but I always missed the nice, compact and tidy view of the process memory layout that you have in OllyDbg (in View->Memory). Obviously WinDbg is capable of showing information about the virtual memory of a process (e.g. with !vad) or of the kernel (e.g. with !address), but I don't really like the output format of its commands. I wanted a fast-to-read output, thus I decided to experiment with WinDbg's interfaces to write my own extension capable of printing a convenient map of all the kernelmode virtual memory.

I chose to develop a DbgEng-style extension (see this documentation for more information about the extension styles, and how to write an extension in general) that basically provides one main command that does the job. I wrote it for 32bit Windows machines, but I am planning to extend it to 64bit platforms as well. I tested it on Windows XP 32bit with PAE enabled, and in theory it should work on other 32bit Windows versions (with or without PAE), but I have not had time to run further tests yet.

The strategy of the command is simple: it iterates over all the possible virtual addresses of 4k pages in the kernel space (that is, from 0x80000000 to 0xFFFFFFFF, for now I ignore the /3GB configuration option), it retrieves their corresponding PTEs and prints the attributes that they contain. Adjacent pages that have the same attributes are joined together and printed as a range. The output also includes some relevant symbols, e.g. it locates important regions identified by kernel variables like MmNonPagedPoolStart, MmNonPagedPoolEnd0 etc., and it associates the names of loaded drivers to the regions of memory to which they are mapped.

The VA and Size fields identify the memory range, then Attributes shows the properties of the pages it contains, and in the most right part of the output there are the symbols contained in such range. A VA with an invalid size (identified by "--------") means that the VA is not allocated, but there is a symbol associated to it nonetheless.

It is clear from the output that VA 80400000 is the beginning of a buffer, composed of two large pages (2Mb each), that contains the modules nt and hal. The NonPagedPool is also visible at VA 81000000 (11 large pages).

If we have a look at VA f888a000, we can see that this region of memory contains the module Cdfs.sys. Interestingly, the second page at VA f888b000 is read only (probably related to the .text section), while VA f888d000 is the starting of a set of pages that are not present and that are marked as transition PTEs (probably related to the .INIT or .PAGE section).

I took some macros from one of the source code templates that is available in the WinDbg SDK. I had problems when passing 64bit integers as function parameters (the extension was compiled for 32bit), therefore I used a quick and ugly macro (PARAM64) to solve the problem.

The core of the functionality is inside the print_layout function/command in exts.cpp:

This is the main loop that iterates over every possible page, performing a virtual-to-physical translation by using GetVirtualTranslationPhysicalOffsets. This function is very interesting because it returns all the entries from all the steps used to perform the translation: the physical address of the PDPT, PDE, PTE and of the page itself (the translations steps change according to the features supported by the CPU). Then, the code uses ReadPhysical to read the data contained in the PTE and extracts all the attributes from it. The rest of the function simply recognizes ranges of pages that share the same attributes and, for every one being identified, PrintRange is invoked.

As the name suggests, PrintRange is in charge of displaying the gathered information for every range, and takes as its arguments a range's virtual address, size and attributes. In addition, it is also responsible for determining if one of the supported symbols (that are stored in the array MemSymbols) is contained within the range and if a driver module is associated to the range by using GetNearNameByOffset. In case it does, it prints them too. Of course, these two last capabilities only work if the debug symbols are loaded in WinDbg.

Note that the array of the supported symbols contains the names of internal kernel variables that identify interesting areas of memory (e.g. the paged pool, the non paged pool, etc.), and two empty entries that will be filled at run-time with the virtual address of the symbol and the data it points to. This functionality is implemented in the print_symbol function/command: you should call it before print_layout in order to displays the supported symbols.

The source files dbgext.cpp and dbgexts.h contain macros and initialization code that are required by the extension, while dbgexts.def contains the definitions of the exported functions (that will become the actual commands to be invoked from WinDbg commandline).

To compile the extension, you need to add WinDbg's include and library paths, normally located in:

Also make sure you have Windows debug symbols installed and loaded. At this point, using !print_symbol initializes the supported symbols, and !print_layout produces the final output.

This is a POC, it was a good exercise to make practice with WinDbg extensions, I am planning to rework this source code to make it compatible with all versions of Windows on 32 and 64 bit. At the moment it is not very fast, it takes few minutes to print the whole layout, but I think it is possible to speed up the processing avoiding the brute force loop on every page, and handling in a smarter way the pages based on the contents of the PDEs and PTEs (basically if a PDE is invalid I can exclude a lot of memory addresses from the loop).

Monday, February 9, 2015

Here is the second part of the solutions to the "Windows Kernel" exercises from the "Practical Reverse Engineering" book. Specifically, this post is about the first eight that you will find in the "Investigating and Extending your Knowledge" section.
It should be noted that the code proposed in my solutions is to be intended as working POCs and that the methodologies can be generalized/improved so that they would work independently of the Windows version etc. Finally, the ideas I used to solve the exercises are based on known mechanisms (e.g. the KeUserModeCallback method).

1)

NX is a bit set in the page tables that specifies whether a memory page can run executable code or not. If the CPU tries to execute code from a page that is not marked as executable, an exception is raised. Windows (and other OSes too) leverages this bit in order to mark heap and stack data areas as not executable. In this way, should a buffer overflow happen, an attacker will not be able to exploit it in order to jump to a shellcode on the heap or stack. This bit is supported on x64 architecture, and on x86 with PAE enabled.

Prior to the introduction of this bit, there were some software implementations that tried to provide non-executable data by using hardware segmentation (e.g. W^X and ExecShield). The x86 hardware, in fact, provides segmentation in order to define code and data segments, each with its own properties (read, write or execute). Normally, Windows (32bit) creates usermode code and data segments (CS and DS) that are as big as the whole 32bit addressable range: this means that according to the code segment properties, every possible 32bit address is executable (the division between usermode and kernelmode is done via the page tables). This leaves the opportunity for an exploit to write shellcodes in data areas and execute them. To sort out this problem without the NX bit, it is possible to make a code segment smaller, in order to leave out a range of addresses that are not part of it. Then a data segment can be created using this range of memory that is not part of the code segment. At this point, the code segment can be marked as executable, and the data segment can be marked as read/write only, ensuring that if the execution ends up in the range of addresses reserved for the data, an exception is raised.

Another potential way to emulate the NX bit would be to modify the page tables for the heap and stack in order to make them invalid: every access to a page would trigger a page fault, that would be trapped by the page fault handler. The OS would have to check the kind of fault, and determine if it is a memory read, write or execute. If it is execute, then there is something wrong and the process will be terminated. In theory, this would work, but in practice it would add a very big overhead on the run time (every memory access would cause an exception!), thus it may not be feasible (the PaX Linux kernel patch uses a similar approach).2) The APIs that provide the functionality to manage APCs are KeInitializeApc and KeInsertQueueApc. Since they are not declared in the DDK headers it is necessary to assign their addresses to appropriate function pointers via MmGetSystemRoutineAddress in order to use them.

KeInitializeApc simply initializes a KAPC structure by storing into it all the necessary information about the APC that is going to be queued for execution, including the KTHREAD to which the APC must be queued to and the addresses of the callbacks to run.

KeInsertQueueApc, instead, does the actual work of scheduling the APC for execution in the given KAPC.Thread (of type KTHREAD). To do so, it begins by acquiring the spinlock stored in KTHREAD.ApcQueueLock, necessary for proper synchronization. Then, if KTHREAD.ApcQueueable is set to 1, the API invokes the internal function KiInsertQueueApc, which in turn verifies that KAPC.Insertedis set to 0 and, if it is, adds the APC to some memory referenced by theKTHREAD.ApcStatePointer array. In particular, this array contains two pointers to KAPC_STATE structures, where the APC queues (implemented by using LIST_ENTRYs) are actually stored. Why two? The first KAPC_STATE structure is related to the APCs whose KAPC.ApcStateIndex is OriginalApcEnvrionment, while the second is related to the ones whose KAPC.ApcStateIndex isAttachedApcEnvironment. Basically, the value of KAPC.ApcStateIndex differentiates between the APCs that are running in the context of the process to which the thread belongs and the ones that are running in a thread that is attached to a different process. This is why two structures are kept. Once the correct one is determined, a further discrimination is to be made. Each structure contains an array of two LIST_ENTRY structures (named KAPC_STATE.ApcListHead), that are selected according to the value stored in KAPC.ApcMode, which is either 0 (KernelMode) or 1 (UserMode). These are the actual APCs queues.Once the APC is queued, the member KAPC.Inserted is set to 1, and then, if the APC is kernelmode, KTHREAD.KAPC_STATE.KernelApcPending is also set to 1. Furthermore, HalRequestSoftwareInterrupt may be invoked to switch to APC_LEVEL.

The queues of APCs will eventually be walked by the KiDeliverApc API, which will call the various kernel, normal and rundown routines for each APC.

APCs offer the possibility to execute code inside a specific process' context and there are various possible use cases for them. Windows uses APCs to perform thread suspension, to schedule some completion routines, to set and get a thread's context, and more.Usermode APCs provide a handy way to execute code in usermode from kernelmode, commonly done by rootkits since it allows the possibility to inject malicious payloads in running processes, hook their APIs etc. Examples are presented in the answer to exercise 3.3)Since there is no directly available API to create a process from kernel mode, I decided to leverage APCs to run malicious usermode code in a particular process. I devised three different ways to achieve this goal and, although all of them rely on APCs, their approach changes considerably.The general strategy involves some preliminary operations to locate the target process, obtain its handle and allocate some memory in its process address space. The malicious code is then copied in this memory area (injection) and an APC is initialized in either one of these ways:

Usermode APC with the normal routine set to the allocated area, that contains the malicious code.

Kernelmode APC with the kernel routine set to hook a user-mode API. In this case, the allocated area contains the assembly code of the hook, that will be executed only once, in the context of the target process.

Kernelmode APC with the kernel routine set to overwrite an empty entry in the kernel-to-usermode callback table with the address of the allocated area, and let KeUserModeCallback call it. The allocated area contains the malicious code.

There are of course many other methods to start a process from kernelmode code. For example, a possible variant of the second method, that doesn't involve APCs, would consist in using SetCreateProcessNotifyRoutine in order to inject the malicious code in every process that is created and then hooking a common API to redirect its code towards the malicious code. However, here I chose to focus solely on the three above mentioned ideas.

Method 1

For the first method, I used the APCs in the most natural way: I queued a usermode APC to Explorer that simply runs a "shellcode", which in turn locates and calls the CreateProcessAPI to execute Notepad.First of all, I needed to have the usermode shellcode, thus I wrote the following usermode application:

This code purposely avoids the use of any API or CRT function in order to be relocatable. As a result, after compiling it, I was able to simply copy all the opcodes generated for the "main" function and use them as an executable buffer that gets injected into a running process.The shellcode behaves similarly to the ones you can find in the exploits: it accesses the PEB to get the PEB_LDR_DATA and its InLoadOrderModuleList field, which is a pointer to a list of LDR_DATA_TABLE_ENTRY structures, each representing a loaded module. The code walks the list to locate kernel32.dll (the DLL name is kept in LDR_DATA_TABLE_ENTRY.FullDllName) and, once found, it retrieves its imagebase via LDR_DATA_TABLE_ENTRY.DllBase. It is then straightforward to parse the PE header of the dll in order to locate its export table, and the address of the CreateProcessA API from it. The shellcode concludes by calling such API to launch Notepad.

Having the shellcode sorted out, let's see the code for the kernelmode driver (note that the shellcode is encoded in the "buffer[]" array):

The driver begins by walking the ActiveProcessLinks from the EPROCESS structure in order to locate the EPROCESS corresponding to Explorer.exe (the target process). The code then retrieves the ThreadListHead from this EPROCESS, and takes note of the first ETHREAD of the list (it is not really important which one). Having done that, PsGetProcessId and PsGetThreadId are called to retrieve the CID of the target process/thread. The driver proceeds by allocating an executable area of memory inside the process via ZwOpenProcess/ZwAllocateVirtualMemory, where it then copies the shellcode bytes. To perform the copy, the driver needs to switch to the Explorer process context via KeStackAttachProcess/KeUnstackDetachProcess.

Finally, an APC is initialized by calling KeInitializeApc and passing to it the pointer to the allocated shellcode as the normal routine. This Apc is finally queued to the target thread belonging to Explorer via KeInsertQueueApc. To be precise, during the initialization, a kernel routine is required by the OS as well, but since we don't really need it, I specified a dummy one that simply deinitializes the reserved memory for the KAPC structure.

At this point, whenever the target thread is scheduled for execution, the APC is going to be run and the usermode shellcode will start a new process. It goes without saying that it is important to choose a thread that is actuallyin an alertable state: some processes may have threads that are asleep or stuck in a wait, and if an APC is queued to them, it may never have a chance to be executed. In my case I picked the first thread of the Explorer process for a commodity: I noticed that this thread awakens when you right click on the icon of a folder on the desktop, thus it is very handy because it allowedme to trigger the APC manually whenever I want.

Method 2

As an alternative, I decided to hijack the execution flow of a process towards my shellcode harnessing kernelmode APCs. The idea is to patch an API that gets called quite often: the patch installs a jump to the shellcode in the entry point of the API, which, in turn, executes Notepad and calls the original API.The code of the DriverEntry is almost the same as the one from Method 1, the only difference is that this time the scheduled APC is kernelmode and not usermode. The different lines of code are the following two:

The first one specifies that this is a kernelmode APC, while the second one passes two parameters to the kernel routine. These parameters are the pointer to the usermode shellcode and the pointer to the EPROCESS related to Explorer.The kernelmode APC is still targeting Explorer.exe like before. Similarly to the shellcode, it retrieves and walks the list of LDR_DATA_TABLE_ENTRY structures to locate the imagebase of kernel32.dll. Once found, the routine retrieves the address of the CreateProcessW API from the export table, and proceeds by patching it in order to jump to the shellcode. I chose CreateProcessW just because it is easy to trigger it on command (e.g. by running a process from explorer's GUI), but the method applies equally to any other API.

The shellcode has also been slightly modified in that I added the following bytes:

The JMP in the API entry point will actually transfer the execution to the third line of this block of instructions (the one marked with "entry"). This code begins by verifying that it is being run inside the Explorer process. It does so by comparing TEB.CliendId.UniqueProcess against a Pid hardcoded in the CMP instruction (fourth line). The CMP instuction has currently a Pid of zero (notice the four bytes following the 0x3d), but these bytes will be patched by the kernelmode APC routine with the value of the Pid of the Explorer process. After this check, the code verifies that it has not been already run by examining the line containing "myflag dq 0". These eight bytes are a quadword that simply stores 0 initially, and which is updated to 1 after the "lock cmpxchg cs:myflag, rbx" is run for the first time.

If both checks are satisfied, the code saves some registers on the stack, and calls the original code that I have described in the previous method. When the original shellcode returns, the code restores the registers saved earlier, executes the first two instructions of CreateProcessW and jumps to the third instruction of the original API. Again, the jump in the last line is followed by zeroed bytes, which means it jumps to the next instruction, but, as we will see later, the four bytes will be patched with the correct offset that will lead the execution flow right to the third instruction of CreateProcessW.

I had to save two instructions because when patching the API entry point I am writing a long JMP, which takes 5 bytes. The first instruction is only 4 bytes long, thus the patch ends up overwriting also the first byte of the following instruction. For this reason, the first two instructions must be preserved and executed in order to restore the original execution flow.

Note that here I hardcoded the first two instructions in the shellcode, because this is a proof-of-concept. To generalize the method it is fundamental to use a mini-disassembler to understand how many instructions are going to be overwritten during the patch (so that they can be saved in the shellcode itself). Also note that if the very first instructions are relative jumps or calls, they cannot be simply copied, but their relative offsets must be recalculated.

As anticipated earlier, this routine hooks the CreateProcessW API by overwriting its first opcodes with a JMP to the shellcode (specifically, to its offset marked with the "Entry" comment) and by patching some of its opcodes with parameters that are available only at run-time. In particular, these parameters are: the address of the third instruction of CreateFileW and the the PID of the target process.
There is still one interesting detail that we haven't discussed yet. In order to perform the hook, the KernelRoutine disables the WriteProtect flag from the CR0 register, which allows the code to write on any present memory page, even if it is marked as read only. However, this has also the side effect of disabling the copy-on-write, and we will see how this is going to be addressed.Normally, a physical memory page of code from a system DLL is shared among all processes' virtual memory. If a process decides to patch such code (e.g. an API), the OS would detect the write attempt and would allocate a dedicated physical memory page to the patching process so that it would remain localized and would not affect the other processes. However, if the WriteProtect is disabled, the OS will not react to the write attempt and thus will not allocate a dedicated physical page for the patch. This means that the patch is effectively operating on all the running processes, but not all of them have a shellcode to jump to. Therefore, to prevent crashing them, the shellcode needs to verify that the current Pid is indeed the one of Explorer.Note: in cases in which the kernel routine needs to modify sensitive areas of memory, some extra care is generally required. For example, it may be necessary to: disable the interrupts (possibly on all the CPUs by scheduling a DPC); use atomic operations; use proper synchronization. In my case, the driver was tested on a machine with a single CPU, therefore once the interrupts are disabled with _disable(), it is pretty safe to patch the code and disable the WriteProtect without atomic operations or synchronization.

Method 3

I tried to work on a third method, which proved to be unstable and therefore cannot be used, however I think it deserves some attention. This method tries to harness the kernelmode API KeUserModeCallback in order to run code in usermode.The OS maintains a table of usermode callback routines, which is located in usermode and is pointed by PEB.KernelCallbackTable. In particular, these callbacks can be called from kernelmode with the API KeUserModeCallback, that takes in input the index of the desired function within the table. Thus, by inserting a pointer to the shellcode inside this table, I can manage to call it from kernelmode and have it executed in usermode.

The code encompasses some changes. A first difference is at the end of the shellcode:

...
0xC0, 0x48, 0x81, 0xC4, 0x68, 0x01, 0x00, 0x00, 0xcd, 0x2b, 0xc3

which ends with an "int 2b" (0xcd 0x2b) and a "ret" (0xC3).We will see later why.

Another modification occurs in the DriverEntry, when the KAPC structure is initialized:

The code uses again a dummy normal routine: in fact, the "baseaddr + 0x35a" parameter refers to the last byte of the shellcode (the RET). If a normal routine is not provided, the system seems to crash.

Finally, the KernelRoutine is the one that changes significantly and does the actual job of overwriting an entry in the KernelCallbackTable :

I chose to overwrite the table entry at index 0x76 because in my system it was always zero, but it would be preferable to have a more generic approach to find an empty entry. Once the table entry is written with the pointer to the shellcode, the driver lowers the IRQL to PASSIVE_LEVEL (it will be restored later) and issues a call to KeUserModeCallbackwith 0x76 as index. The routine gets executed (Notepad starts successfully) and when the usermode code has finished its task, it returns back to the kernel by issuing an int 2b. Unfortunately, when the routine ends, it crashes. I made some tests and experiments, trying to figure out if it was a problem related to the stack, but I always ended up with a crash (a usermode one, not BSOD). In the end, I did not proceed in further investigating this issue, but I believe that it should be possible to make this method stable and reliable.

4)

To protect a shared memory resource (allocated in nonpaged memory) in a SMP environment I would use a spinlock: the routines responsible to access the resource would need to acquire the spinlock in order to read or write the data. Before acquiring a spinlock, the system raises the IRQL at Dispatch level so that other threads cannot preempt the CPU, then it attempts to obtain the ownership of the spinlock by continuously checking its availability in a loop (that is, spinning). This mechanism ensures that only one thread from one CPU at a time is accessing the shared data and it is quite efficient, assuming that the lock is not being held for a long time.

The driver installs a load image notify routine via PsSetImageNotifyRoutine. This routine verifies if the name of the loaded image is bda.sys, and if it is, it patches its entry point with assembly instructions equivalent to:

return STATUS_UNSUCCESSFUL;

The load image notify routine is called after the driver is mapped in memory, but before its entry point is executed. Thus, patching the entry point with the above code will cause the driver to report a failure in loading and the OS will unload bda.sys from memory without executing any other code from it.
Finally, when the driver is unloaded, the callback to the load image notify routine is removed via PsRemoveLoadImageNotifyRoutine.

The driver implements a basic keylogger. It attaches its device object to the keyboard device stack and filters the IRPs going to it. In particular, the device object is created via IoCreateDevice, passing FILE_DEVICE_KEYBOARD as the DeviceType and setting its DeviceExtension to target the keyboard device stack via IoAttachDeviceToDeviceStack. The keyboard device is obtained via IoGetDeviceObjectPointer, by specifying \\Device\\KeyboardClass0 as the ObjectName. The flags of the keyboard device are actually used to set the ones of the newly created device object, as explained in the source code.
Moreover, the MajorFunction[IRP_MJ_READ] entry (in the driver object) is set to a simple pass-through function, that receives an IRP, sets a completion routine (via IoSetCompletionRoutine), copies the current stack location to the next device stack location (via IoCopyCurrentIrpStackLocationToNext) and calls its IRP_MJ_READ function (via IoCallDriver).
The completion routine processes the IRP after the keyboard driver has filled it with the information about the received keystroke. The driver simply inspects each KEYBOARD_INPUT_DATA structure from the output buffer (stored in IRP.AssociatedIrp.SystemBuffer) and retrieves the keystroke scan codes. I used standard scan codes to perform a very basic mapping of the keystrokes to the relative characters, however such translation is in general way more complicated than this implementation.
During the unloading of the driver, the device will be first detached from the keyboard one (via IoDetachDevice) and then deleted (via IoDeleteDevice).

The driver creates a MDL associated to a virtual address, then probes and locks it and finally maps it to a new virtual address. As an extra, I call the function MmProtectMdlSystemAddress to ensure that the RWX protection is set, but by debugging I have noticed that such protection is already in place after MmMapLockedPagesSpecifyCache (MmBuildMdlForNonPagedPool would have been more appropriate normally, but for the sake of this exercise it can be ignored). After the work is done, the MDL isreleased by unmapping its pages and deallocating it.

To verify that the protection is successfully changed, I made a simple test. I used the !pte debugger extension to translate the virtual address of the imagebase of monitor.sys:

The log shows that while the former lacks the executable protection, the latter does not.

As suggested by the exercise, I tested the same code using the imagebase address of win32k.sys, that is a session space address, and the system crashed with a BSOD. A quick investigation revealed the problem: the DriverEntry routine is called in the context of the System process, which is not associated to any session. Thus, the session space virtual addresses are not available and cannot be used to build MDLs.

I experimented a bit and found a simple trick to bypass this problem: if the System process is not associated to a session, the code should work if it is run from the context of a process that is associated to a session. This is a simple modification that would make the driver code work:

I used KeStackAttachProcess in order to get in the context of Explorer.exe, which is associated to the currently logged in user, but any other process run inside the same login session would have worked (the PEPROCESS is hardcoded just for this test). Debugging this code, I tested the accessibility of win32k.sys imagebase address via WinDbg before the driver attached to Explorer:

Translating the virtual address to a physical one shows an invalid PTE, and even dumping the bytes from that memory address returns no data. However, as soon as I step beyond KeStackAttachProcess the address becomes available:kd> !pte fffff960`00060000

If I step further down with the debugger, and go after KeUnstackDetachProcess, the address becomes dead again.

8)To figure out which function is calling the DriverEntry I have: written a dummy driver; set a breakpoint on its entry point with DbgBreakPoint(); run it under kernel debugging so that I could dump the stack.driver!DriverEntry+0x3ant!IopLoadDriver+0xa07nt!IopLoadUnloadDriver+0x55nt!ExpWorkerThread+0x111nt!PspSystemThreadStartup+0x5ant!KxStartSystemThread+0x16

The dump shows the functions that were called right before the DriverEntry. The direct responsible for calling the entry point is IopLoadDriver, which is in turn called by IopLoadUnloadDriver. This function manages both the loading and unloading of a driver (calling the entry point or the driver unload routine respectively), and it is called by a dedicated system thread, as can be noted by the three functions ExpWorkerThread, PspSystemThreadStartup and KxStartSystemThread.