Abstract

After having published my article about process-wide API spying, I received plenty of encouraging messages - readers have generally accepted my model of hooking function calls. In this article, we will extend our model to kernel- mode spying, and hook the API calls that are made by our target device driver. We will also introduce a brand-new way of communication between the kernel-mode driver and the user-mode application - instead of using system services, we will implement our own mini-version of Asynchronous Procedure Calls. This task is not as complicated as it may seem - in fact, it is just shockingly easy. Windows flat memory model offers us plenty of exciting opportunities - the only thing we need is a sense of adventure (plus a good knowledge of assembly language, of course). All tips and tricks, described in this article, are 100% of my own design - you would not find anything more or less similar to these tricks anywhere.

I took into account the messages, saying that I should have provided the readers with the source code. Therefore, the source code for this article is available for download (please read the installation instructions in the end of the article).

Important: This article is based upon the bold assumption that you have read my previous article about process-wide API spying. If you haven't read it yet, go and do it now. This is an absolute must - otherwise, you will be unable to understand absolutely anything here, and the whole article will seem totally incoherent to you.

Introduction

First of all, let's look at complications, arising from the fact that our spying activity is going to take place in the kernel mode. To begin with, Windows is a protected system, which means user-mode applications have no access to the kernel address space. Therefore, a spying DLL, described in the previous article, cannot work in kernel mode - in order to start spying in kernel mode, we have to write not a DLL but a kernel-mode spying driver.

Our spying driver has to communicate with the user-mode controller application, and do it asynchronously. Any virtual address in the lower 2G of the process address space is meaningless for the kernel code - even in the same process. This means kernel-mode code cannot call user-mode functions, so that our driver cannot send a window message to its controller application the way user-mode spying DLL does, because SendMessage() API function resides in user-mode address space. In fact, even if our driver could call SendMessage(), doing so would be an extremely unwise move on its behalf - SendMessage() does not return until the message gets processed, and kernel-mode code cannot afford to wait patiently for the user-mode one to accomplish its task, unless we are desperate to bring the system down. Therefore, we must find some way of asynchronous communication between the spying driver and its controller application.

Furthermore, if we spy on user-mode modules, we can save the original return address, as well as other relevant information, in the thread local storage. However, our kernel-mode driver cannot call user-mode TlsGetValue() or TlsSetValue(). TLS-related system functions that can be called by kernel-mode code don't seem to exist. If you disassemble TlsAlloc(), TlsGetValue() or TlsSetValue(), you will see that neither of these functions invokes INT 2Eh, i.e. managing TLS does not involve system services. Once every user-mode thread in the system receives its own copy of CPU stack and registers, the FS register is mapped to the beginning of the section where thread-specific data is stored. This is how kernel32.dll implements thread local storage - everything is implemented purely by user-mode code, which cannot be called by our spying driver. Therefore, we have to find some other thread-safe way of saving the original return address, as well as all other relevant information.

In addition to the above, we should not forget that our spying code may run at any IRQL. Some system services may be called at any IRQL, and some are IRQL-dependent. Therefore, we have to take the current IRQL into account when we call certain system services. Furthermore, we must make sure that, if our spying code currently runs at high IRQL, it does not go anywhere close to the paged pool. If our spying code tries to access the paged memory while running at high IRQL, or calls any function that tries to do it, the “blue screen of death”, due to the page fault, is inevitable.

We should also keep in mind that kernel-mode code is interruptible. If an interrupt occurs while our spying code runs, the system will transfer control to the interrupt handler, which may call some API function that we have hooked. In other words, the system may interrupt the execution of our spying code only in order to execute exactly the same spying code. As a result, when our interrupted code continues its execution, it may find global variables and resources in the state, very different from the one they used to be in before the interrupt had occurred. We should always remember this, and disable interrupts whenever we expect global variables and resources to be immutable while our code runs - otherwise, we can get an unpleasant surprise (i.e., the "blue screen", which is the system's standard reaction to any mistake we make in kernel-mode code).

As you can see, in order to work in kernel mode, Prolog() and Epilog() need quite a few adjustments. Although ProxyProlog() and ProxyEpilog() may stay as they are - the only thing they do is saving and restoring CPU registers and flags, which is done the same way in both kernel and user modes. We will start with "theoretical foundations" - first, we will look at how our custom asynchronous message queue and TLS can be implemented, and proceed to the actual task of spying afterwards.

Asynchronous Message Queuing

Look at the code below (I hope that, after having read my previous article, you are able to recognize handcrafted indirect jump instruction):

What does the newly created thread do? Absolutely nothing - in the beginning of chunk1 array, it finds the instruction to jump to the location, address of which is stored 6 bytes above the beginning of chunk1 array. However, the only thing stored there is the address of chunk1 array itself. Therefore, the code is instructed to jump to its current position, where it finds the instruction to jump to its current position, etc., etc., etc... As a result, the thread gets into the processor-level version of infinite loop, and is unable to move anywhere - it spins at its start address.

What is going to happen if, at some point, the program executes the following lines (addr is the variable, containing a pointer to the function which, for the sake of simplicity, takes no arguments):

When the last line of the above code gets executed, the thread will break out of its infinite loop, and jump to the handcrafted code in chunk2 array. Why? Because the thread is instructed to jump to the location, address of which is stored 6 bytes above the beginning of chunk1 array. We wrote the address of chunk1 array there, and, as a result, made the thread spin. If we write some other address 6 bytes above the beginning of chunk1 array, the code flow will jump to the new location.

The code in chunk2 array (starting from byte 10, i.e. the location to which our thread will jump) instructs the thread to push the address of chunk2 array on the stack, and to jump to the target function. Therefore, after the target function returns, the code flow will jump to the beginning of chunk2 array, where it will find the instruction to jump to the location, address of which is stored 6 bytes above the beginning of chunk2 array. Once the only thing stored there is the address of chunk2 array, the thread starts spinning again.

We can repeat this sequence again and again and again with chunk3, chunk4, chunk5, etc. - the thread will break out of its infinite loop, execute the target function, enter an infinite loop, break out when the next chunk of instructions arrives, execute the target function, enter an infinite loop again, and so on and so forth. Therefore, we can schedule the target function for subsequent execution simply by writing data to the array, rather than by means of system-defined APC. We can do it whenever we wish - our thread never terminates. The only thing we need to know is the location where the thread currently spins, or, in case if it currently executes the target function, the location where it will spin after the target function returns - we have to write the address we want it to jump to 6 bytes above this location. In practical terms, it makes sense to allocate all these chunks from the same pool, and to use an offset index as a pointer into this pool. When the pool gets fully packed, we can reuse it- all we have to do is to set an index to zero, and start it over again.

Let's say the thread runs in the user-mode process X. It is understandable that the function we want it to execute must reside in the address space of user-mode process X as well, but what about the code that actually fills the array with the machine instructions and makes the thread break out of its infinite loop? Where should it reside? It can reside anywhere - in the same module, in a different module of the same process, in another user-mode process, or even in kernel- mode driver. Furthermore, this code can be concurrently executed by different threads in different processes. As long as this code has a write access to the address space of the process X, it can post an asynchronous message that schedules the target function for execution. In order to do so, it does not need any system services - what it needs are the addresses of the target function and of the pool (as they are known to the process X), the ability to write data to this pool, plus the current value of offset index. This approach is absolutely generic, and can be used whenever we need to post an asynchronous message to an application, but, for some reason, are unable to use system-defined APC and asynchronous message queuing.

Now, let's look at how we can save thread-specific data without using system-defined TLS.

Storing Thread-Specific Data

Prolog and epilog of any function that uses EBP register as a base pointer to its local variables look like the following:

It is easy to see that, as a very first step, the function must save the value of EBP register on the stack, and restore it before returning the control. This is an absolute must - otherwise, the caller will be unable to keep track of its local variables after the function returns. It is also easy to see that the function saves the value of EBP on the stack just one stack entry above the function's return address, and then copies the current value of ESP into EBP register. Therefore, the current value of EBP register always points to the location that stores the previous value of EBP.

What does it have to do with spying? Look at how our spying code flow affects the original value of EBP, i.e. the value of EBP at the time when ProxyProlog() starts execution (I hope you remember our "spying team" from the previous article):

ProxyProlog() - does not change EBP value before calling Prolog()

Prolog() - Once Prolog() is not a naked routine, the compiler will definitely generate instructions to save the value of EBP above Prolog()'s return address, and to pop this value into EBP register before Prolog() returns.

ProxyProlog() - does not change EBP value after Prolog() returns.

Actual callee - either saves the original value of EBP and restores it before returning control, or leaves EBP register intact. In any case, the original value of EBP is of no importance for the actual callee - it is only the client code that actually uses it.

ProxyEpilog() - does not change EBP value before calling Epilog()

Epilog() - Once Epilog() is not a naked routine, the compiler will definitely generate instructions to save the value of EBP above Epilog()'s return address....

Did you get it? The value, stored above Prolog()'s return address, is always going to be equal to the one, stored above Epilog()'s return address!!! Furthermore, this value in itself is of no importance until the program flow returns to the client code. This is what we can take advantage on.

Prolog() can save the original value of EBP, along with the original return address and all other relevant information, in the Storage structure, and write the pointer to this structure just above its return address. Once the same pointer to the Storage structure is going to be stored above Epilog()'s return address, and the original value of EBP is available from this structure, Epilog() can write this value above its return address, so that it will get popped into EBP register before Epilog() returns. As a result, at the time when the program flow jumps to the client code, the value of EBP is going to be exactly the same it used to be at the time when ProxyProlog() started execution.

Therefore, we can save the pointer to the Storage structure in the location, pointed to by EBP register, rather than in the thread local storage - our spying code is going to stay thread-safe anyway. Such approach is suitable for both kernel-mode and user-mode spying. It is needless to say that, in case of kernel-mode spying, the Storage structure must be allocated from non-paged pool - our spying code may run at high IRQL.

Armed with this theoretical knowledge, we can implement it in practice, and proceed to the actual task of kernel-mode spying. I am sorry - I forgot to tell you that overwriting the addresses of functions, imported by kernel-mode driver, is a bit different from modifying user-mode module's IAT.

Kernel-Mode Spying

First of all, let's look at how IAT is filled with the addresses of imported functions at load time. As a first step, the loader has to locate the module's import directory, i.e. the array of IMAGE_IMPORT_DESCRIPTOR structures, from which it can obtain the pointers to Import Name and Import Address tables. After having obtained the name of the imported function from the Import Name Table, the loader can get its address from the IMAGE_EXPORT_DIRECTORY of the module that exports the given function, and write this address to the Import Address Table, so that the program can call the imported function. It is understandable that the Import Address Table is needed as long as the program runs, but what about the Import Name Table and the array of IMAGE_IMPORT_DESCRIPTOR structures? Are they needed after the loader fills the Import Address Table with the addresses of imported functions? Not really - they are needed only at the load time. What is the point of keeping them in memory after the module gets loaded?

As long as they reside in pageable memory, this is not really a big deal - they will be normally swapped to the disk, and get loaded into RAM only if we want to access them, i.e. they are not going to take up space in RAM anyway. Once user-mode modules can be paged to the disk, the loader just cannot be bothered to discard the Import Name Table and the array of IMAGE_IMPORT_DESCRIPTOR structures from memory after the user module gets loaded - it is not worth an effort. Therefore, the pointer to the array of IMAGE_IMPORT_DESCRIPTOR structures, available from the target user module's IMAGE_OPTIONAL_HEADER, is always valid.

However, the kernel-mode driver, apart from its code that has been explicitly marked as pageable, has to be constantly loaded in RAM, which, compared to virtual memory, is scarce. Under such circumstances, keeping the Import Name Table and the array of IMAGE_IMPORT_DESCRIPTOR structures in memory is just an unreasonable waste of resources. In order to solve this problem, the linker may place the Import Name Table and the array of IMAGE_IMPORT_DESCRIPTOR structures to the driver's INIT section, along with DriverEntry() and DriverReinitialize() routines, i.e. the code that is needed only during driver's initialization. After the driver is loaded, the loader simply discards its .INIT section from memory, because it does not contain any information that may be needed after driver's initialization. As a result, after the driver is loaded, the pointer to the array of IMAGE_IMPORT_DESCRIPTOR structures, available from the target driver's IMAGE_OPTIONAL_HEADER, may point right to the middle of nowhere, and, hence, should not be accessed - in order to figure it out, I had to spend two evenings in "blue screen - reboot loop".

However, the array of IMAGE_IMPORT_DESCRIPTOR structures contains information, crucial for locating the Import Address Table. What are we going to do? We will map the target driver's file into the memory, and obtain all necessary information from the file mapping. This can be done by the user-mode controller application. Look at the code below:

As a first step, we find the address at which our target driver is loaded in memory. We do it by calling ZwQuerySystemInformation() native API function, which can return information about all currently loaded drivers. ZwQuerySystemInformation() returns this information as an array of SYSTEM_MODULE_INFORMATION structures. Base field of SYSTEM_MODULE_INFORMATION structure indicates the address at which the given driver is loaded, and ImageName field may contain either the name of the driver, or the whole path to the driver's file. If ImageName field contains the path, we extract the name of the driver from the path, and compare it to the target driver's name, which has also been extracted from the path that was received by spy() as an argument. We do it until we locate the target driver, and save the Base field of its corresponding SYSTEM_MODULE_INFORMATION structure in local variable.

Then we map the target driver's file into the memory, and locate its import directory, i.e. the array of IMAGE_IMPORT_DESCRIPTOR structures. The FirstThunk field of every IMAGE_IMPORT_DESCRIPTOR structure indicates the offset to the beginning of Import Address Table that corresponds to the given imported module, and the OriginalFirstThunk field indicates the offset to the beginning of the array of IMAGE_THUNK_DATA structures. By counting the number of IMAGE_THUNK_DATA structures that have non-zero value of the Function field, we can find the number of entries in IAT that corresponds to the given imported module. For every imported module, we get the offset to the beginning of its corresponding Import Address Table, and count the number of entries in this table. We write these values into the control array, starting from control array's 16th byte. We also save pointers to the names of imported functions in global functionbuff array.

Then we allocate a page of virtual memory, fill first 10 bytes of it with the machine codes, and create the thread which will process our asynchronous messages - it is going to spin in infinite loop until the message arrives. We also write the current value of our offset index (which is currently zero), and address of the logging function to be invoked asynchronously by the spying driver, into structbuff array. This function is too simple to be listed here - it just takes the return value of the API function and the position of its name in functionbuff array as parameters, formats the null-terminated string in the form ApiFunctionXXX - returned YYY, and writes it to the log file. The only thing worth mentioning is that this logging function definitely has to be declared with __stdcall calling convention, so that it pops its arguments off the stack.

Then we fill the first 16 bytes of the control array with the remaining relevant data. This includes the address at which the target driver is loaded, the number of functions imported by the target driver, the address of the pool, and the address of the array that stores the address of the target function and the offset index. Then we create the log file in the folder where the controller application's .exe file resides. Finally, we send the control array to the spying driver by calling DeviceIoControl() with our user-defined IOCTL_START_SPYING command.

As a first step, we map the addresses of user-mode pool, and of the array which holds the current value of offset index and the address of the target function, into the kernel address space. We save these pointers in userstructptr and userbuffptr global variables - we must make them available to Epilog(), which would need them. We also save their virtual addresses, as they are known to the user-mode code, in userstruct and userbuff global variables - Epilog() would need them when it fills the pool with the machine instructions.

Then we save the remaining data received in the control array, in global variables. We would need this data in order to unhook the functions that we are about to hook now, so we must make sure that it is available to DrvClose(), which is going to do this job. Then we allocate the arrays that are going to hold function replacement chunks and the addresses of actual functions - we already know the number of functions we need to hook, and, hence, the number of bytes to allocate. We save these pointers in replacementchunks and functionarray global variables - we must make them available to Prolog() and DrvClose().

At this point, we can proceed to the actual task of overwriting IAT entries. We don't even have to process PE-related structures - the user-mode controller application has provided us with all information we need. I hope that, after having read my previous article, you are able to understand how we do it, so I don't go into details here - the only difference is that RelocatedFunction structure stores not the actual address of imported function, but the position at which this address can be found in functionarray table.

Now, let's look at modifications that have to be applied to Prolog() and Epilog(). ProxyProlog() and ProxyEpilog() don't need any modifications, so we don't discuss them here.

As a first step, Prolog() locates the first available Storage structure in storagearray table, and sets its isfree field to 100, i.e. marks it as having been occupied. We have to synchronize this operation, so we wait on synchronization event that has been initialized in DriverEntry() (DriverEntry() initializes two synchronization events and fills the retbuff array with the instructions that call ProxyEpilog(), i.e. does pretty much the same things as DllMain() in the previous article). Once we are about to wait until our event is set to the signaled state, i.e. possibly for non-zero interval, we must make sure that we do it if and only if current IRQL is below the DISPATCH_LEVEL. We also have to make sure that this operation does not get interrupted, so we disable interrupts by clearing IF flag. Once we must re-enable them as quickly as possible, the code that locates the first available Storage structure is written in pure assembly.

Then we store all relevant information in the Storage structure. This includes the original return address, the value pointed to by EBP register, and the pointer to RelocatedFunction structure. Finally, we modify the CPU stack the way we did it in the previous article, and write the pointer to the Storage structure to the location, pointed to by EBP register, i.e., one stack entry above Prolog()'s return address.

Epilog() gets the pointer to the Storage structure from the location pointed to by EBP register, obtains the original return address and the original value of EBP from this structure, and modifies the CPU stack - it stores the original value of EBP one stack entry above its return address, and replaces the address to which ProxyEpilog() would otherwise return control, with the original return address. Then sets the isfree field of the Storage structure to 0, i.e. marks it as free. Finally, it informs the controller application that the given API function has returned, and sends it the return value of this function. The way it is being done requires a little bit more attention.

We schedule the thread, which runs in the controller application's process, for asynchronous execution of the target function, by writing the machine codes to the pool the way it was explained in introduction. First of all, we need to get the current value of offset index in order to find the address of the chunk where we are going to write data, to write the address of this chunk (as it is known to the controller application) 6 bytes above its beginning, and to update the value of offset index. We must synchronize this operation, and make sure that it does not get interrupted. Therefore, we wait until the synchronization event is set to the signaled state (certainly, if and only if current IRQL is below the DISPATCH_LEVEL), and then disable interrupts. Once we must re-enable them as quickly as possible, the code that executes the above tasks is, again, written in pure assembly.

After having accomplished the above, we must fill the remaining part of the chunk with handcrafted instructions to push the index of real callee's address in functionarray table (which is available from RelocatedFunction structure, pointer to which was saved in Storage structure), the real callee's return value, and the address of the current chunk (as it is known to the controller application), on the stack, and then to jump to the logging function. At this point, we already don't have to worry about either interrupts or context switches - the concurrent threads are already unable to overwrite our data anyway. Therefore, before proceeding to the above task, we re-enable interrupts and set the synchronization event to the signaled state, so that other threads don't need to wait until we finish filling the chunk with the machine codes. Once the remaining part of the job is not so time-critical, we can afford to do it in C, rather than assembly.

Finally, we schedule the logging function, which resides in controller's application address space, for execution, by writing the address of the current chunk's 10th byte (as it is known to the controller application) 6 bytes above the address of the previous chunk.

As a result, when the logging function starts execution, it will receive the index of real callee's address in functionarray table and its return value as parameters. Once the functionbuff table in controller application's address space stores names of functions imported by target driver, in the same order as functionarray table in kernel address space stores their addresses, the former parameter is sufficient for locating the name of the real callee. Therefore, the logging function formats the null-terminated string in the form ApiFunctionXXX - returned YYY, and writes it to the log file.

As you can see, kernel-mode spying is a not so terribly complex task, although it makes us worry about things we don't even have to know about when spying in user-mode. The funniest thing is that this model is suitable for user-mode spying as well. Therefore, the source code, provided with this article, applies to my previous article as well.

In order to run the sample application, you have to copy the spying driver into the C://WINNT/system32/drivers directory. And to create an on-demand-start service, it can be done either manually, or with the following lines:

You must manually start the spying service by typing net start spyservice line on the command prompt before running the application. Click on Start button in the Spy menu, choose the driver you want to spy on, relax for a while, and then either click on Stop button, or just close the program. After that, you can open the log file in the text editor, and examine its contents.

Warning: The spying driver has been built by Windows 2000 DDK, and tested on Windows 2000. I really don't know what happens if you run it on any other NT platform - it is your task to find it out. If it does not work, you can always rebuild the spying driver for your platform - the source code would not need any modifications for sure.

Furthermore, I would not advise you to hook Ntfs.sys with this sample. I still have to figure out why it happens, but if you try to hook Ntfs.sys, the system starts slowing down, stops responding pretty shortly, and, finally, goes off after a while. This is not an issue when hooking all other drivers. I tried to hook keyboard and mouse class drivers, i8042prt, atapi, disk, CDROM, floppy, videoport and display - it works fine everywhere. I think that all problems with Ntfs.sys arise because, once our spying code synchronizes all calls that are made by different threads in different systems and user processes, the system cannot cope with the resulting slowdown - in case of Ntfs.sys, some of these calls are made by system threads of high priority, but our spying code currently does not take priority into the account. Probably, we should also take the priority of the calling thread into the account when we synchronize calls and disable interrupts. However, this is only the suggestion - I am not yet in a position to make a definite conclusion (otherwise, instead of speaking about this problem, I would just fix it, wouldn't I?)

I would highly appreciate if you send me an e-mail with your comments and suggestions.

Conclusion

In conclusion, I must say that, although we are able to hook the API calls that are made by both user-mode modules and kernel-mode drivers, our exploration of Windows is far from being over. Would not it be interesting to hook the calls that are made by the system to the target driver? It is obvious that the system has to store the addresses of all functions exported by the driver, in some kind of service table, so that it can call the driver. Therefore, we have to locate this table somehow.

Furthermore, even our process-wide API spying is far from being complete - we can hook only the old-fashioned functional API. Would not it be interesting to spy on COM interfaces as well? In the forthcoming articles, we will try to do the above mentioned things.

About the Author

Comments and Discussions

Hello,
anybody guide me about how to access the Portable Executable (PE) graphical area. plz also tell me it is possible or not?
i want to access the PE graphical area that is going to display at screen....