Process-wide API spying - an ultimate hack

Abstract

API hooking and spying is not uncommon practice in Windows programming. Development of system monitoring and analysis tools heavily depends upon it. Numerous articles have been written on this subject – quite a few are even available on The Code Project. To be honest, I did not find these articles to be that much informative - they all seem to describe the techniques that were presented by Matt Pietrek and Jeffrey Richter a decade ago. Don’t get me wrong – I don’t want to say anything about the quality of these articles. The only thing I am saying is that their authors don’t seem to be describing programming tips and tricks of their own design.

This article presents an absolutely universal model of process-wide API spying solution, capable of hooking all API calls in any user-mode process of our choice, i.e. our spying model is not bound to any particular API at the compile time. Our implementation is limited to logging the return values of all API functions that are called by the target module. However, our model is extensible - you can add parameter logging as well. Our spying model is particularly useful for analyzing the internal working of third-party applications when the source code is not available. In addition to the universal process-wide spying model, we also present one more way to inject the DLL into the target process.

All the programming tricks, described in this article, are 100% of my own design, although, certainly, based upon the ideas that were first expressed by Matt Pietrek.

Introduction

Process-wide API hooking relies upon the technique of modifying entries in the Import Address Table (IAT) of the target executable module. First of all, you need to understand how imported functions are invoked – at the binary level, calling an imported function is different from intra-modular call. When you make an intra-modular call, the compiler generates the direct call instruction (0xE8 on Intel CPU), because the offset of function within the module, relative to the place from which it is called, is always known - even at the compile time. However, if the function is imported, its address is unknown at the compile time, although a guess can be made. Therefore, when you call the imported function, the compiler generates indirect (0xFF, 0x15 on Intel CPU), rather than direct, call instruction. When you call an imported function, the compiled code looks like following:

calldwordptr
[__imp__CreateWindowExA@48]

This instruction tells CPU to call the function, the address of which is stored in __imp__CreateWindowExA@48 memory location. At the load time, the loader will write the address of CreateWindowExA() to __imp__CreateWindowExA@48 memory location, and the above instruction, when executed, will invoke CreateWindowExA(). If we write the address of our user-defined function into __imp__CreateWindowExA@48 memory location at the run time, then all calls to CreateWindowExA() within the module will invoke our user-defined function, instead of CreateWindowExA(). Our user-defined function can log or validate parameters, and then call CreateWindowExA() directly by its address. Process-wide API hooking is based upon this idea.

The API spying solution normally consists of driver DLL, which actually does all the job of hooking and spying, and controller application, which injects the driver DLL into the target process. The driver DLL normally communicates with its controller application by window messages - WM_COPYDATA message is a convenient way to pass a small amount of data from one application to another.

The addresses of all functions, imported by the module, are stored in Import Address Table (IAT), every entry of which has the internal form of __imp__xxx. Once the driver DLL has been injected into the target process, it overwrites IAT entries of the target module with the addresses of user-defined proxy functions, implemented by the driver DLL. Each IAT entry replacement normally requires a separate proxy function - a proxy function must know which particular API function it replaces so that it can invoke the original callee. However, with some certain workaround, all IAT entry replacements can be serviced by a single proxy function - we will show you how this can be done. This is an ultimate hack, but such approach makes our model absolutely universal – we can hook all API calls in any user-mode process of our choice.

Locating the Import Address Table

In order to start spying, we have to locate the Import Address Table (IAT) of the target executable module. Therefore, we need a brief introduction to Portable Executable (PE) file format, which is the file format of any executable module or DLL. MSDN CD provides a very detailed description of Portable Executable (PE) file format, so we are not going too deeply into details here - we are mostly concerned with locating the Import Address Table of the target executable module.

PE file starts with 64-byte DOS file header (IMAGE_DOS_HEADER structure), followed by tiny DOS program which, in turn, is followed by 248-byte NT file header (IMAGE_NT_HEADERS structure). The offset to NT file header from the beginning of the file is given by e_lfanew field of IMAGE_DOS_HEADER structure. First 4 bytes of NT file header are file signature, followed by 20-byte IMAGE_FILE_HEADER structure, which, in turn, is followed by 224-byte IMAGE_OPTIONAL_HEADER structure. The code below obtains a pointer to IMAGE_OPTIONAL_HEADER structure (hMod is a module handle):

In actuality, IMAGE_OPTIONAL_HEADER is far from being optional – the information it contains is too important to be omitted. This includes the suggested base address of the module, size and base addresses of code and data, stack and heap configuration, the address of entry point, and, what we are mostly interested in, pointer to the table of directories. PE file reserves 16 so-called data directories. The most commonly seen directories are import, export, resource and relocation. We are mostly interested in import directory, which is just an array of IMAGE_IMPORT_DESCRIPTOR structures, with one structure corresponding to each imported module. The code below obtains a pointer to the first IMAGE_IMPORT_DESCRIPTOR structure in import directory:

The first field of IMAGE_IMPORT_DESCRIPTOR structure holds an offset to the hint/name table, its last field holds an offset to the import address table. These two tables are of the same length, with one entry corresponding to each imported function. The code below lists all names and addresses of IAT entries for all functions imported by the module:

The inner loop retrieves function names and addresses of IAT entries for the imported module from IMAGE_IMPORT_DESCRIPTOR structure that corresponds to the given module; the outer loop just proceeds to the next imported module. As you can see, Import Address Table for the imported module is nothing more than just an array of DWORDs. All we have to do in order to start spying is to fill this array with the addresses of our user-defined proxy functions. As we promised, we will show you a trick that makes it possible for all IAT entry replacements to be serviced by a single proxy function.

Implementing the spying solution

Our spying team consists of 4 members - ProxyProlog(), Prolog(), ProxyEpilog() and Epilog(). As their names suggest, ProxyProlog() and Prolog() are invoked before the actual calee takes control; ProxyEpilog() and Epilog() are invoked after the actual calee returns. ProxyProlog() and ProxyEpilog() are implemented as naked assembly routines; Prolog() and Epilog() are just regular C functions. The actual spying job is done by Prolog() and Epilog(). The only task of ProxyProlog() and ProxyEpilog() is to save and restore CPU registers and flags before and after Prolog() and Epilog() perform their tasks – if we want the target process to keep on functioning properly, the whole process of spying must leave everything intact, at least as far as the API function and its client code are concerned.

Windows uses flat memory model, which means code and data reside in the single address space, rather than in separate segments. This implies we can fill an array with the machine instructions, and call it as a function. Look at the code below:

This is a 6-byte indirect call instruction. The first 2 bytes are occupied by the call instruction itself, and 4 bytes that follow are occupied by the operand - they hold the address of the variable that contains the address of ProxyEpilog(). In this particular case, this variable comes immediately after the 6-byte instruction. When the instruction pointer hits retbuff, our handcrafted code is going to call ProxyEpilog(). Call instruction implicitly pushes the address, to which the invoked routine must return control, on the stack – this is how the function knows its return address. In our case, the pointer to the variable that contains the address of ProxyEpilog() (the address of retbuff[6]) is going to be on top of the stack when ProxyEpilog() starts execution.

When DllMain() is called with fdwReason set to DLL_PROCESS_ATTACH, we fill retbuff array with the machine instructions (retbuff is a global BYTE array), dynamically allocate some memory, allocate Tls index, and store the memory we have allocated in the thread local storage. Every time DllMain() is called with fdwReason set to DLL_THREAD_ATTACH, it must dynamically allocate some memory and put it aside into thread local storage.

Now let’s look at how we overwrite IAT entries, after obtaining name and address of IAT entry for the given imported function:

For each IAT entry replacement, we dynamically allocate an array, first 6 bytes of which are occupied by indirect call instruction, and 16 bytes that follow are processed as RelocatedFunction structure, first member of which is set to the address of ProxyProlog() (it definitely has to be the first). The other fields are set to the address and the name of the imported function, plus to the name of the DLL, from which the given function is being imported. First 2 bytes of the array are 0xFF and 0x15, and 4 bytes that follow contain the address of RelocatedFunctin structure. We replace each IAT entry with the address of such array - each IAT entry replacement requires a separate array.

As a result, every call to the API function will, in actuality, call our handcrafted code that calls ProxyProlog(). As we said, call instruction implicitly pushes on the stack the address, to which the invoked routine must return. In our case, the pointer to RelocatedFunction structure is going to be on top of the stack, and the original return address, i.e. the address to which the API function must return control, is going to be one stack entry below at the time when ProxyProlog() starts execution. Stack entries below the original return address are going to be occupied by the API function arguments. Now let’s look at ProxyProlog() and Prolog() implementations.

ProxyProlog() saves registers and CPU flags, pushes the value of ESP at the time when ProxyProlog() started execution, and calls Prolog(). As we said, the pointer to RelocatedFunction structure is on top of the stack, and the address to which the API function must return control, is one stack entry below at the time when ProxyProlog() starts execution. As a result, Prolog() receives a pointer to the stack location where the pointer to RelocatedFunction structure can be found, as an argument. By incrementing its argument, Prolog() can find a pointer to the stack location where the original return address is stored.

Prolog() saves the pointer to RelocatedFunction structure and the original return address in the thread local storage, which is organized as a DWORD, followed by the array of Storage structures. We treat this array as a stack – DWORD just indicates the number of stack entries, i.e. is just a counter. Prolog() saves the pointer to RelocatedFunction structure and the return address in the topmost stack entry, and increments the counter. After performing the above tasks, Prolog() modifies the CPU stack – the address of the API function obtained from RelocatedFunction structure, replaces the pointer to RelocatedFunction structure, and the address of retbuff global array which is filled with the machine instructions in DllMain(), replaces the original return address on the stack.

After Prolog() returns, ProxyProlog() restores registers and CPU flags. Prolog() has modified the CPU stack in such way that, after ProxyProlog() returns, the program flow jumps to the original calee, i.e. to the API function, upon the return of which the program flow jumps, instead of the original return address, to our handcrafted code that calls ProxyEpilog().

Implementation of ProxyEpilog() is almost identical to that of ProxyProlog(). ProxyEpilog() saves registers and CPU flags, pushes the value of ESP at the time when EAX register was on top of the stack, and calls Epilog(). As a result, Epilog() receives a pointer to the stack location where the return value of the API function can be found, as an argument. By incrementing its argument, Epilog() can find a pointer to the stack location where the address, to which ProxyEpilog() must return, is stored. Let’s look at Epilog().

Epilog() gets the pointer to RelocatedFunction structure and the original return address from the topmost Storage structure in the thread local storage, and decrements the counter. Then Epilog() modifies the CPU stack – it replaces the address to which ProxyEpilog() must return, with the original return address. After performing the above tasks, Epilog() informs the controller application that the API function has returned – the name of the given function, as well as of the DLL that exports it, are available from RelocatedFunctionstructure, pointer to which was saved in the thread local storage, and the pointer to the return value of the API function is Epilog()’s argument. Epilog() provides the controller application with all the above information by sending WM_COPYDATA message to the controller window.

After Epilog() returns, ProxyEpilog() restores registers and CPU flags. Epilog() has modified the CPU stack in such a way that, after ProxyEpilog() returns, the program flow jumps to the address, to which the API function was supposed to return control if no “espionage” was taking place. As you can see, all our “spying activity” cannot disrupt the program execution in any possible way, because it leaves CPU stack, registers and flags intact, at least as far as the API function and its client code are concerned. Our “spying team” does not care which API function to spy on - our model is absolutely universal, because our implementation is not bound to any particular API function at the compile time. Furthermore, our model is suitable for spying in multithreaded environment, because we save all necessary data in the thread local storage.

For the time being, our model is suitable only for listing all API calls and for logging the return values of API functions. If you want to add parameter logging or validation, it can easily be done - the API function arguments are just below the original return address on the CPU stack. However, you must provide our “spying team” with the argument lists of the target API functions – unfortunately, there is no way to obtain this information from the PE file. The solution to this problem lies with the enhanced communication between the controller application and the spying DLL - the controller application can always get the description of arguments of the target API function from the user, and provide the DLL with this information at run time. Apparently, RelocatedFunction structure would require one more data member, i.e. a pointer to some array that contains the description of arguments, so that Prolog() would be able to examine the arguments. We leave it for you to decide how to do it.

Warning: In case if your target executable module dynamically links to C run-time library, don’t try to hook the functions that are imported from MSVCRT.dll. Instead, you should hook the API calls that C run-time library makes, i.e. overwrite the Import Address Table of MSVCRT.dll’s module.

Therefore, we are able to hook all API calls that are made by the target executable module, i.e.outgoing calls. What about the opposite task, i.e. hooking all incoming calls to some particular DLL module (say, kernel32.dll ), made by all modules that are loaded into the address space of the target process, including system DLLs?

HOOKING ALL CALLS TO DLL MODULE, MADE BY THE TARGET PROCESS

Once we know that process-wide API hooking can be achieved by modifying IAT entries of the target executable module, the answer to this question must be obvious. All we have to do is to walk through all modules that are currently loaded into the address space of the target process, and, in each loaded module, overwrite IAT entries of all functions that are imported from kernel32.dll. As a result, we will hook all calls that are made to kernel32.dll by all modules that are currently loaded into the address space of the target process.

Unfortunately, this is only the partial solution. The problem is that any modification of IAT entries in the module affects only the given module. Hence, even if we hook all calls to kernel32.dll in all currently loaded modules, any module that is subsequently loaded into the address space of the target process is not going to be affected – all calls to kernel32.dll , made by such module, will remain unhooked.

In order to get a real solution, in addition to above mentioned overwriting of IAT entries in all currently loaded modules, we must also overwrite IMAGE_EXPORT_DIRECTORYof kernel32.dll itself. If we overwrite IMAGE_EXPORT_DIRECTORY of kernel32.dll, all future loading of DLLs into the target process will link with our proxy functions, although all currently loaded modules are not going to be affected. By combining the modification of IATs of all currently loaded modules with overwriting the IMAGE_EXPORT_DIRECTORY of kernel32.dll itself, we will hook all calls that are made to kernel32.dll by absolutely all (including yet-to-be-loaded) modules in the address space of the target process. Don’t confuse it with system-wide spying – apart from the target process, all other processes in the system will stay intact.

All information about the functions, exported by DLL module, can be found in IMAGE_EXPORT_DIRECTORY structure, which is accessible via IMAGE_OPTIONAL_HEADER structure. The code below obtains a pointer to IMAGE_EXPORT_DIRECTORY structure (hMod is kernel32.dll module's handle):

IMAGE_EXPORT_DIRECTORY contains the information about the addresses, names and ordinal values of all functions that are exported from the given DLL. The address table is an ULONG array that holds the addresses of all exported functions, name table is an ULONG array that holds the addresses of function name strings, and the ordinal table is an USHORT array that holds the difference between the real ordinal and base ordinal values. Please note that the addresses of functions and names are given as Relative Virtual Addresses (RVAs). In order to get the actual memory address of the exported function or of its string name, you must add its corresponding entry in the address or name table to the address, at which the given module is loaded. The code below lists all names and addresses of all functions that are exported by DLL module:

As you can see, for the time being everything is more or less the same as with listing the imported functions and their names. However, things become a little bit different when it comes to patching the export address table – its entries must be overwritten not with actual memory addresses of proxy functions, but with RVAs, i.e. the differences between the actual memory addresses of proxy functions and the address, at which the given module is loaded. This means that all proxy functions must be loaded at the addresses that are higher than kernel32.dll module’s base address – RVA cannot be negative. Let’s look at how it can be done:

As a first step, we allocate a chunk of virtual memory at the highest possible address. The version of kernel32.dll on my machine (it runs Windows 2000) exports 823 functions. For each function replacement, we need 6 bytes for indirect call instruction, plus 16 bytes for RelocatedFunction structure, i.e.22 bytes. If we round this number up to 24 bytes, we will be able to fit 170 function replacement chunks in one page of memory (4096 bytes on Intel CPU), and 16 bytes of every page will remain unused. Therefore, we will need the total of 5 pages of virtual memory. It is a good idea to align these function replacement chunks on the page boundary. Therefore, the address of every given function replacement chunk can be calculated as following:

Hence, the RVA of every given chunk, relative to the target module’s base address, can be calculated as following:

DWORD offset=(DWORD)writebuff-(DWORD)hMod+pos;

The rest is pretty much the same as overwriting the IAT entry – we fill first 6 bytes of the current chunk with the machine instructions, process 16 bytes that follow as RelocatedFunction structure, and write RVA to export address table entry that corresponds to the given function. As a result, every DLL that is subsequently loaded into the target process, will link with our proxy “functions”, i.e. with our handcrafted code that calls ProxyProlog(). Furthermore, any call to GetProcAddress() from any module within the target process will return the address of our proxy “function”, rather than the address of the real calee, although if we call any function, exported by kernel32.dll, by its name, it will result in calling the actual function, rather than our handcrafted code (unless the call is made by the module that was loaded after we have patched the export address table of kernel32.dll) - IATs of all modules that were loaded into the target process before we had patched the export address table of kernel32.dll still contain the addresses of actual functions.

WARNING: In case if any module in your target process dynamically links to C run-time library, make sure that MSVCRT.dll is loaded into your target process’s address space before you overwrite kernel32.dll’s export table. If you try to load MSVCRT.dll into your target process’s address space after you have hooked kernel32.dll, it will fail to load properly. When it comes to hooking and spying, MSVCRT.dll turns out to be a hell of a library to work with - you remember that you should not hook the functions that are imported from MSVCRT.dll, i.e. this library always requires a special treatment.

After having modified the export address table of kernel32.dll, we must walk through all modules that are currently loaded into the address space of the target process, and, in each loaded module, overwrite IAT entries of all functions that are imported from kernel32.dll. The code below shows how it can be done (currenthandle is a module handle of spying DLL):

We walk through all modules that are currently loaded into the address space of the target process (the fact that, starting from Windows 2000, Toolhelp32 functions are available on NT platform, simplifies our task greatly), and, in each loaded module, overwrite IAT entries of all functions that are imported from kernel32.dll. We don't even have to fill function replacement chunks - it has already been done when we overwrote the export address table of kernel32.dll. All we have to do is to overwrite IAT entries with the addresses that are returned by GetProcAddress() - after we have overwritten the export address table of kernel32.dll, GetProcAddress() returns the addresses of our function replacement chunks, rather than addresses of actual exported functions. It is understandable that all the code you have seen so far resides in our spying DLL.

INJECTING THE SPYING DLL INTO THE TARGET PROCESS

There is one more thing to be done – we must inject the spying DLL into the target process. The technique, described by Jeffrey Richter, uses CreateRemoteThead() API function in order to achieve this goal. Unfortunately, this technique is not going to work in our case. Why not? Because we save that original return address in the thread local storage. If we want the target process to keep on functioning properly, absolutely every thread in the process must dynamically allocate some memory and put it aside into thread local storage, i.e. DllMain() must be called by absolutely every thread in the process. DllMain()will be first called by the thread that loads the spying DLL into the target process, and, subsequently, by all threads that are created in the target process after the spying DLL has been loaded. However, in case if we use CreateRemoteThead() to inject the spying DLL, all threads that were created by the target process before we had injected the spying DLL are not going to call DllMain(). Therefore, if we want the target process to keep on functioning properly, we have only 2 options:

1. We must inject the spying DLL into its primary thread, and do it before the target process creates any additional threads, i.e. at the earliest possible stage of the target process’s lifetime

2. We must make every thread that currently runs in the target process call our spying DLL's entry point

Implementing the former option is relatively easy, compared to the latter one. Therefore, we will start from the first option, and then proceed to the second one.

INJECTING THE SPYING DLL INTO THE PROCESS THAT WE CREATE OURSELVES

First, we will inject our spying DLL into the process that we create ourselves. Let’s look at how it can be done:

As a first step, we obtain the address of entry point of the target executable module – we can get this information before even spawning the target process. Our executable file is saved on the disk in PE format, and, hence, the address of entry point is available from the IMAGE_OPTIONAL_HEADER structure - all we have to do is to add together AddressOfEntryPoint and ImageBase fields of IMAGE_OPTIONAL_HEADER structure.

Then we create a target process with the initially suspended primary thread from the .exe file, dynamically allocate a memory array in the target process’s address space, and fill this array with the machine instructions in the following form:

Here we simulate the call instruction by combination of push and jmp instructions. When the instruction pointer hits the first byte of this array, the program will call LoadLibraryA() with pointer_to_dllname as an argument, and then return control to the application’s entry point.

Finally, we change the execution context of the target process’s primary thread – we set the thread’s instruction pointer to the first byte of our array with handcrafted instructions, and then let the thread run by calling ResumeThread() . As a result, the spying DLL will be loaded by the target process’s primary thread even before the target application’s entry point is called.

INJECTING THE SPYING DLL INTO THE RUNNING PROCESS

Now let' do much more complicated thing, and inject our spying DLL into the process that already runs. Let’s look at how it can be done:

As a very first step, we allocate a memory array in the address space of the target process, copy the name of our spying DLL into this array, and call CreateRemoteThread() API function with the lpStartAddress and lpParameter parameters set to respectively the address of LoadLibrary() API function and the address of the array that we have allocated, i.e. inject the spying DLL into the target process the way described by Jeffrey Richter. Then we walk through all modules that are currently loaded into the address space of the target process, until we find the module handle of our spying DLL. Then we read the memory of the target process, starting from the address that corresponds to our spying DLL's module handle. At this point we are already able to find the address of our DLL's entry point in the address space of the target process - this information is available from IMAGE_OPTIONAL_HEADER.

Then we create auto-reset event in initially unsignaled state - the meaning of this step will become obvious when you see the implementation of inject(). Finally, we enumerate all threads that currently run in the target process, and make every thread in the target process call our DLL's entry point - this is implemented by inject(), to which the above mentioned event handle is one of the parameters. Let's look at inject()'s implementation:

The implementation of inject() does, basically, the same thing as our DLL-injecting code in the previous example- it fills the memory array with the machine codes, and changes the execution context of the target thread, i.e. makes it execute our handcrafted code that calls our DLL's entry point. However, now things become more complicated -our target thread already runs, so that all our activity must leave CPU registers and flags intact, as far as the target thread is concerned. Furthermore, for the safety reasons, we must synchronize our injections, i.e. proceed to the next target thread only after the current target thread's execution context has been restored. Therefore, we have to fill the array with the following instructions:

This seems to be a bit of a tough job, but, unless you are desperate to crash the target process, it has to be done. After having changed the execution context of the target thread, inject() waits until the target thread sets the synchronization event we have created, so that we cannot proceed to the next thread until the execution context of the target thread is restored. But what if the target thread is deadlocked at the time when we want it to call the entry point of our spying DLL? Then our code will get stuck - no one is going to set our synchronization event to the signaled state. This means that the above technique can be useful (with few adjustments applied) for detecting deadlocked threads in the target process - the fact that one of the worker threads in multithreaded application is deadlocked is not always obvious at the first glance.

NOTE: In case if we inject our spying DLL into the target process that we create ourselves, we can overwrite the addresses of our target functions right in DllMain()when it is called with fdwReason parameter set to DLL_PROCESS_ATTACH, because our target process has only one thread at the time when our spying DLL is injected. However, if we inject our spying DLL into the target process that already runs, we can overwrite the addresses of our target functions only after absolutely every thread in the target process has called our DLL's entry point. Otherwise, there is a good chance that the function replacement code will be called by the thread that has not yet allocated its storage, which means the target process will crash when Prolog() tries to save the return address in the storage that has not yet been allocated.

This implies that the code, which actually overwrites the addresses of our target functions, must reside in a function that is exported by our spying DLL. Then, after the code in loadandinject() is executed , we would be able to create a thread in the target process by calling CreateRemoteThread() with the lpStartAddress parameter set to the address of this function - once the function is exported, we can always get its address in the target process from the spying DLL's export address table.

In case if all this seems too complicated to you, I suggest you should create the target process yourself, rather than spy on the process that already runs - as you can see, the fact that the target process already runs at the time when we inject our spying DLL gives us quite a few things to worry about. To be honest, I would personally prefer, for the practical purposes, to create the target process myself.

Conclusion

In conclusion we must say that our spying model is not bound to any particular API function at the compile time, i.e. is extremely flexible, and is suitable for spying in multithreaded environment. If you extend it to checking the API function parameters, you can turn it into tremendously powerful system tool. For the time being this model is suitable only for user-mode API hooking. In the next tutorial we will show you how to extend this model to kernel-mode spying - we will hook all the system calls made by the target device driver.

About the Author

Comments and Discussions

Hello,
anybody guide me about how to access the Portable Executable (PE) graphical area. plz also tell me it is possible or not?
i want to access the PE graphical area that is going to display at screen....