C++ WinAPI Wrapper Object using thunks (x32 and x64)

Using "thunk" technique to add this pointer as fifth parameter to WndProc call for x32 and x64

Introduction

This article presents an overview of a technique known as "thunking" as a means to instantiate the WinAPI in a C++ object. While there are various methodologies for such an implementation, this article describes a method of intercepting the WndProc call and appending the this pointer to a fifth parameter on the function call. It uses one thunk and only one function call and has code for both x32 and x64.

Background

The WinAPI implementation was introduced before C++ and OOP became popular. Attempts such as ATL, have been made to make it more class and object orientated. The main stumbling block is that the message handling procedure (typically known as WndProc) is not called as part of the application but is called from outside by Windows itself. This requires that the function be global and in the case of a C++ member function, it must be declared static. The result is that when the application enters WndProc, it does not have a pointer to the particular instance of the object to which to call any additional handler functions.

This means that any C++ OOP approach must solve for determining from a static member function which object method the message processing should be passed to.

Some options include:

Only design for a single window instance. The message processor can be global or namespace-scoped.

One can use the extra memory bits provided by cbClsExtra or cbWndExtra to store a pointer to the correct object.

Add a property to the window which is a pointer to the object and use GetProp to retrieve it.

Maintain a look-up table that references the pointer to the object.

Use a method known as a “thunk”.

Each has upsides and downsides.

You are limited to a single window and the code cannot be reused. This may be fine for simple applications but going to the effort of encapsulation within an object, you might as well forgo it and stick with the standard template.

The method is “slow” and requires overhead to make the call to get the pointer from the extra bits each time a message comes through. In addition, it reduces reusability of the code as it hinges on these values not being overwritten or used for other purposes through the life of the window. On the other hand, it is a straightforward and easy implementation.

Slower than number 2 and introduces similar overhead, but you do eliminate the potential of data being overwritten (though you need to ensure the property has a unique name so it will not conflict with any other added properties).

Here, we run into performance and overhead issues as the look-up table grows, and this lookup needs to happen each time the message processor function is called. It does allow for the function to be a private static member.

This is somewhat tricky to implement, but provides for low overhead, better performance over the other methods and allows for enhanced flexibility and suitable to any OOP design style.

The truth is that a good deal of applications really don't need anything fancy and can get away with using a more conventional approach. However, if you want to build an extensible framework with low overhead, however, then method 5 provides the best option and this article will present an overview of how to actually approach such a design.

Using the Code

A thunk is a piece of executing code located in memory. It has the potential to change the executing code at the moment of execution. The idea is to place a small piece of code into memory and then have it execute and modify the running code elsewhere. For our purposes, we want to capture the executable address of the message processing member function and substitute it with the originally registered function and encode the object’s address with the function so that it can properly call the correct non-static member function next in the message processing queue.

First, let’s create our template for this project. We’ll need a main file that will contain the wWinMain function.

We’ll need to setup our object with all of the necessary elements that are required for the window to be created. The first element is registering the WNDCLASSEX structure. Some of the elements in WNDCLASSEX should be allowed to be changed by the code instantiating the object, but some fields we want to reserve to the object to control.

An option here is to define a “struct” with the elements we are allowing a user to define themselves and then pass that to a function to copy into the WNDCLASSEX structure that will be registered or to just pass the elements as part of the function call. If we use a “struct”, the data elements could be reused elsewhere possibly. Of course, a struct takes up memory and if we are using the elements only once, that’s not very efficient. One could simply pass the elements as part of the function call and reduce the scope to just that function and being more efficient. But we would need to pass at least 20 parameters and then perform checks for each on their value.

Here, we will declare default values within our creation function and then declare a “struct” outside of our class where if the user wants to adjust the defaults, they can and they can manage the lifecycle of that structure. The user just declares to the function whether they will pass the struct and update default values or just go with the defaults. So, we declare the following function:

Note that we are missing our declaration for wcex.lpfnWndProc. This variable will register our message processing function. Because of the setup, this function must be static and hence will not be able to call specific functions of the object to handle message processing for specific messages. A typical WNDPROC function header looks like this:

Eventually, we will use a thunk to essentially overload the function call with a 5th parameter we will insert that will be a pointer to our objext's this. Before we do that, we'll declare our WndProc function. This is just a standard WndProc function providing handling of PAINT and DESTROY message -- just enough to get a window up.

Here, we have declared a fifth parameter that will include our this pointer. Windows will call it with the four standard parameters passed. So we need to interrupt the function call and place on the call stack a 5th parameter that will be a pointer to our class object. This is where the thunk comes in.

Again, a thunk is a bit of executable code on the heap. Instead of calling the window message procedure, we will call the thunk as if it were a function. The function variables are pushed onto the stack prior to the thunk call and all the thunk needs to do is add one more variable to the stack and then jump to the original intended function.

A couple of notes. Because of DEP (Data Execution Prevention), we must allocate some heap that is marked executable for this process. Otherwise, DEP will prevent the code from executing and throw an exception. We use HeapCreate with the HEAP_CREATE_ENABLE_EXECUTE bit set. HeapCreate will at a minimum reserve a 4k page of memory and our thunk is very small. Since we don’t want to create a new page for every new thunk instance of every object, we will declare a variable to hold the heap handle so the heap can be reused.

We initialize the static eheapaddr (executable heap address) and objInstances (our marker to count the number of instances of our object) to 0. In the constructor, we first increment objInstances. We do not want to destroy our heap until all other instances of our object are gone. Now, we check if eheapaddr has already been initialized and if not, we give it the value of the handle returned by HeapCreate. We call HeapCreate and specify that we want to enable execution of code on this heap and we want to generate exceptions if this allocation fails. We then wrap this in a try catch statement that will rethrow the exception given by HeapCreate and allow the caller of the object to figure it out.

We’ll also allocate our thunk on the heap. We’ll also want to override the new operator for our thunk class so that it can be allocated onto our heap and we can pass the handle from HeapCreate. We’ll also put this into a try catch statement in case the alloc fails (because we set HEAP_GENERATE_EXCEPTIONS for HeapCreate, HeapAlloc will also generate exceptions).

We will destroy this heap when our object is deleted so we will update the destructor with the following:

Simply check if we are the last object instantiation and if so, destroy the heap and reset eheapaddr to NULL. Otherwise decrement objInstances. Note: eheapaddr and obInstances do not need to be set to zero as our whole object is about to go away. We do need to call the delete operator on our thunk which ensures it frees itself from our heap.

A note here: InterlockedInstances() could be used to provide a better multi-threaded approach instead of incrementing and decrementing a static counter.

Now we can declare our thunk class. Because x32 and x64 are different in how they handle the stack and function calls, we need to wrap the declaration in #if defined statements. We use _M_IX86 for x32 bit apps and _M_AMD64 for x64 bit apps.

The idea is we create a structure and place variables in a specific order at the top. When we make a call to this “function”, we are instead calling into the top of the memory of the structure and will begin to execute the code stored in the variables at the top.

We use the #pragma pack(push,#) declaration to align the bytes correctly for execution, otherwise the compiler may pad the variables (and does so anyway with the x64 set).

For x32, we require 7 variables. We then assign them the hexadecimal equivalent of our x86 assembly code. The assembly looks like the following:

push dword ptr [esp] ;push return address
mov dword ptr [esp+0x4], pThis ;esp+0x4 is the location of the first element in the function header
;and pThis is the value of the pointer to our object’s "this"
jmp WndProc ;where WndProc is the message processing function

Because we do not know the value of pThis or WndProc before the program runs, we need to collect these at runtime. So we create a function in the structure to initialize these variables and we will pass both the location of the message processing function and pThis.

We also need to flush the instruction cache to ensure our new code is available and the instruction cache will not try to execute old code. If the flush succeeds (returns 0), we return true else we return false and let the program know we had a problem.

A few notes on what is going on for our 32-bit code. Following calling conventions, we need to preserve our stack frame for the calling function (remember it is calling a function it thinks it has 4 variables). The calling function return address is at the bottom of the stack. So we deference esp (which is pointing to our return address) and push (push[esp]) decrementing esp, adding a new "layer" holding the return address and hence make room for our fifth variable. Now, we move our object pointer value +4 bytes onto the stack (overwriting the original location of the return value) where it will become the first value in our function call (conceptually we pushed the function parameters to the right). In Init m_mov is given the hexadecimal equivalent of mov dword ptr [esp+0x4]. We then assign the value of pThis to m_this to complete the mov instruction. m_jmp gets the hexadecimal equivalent of the jmp opcode. Now we do a little calculation to find the address we need to jump to and assign it to m_relproc (relative position to our procedure).

We also need to override new and delete for our struct to properly allocate the object on our executable heap.

Also note that Intel uses "little endian" format so the instruction bytes must be reversed (high order byte is first) [applies to x64 as well].

The x64 version follows the same principles but we need to account for some differences in how x64 handles the stack and to compensate for some alignment issues. The Windows x64 ABI uses the following paradigm for pushing variables for function calls (note it doesn't do push or pop - it is similar to a fastcall). The first parameter is moved to rcx. The second parameter is moved to rdx. The third parameter is moved to r8. The fourth parameter is moved to r9. The following parameters are pushed to the stack but there is a trick. The ABI reserves space on the stack for storage of these 4 parameters (referred to as shadow space). Hence there are four 8 byte spaces reserved at the top of the stack. Also at the top of the stack is the return address. So the fifth parameter is placed on the stack at position rsp+28.

For non-static function calls, it does the following for the first 5 parameters. It pushes this to rcx, then to edx (1st param), then to r8 (2nd param), then to r9 (3rd param), then to rsp+0x28 (4th param), then rsp+0x30 (5th param). For non-static 1st parameter to rcx, then to rdx (2nd param), then to r8 (3rd param), then to r9 (4th param), then to rsp+0x28 (5th parameter). So we need to place our value at rsp+0x28.

We encounter a problem in that one of the instruction sets (mov [esp+28], rax) is a 5 byte instruction and the compiler tries to align everything on a 1,2,4,8,16 byte boundary. So we need to do some manual alignment. This requires adding a no operation (nop) [90] command. Otherwise, the same principles are applied. Note because the addresses for pThis and proc occupy 64 bit variables, we need to use the movabs operand which makes use of rax.

We now have our message handler and our thunk. We can now assign the value to lpfnWndProc.

Caution - We use two different calling parameters, one for 32-bit and one for 64-bit. In our 32-bit code, our pointer is the first parameter. In our 64-bit code, it is the fifth parameter. We need to account for this by wrapping our code with some compiler instructions.

Some thunks may dynamically allocate their memory so we use the GetThunkAddress function which simply returns the thunk's sure this pointer. We cast the call with WNDPROC as that is what our windows class is expecting.

Now we register our WNDCLASSEX structure. We’ll declare a public variable classatom to hold the return of RegisterClassEx for future use if wanted. And we call RegisterClassEx.

Now we call CreateWindowEx pass along the variables. If the WS_VISIBLE bit was set, then we do not need to call ShowWindow so we check for that. We do an UpdateWindow and then enter our message loop. And we are done.

*One additional note. I use DWORD_PTR This in my WndProc declaration. This is in my opinion a better aid to help demonstrate the principle. However, to avoid a useless conversion, declare it as AppWinClass This.

History

Version 1.1

Corrected the lack of setting objInstances back to zero on the last instance of an object. Added two notes.

Version 1.5

Changed the 32 bit convention for the thunk to properly preserve the stack as pVerer suggestion in the comments - this precipitated a need to wrap the WndProc function in some #if defined conventions as the 32bit and 64bit code now call this function differently.

Version 1.6

Made some minor modifications on the thunk's delete operator

Version 1.8

Updated the x64 ABI section to more properly and clearly discuss the convention. There was a mistake in its description. Also updated the x64 thunk code and made it simpler using the movabs operand which works with 64bit immediates.

Version 1.8.1

Updated the thunk portions with a minor edit to support VS 2017 - changed using the Microsoft typedefs of USHORT, ULONG etc. to the C++ proper (i.e. USHORT became unsigned short) otherwise the code would not properly execute after being compiled by VS 2017 RC

Share

About the Author

I am a Solution Architect for IU Health architecting software solutions for the enterprise. Prior I had been employed with eBay as a Project Manager and Lead (Enterprise Architect) which focused on managing the scope, timeline, resource, and business expectations of third party merchants and their first-time onboarding to the eBay marketplace. I also acted as an Enterprise Architect leading the merchant in the design, reliability, and scaling of their infrastructure (both hardware and software). Prior I worked for Adaptive Computing as a Sales Engineer. I was responsible for working with customers and helping with the sales life cycle and providing input to Program Management. Prior I was employed as a High Performance Computing Analyst. I was responsible for administering and maintaining a high performance / distributed computing environment that is used by the bank for financial software and analysis. Previous, I had been employed at ARINC, where as a Principal Engineer I worked in the Systems Integration and Test group. The division I worked in represented the service end of the airline communication network. Prior to this position I worked for the government in several contractual roles which allowed me to lead a small team in consulting and evaluating research initiatives and their funding related to discovering and negating threats from weapons of mass destruction. Other contracts included helping the Navy in a Signals Intelligence role as the lead hardware and systems architect/engineer; a Senior Operations Research Analyst assessing force realignment and restructuring for the joint Explosive Ordnance Disposal community with a project assessing the joint force structure of the EOD elements, both manning, infrastructure, and overall support and command chain structures. I also spent three years involved with the issue of Improvised Explosive Devices and for the Naval EOD Technology Division where I acted as the lead engineer for exploitation.

I have encountered an error while playing with VisualStudio 2017 RC.
For some reasons version 1.8 of your code trigers an exception when program calls the winThunk WndProc. It crashes in CreateWindow function but you get the same behavior by using any other function that call WndProc (CallWndProc for example). Version 1.6 works with no problem.
I encountered this problém with both x86 and x64 version. Tested on Win7 x64 and Win10 x64.
Any ideas?

I haven't tested yet in VS2017 but I would suspect the packing of the assembly instructions is off. The compiler tries to align the variables along particular byte boundaries and sometimes to achieve this it adds some "padding". So in the latest version I have an instruction that takes 3 bytes to push adjust the stack. Usually the compiler tries to make everything around even bytes so I find an extra set of "00" around the intended code. I had to play around extensively with the variable types to get the proper alignment. The "pragma pack" is suppose to help with this but I didn't find it did that much. I'm certain there is some type of compiler setting that would alleviate this, but I haven't been able to find one yet. The older code had a different set of instructions that were aligned slightly different. After reviewing the code further I found a few efficiencies gains so changed it up.

I'll try and setup a VS2017 instance, but it's a bit simple to debug. Put a breakpoint right after the winThunk is first initialized (best place is

if (!varStruct)

) in the "Create" function. Open the "Disassembly" window and input in the address field

pProc

This will take you right to the location of the assembly. Just verify if it matches the intended assembly (which I listed in the comments).

I'll verify myself as soon as I can get a copy of VS2017 up and running.

But given the thunks live somewhere in a heap dedicated to them, I am concerned about an increased pressure on CPU cache, as it makes data locality worse, right? Therefore it would be great to see some real numbers showing whether and how much faster on a modern CPUs this approach is in comparison to other methods like e.g. WNDCLASS::cbWndExtra.

It would be an interesting project to benchmark - I just might take that task on.

I would suspect that using cbWndExtra would be a better approach if you were weren't writing for a library or reusable code, i.e. if you were the only user of the code and/or could ensure cbWndExtra wouldn't be used for something else or overwritten.

ATL takes the approach of using a thunk and calls a first ATL special WndProc function and overwrites the hWnd parameter. Before calling the actual WndProc it uses a linked-list to get fill in all the details including the hWnd. That seemed overly complicated compared to this approach where we break a paradigm (the fifth parameter in the WndProc) but achieve (to my view at least) simplicity and efficiency. We also preserve cbWndExtra to be used for other purposes. So this approach should beat out ATL at least.

So I instrumented the code and ran performance measurements against using the cbWndExtra and this thunk technique. Measuring against calls to both on my particular machine the thunking methodology comes out ahead on average. This is mostly due to the fact that you have to make an extra function call to get the WndExtra value where in the other method the pointer is already on the stack. The data heap does not seem to have any affect. I tested against painting to the window and running several mathematical algorithms, and while OS preemption does add to the overhead on the timing, thunking always came out ahead though at times it could be as little as 10ns to a few hundred depending on the complexity of the code.

The creation of the op code in your winThunk::Init function needs to be corrected.

On the step where you "m_mov2 = 0x04" this code fails because m_mov2 is defined as a DWORD and the result puts a 0x0004 in your thunk code. The result is that the m_this pointer is not copied correctly because it has an extra 0x00 from the DWORD.

You need to change the definition of the m_mov2 to a CHAR. In this way when it moves the instruction into place it only moves the 0x04 and your op code is correct.

Thanks for the report. You caught my error. That should indeed be declared as a type BYTE and not a DWORD. It looks like I made the error when transcribing from my running code. I've updated the code to reflect the correct declaration.

You are correct here. I made the assumption that the user would be familiar with the nature of the resource.h file and provide their own. The code is to provide an example of a "thunk" that could be used in a wrapper class and was not intended to be a "fully functioning" example, but maybe that's a false assumption to make. I should probably add some notation around this fact.

By calling instruction that modifies stack beyond passed parameters you destroy local state of the caller!
That could crash your application in the future.
Caller pushes only 4 parameters to the stack and then CPU pushes the return address. So the last parameter could be referenced as [esp+0x10]. After (esp+0x14) the stack contains caller's own frame. And you corrupt it by invoking "mov dword ptr [esp+0x14],value".
To do it correctly you must pop the return address, push your additional parameter (that will be the fist in such case) and then push the return address back. After all call your custom 5-arity WndProc with signature (DWORD_PTR pThis, HWND hWnd, UINT message, WPARAM wParam, LPARAM lParam). Here pThis is the first.
pop eax
push value
push eax
jmp ...
But it modifies eax. So doing without eax:
push [esp] ; push return address
mov dword ptr [esp+4],value ; set our new parameter by replacing old return address
jmp ...

You may read about thunk technique in the book (chapter 10.CWindowImpl. The Window Procedure):
ATL Internals. Working with ATL 8. C.Tavares,K.Fertitta,B.Rector,C.Sells.2ed,2006

ATL does not use 5th parameter. It simply replaces the first HWND parameter with custom object address. HWND is previously cached in that object.

You're right about needing to preserve the stack pointer. As I read the information regarding MS _stdcall handling of the stack what you describe shouldn't be an issue but my knowledge of these things is getting rusty, so why take the risk, especially when you present a more elegant solution. I'll update the article to reflect the change.

Thank you for the suggestion.

I reviewed the ATL code and it's a mess. Multiple class inheritances and it uses a linked list to associate the hWnd parameter with the correct object and then a call to SetWindowLongPtr to establish the correct message handler. And it uses the thunk on top of this.

For me it shouldn't be so complicated. I feel that this is a better simpler approach.