Basics of Windows shellcode writing

Table of contents

Introduction

This tutorial is for x86 32bit shellcode. Windows shellcode is a lot harder to write than the shellcode for Linux and you’ll see why. First we need a basic understanding of the Windows architecture, which is shown below. Take a good look at it. Everything above the dividing line is in User mode and everything below is in Kernel mode.

Unlike Linux, in Windows, applications can’t directly accesss system calls. Instead they use functions from the Windows API (WinAPI), which internally call functions from the Native API (NtAPI), which in turn use system calls. The Native API functions are undocumented, implemented in ntdll.dll and also, as can be seen from the picture above, the lowest level of abstraction for User mode code.

The documented functions from the Windows API are stored in kernel32.dll, advapi32.dll, gdi32.dll and others. The base services (like working with file systems, processes, devices, etc.) are provided by kernel32.dll.

So to write shellcode for Windows, we’ll need to use functions from WinAPI or NtAPI. But how do we do that?

ntdll.dll and kernel32.dll are so important that they are imported by every process.

I also wrote a little assembly program that does nothing and it has 3 loaded DLLs:

Notice the base addresses of the DLLs. They are the same across processes, because they are loaded only once in memory and then referenced with pointer/handle by another process if it needs them. This is done to preserve memory. But those addresses will differ across machines and across reboots.

This means that the shellcode must find where in memory the DLL we’re looking for is located. Then the shellcode must find the address of the exported function, that we’re going to use.

The shellcode I’m going to write is going to be simple and its only function will be to execute calc.exe. To accomplish this I’ll make use of the WinExec function, which has only two arguments and is exported by kernel32.dll.

Find the DLL base address

Thread Environment Block (TEB) is a structure which is unique for every thread, resides in memory and holds information about the thread. The address of TEB is held in the FS segment register.

One of the fields of TEB is a pointer to Process Environment Block (PEB) structure, which holds information about the process. The pointer to PEB is 0x30 bytes after the start of TEB.

0x0C bytes from the start, the PEB contains a pointer to PEB_LDR_DATA structure, which provides information about the loaded DLLs. It has pointers to three doubly linked lists, two of which are particularly interesting for our purposes. One of the lists is InInitializationOrderModuleList which holds the DLLs in order of their initialization, and the other is InMemoryOrderModuleList which holds the DLLs in the order they appear in memory. A pointer to the latter is stored at 0x14 bytes from the start of PEB_LDR_DATA structure. The base address of the DLL is stored 0x10 bytes below its list entry connection.

In the pre-Vista Windows versions the first two DLLs in InInitializationOrderModuleList were ntdll.dll and kernel32.dll, but for Vista and onwards the second DLL is changed to kernelbase.dll.

The second and the third DLLs in InMemoryOrderModuleList are ntdll.dll and kernel32.dll. This is valid for all Windows versions (at the time of writing) and is the preferred method, because it’s more portable.

So to find the address of kernel32.dll we must traverse several in-memory structures. The steps to do so are:

Get address of PEB with fs:0x30

Get address of PEB_LDR_DATA (offset 0x0C)

Get address of the first list entry in the InMemoryOrderModuleList (offset 0x14)

Get address of the second (ntdll.dll) list entry in the InMemoryOrderModuleList (offset 0x00)

Get address of the third (kernel32.dll) list entry in the InMemoryOrderModuleList (offset 0x00)

Get the base address of kernel32.dll (offset 0x10)

The assembly to do this is:

movebx,fs:0x30; Get pointer to PEBmovebx,[ebx+0x0C]; Get pointer to PEB_LDR_DATAmovebx,[ebx+0x14]; Get pointer to first entry in InMemoryOrderModuleListmovebx,[ebx]; Get pointer to second (ntdll.dll) entry in InMemoryOrderModuleListmovebx,[ebx]; Get pointer to third (kernel32.dll) entry in InMemoryOrderModuleListmovebx,[ebx+0x10]; Get kernel32.dll base address

They say a picture is worth a thousand words, so I made one to illustrate the process. Open it in a new tab, zoom and take a good look.

If a picture is worth a thousand words, then an animation is worth (Number_of_frames * 1000) words.

When learning about Windows shellcode (and assembly in general), WinREPL is really useful to see the result after every assembly instruction.

Find the function address

Now that we have the base address of kernel32.dll, it’s time to find the address of the WinExec function. To do this we need to traverse several headers of the DLL. You should get familiar with the format of a PE executable file. Play around with PEView and check out some great illustrations of file formats.

Relative Virtual Address (RVA) is an address relative to the base address of the PE executable, when its loaded in memory (RVAs are not equal to the file offsets when the executable is on disk!).

In the PE format, at a constant RVA of 0x3C bytes is stored the RVA of the PE signature which is equal to 0x5045.0x78 bytes after the PE signature is the RVA for the Export Table.0x14 bytes from the start of the Export Table is stored the number of functions that the DLL exports.
0x1C bytes from the start of the Export Table is stored the RVA of the Address Table, which holds the function addresses.0x20 bytes from the start of the Export Table is stored the RVA of the Name Pointer Table, which holds pointers to the names (strings) of the functions.0x24 bytes from the start of the Export Table is stored the RVA of the Ordinal Table, which holds the position of the function in the Address Table.

Write the shellcode

Now that you’re familiar with the basic principles of a Windows shellcode it’s time to write it. It’s not much different than the code snippets I already showed, just have to glue them together, but with minor differences to avoid null bytes. I used flat assembler to test my code.

The instruction “mov ebx, fs:0x30” contains three null bytes. A way to avoid this is to write it as:

When I started learning about shellcode writing, one of the things that got me confused is that in the disassembled output the jump instructions use absolute addresses (for example look at address 401070: “je0x40107c”), which got me thinking how is this working at all? The addresses will be different across processes and across systems and the shellcode will jump to some arbitrary code at a hardcoded address. Thats definitely not portable! As it turns out, though, the disassembled output uses absolute addresses for convenience, in reality the instructions use relative addresses.

Look again at the instruction at address 401070 (“je0x40107c”), the opcodes are “74 0a”, where 74 is the opcode for je and 0a is the operand (it’s not an address!). The EIP register will point to the next instruction at address 401072, add to it the operand of the jump 401072 + 0a = 40107c, which is the address showed by the disassembler. So there’s the proof that the instructions use relative addressing and the shellcode will be portable.

To run it successfully in Visual Studio, you’ll have to compile it with some protections disabled:
Security Check: Disabled (/GS-)
Data Execution Prevention (DEP): No

Proof that it works :)

Edit 0x00:

One of the commenters, Nathu, told me about a bug in my shellcode. If you run it on an OS other than Windows 10 you’ll notice that it’s not working. This is a good opportunity to challenge yourself and try to fix it on your own by debugging the shellcode and google what may cause such behaviour. It’s an interesting issue :)

In case you can’t fix it (or don’t want to), you can find the correct shellcode and the reason for the bug below…

EXPLANATION:
Depending on the compiler options, programs may align the stack to 2, 4 or more byte boundaries (should by power of 2). Also some functions might expect the stack to be aligned in a certain way.

The alignment is done for optimisation reasons and you can read a good explanation about it here: Stack Alignment.

If you tried to debug the shellcode, you’ve probably noticed that the problem was with the WinExec function which returned “ERROR_NOACCESS” error code, although it should have access to calc.exe!

If you read this msdn article, you’ll see the following:
“Visual C++ generally aligns data on natural boundaries based on the target processor and the size of the data, up to 4-byte boundaries on 32-bit processors, and 8-byte boundaries on 64-bit processors”. I assume the same alignment settings were used for building the system DLLs.

Because we’re executing code for 32bit architecture, the WinExec function probably expects the stack to be aligned up to 4-byte boundary. This means that a 2-byte variable will be saved at an address that’s multiple of 2, and a 4-byte variable will be saved at an address that’s multiple of 4. For example take two variables - 2 byte and 4 byte in size. If the 2 byte variable is at an address 0x0004 then the 4 byte variable will be placed at address 0x0008. This means there are 2 bytes padding after the 2 byte variable. This is also the reason why sometimes the allocated memory on stack for local variables is larger than necessary.

The part shown below (where ‘WinExec’ string is pushed on the stack) messes up the alignment, which causes WinExec to fail.

; push the function name on the stackxoresi,esipushesi; null terminationpush63hpushw6578h; THIS PUSH MESSED THE ALIGNMENTpush456e6957hmov[ebp-4],esp; var4 = "WinExec\x00"

To fix it change that part of the assembly to:

; push the function name on the stackxoresi,esi; null terminationpushesipush636578h; NOW THE STACK SHOULD BE ALLIGNED PROPERLYpush456e6957hmov[ebp-4],esp; var4 = "WinExec\x00"

The reason it works on Windows 10 is probably because WinExec no longer requires the stack to be aligned.

Below you can see the stack alignment issue illustrated:

With the fix the stack is aligned to 4 bytes:

Edit 0x01:

Although it works when it’s used in a compiled binary, the previous change produces a null byte, which is a problem when used to exploit a buffer overflow. The null byte is caused by the instruction “push 636578h” which assembles to “68 78 65 63 00”.

Resources

For the pictures of the TEB, PEB, etc structures I consulted several resources, because the official documentation at MSDN is either non existent, incomplete or just plain wrong. Mainly I used ntinternals, but I got confused by some other resources I found before that. I’ll list even the wrong resources, that way if you stumble on them, you won’t get confused (like I did).

[0x05] The information for the TEB, PEB, PEB_LDR_DATA and LDR_MODULE I took from here (they are actually the same as the ones used in resource 0x04, but it’s always good to fact check :) ).https://undocumented.ntinternals.net/