Search This Blog

Wednesday, May 30, 2012

Internet is full of programmers' forums and those forums are full with questions about CreateRemoteThread Windows API function not working on Windows 7 (when trying to inject a DLL). Those posts made by lucky people, somehow, redirect you to the MSDN page dedicated to this API, which says: "Terminal Services isolates each terminal session by design. Therefore, CreateRemoteThread fails if the target process is in a different session than the calling process." and, basically, means - start the process from your injector as suspended, inject your DLL and then resume the process' main thread. This works... Most of the time... But sometimes you really need to inject your code into a running process. Isn't there a way to do that? Well, there is. As a matter of fact, it is so easy, that I decided not to attach my source code to this article (mainly, because I am too lazy to make it look readable :) ). It appears to be that I am not the only one lazy here :), so I have uploaded the source code.

Let me start as usual, with a note for nerds in order to avoid meaningless comments and stupid discussions.

The code provided within the article is for example purposes only. Error checks have been omitted on purpose. Yes, there may be another, probably even better, way of doing this. No, manual DLL mapping is not better unless you have plenty of time and nothing to do with it.

All others, let's get to business :)

Opening the Victim Process

This is the easiest part. At this stage you will see whether you are able to inject your code or not (in case of a system process, for example). Nothing unusual here - you simply invoke the good old OpenProcess API

HANDLEWINAPI OpenProcess(

DWORD dwDesiredAccess, /* in our case PROCESS_ALL_ACCESS */

BOOL bInheritHandle, /* no need, so FALSE */

DWORD dwProcessId /* self explanatory enough */

);

which opens the process specified by dwProcessId and returns a handle to that process, unless, you have no sufficient rights to access that process.

Reading the Shellcode

What you usually see in the examples of shellcode over the internet, is an unsigned char array of hexadecimal values somewhere in the C code. Helps to keep the amount of files smaller, but is not really comfortable to deal with. I decided to store the shellcode in a separate binary file, produced with FASM (Flat Assembler):

use32

; offset of the LoadLibraryA address within the shellcode

dd func

; save all registers

push eax ebx ecx edx ebp edi esi

; get your EIP

call next

next:

pop eax

mov ebx, eax

; get the address of the DLL name

mov eax, string - next

; do this to avoid possible negative values (due to sign extend)

movzx eax, al

add eax, ebx

; pass it to the LoadLibraryA API

push eax

; get the address of the LoadLibraryA function

mov eax, func - next

movzx eax, al

add eax, ebx

mov eax, [eax]

; call LoadLibraryA

call eax

; restore registers

pop esi edi ebp edx ecx ebx eax

; return

ret

func dd 0x12345678 ; placeholder for the address

string:

Compiling this code with FASM.EXE will produce a raw binary file, where all offsets are 0 - based. There are some parts in the code above, that may require some additional explanation (for example, why does it not end with ExitThread()). I am aware of this and I will provide you with the explanation a little bit later.

For now, allocate an unsigned char buffer for your shellcode. Make this buffer large enough to contain the shellcode and the name of the DLL (my assumption is, that you passed that name as a command line parameter to your injector). with it's terminating zero.

Once you have read the shellcode into that buffer - append the name of the DLL (which may be a full path to the DLL) to the end of the shellcode with, for example, memcpy() function. Half done with it. Now we still have to "tell" the shellcode where the LoadLibraryA API function is located in memory. Fortunately, the load address randomization in Windows is far from being perfect (addresses of loaded modules may vary between subsequent reboots, but are the same for all processes). This means that, just as in usual DLL injection, we obtain the address of this API in our process by calling good old GetProcAddress(GetModuleHandleA("kernel32.dll"), "LoadLibraryA") and save it to the "func" variable of the shellcode. Due to the fact that our shellcode may vary in size from time to time (that depends on the needs), we saved the offset to that variable in the first four bytes of the shellcode, which eliminates the need to hardcode the offset. Simply do the following:

As the title of this paragraph suggests - we are not going to use the CreateRemoteThread(). In fact, we are not going to create any thread in the victim process (well, the injected DLL may, but the shellcode won't).

Code Injection

Surely, we need to move our shellcode into the victim process' address space in order to load or library. We are doing it in the same manner, as we would copy the name of the DLL in regular DLL injection procedure:

Allocate memory in the remote process withLPVOIDWINAPI VirtualAllocEx(HANDLE hProcess, /* the handle we obtained with OpenProcess */LPVOID lpAddress, /* preferred address; may be NULL */SIZE_T dwSize, /* size of the allocation in bytes */DWORD flAllocationType, /* MEM_COMMIT */DWORD flProtect /* PAGE_EXECUTE_READWRITE */);This function returns the address of the allocation in the address space of the victim process or NULL if it fails.

Copy the shellcode into the buffer we've just allocated in the address space of the victim process:BOOLWINAPI WriteProcessMemory(HANDLE hProcess, /* same handle as above */LPVOID lpBaseAddress, /* address of the allocation */LPCVOID lpBuffer, /* address of the local buffer with the shellcode */SIZE_T nSize, /* size of the shellcode together with the appended NULL-terminated string */

SIZE_T *lpNumberOfBytesWritten /* if this is zero - check your code */);If the return value of this function is non zero - we have successfully copied our shellcode into the victim process' address space. It may also be a good idea to check the value returned in the lpNumberOfBytesWritten.

Make It Run

So, we have copied our shell code. The only thing left, is to make it run, but we cannot use the CreateRemoteThread() API... Solution is a bit more complicated.

First of all, we have to suspend all threads of the victim process. In general, suspending only one thread is enough, but, as we cannot know for sure what is going on there, we should suspend them all. There is no specific API that would provide us with the list of threads for a specified process, instead, we have to create a snapshot with CreateToolhelp32Snapshot, which provides us with the list of all currently running threads of all processes running in the system:

HANDLEWINAPI CreateToolhelp32Snapshot(

DWORD dwFlags, /* TH32CS_SNAPTHREAD = 0x00000004 */

DWORD th32ProcessID /* in this case may be 0 */

);

This function returns the handle to the snapshot, which contains information on all present threads. Once we have this, we "iterate through the list" with Thread32First and Thread32Next API functions:

BOOLWINAPI Thread32First(

HANDLE hSnapshot, /* the handle to the snapshot */

LPTHREADENTRY32 lpte /* pointer to the THREADENTRY32 structure */

);

The Thread32Next has the same prototype as Thread32First.

typedefstructtagTHREADENTRY32{

DWORD dwSize; /* size of this struct; you have to initialize this field before use */

DWORD cntUsage;

DWORD th32ThreadID; /* use this value to open thread for suspension */

DWORD th32OwnerProcessID; /* compare this value against the PID of the victim

to filter out threads of other processes */

LONG tpBasePri;

LONG tpDeltaPri;

DWORD dwFlags;

} THREADENTRY32, *PTHREADENTRY32;

For each THREADENTRY32 with matching th32OwnerProcessID, open it with OpenThread() and suspend with SuspendThread:

HANDLEWINAPI OpenThread(

DWORD dwDesiredAccess, /* THREAD_ALL_ACCESS */

BOOL bInheritHandle, /* FALSE */

DWORD dwThreadId /* th32ThreadID field of THREADENTRY32 structure */

);

and

DWORDWINAPI SuspendThread(

HANDLE hThread, /* Obtained by OpenThread() */

);

Don't forget to CloseHandle(openedThread) :)

Take the first thread, once it is opened (actually, you can do that with any thread that belongs to the victim process) and suspended, and get its CONTEXT (see "Community Additions" here) using the GetThreadContext API:

BOOLWINAPI GetThreadContext(

HANDLE hThread, /* handle to the thread */

LPCONTEXT lpContext /* pointer to the CONTEXT structure */

);

Now, when all the threads of the victim process are suspended, we are may do our job. The idea is to redirect the execution flow of this thread to our shellcode, but make it in such a way, that the shellcode would return to where the suspended thread currently is. This is not a problem at all, as we have the CONTEXT of the thread. The following code does that just fine:

/* "push" current EIP of the thread onto its stack, so that the ret instruction in the shellcode returns the execution flow to this address (which is somewhere in WaitForSingleObject for suspended threads) */

ctx.Esp -= sizeof(unsigned int);

WriteProcessMemory(victimProcessHandle,

(LPVOID)ctx.Esp,

(LPCVOID)&ctx.Eip,

sizeof(unsigned int),

&bytesWritten);

/* Set the EIP to our injected shellcode; do not forget to skip the first four bytes */

ctx.Eip = remoteAddress + sizeof(unsigned int);

Almost there. All we have to do now, is resume the previously suspended threads in the same manner (iterating with Thread32First and Thread32Next with the same snapshot handle).

Don't forget to close the victim process' handle with CloseHandle() ;)

Shellcode

After all this, the execution flow in the selected thread of the victim process reaches our shellcode, which source code should be pretty clear now. It simply calls the LoadLibraryA() API function with the name/path of the DLL we want to inject.

One important note - it is a bad practice to do anything "serious" inside the DllMain() function. My suggestion is - create a new thread in DllMain() and do all the job there, so that it may return safely.

Wednesday, May 23, 2012

Virtual machines and Software Frameworks are an initial part of our digital life. There are complex VM and simple Software Frameworks. These two articles (Simple Virtual Machine and Simple Runtime Framework by Example) show how easy it may be to implement one yourself. I did my best to describe the way VM code may interact with native code and the Operating System, however, the backwards interaction is still left unexplained. This article is going to fix this omission.

As usual - note for nerds:

The source code given in this article is for example purposes only. I know that this framework is far from being perfect, therefore, this article is not a howto or tutorial - just an explanation of principle. Error checks are omitted on purpose. You want to implement a real framework - do it yourself, including error checks.

By saying VM's code I do not refer to the implementation of the virtual machine, but to the pseudo code that runs inside it.

Architecture Overview

Needless to mention, that the ability to pass events/signals to a code executed by the virtual machine implies a more complex VM architecture. While all previous examples were based on a single function responsible for the execution, adding events means not only adding another function, but we will have to introduce threads to our implementation.

At least two threads are needed:

Fig.1

VM Architecture with Event Listener

Actual VM - this thread is responsible for the execution of the VM's executable code and events queue dispatch (processor);

Event Listener - this thread is responsible for collection of relevant events from the Operating Systems and adding them to the VM's event queue (listener).

You may see that the Core() function, in the attached source code, creates additional thread.

Event ListenerThis thread collects events from the Operating System (mouse move, key up/down, etc) and adds a new entry to the list of EVENT structures.

typedefstruct_EVENT

{

struct_EVENT* next_event; // Pointer to the next event in the queue

int code; // Code of the event

unsignedint data; // Either unsigned int data or the address of the buffer

// containing information to be passed to the handler

}EVENT;

The code for the listener is quite simple:

while(WAIT_TIMEOUT == WaitForSingleObject(processor_thread, 1))

{

// Check for events from the OS

if(event_present)

{

EnterCriticalSection(&cs);

event = (EVENT*)malloc(sizeof(EVENT));

event->code = whatever_code_is_needed;

event->data = whatever_data_is_relevant;

add_event(event_list, event);

event->next_event = NULL;

LeaveCriticalSection(&cs);

}

}

The code is self explanatory enough. First of all it checks for available events (this part is omitted and replaced by a comment). If there is a new event to pass to the VM, it adds it to the queue. While in this example, event collection is implemented as a loop, in real life, you may do it in a form of callbacks and use the loop above just to wait for the processor thread to exit.

Processor

Obviously, the "processor" thread is going to be a bit more complicated, then in the previous article (Simple Runtime Framework by Example), as in addition to running the run_opcode(CPU**) function, it has to check for pending events and pass the control flow to the associated handler in the VM code.

typedefstruct_EVENT_HANDLER

{

struct_EVENT_HANDLER* next_handler; // Pointer to the next handler

int event_code; // Code of the event

unsigned int handler_base; // Address of the handler in the VM's code

}EVENT_HANDLER;

DWORDWINAPI RunningThread(void* param)

{

CPU* cpu = (CPU*)param;

EVENT* event;

EVENT_HANDLER* handler;

do{

EnterCriticalSection(&cs);

if(NULL != events)

{

event = events;

events = events->next_event;

// Save current context by pushing VM registers to VM's stack

cpu->regs[REG_A] = (unsigned int)event->code;

cpu->regs[REG_B] = event->data;

handler = handlers;

while(NULL != handler && event->code != handler->event_code)

handler = handler->next_handler;

cpu->regs[REG_IP] = handler->handler_base;

free(event);

}

LeaveCriticalSection(&cs);

}while(0 != run_opcode(&cpu));

return cpu->regs[REG_A];

}

We are almost done. Our framework already knows how to pass events to a correct handler in the VM's code. Two more things are yet uncovered - registering a handler and returning from a handler.

Returning from Handler

Due to the fact that Event Handler is not a regular routine, we cannot return from it using the regular RET instruction, instead, let's introduce another instruction - IRET. As event actually interrupts the execution flow of the program, IRET - interrupt return is exactly what we need. The source code that handles this instruction is so simple, that there is no need to give it here in the text of the article. All it does is simply restoring the context of the VM's code by popping the registers previously pushed on stack.

Registering an Event Handler

The last thing left is to "teach" the program written in pseudo assembly to register a handler for a given event type. In order to do this, we need to add one simple system call - SYS_ADD_LISTENER. This system call accepts two parameters:

Code of the event to handle;

Address of the handler function.

loadi A, 0 ;Code of the event

loadi B, handler;Address of the handler subroutine

_int sys_add_listener;Register the handler

Example Code

The example code attached to this article is the implementation of all of the above. It does the following:

Saturday, May 19, 2012

These days we are simply surrounded by different software frameworks. Just to name a few: Java, .Net and, actually, many more. Have you ever wondered how those work or have you ever wanted or needed to implement one? In this article, I will cover a simple or even trivial runtime framework.

As usual - note for nerds:

The source code given in this article is for example purposes only. I know that this framework is far from being perfect, therefore, this article is not a howto or tutorial - just an explanation of principle. Error checks are omitted on purpose. You want to implement a real framework - do it yourself, including error checks.

Now, to let's get to business.

Software Framework

Wikipedia gives the following identification for the term "Software Framework" - "A software framework is a universal, reusable software platform used to develop applications, products and solutions. Software Frameworks include support programs, compilers, code libraries, an application programming interface (API) and tool sets that bring together all the different components to enable development of a project or solution". As you can see, software framework is quite a complex thing. However, let's simplify it and see how it basically work.

Figure 1. Software Framework

The diagram on the left may give you a good understanding of what Software Framework is and what role it performs. Simply saying, it is a shim between the user application and the Operating System. There are at least two types of Software Frameworks:

Application Programming Interface (API) - if we take a look at Windows API, we may see that it is a framework as well. However, it may be bypassed or, at least, a programmer may choose to decrease the interaction with it by, for example, using functions from ntdll.dll instead of those provided by kernel32.dll or even "talk" to Windows kernel directly (highly not recommended, but may be unavoidable some times) through interrupts.

.Net like framework - total isolation of user code from the operating system. Such frameworks are mostly virtual machines totally isolating user application from the operating system and hardware. However, such framework has to provide the application with all the services available in the Operating System. This is type of framework we are going to build in this article.

Virtual Machine

The basics of building a simple virtual machine is covered in this article, so I will only give a brief explanation here. Our VM in this example will consist of the following components:

Virtual CPUA structure that represents a CPU - basically, has 6 registers and a pointer to the stack:

typedefstruct{unsigned int regs[6];unsigned int* stack;}CPU;The 6 registers are general purpose A, B, C and D, where A is also used to store system call return value and C is used as a counter for LOOP instruction, STACK POINTER (SP) and INSTRUCTION POINTER (IP).

Instruction InterpreterA function or a set of functions which responsible for interpretation of the pseudo assembly (or call it intermediate assembly language) designed for this virtual machine (in this case 14 instructions).

System Call HandlerThis component provides the means for the user application to interact with the Operating System (in this case 2 system calls: sys_write and sys_exit).

Core Function

The name of the function speaks for itself. This is the first function of the framework implementation which gains control. In this particular case, it does not have too many things to do - initialization of the virtual CPU and execution of the command interpreter, until the user application exits (signals the framework to terminate the execution).

Implementation

It is a common practice to implement a framework as a DLL (dynamic link library), for example, mscoree.dll - the core of the .Net framework. I do not see any reason to reinvent the wheel, therefore, this framework will be implemented as a DLL as well.

All is fine, you may say, but how should we pass the compiled pseudo assembly code to the framework? Well, I bet, most of you know how to do that. In case you don't - no worries, just keep reading.

In case of .Net framework (at least as far as I know), the loader identifies a file as a .Net executable, reads in the meta header, and initializes the mscoree.dll appropriately. We will not go through all those complications and will use a regular PE file:

The code above produces a tiny executable which invokes framework's core() function. Pseudo assembly code simply prints two messages (the first one is decoded prior to being printed). Full sources are attached to this article (see the very first line).

The good thing is that you do not have to start the interpreter and load this executable (or specify it as a command line parameter) - you may simply run this executable, Windows loader will bind it with the framework.dll automatically. The bad thing is that you would, most probably, have to write your own compiler, because writing assembly is fun, dealing with pseudo assembly is fun as well, BUT, only when done for fun. It is not as pleasant when dealing with production code.

Possible uses

Unless you are trying to create a framework that would overcome existing software frameworks, you may use such approach to increase the protection of your applications by, for example, virtualizing cryptography algorithms or any other part of your program which is not essential by means of execution speed, but represents a sensitive intellectual property.

Thursday, May 17, 2012

One of the aspects of software anti RE (reverse engineering) protection is the need to protect sensitive data (for example decryption or license keys, etc.) There is quite a common practice of storing such data in encrypted form and using it by passing to a certain routine for decryption. I am not going to say, that this is not a good idea, but the problem with such approach is - vendors (in most cases) only rely on the complexity of the encryption algorithm, which is not as protective as it is thought to be and too often is placed in a single function (which, potentially, may be ripped and used with malicious intent).

I have already covered the basics of executable code obfuscation in this article, now it's time to take a look at how data may be hidden (this approach may be used with executable code as well) by, for example, putting it on stack and using several separate functions to reconstruct the original data.

The idea of hiding data in uninitialized variables (of which I am going to talk here) is not new at all, but still is rarely used, if at all.

Note for nerds:

This is not a tutorial, neither a howto. This is a basic explanation of the concept (no, this is not my invention and yes, there are other ways). The supplied code may be not perfect. It may contain bugs and is given here as an example only.

Needle and the Haystack

While needle is the data we want to hide, haystack is our whole program. You may hide data anywhere - data section, code section, etc. You may even spread parts of the data throughout the program. In this particular example, the data is pretended to be a part of the key computation algorithm. We will reconstruct the data on the stack (this is thread safe as every thread has its own stack in either way).

As this is (and I will reiterate this) just an example, our program is quite short:

#include<stdio.h>

#include<stdlib.h>

#include<string.h>

#defineDATA_SIZE 16

int main(int argc, char** argv)

{

unsigned int key;

char* str;

char* res = (char*)malloc(sizeof(char) * DATA_SIZE);

// Calculate pseudo key

key = CalcKey(0x12345678);

// Mutate the key (get the actual key)

key = Mutate(0);

// Get the pointer to the data

str = GetPtr();

// Decode the data

Decode();

// Copy the data to a buffer

memcpy(res, str, sizeof(char) * DATA_SIZE);

// Print the data (which is actually a string)

puts(res);

return 0;

}

As you may see, there is a set of functions used to construct the hidden data (functions are written in assembler):

unsigned int CalcKey(unsigned int seed);

Uses "seed" to start preparing the decryption key. The value returned by such function should be used somewhere, for any kind of "decryption" operation, just in order to lead the attacker astray. You may say, that sooner or later, this move would be disclosed and the attacker would get back to this point and revise it and you will be right. However, given that "real life" implementation should be more complicated then the following code, it would take a while until the real purpose is discovered. Even more than that, it would still scare away some "hackers".

The following code is the implementation of the CalcKey function used in this example:

The highlighted constants, which may seem to be a part of the pseudo key calculation are in fact the data. As you can see, we put it on stack and "forget" there. It is important to mention, that you have to be careful if you decide to use stack for this purpose, and make sure the data is not being overwritten by subsequent calls to other functions. In order to make sure this does not happen, the suggestion is to put the actual data further into the stack (e.g. at [ebp - 0x100] instead of [ebp - 0x14] or even further).

I would say it again - make use of the pseudo key somewhere.

unsigned int Mutate(unsigned int dummy);

"dummy" is a dummy parameter and my personal suggestion is to do some manipulations with it. This function may seem as the one that produces different keys derived from the pseudo key computed by CalcKey depending on the "dummy" parameter. Well, it does. But those keys are not used in this example. What it does in deed, is mutating the half generated key, which is still present on stack where we left it in the CalcKey function (if it is not - check your code), and finalizing the key generation process.

Once this function returns, we have a ready to use key somewhere in the stack space. A small note to satisfy nerds (as others should know this by default) - you should not call these functions one after the other in real life.

unsigned char* GetPtr(void);

This is the most simple (meaning short) function. All it does - returns a pointer to the location of data inside the stack area.

get_ptr: push ebp mov ebp, esp sub esp, 0x14 mov eax, esp leave ret

In case of this example, the GetPtr() function returns the pointer itself, however, you may make it return any value that allows you to form a real pointer to the real data. Another recommendation is to call this function before the data gets decrypted so that it may be considered a pointer to immediate.

void Decode(void);

Finally, the end of this "complicated" procedure - decoding the actual data with the actual key.

leave ret
Upon return from this function, the pointer obtained with GetPtr() would point to the decrypted data which is still on stack. Suggestion is - move it from there and overwrite that stack area with whatever you want.

Compiling and running the attached code would print the famous "Hello, World!" string to the terminal.

Hope I managed to explain the idea and that you may find this article interesting.