Pthreads on Microsoft Windows

An extremely common API used for developing parallel programs is the Posix Threads API (pthreads). The API contains many synchronization primitives that allow threaded code to be efficiently written. Unfortunately, Microsoft Windows does not support this interface as-is. Thus if one wishes to port over an application, quite a bit of work may need to be done.

Fortunately, a pthreads library for windows has been written, thus simplifying the porting effort. However, the windows API has progressed significantly with regards to threading since that library was written. Many new functions have been exported that simplify the creation of a pthread library on windows. In fact nearly all of the synchronization primitives now exist, and using them may only require the creation of a few simple macros.

Thus it seems to be time to explore the creation of a new pthreads library for Microsoft windows. To make its use as simple as possible, we will require its entire implementation to be confined to a single header .h file. Thus requiring no explicit library to be linked into an application or dll. This trick can be done if all its global variables are implicitly defined to be zero, and all functions are static.

Finally, we note that whilst many of the synchronization primitives required by the pthreads API are exported by Microsoft windows, not all are. Thus we will need to explore some of the undocumented internals of windows. This is obviously dangerous, as Microsoft may change these undocumented features at any time in the future. However, as a educational learning exercise, we can ignore this unpalatable fact and see exactly how much we can get away with.

Critical Sections for Mutexes

The first part we shall implement are the functions for pthread_mutex_t. This can be done by using a CRITICAL_SECTION object and a typedef. This may not be the most efficient mutex, but it is extremely portable on Microsoft windows. It allows you to use the resulting pthread API on any mutex, even those defined in other libraries. Since the pthread API extends the windows one, this is rather nice.

Most of the mutex functions are simple wrappers around the windows counterparts:

The pthreads API has an initialization macro that has no correspondence to anything in the windows API. By investigating the internal definition of the critical section type, one may work out how to initialize one without calling InitializeCriticalSection(). The trick here is that InitializeCriticalSection() is not allowed to fail. It tries to allocate a critical section debug object, but if no memory is available, it sets the pointer to a specific value. (One would expect that value to be NULL, but it is actually (void *)-1 for some reason.) Thus we can use this special value for that pointer, and the critical section code will work.

The other important part of the critical section type to initialize is the number of waiters. This controls whether or not the mutex is locked. Fortunately, this part of the critical section is unlikely to change. Apparently, many programs already test critical sections to see if they are locked using this value, so Microsoft felt that it was necessary to keep it set at -1 for an unlocked critical section, even when they changed the underlying algorithm to be more scalable. The final parts of the critical section object are unimportant, and can be set to zero for their defaults. This yields an initialization macro:

#define PTHREAD_MUTEX_INITIALIZER {(void*)-1,-1,0,0,0,0}

The next part of the pthread_mutex_t API is the pthread_mutex_timedlock() function. This also has no correspondence to something exported by windows. A second problem is that the Posix function expects a struct timespec object. Unfortunately, no such type is defined in windows headers, so we'll need to define it, and some helper functions that use it.

This almost completes the mutex API. However, to be fully complete we need to support the pthread_mutexattr_t functions. These affect the type of mutex created by pthread_mutex_init(). However, since Microsoft windows only supports one type of critical section, we will ignore most of this functionality. Instead, these simple wrapper functions will just record the state so that they give consistent results.

Slim Read Write Locks for rwlocks

Since the earlier pthreads implementation on windows, Microsoft has added Slim Read Write locks (SRWlocks). These allow a simple implementation of much of the pthread_rwlock_t API. However, again the Microsoft API is a subset of the Posix API, so we will again need to explore the undocumented internals to construct the missing functionality.

Using wrappers for the functions that already exist, we can implement:

Where we have added explicit calls to pthread_testcancel() mandated by the Posix API. Next, we need to work out an initialization macro, together with a way of implementing the trylock and unlock functionality. The problem here is that the Posix API requires a single unlock function, whereas the Microsoft windows API has separate unlock functions for read and write locks. We'll need to understand the lock internal state to work out whether we are a read or write lock in order to work out which unlock function to call.

By looking at the Microsoft documentation, we notice that the implementation of a SRWLock is a single pointer sized object. The InitializeSRWLock() function simply sets this pointer to zero. Thus, an initialization macro may be written as:

#define PTHREAD_RWLOCK_INITIALIZER {0}

Next, we can construct simple programs to see how the state of this pointer-sized object changes as we read and write lock and unlock it. The first thing that is noticeable is that a SRWLock that is owned exclusively has the value 1, and that if multiple shared owners have taken the lock, then it has value 1+16n, where n is the number of shared owners. If there is contention, then the lock has the value of a pointer with the low bit set.

Thus, if the low bit is set, then the lock is owned by someone. If any other of the bottom three bits are set, then the lock has some internal state. Using this, we can construct the trylock functions which are not implemented by Microsoft.

Next we need to implement pthread_rwlock_unlock. This is a little tricky. Unfortunately, it doesn't seem that there is an easy way to determine if the lock is owned by a reader or a writer. We have some known special cases. If the lock value is equal to 1, then it is owned by a single writer - so we can write unlock it. If the lock value is equal to 1 + 16n, with n>1 then we must be a shared reader trying to unlock it. If the lock is contended, and there is a list of threads waiting for it, then we need to know who to wake up.

Unfortunately, since the internals of a SRWLock are undocumented we are a little stuck. By exploring the implementation in asm we can see what is going on. Basically, the unlock routines need to check if they are the simple cases described above. If they aren't, and there are waiters, then more complex code is called to wake the required waiter. Fortunately, it seems that the code to wake the next waiter is extremely similar between the shared and exclusive cases. The shared (reader) case appears to be more generic, so we will use that. Testing the resulting function seems to pass. However, this is a little bit of a hack. Unlocking a contended exclusive lock with the shared unlock function may not work in the future. Ignoring that for now, we have:

Finally, we need to implement the timedlock functionality. Again, Microsoft windows doesn't have any timeout functions we can use. Even worse, the underlying wait is in an undocumented function NtWaitForKeyedEvent(), so we probably can't do the trick we did with critical sections. Instead, we will implement a busy wait. This isn't optimal, but until Microsoft describes its new interfaces it is free to alter them at will, so depending on them is extremely risky.

The only thing left for a complete implementation of the rwlock API are the attributes for pthread_rwlock_init(). Again, since Microsoft only has one type of read/write lock, these don't need to do anything, and can be implemented as simple wrapper functions.

Condition Variables

The next part of the synchronization API we shall implement are condition variables. Previously, the implementation of them was rather difficult, with issues of correctness and fairness arising. Fortunately, Microsoft has now implemented a condition variable API, and so the Posix API can now be implemented as a set of simple wrapper functions. A quick check reveals that condition variables can be safely zero-initialized, allowing a nice initialization macro.

Barriers

Using the condition variables and mutexes, we can now create the barrier API. The following describes a barrier where we count the number of incoming and outgoing waiters. By using a flag bit, we can work out whether or not we need to let threads into or out of the barrier. Note that the following code is designed for simplicity. By using a wait-tree greater performance can be obtained at the cost of complexity.

Spinlocks

There are many ways to implement spinlocks. We will choose the simplest way because it doesn't suffer slowdowns when the number of threads is higher than the number of processors. Since the whole point of spinlocks is speed, it doesn't matter that Microsoft doesn't export a spinlock API. Using a simple exchange-based algorithm:

The extra complexity in the above comes due to the fact that cancellation may happen during the function passed to pthread_once(). The Posix specification states that if the function is cancelled, then the pthread_once_t variable needs to return to the "uninitialized" state. To do this, we use the magic pthread_cleanup_push() and pthread_cleanup_pop() macros to record a cleanup function on a cleanup list.

Cancelling

The next big part of the pthreads API to implement are the functions related to cancellation. Posix describes two types of cancellation. The first is synchronous, and is happens by explicit calls to pthread_testcancel(), and to other library functions explicitly listed as being cancellation points. We can add these cancellation points by using macros. For example:

A complete list of the cancellation points defined by Posix is in the pthread header on the downloads page. You may obviously use extra macro overrides to wrap windows API functions that may block.

The second class of cancellation required for Posix support is asynchronous cancellation. This should happen "immediately" without a wait until the next cancellation point. The problem is that this is typically implemented in unix operating systems via signals. Unfortunately, Microsoft windows doesn't support signals in quite the same way. (There is rudimentary support in the CRT, but it is not comprehensive enough for us.) Thus we require another way of triggering a cancellation in another thread. One way that will work most of the time is to modify the thread descriptor block. By changing the instruction pointer to point to our cancellation handler, we can cause the cancelled thread to alter its flow of execution. The only problem is that this will not unblock a thread blocked in the kernel. Anyway, ignoring this problem, we have

The above uses the _pthread_cancelling variable to fast-path the common case where no cancellation is happening. (Having a thread have to check its thread-local cancellation flag all the time is relatively slow.)

One possible way to fix the case where a thread is blocked in the kernel is to use a kernel driver. This choice is used by the old windows pthread library. Another possibility which may work is to use a Doppelganger thread. The algorithm goes as follows: Suspend the cancelled thread. The current thread can then save its stack pointer and thread segment register somewhere in the thread-local space of the suspended thread. This thread can then impersonate the suspended thread by stealing its stack and thread segment selector. By running the standard cancellation routines in the Doppelganger thread, the required cleanup functions can be run in the correct context. Finally, once that is done, the Doppelganger can then restore its state to what it was previously and then call the low-level TerminateThread() function to kill the suspended thread (even if it was in kernel mode).

So why isn't the above implemented? The problem is that the thread-stealing requires low level assembly to work. Unfortunately, Microsoft has decided that all 64bit assembly should use compiler intrinsics instead of inline assembly. The problem here is that Microsoft hasn't thought of everything, and the particular instructions required are not exported as compiler intrinsics. The correct way to implement this on 64bit would be to have a separate assembly file to compile along side the C code. This, however, does not fit with our goal of having a single .h file for the implementation of the library. (Another problem is that it is difficult to portably hook the cleanup of the CRT to prevent memory leaks - but this may perhaps be fixed through the use of a new thread as the Doppelganger.)

Thread Creation and Destruction

pthread_create() unfortunately has a slightly different interface than _beginthreadex(). This deficiency may be fixed by using a wrapper function that in turn will call the thread main function. We can hide the extra information inside pthread_t. Similarly, we can store the return value inside pthread_t so that pthread_join() will work correctly. Together with a few extra details to complete the library, the resulting pthread_t definition is:

Most of the above code is fairly trivial. The only non-obvious thing is the use of a jump buffer to transfer control in pthread_exit(). This is done so that C++ destructors may be called as the stack is unwound. A simple call to terminate the thread may not clean up exactly what we want otherwise. Another subtlety is pthread_self() This function will transparently convert a non-pthreads created thread into one with the extra information required for the pthreads API. Thus if you don't want to use pthread_create(), and instead use the Microsoft windows thread creation API, you can. The only downside is that pthread_self() has no documented failure mode, so if it can't allocate memory it calls abort() as there is nothing else it can do.

Thread Specific Data

The last significant part of the pthreads API is that for thread specific data. This can be implemented in two different ways in windows. The simplest is to use __declspec(thread). Unfortunately, that technique doesn't work correctly for dll's. Thus we are forced to use the other interface based on TlsAlloc(). This is the reason why the variable _pthread_tls is used above. By using a resizable array within the thread-specific data we can store the information required by the Posix specification.

The data-keys are able to be implemented as a global resizable array protected by read-write locks.

Unimplemented Functions

There finally remains some functions that exist in the Posix Threading API, but do not correspond well to the Microsoft windows API. The first of these is pthread_atfork(). It is designed to make sure the fork() system call works as intended in multithreaded programs, allowing the child process to know that it is able to use all its required data due to it being in a consistent state. However, windows doesn't have the fork() function, so this routine can be a simple stub.

Similarly, windows lacks good support for signals. Thus pthread_kill() and
pthread_sigmask also can be implemented as stubs since porting a signal-using application over will require much extra work anyway.

A full implementation of this library is on the downloads page. It is licensed under the BSD license. However, be aware that it does use undocumented windows internals for a few of the synchronization primitives. These may be changed by Microsoft in the future, so probably should not be relied on for anything important.

Comments

AYF said...

Giving my thanks for a great alternative to pthreads-win32 under a liberal license.

sfuerst said...

The get and set concurrency functions had an extra underscore in their name. This has now been fixed in the article, and in the downloadable header file.

Borislav Trifonov said...

The asynchronous cancellation trick is nice, but how do we guarantee C++ destructors are still called for the thread's automatic objects?

said...

Great licencing, for a very useful library! Now it does not compile with vc10 for a C++ project.

sfuerst said...

There was a bug in implementation of pthread_barrier_destroy(). It should wait until all threads have left the barrier before returning. This has been fixed.

thomas said...

I would like to try this in place of the pthread-win32 library, which seems very slow in comparison to pthreads running on other OS.

But including the winpthreads.h in a vs2008 c++ file produces lots of errors.

Since keys are reused in a circular fashion, and nothing is done in pthread_key_create() or pthread_key_delete() to clear the TLS slots of existing threads, doesn't this implementation violate the rule that: "Upon key creation, the value NULL shall be associated with the new key in all active threads"?

jsestrad said...

Thank you for this! It helps me get out of a slump of confusion! However, I am getting errors in windows vc++ 2010 as well. Will this code not work because microsfot changed some things? Here are the errors:

but instead of Linux pthread usage, you MUST initialise the mutex before locking it, this can be solve with a bigger management...

Frediano Ziglio said...

About PTHREAD_MUTEX_INITIALIZER if you use it amd you try to lock the mutex in another thread program crash.

sfuerst said...

thomas: The casting rules for C++ are different than C. You are trying to compile as C++, and this causes the header to fail. If you replace with reinterpret_cast<>(), it should work. (However, that would then break the C implementation.)

tjm: It is difficult to efficiently fix this, other than leaking old keys. I suppose there could be a list maintained of active threads, whose descriptors could be scanned under some lock... Try using the updated version of the library in mingw64. They may have fixed this.

jsestrad: Yeah, Microsoft broke their headers. They stopped defining the _InterlockedCompareExchangePointer() intrinsic, even though their documentation still mentions it. Try using the version without a leading underscore, or manually using the compare-exchange of the correct size for your pointers.

Frediano: The PTHREAD_MUTEX_INITIALIZER has been tested to work with Windows Vista. Earlier versions won't work. Later ones may not either. Microsoft can change their implementation of the critical section at any time.

Frediano Ziglio said...

Yes, I forgot to tell that PTHREAD_MUTEX_INITIALIZER didn't work using Windows XP 32. It worked correctly with Windows 7 (both 32 and 64 bit).
Also you can have performance problems due to the fact that a number should be a spinning could which is initialized differently for smp (probably higher than 0).
It would be fantastic to have a PTHREAD_MUTEX_INITIALIZER even on Windows to have a way to have them initialized statically. Personally I use a different approach using an extended struct that include CRITICAL_SECTION and use MCS locks to get initial initialization automatically.
Personally I'm quite worried about portability. If program stop working just cause you update your system is not that good.

said...

Enter your comments here

Alexey Melnichuk said...

pthread_kill(t,0) could be usable to detect working of thread.
may be should implement it as WaitSingleObject with zero timeout?