context switching in User Mode can be faster because it does not require going back and forth between user mode and kernel mode.

User Mode Scheduling introduces cooperative multitasking which can lead to performance improvements. Now threads can made an assumption that they won’t be preempted by other threads (from the same UMS group) unless they block in kernel (e.g. waiting on IO operation to complete). This assumption could be used e.g. to create more efficient locks.

having customized scheduler gives more control over threads execution. For example we can write round robin scheduler which will select threads one by one in fair manner (this could be useful in applications that have real time constraints). Or scheduler that will react on application specific events – events that Kernel Mode Scheduler knows nothing about.

It all sounds good enough to check how User Mode Scheduling works in reality. And that’s the thing I have done.

The Rules of the Game
Detailed description User Mode Scheduling API can be found in MSDN. To shortly sum up how to use this feature:

Application creates Worker Threads that will perform actual work. Worker threads can be preempted only if they blocks in kernel mode – for example when page fault occurred or if threads are waiting for IO operation to complete. They can also resign from it’s processor time by calling UmsThreadYield.

Application creates User Mode Scheduling Threads for cores on which Worker Threads will run. User Mode Schedulers are notified every time Worker Threads blocks in kernel or yields it’s execution. When flow is passed to scheduler, it is responsible to select next thread that will run.

Code executed by scheduler is pretty limited. During notifications it cannot touch any locks acquired by worker threads because it will result in deadlock (thread A waits to be scheduled and scheduler waits for lock acquired by thread A – see deadlock condition?). And this means scheduler cannot call any function which acquires locks (like global heap allocation) unless you know for sure that worker threads won’t acquire same locks.

The notification about blocking in kernel seems to be double edged sword.
The good side of notification is that it gives more control and it allows to employ more sophisticated scheduling strategies. For example we can create pool of threads assigned to one logical processor. When current executing thread blocks, scheduler can pass execution to next thread assigned to same processor to improve throughput. In such scenario processor cache pollution is minimized because one task is entirely done at one core which could lead to better performance.
The bad side of notification is that transition has to be made from kernel to user mode, which of course takes time. With original system scheduler this step is not needed because scheduling takes place in kernel mode. So actually notifications could decrease performance.
If overall performance decreases or increases depends on nature of the tasks. If tasks blocks a lot in kernel we can expect that performance will drop.

Basic UMS
To test performance of User Mode Scheduling I have written benchmark which will compare traditional system scheduler with round robin User Mode Scheduler. The embarrassingly parallel job will be performed: computation of methematical function over range of values. Each range chunk will be assigned to new thread which will perform computation. You can read code here:

Results of benchmark are surprising – no matter how many worker threads – System Scheduler has always slightly better performance. Process Monitor reveals reason – when using our Basic User Mode Scheduler, not all cores are utilized uniformly. This is because each Scheduler Thread dequeues some number of ready threads from global queue. Number of dequeued threads are not always the same which means work does not split uniformly across cores. And this causes lower throughput. To fix it, we will have to complicate program and write some global queue with ready threads or implement work stealing strategies. Ain’t no free lunch at this time.

Conclusions
It seems that guys from Microsoft also realized that User Mode Scheduling is not the best way to achieve greater performance in simple cases. UMS was used as default scheduling policy in ConcRT (MS library that allows code parallelization). Well, this is no longer the case – default scheduling policy was changed to Win32 System Scheduler. Developer of ConcRT – Don McCrady – explains:

UMS support in Concrt is being retired because although it’s super fast at context-switching, well-structured parallel programs shouldn’t do a lot of context switching. In other words, in most real PPL scenarios, the UMS scheduler wasn’t any faster than the regular Win32 thread scheduler.

Although User Mode Scheduling does not improve performance in common cases, it still can be useful when there is a need for tighter control over threads execution (e.g. SQL Server uses it to achieve better scalability). Also it is good replacement for Fibers – old Windows mechanism that provides cooperative multitasking. Fibers has several limitations like lack of support for multiple cores and constraints on function that could be called during execution (when function blocks all fibers in thread blocks). None of these limitation apply to User Mode Scheduling.

Writing multithreaded programs is challenging task both because of programming and debugging. Let’s see what traps await for lone rider entering Parallel Universe.

More is less
Psychological research shows that people becomes unhappy when they are overloaded by multiple possibilities. Well if this is true, then analyzing concurrent program could cause serious nervous breakdown of unprepared minds. Number of execution possibilities grows exponentially because of thread scheduler non-determinism. Let’s assume that we operate in sequential consistent model (e.g. execution on single core) and we have only two threads that are executing instructions:

Number of possible interleaved executions is equal to number of permutations with repetition

Assuming that m = n

which means exponential growth of possibilties with number of instructions.

Even if there are only 2 threads with 3 instructions per thread, there is 20 possible cases to analyze (enumerate it by yourself if you have too much time). This is enough to make your life miserable, but real world programs are much more complicated than this. And it implies that your mind is physically incapable to analyze so many possibilities.

To ensure correctness we have to use locks, which also reduces number of possible execution scenarios (because now several instructions are atomic and could be viewed as single instruction). It also introduces new opportunities to break your program (like deadlocks or priority inversions) and experience performance drops.

Your friend becomes your enemy
In single threaded programming compiler guarantee that even after optimizations your program will have same effect as before optimizations. This is no longer the case in multithreaded programming. Compiler is not able to analyze all instructions interleaves when multiple threads are present, but it will still optimize program as single threaded application. It means that now you are fighting against compiler to suppress certain optimizations.

Simplest optimization that can breaks things is dead code elimination. Consider following code:

New thread waits for flag to become true and exits. After that, main thread terminates. But with optimization -O3 program goes into infinite execution. What happened? Disassembly for thread_func is following:

It’s exactly an infinite loop. Compiler made an assumption that program is single-threaded, so flag=true will never happen during thread_func execution. Therefore code can be transformed into endless loop (notice that even call to pthread_exit has been wiped out as a dead code).

Another optimization that can be deadly is variable store/reads reordering. Compiler may reorganize reads and writes of non-dependent variables to squeeze more performance. Of course assumption about single threaded execution still holds, so compiler will don’t bother about dependencies between multiple threads. This will probably result in erroneous execution, if such dependencies exists.

In C/C++ both optimizations can be relatively easily found by looking into disassembly listings and fight back by proper use of volatile keyword (e.g. in previous program declaring flag as volatile bool will fix it). But if you think you have defeated final boss you are unfortunately wrong…

The Grey Eminence of all optimizations
Processor. He has last word when and in what order variables will be stored in memory. Rules are similar as with compiler – if program executes on single core, optimization performed by CPU will not affect results of execution (except performance). But when program is executed on multiple cores and each core can reorder instruction in its pipeline you can watch how things break.

To fight back processor reordering first you have to know when reordering may take place. Rules of memory operations ordering are written in processor specification. An Intel has relatively strong consistency model – which means there are not too much cases when it can reorder memory operations. Specification says

Reads may be reordered with older writes to different locations but not with older writes to the same location.

It could break some algorithms like Dekker or Peterson mutual exclusion, but other programming constructs like Double Checking Locking pattern may work correctly. Other processors like Alpha are not so forgiving and almost any memory operation could be reordered (which means Double Checking Locking Pattern won’t work without explicit memory barriers).

If you already know where reordering can cause problems you can suppress it with proper Memory Barrier instruction. In the contrast with compiler optimizations – you are not able to see if processor runtime optimization took place – you can only see its effects. And fighting with invisible opponent can be pretty damn hard.

In next parts of Adventures in Parallel Universe we will explore some concepts presented here more deeply along with the new stuff.