context switching in User Mode can be faster because it does not require going back and forth between user mode and kernel mode.

User Mode Scheduling introduces cooperative multitasking which can lead to performance improvements. Now threads can made an assumption that they won’t be preempted by other threads (from the same UMS group) unless they block in kernel (e.g. waiting on IO operation to complete). This assumption could be used e.g. to create more efficient locks.

having customized scheduler gives more control over threads execution. For example we can write round robin scheduler which will select threads one by one in fair manner (this could be useful in applications that have real time constraints). Or scheduler that will react on application specific events – events that Kernel Mode Scheduler knows nothing about.

It all sounds good enough to check how User Mode Scheduling works in reality. And that’s the thing I have done.

The Rules of the Game
Detailed description User Mode Scheduling API can be found in MSDN. To shortly sum up how to use this feature:

Application creates Worker Threads that will perform actual work. Worker threads can be preempted only if they blocks in kernel mode – for example when page fault occurred or if threads are waiting for IO operation to complete. They can also resign from it’s processor time by calling UmsThreadYield.

Application creates User Mode Scheduling Threads for cores on which Worker Threads will run. User Mode Schedulers are notified every time Worker Threads blocks in kernel or yields it’s execution. When flow is passed to scheduler, it is responsible to select next thread that will run.

Code executed by scheduler is pretty limited. During notifications it cannot touch any locks acquired by worker threads because it will result in deadlock (thread A waits to be scheduled and scheduler waits for lock acquired by thread A – see deadlock condition?). And this means scheduler cannot call any function which acquires locks (like global heap allocation) unless you know for sure that worker threads won’t acquire same locks.

The notification about blocking in kernel seems to be double edged sword.
The good side of notification is that it gives more control and it allows to employ more sophisticated scheduling strategies. For example we can create pool of threads assigned to one logical processor. When current executing thread blocks, scheduler can pass execution to next thread assigned to same processor to improve throughput. In such scenario processor cache pollution is minimized because one task is entirely done at one core which could lead to better performance.
The bad side of notification is that transition has to be made from kernel to user mode, which of course takes time. With original system scheduler this step is not needed because scheduling takes place in kernel mode. So actually notifications could decrease performance.
If overall performance decreases or increases depends on nature of the tasks. If tasks blocks a lot in kernel we can expect that performance will drop.

Basic UMS
To test performance of User Mode Scheduling I have written benchmark which will compare traditional system scheduler with round robin User Mode Scheduler. The embarrassingly parallel job will be performed: computation of methematical function over range of values. Each range chunk will be assigned to new thread which will perform computation. You can read code here:

Results of benchmark are surprising – no matter how many worker threads – System Scheduler has always slightly better performance. Process Monitor reveals reason – when using our Basic User Mode Scheduler, not all cores are utilized uniformly. This is because each Scheduler Thread dequeues some number of ready threads from global queue. Number of dequeued threads are not always the same which means work does not split uniformly across cores. And this causes lower throughput. To fix it, we will have to complicate program and write some global queue with ready threads or implement work stealing strategies. Ain’t no free lunch at this time.

Conclusions
It seems that guys from Microsoft also realized that User Mode Scheduling is not the best way to achieve greater performance in simple cases. UMS was used as default scheduling policy in ConcRT (MS library that allows code parallelization). Well, this is no longer the case – default scheduling policy was changed to Win32 System Scheduler. Developer of ConcRT – Don McCrady – explains:

UMS support in Concrt is being retired because although it’s super fast at context-switching, well-structured parallel programs shouldn’t do a lot of context switching. In other words, in most real PPL scenarios, the UMS scheduler wasn’t any faster than the regular Win32 thread scheduler.

Although User Mode Scheduling does not improve performance in common cases, it still can be useful when there is a need for tighter control over threads execution (e.g. SQL Server uses it to achieve better scalability). Also it is good replacement for Fibers – old Windows mechanism that provides cooperative multitasking. Fibers has several limitations like lack of support for multiple cores and constraints on function that could be called during execution (when function blocks all fibers in thread blocks). None of these limitation apply to User Mode Scheduling.