Scheduler stories: The joy of fiber mode

Probably the funniest thing I had ever seen on stage was a two-hander called “Frank ‘n Stein”. It’s a telling of the classic Frankenstein story, with the physical comedy of two actors having to rotate continuously between a large number of roles, including a whole crowd chasing the monster. This was all made possible by them never leaving the stage, but instead changing characters in front of the audience, using only rudimentary props to help differentiate the characters.

If this is the only thing you remember about fiber mode scheduling, it should see you through.

Let me have just a little bit of peril

The title of this post is of course an allusion to Ken Henderson’s classic article The perils of fiber mode, where he hammers home the point that fiber scheduling, a.k.a. lightweight pooling, appears seductive until you realise what you have to give up to use it.

We’ll get to the juicy detail in a moment, but as a reminder, the perils of fibers lie in their promiscuity: many fibers may share one thread, its kernel structures and its thread-local storage. This is no problem for code that was written with fibers in mind, including all of SQLOS, but unfortunately there are bodies of code for which this isn’t true.

I am taking this flirtation with peril as a learning opportunity though. See, in fiber mode, the mechanics of scheduling are made visible by necessity, so getting to grips with fibers is good preparation for thread-mode SQLOS scheduling.

The context of a fiber

Fibers carry around a comparatively small amount of private state, inheriting much from their underlying threads. However, a thread’s state is split among kernel-mode and user-mode components, and when a thread does a costume change to become one of a number of fibers, the change only involves the user mode side of the divide. You can do more complicated changes backstage, out of sight of the audience, but in fiber mode all your changes have to be onstage.

The weightiest part of a thread’s context is its user-mode stack, consisting of rolled-up layers of as-yet-unfinished function calls made since the thread’s birth and (pretty much) every local variable used in each of these functions. Once we progress into fiber mode, each of the potentially many fibers on a thread gets its own user-mode stack, and context switching includes changing stacks while remaining in user mode. This isn’t as complicated as it might sound: all it requires is an update to the stack pointer (just another CPU register), of course after saving the old stack pointer as part of the outgoing fiber’s context.

What remains of context? Some more CPU registers, the thread impersonation handle, floating-point control registers and the address we should return back to when switching back to this fiber. In a thread switch Windows does all this stuff for us, but a fiber switch saves and restores a bare baseline of context; anything beyond that is up to us to do if needed.

Actually doing it

Once there are multiple fibers set up on a thread, switching between them requires a call to the Windows function SwitchToFiber() with the pointer to the fiber data of the incoming fiber. While the name is deceptively similar to SwitchToThread(), these are entirely different creatures. In the latter case we yield control to the Windows scheduler, who decides what thread to switch to, but with SwitchToFiber() we are completely in control of what runs next. Of course, Windows may preemptively stop any thread when it needs to borrow the CPU for running other threads, but in fiber mode, once that thread returns to the stage it will be sporting the costume of our fiber, continuing where it left off. Windows never makes decisions about swapping fibers on and off a thread, but instead only schedules threads.

In fact, if we never call SwitchToFiber(), we may as well have remained in thread mode, with one important caveat. When yielding a thread to the Windows scheduler or calling any wait function with a finite timeout, we can count on the call returning, i.e. the wait ending. However, a fiber can be completely starved by user code that cruelly refuses to switch to it. In other words, when using fibers you get given the freedom to shoot yourself in the foot.

Nuts and bolts

Keep in mind that the below description starts using the word “Worker” which I’ve avoided so far. A Worker (which predictably is an instance of the Worker class) could represent either a thread or a fiber, and most code couldn’t care less which one it is, but we’re now getting to a point where the difference matters, and I’ll only be describing the fiber case for the moment.

SQLOS context switches start with a call to SOS_Scheduler::SwitchContext(), typically made by SOS_Scheduler::SuspendNonPreemptive() as the indirect result of a blocking synchronisation method or a voluntary yield. SwitchContext() starts by invoking the housekeeping tasks like IO processing, timer expiration checks, and seeing if there is any runnable worker to hand over to.

Assuming that a suitable worker is identified – which could even be the current one if it came here due to quantum expiry but remains the only runnable candidate – control then moves to the Switch() method, which takes pointers to the outgoing and incoming workers as parameters.

Within Switch(), the very first action is to check if outgoing and incoming workers are the same. If so, we have an instant resume scenario, and the bare minimum is done to set up the worker to continue where it left off. This is irrespective of worker type – even though some quantum bookkeeping needs doing, there is no context switching, and very little cost for the courtesy of an attempted SOS_SCHEDULER_YIELD.

If we’re not doing an instant resume, the dreaded context switch is upon us. Fortunately this is very simple for a fiber switch, and is encapsulated in a call to SOS_Scheduler::SwitchToFiberWorker(). For the moment, I am ignoring suspend and resume housekeeping done in the TaskTransition() method – we’ll get to that on another occasion. Here are pseudocode highlights of SwitchToFiberWorker(), taking OW as the outgoing worker and IW as the incoming worker:

While comparatively simple to follow, note that everything other than that final SwitchToFiber() call represents framework support added by SQLOS on top of the bare Windows switching implementation. It is also worth reflecting on things like the setting of the ambient SystemThread’s hToken member as something distinct from actually setting the current thread’s impersonation token (SetThreadToken()).

This is the nature of building abstractions: the CPU is a raw force that just Does Stuff, irrespective of whether you view and control its workings through the abstraction of a thread. At a higher level of abstraction, the thread is that raw force, upon which we build the abstraction of fibers, viewed through the object-coloured lenses of the Worker class. When you’re inside the abstraction, you’re just twiddling member variables and it feels like so much Lego construction – surely this can’t be the real thing? But then you find yourself outside and on top of it, e.g. within SQL Server storage engine code, and suddenly things like Workers and SystemThreads are forces of nature themselves.

Enter SQLOS stage left

The funny thing is that this ability to switch between tasks at will is exactly what we want from the SQLOS user mode scheduler. One task runs up to the point where it knows it can’t go any further, it calls into the context switching code that chooses the next runnable task, and control is transferred to the fiber running that task. The handover is straightforward and it’s clean. One nearly gets to the point of asking “Who needs a stinkin’ operating system?”

Broadly speaking, the implementation of cooperative scheduling in fiber mode flattens a cumbersome hierarchy. Each scheduler has an associated CPU and keeps running just one thread, switching between fibers as the need arises, and occasionally being interrupted by being preemptively scheduled off the CPU and then back on again. The flattened mental model thus looks like this: A scheduler is a thread, which more-or-less stays on a CPU forever, and flips between runnable fibers workers, with each worker running a task. In other words, we’re back at the point where we can simply say “the CPU runs task A, then task B”.

So on the one hand we have the apparent complexity of having multiple fibers on a thread, but on the other we have the simplicity of just one main thread per scheduler. It’s all rather equivalent in the end.

Bonus material: Thread ID as ownership signature

You may recall that a SQLOS spinlock (both traditional and ReaderWriterSpinlock flavours) has two possible states. When it is unaquired, the lower 32 bits contain the value zero, but once acquired exclusively, they contains the Windows thread ID of the owning worker. Fibers put us in an interesting situation: since all fibers on a scheduler may share the same thread, and hence the same thread ID, how do we know which one is the owner?

The simple answer is that it doesn’t matter. However, if you really cared in a debug or memory dump scenario, it would be the fiber worker that currently owns the scheduler, and since an active thread is bound to a scheduler, the thread ID implies the scheduler. One of the reasons it doesn’t matter is that spinlocks aren’t recursive, and an attempt by a worker who already owns the spinlock (whether a thread or fiber worker) is doomed to self-deadlock anyway. So a fiber who already owns the spinlock would never need to care about the question.

A more important reason is that SQLOS-compliant code must never indulge in cooperative context switching while holding a spinlock. It is in the nature of spinlocks that they must only be held for the briefest possible time, and going to sleep while holding one is a mortal sin. Now we can’t completely stop a hardware interrupt or the Windows scheduler from preempting our spinlock-owning thread, but until this thread wakes up, no other work is going to get done by this SQLOS Scheduler. In this light, it is clearly an academic issue to worry about the thread ID being ambiguous.

Further reading

Linchi Shea has done some benchmarking of the potential performance advantages of fiber mode here.