Unsung SQLOS: the SystemThread

SystemThread, a class within sqldk.dll, can be considered to be at the root of SQLOS’s scheduling capabilities. While it doesn’t expose much obviously exciting functionality, it encapsulates a lot of the state that is necessary to give a thread a sense of self in SQLOS, serving as the beacon for any code to find its way to an associated SQLOS scheduler etc. I won’t go into much of the SQLOS object hierarchy here, but suffice it to say that everything becomes derivable by knowing one’s SystemThread. As such, this class jumps the gap between a Windows thread and the object-oriented SQLOS.

The simplest way to conceptualise a thread in SQL Server is to think of a spid or connection busy executing a simple query, old skool sysprocesses style. It’s not hip, but it’s close enough to the truth to be useful. This conflates a few things that are separate entities, but it is a good starting point for teasing things apart as needed:

We have a specification for how the query is to be prepared and served up, i.e. an executable recipe. This is the task, which is an instance of the SOS_Task class.

The task has a worker bound to it; this is an abstraction for something that can do the work, and predictably is an instance of the Worker class. Key point is that it’s largely just a bag of state related to the task.

This is starting to feel like a government subcontracting exercise. We’re now at another level of abstraction and outsourcing, namely the SystemThread, which isn’t in fact a thread at all, although it maps to one.

Finally, at the bottom of the food chain is an honest-to-goodness operating system thread doing the actual work! And it does this without the benefit of being properly object-oriented or having a notion of what SQLOS is.

Wait, I spoke too soon. Isn’t a thread just an abstraction for a CPU with a nice haircut and a set of medium-term goals?

Of the items listed above, there is only one pairing where they mate for life: the coupling of a SystemThread with an OS thread. We find serial monogamy between a SOS_Task and a Worker – the worker will normally outlive the task, and is then eligible to remarry after spending some time in the dating pool – but a long-term glut of widowed or otherwise unattached Workers will lead to some being culled. Under normal circumstances, there is long-term fidelity between a Worker and a SystemThread, which in turn cleaves unto an OS thread till death do them part. However, fiber mode takes us into promiscuous territory where one SystemThread can quickly service several Workers in round robin fashion, and that explains why we don’t talk about fiber mode in polite company.

Okay, that was weird. What is a SystemThread really?

Think of an OS thread as the muscle in the family. It will run anything you request of it until it blocks, suspends, is explicitly killed, or naturally dies by exiting its top-level thread function. However, it is naked and uncouth, so SQLOS dresses it in a SystemThread instance: a #sqlkilt which conveniently comes with a sporran for keeping a few valuables in. Here then are the members of the SystemThread class as found in SQL Server 2014, which add up to a storage requirement of 312 bytes:

Eighteen pointer-sized slots of LocalStorage. This is the sporran, which is really just TLS available to the task du jour, keeping it away from the temptation of asking the underlying OS for TLS slots.

A ListEntry that links idle SystemThreads into a dispatcher list managed by the SystemThreadDispatcher living within a SchedulerManager, which in turn is associated with an SOS_Node. This list, along with other state within the SystemThreadDispatcher, is protected by an SOS_SYSTHREAD_DISPATCHER spinlock.

Another list entry that places it in a list directly belonging to an SOS_Node. This structure is more grown-up than a simple ListEntry, since it includes a reference count that allows us to track when it is safe to dispose of the SystemThread instance. This also contains a pointer to the parent SOS_Node, allowing traversal from list entry to list head, something which isn’t possible with a bare ListEntry.

A CPU id

The CPU affinity mask

A spinlock going by the name SOS_SYSTHREAD. This protects instance members where a single “update transaction” spans multiple member updates.

The address of the last SystemThread that signalled it

A pointer to the associated Worker

A status ID

A pointer to itself. While reminiscent of the Thread Information Block’s Self pointer, this actually serves to support a simple consistency check, making it easy to recognise when a purported SystemThread reference is an obvious dud.

The OS thread ID

A pointer to fiber data (only applicable in fiber mode)

An OS handle to the thread. By ensuring that there is a least this one outstanding handle, we remind Windows that it can’t consider destroying the thread, even if it has exited its entry function, and for the few cases where actual OS thread manipulation is required (which requires a handle) it means relevant code can get the cached handle here instead of having to jump through any hoops.

An optional handle to an impersonation token.

A handle to an event oject associated with the thread. This event object is the key to user-mode scheduling (see the Ken Henderson UMS article listed in Further Reading for background).

A pointer to the associated SOS_Scheduler.

A set of bit flags.

If you thought I’m describing the contents of sys.dm_os_threads here, you’re very close to the mark. That DMV is essentially a list of SystemThread objects, and since there is a 1:1 relationship between an OS thread and a SystemThread, the DMV can join these, yielding a list of “things which represent threads”, displaying a combination of state which lives in the SystemThread object and state which belongs to the actual thread.

DLL callbacks and the reproductive habits of SQLOS threads

This is where things get magical. As I’ve mentioned, there is a 1:1 relationship between SystemThreads and threads, and we know that the SystemThread can expose its associated thread by handle or ID. So how does the relationship work in the other direction, and what does SQLOS do to stop thread management being a burden on the SQL Server developers who are clients of SQLOS functionality?

Thread creation is one of the process lifecycle events for which a DLL callback can be registered. What this means is that a developer gets to write something akin to an OnThreadCreate event handler for any DLL(s) within a process. The handler executes within the context of the newly created thread (i.e. as that thread) once all the thread structures have been created and the thread is viable for execution, but before the entry point supplied to the thread creation function is called. In other words, whatever function is supplied for a thread creation callback will be the first user code run by every newborn thread, irrespective of what the thread was created to do. If you want to apply the idea of the Finnish baby box to Windows programming, giving every thread a good start in life, this is where you do it.

The thread creation callback for sqldk is the static method SystemThread::DllMainCallback, which defers the task of packaging up that baby box to SystemThread::MakeMiniSOSThread. In order to successfully construct a SystemThread instance, MakeMiniSOSThread asks the MiniSOSThreadResourcesMgr for a MiniSOSThreadResources object, which is itself a packaged set of three prerequisites:

An event object, which is an actual kernel object; this is key to the interaction that user-mode scheduling will have with the Windows scheduler in thread mode.

A Worker instance. I presume this is built on the premise that a SystemThread without an associated Worker is useless (true for both thread and fiber mode), and there is no point deferring its creation to a later point where an allocation might fail.

An SOS_Task instance. A similar argument likely applies here; although workers and tasks can live separate lives, it may be that this cached task (of which each worker can have one) exists as a low-resource fallback. To be honest though, I’m not sure about the semantics of cached tasks yet.

With these three allocated objects in hand, there’s no further risk of resource allocation failure, and the actual SystemThread object is now constructed using the MakeSystemThread factory method. This is a case of finding the 312 bytes of memory needed for the instance and initialising it, including binding it to its event object. To finish off, MakeMiniSOSThread does some basic Worker and SOS_Task state initialisation, binds the lot of them to a default scheduler and publishes the thread_attached XEvent. All of this happens before the first proper function call at the top of the thread stack, i.e. the construction activity is not exposed in stack dumps.

If you only take one factoid away from all of this, make it this one: The SystemThread is stored in thread-local storage at a slot location which is globally known. Let it sink in, because I really think this is one of the most fundamental concepts in SQLOS. No matter what code is running, what class it lives in, which DLL, or which area of SQLOS or the application (SQL Server) built on top of it; it is always possible to go to this well-known TLS slot to ask that all-important question: Who am I?

This then is the state of affairs that every single thread running under SQLOS is born as, whether a planned child or not. The thread hasn’t properly started running by having its entry function called with the supplied parameter, and for the threads we’re interested in, some of the above will get modified within the entry function. All the same, post SQLOS boot, each thread in the process will always be able to relate itself to a SystemThread object and hence to a worker, a task, a scheduler, a node and a memory object.

SystemThread allocation and TLS

None of the fancy SQLOS memory management we pretend to understand, but secretly find unbelievably confusing, has yet entered the picture at this point. This is because we’re at the basement level of SQLOS, dealing with code that doesn’t want to rely on that grand edifice. Thus memory allocation for things like the MiniSOSThreadResources object is done using the bare Windows function HeapAlloc from the default process heap, admittedly sugared a bit by being delegated to the MiniSOSThreadResourcesMgr which caches and dispenses these objects.

However, we have one fascinating exception. Recall that the creation of a thread includes allocating various structures in kernel and user memory space, including the Thread Environment Block, and that the first 512 bytes (64 slots) of thread-local storage is a contiguous block within the TEB. Now the polite textbook way of allocating a 312-byte SystemThread would be to allocate memory, construct the object and then save a pointer to it in TLS. I’m fairly certain I’ve seen folks bristling at the idea of using more slots than strictly necessary, given what a precious process-global resource TLS is. Well, bristle away.

In SQL Server 2014, the allocation of TLS space is done in static initialisation of the SystemThread class. This can go two ways:

We greedily try and reserve 312 bytes worth of contiguous slots, i.e. 39 slots. If this fails, there is always the second option which only requires one slot (and which doesn’t have to be one of the first 64 either). But if it succeeds, instead of storing a pointer to the allocated SystemThread in one slot, we just build each SystemThread right there inside the TLS array within the Thread Environment Block, spanning 39 slots. SystemThread storage has now essentially been preallocated for every thread that will ever be created in the process./li>

The textbook way: we reserve a slot here at static initialisation, knowing that every SystemThread instance will require a 312-byte memory allocation to live in, and this memory will have to be found every time a thread is created. To smooth things a bit, we preallocate and cache such chunks using the dedicated SystemThreadPool class which then owns the HeapAlloc problem. When an allocation is required within the thread creation callback, it is dispensed from this cache.

The first option, when possible, has a potential process-wide performance advantage, apart from the convenience of using memory already allocated for the thread. Knowing nothing other than the thread-specific TEB address (which can be retrieved quickly from GS:[30h]) and the process-constant TLS offset from here (retrieved from a global variable and then added to the TEB address) it takes only three assembly language instructions to retrieve the address of the ambient SystemThread, aka “Who am I?”. Because it is so short, it is perfectly suited to inlining rather than being called as a function:

mov rcx, GS:[30h]
mov rdx, qword ptr [globalTLSoffset]
add rcx, rdx

This takes only a handful of bytes, and because it is pure linear code, it can’t participate in branch misprediction. And although this isn’t the kind of code you would be finding in tight inner loops where those kind of things matter, the pattern clearly holds a geeky charm. Admittedly, the need to cover both this and the “textbook” case (calling the Windows function TlsGetValue) each time we reach the “Who am I?” point means that the full expression is a bit longer and tarnished by requiring conditional evaluation and branching.

While this little point isn’t material to a discussion of the SystemThread as a class, it does bring to light something I find very interesting. From my observations, SQL Server 2016 consistently plays by the book and only uses TlsGetValue instead of acting on its knowledge of Windows implementation details like 2014 does. With the benefit of hindsight, it is rather as if someone had made a high-level decision that SQLOS developers should stop encoding detailed assumptions about the operating system underlying SQLOS…

On a related note, consider the static function SystemThread::GetCurrentId, which returns the OS thread Id of the currently running thread, which of course indirectly also means the ambient SystemThread. In SQL Server 2014 this returns our old TEB friend GS:[48h], and is inlined whenever it occurs in sqldk. I have touched on this before in my spinlock post, but we now have more context to appreciate just how cheap the acquisition of an uncontended spinlock can be. To achieve this:

Get current thread ID as potential ownership signature
Atomically write this into the spinlock if the current value is 0, i.e. not acquired

the shortest valid sequence of assembly instructions, assuming we have the spinlock address in r8, would be something like:

Again, this code fragment is a very straight arrow. Without that inlining and insider TEB knowledge, getting the thread id involves multiple levels of indirection and branching, even when the instruction that is ultimately run is still the same GS:[48h] retrieval. So sqldk has always had an insider edge over the other SQL Server DLLs like sqlmin, since sqlmin couldn’t use that inlining and had to call SystemThread::GetCurrentId instead of the shortcut, and this additionally used a virtual function pointer lookup (standard thing when calling between DLLs). However, in 2016, GetCurrentId has also had to give up its insider advantage, and now calls the Windows function GetCurrentThreadId, which of course does the lookup in the TEB, although not without more indirection of its own. This means just that little extra hurdle for spinlock acquisition, both in sqldk and elsewhere. Does it matter? One would hope that the decision between clean abstraction and optimal performance didn’t mean losing out measurably on the performance side, but it is a great illustration of the kind of tiny compromises that likely involve multiple people with “architect” in their job titles shouting at each other. If you’ve ever read Showstopper you’ll have a sense of the historical roots here, although the idea of potentially abstracting away from the NT kernel altogether is a significant new twist.

Final thoughts and next steps

Here then is my candidate definition of what the SystemThread class sets out to be:

Add state (e.g. references to an SOS_Scheduler and an associated event object) to an underlying OS thread.

Provide methods that encapsulate interaction with that underlying thread, meaning that other SQLOS code is insulated from the thread except as a tame abstraction.

Embed itself within the OS thread by being stored in thread-local storage, whether entirely or as a reference to a separately allocated chunk of memory.

Supply an entry point which helps to build the abstraction layers that take us away from thinking about OS threads and elevate our mental model to the level of SQLOS schedulers invoking tasks.

I covered a reasonable fraction of the first three points in this post, and will dive into the last one in a future post, which is also my excuse for not talking about who creates threads, and what happens within that initial thread function when one starts running properly. If I left you feeling that there is too much detail to get a grip on, or perhaps that the above was superficial, welcome to my world! The deeper you dive, the more respect you gain for the unsung work of others who build the stuff you uncover.

Further reading

I’d like to call out three older resources, because IT folks are quick to assume that older means irrelevant:

Ken Henderson’s UMS chapter from The Guru’s guide to SQL Server Architecture and Internals. The book is an interesting reference for contextualising SQL Server as a Windows application, and there are still valuable nuggets in there for the patient reader. The UMS chapter is here and is a fantastic reference, as well as a reminder of how much the UMS is one of SQLOS’s crown jewels. Scheduling algorithms have moved on, and modern versions of SQL Server refactored the implementation, made it NUMA-aware, and wrapped it into this thing we call SQLOS, but the basic mechanics of what Ken describes are still valid.