Introduction

This article describes, as its title suggests, a way to provide the so-called single-threaded concurrency for complex programs with many components. First, we'll talk about concurrency and distinguish different types of it. In particular, we'll see that concurrency in not a synonym of multi-threading, and in many cases (actually - most of the cases), single-threaded concurrency is the way to go.

The approach that I want to describe in this article is something I've implemented several years ago, and since then, I've written dozens of applications using it, some of which are pretty complex. And, it proved to be extremely handsome and flexible.

I had always felt that there's a need for an article on this topic, and actually, I should have posted it a long time ago. Up until today, I have never seen a correctly designed complex application with concurrency. This may sound too loud, but unfortunately, this is the sad truth. I've worked in several companies, and everywhere, I have see tiny monsters awfully-overloaded with threads, without too much thought about synchronization, deadlock prevention and etc.

Concurrency

First, let's decouple concurrency and multithreading without digging too much into the theory.

Some operations take significant time to complete, and during this time, we want to be able to do something else; that's where concurrency is needed. Let's look at the two main reasons for operations to be time-consuming:

Operations that depend on network / hardware / user interaction / etc. They need time because they need to wait for something to happen.

For operations of the first kind, multithreading is preferred. This is because on MP (multi-processor) systems, several threads may actually be executed simultaneously. This way, higher overall performance is achieved.

However for operations of the second kind, multithreading is useless: creating several threads to wait for some conditions to occur (such as a network event or user interaction) won't help those conditions to occur faster. In fact, a single thread may wait for any of the specified conditions to occur and do what is necessary for any of them.

This is actually the most common case: 99% of all applications do nothing 99% of the time (at least they should). Creating multiple threads for such an application just to put all of them into a suspended state isn't good for the overall system performance (every thread context has a significant cost for the system, context switches, etc.). In addition, multi-threaded applications require some skills to write correctly, without all kinds of deadlocks, synchronization problems, or race conditions. So, single-threaded approach is preferred here.

The single-threaded approach, however, is a challenge to implement for complex applications with different components. When all those components work in different threads - there's no problem, every component does whatever it wants in its own thread, and the whole thing "works" parallel (as said before, actually it doesn't work at all most of the time). However, if we want those components to work in a single thread - this means that a particular component is not allowed to call a waiting function (such as WaitForXXXX), because by doing that, it disables other components. Instead, it should work as a state machine - respond to some event and return control immediately. And, if it does call a waiting function - it should wait for all the events of all the components in this thread. Moreover, it should pass control to the component whose event was signaled.

In conclusion, let's say these are where multi-threading is preferred:

Significant CPU time is needed.

Need to call an awkward synchronous (blocking) API.

Too complex to convert the components into state machines. This theoretically can be achieved by using fibers, but we won't discuss it here.

In all other cases, single-threading should be preferred.

The purpose of this article is to show a ThreadCtx library that implements the above design, so that multiple components may work in a single thread without direct interaction. We'll see that it's a very lightweight library, and it imposes minimum restrictions. The most important is that it's not a runtime, which steals control from you. You actually have the control over the program execution all the time. The only thing that this library restricts is blocking a component from responding to its events, and all the components are guaranteed to receive their notifications and not block each other. For this to work correctly, all the components must do all the scheduling and waiting via ThreadCtx, and never call waiting functions directly. The rest is up to the ThreadCtx library.

Wait? For what?

Before we discuss the API and the implementation of ThreadCtx, let's discuss some technical details about waiting conditions and functions. There're plenty of Win32 functions that may turn a thread into a suspended state (a.k.a. waiting) until some criteria is met. First of all, we'll classify those functions and look at what they actually wait for.

Timer (Sleep, also many other waiting functions accept timeout)

Win32 waitable handles (WaitForSingleObject)

Win32 window message (GetMessage, WaitMessage, etc.)

APC completion (SleepEx)

I/O completion port (GetQueuedCompletionStatus)

The last type (I/O completion port) is an advanced waitable object. It's designed especially to be used by multiple threads, and we won't discuss it here.

So, from the OS's point of view, a thread may wait for one (or more) of the above four types of conditions. Luckily, there're Win32 API functions that combine different types of conditions to wait for, so that when we need to wait for many different types of conditions - we may always use the appropriate API function:

So, theoretically, there's no problem to wait for any of those events, and several objects that need to wait for those events may be served in a single thread. However:

There's a limitation on the handles count for WaitForMultipleObjects (and similar), the MAXIMUM_WAIT_OBJECTS, which is currently defined as 64. But, this can be worked around: either ThreadCtx may create internal threads to wait for more, or WaitForMultipleObjects may be called with timeout=0 multiple times for all the waitable handles, and then if all the objects are not signaled - call Sleep(0) or SwitchToThread(). So, for the design of the ThreadCtx API, this limitation is irrelevant. It should guarantee the proper scheduling for arbitrary count of waitable handles, whereas the consumer of ThreadCtx should keep in mind that the overall performance may decrease if there're too many waitable handles; and if many synchronization objects are needed, another mechanism should be preferred. For instance, applications with many asynchronous I/O objects (such as sockets) should rather use the APC mechanism (which doesn't have this limitation) to track I/O completion.

The waiting function may receive only one timeout (whereas several objects need to be scheduled with different timers). There're API functions that enable multiple timers (such as SetTimer, CreateWaitableTimer and etc.) but they impose other restrictions. So, ThreadCtx will be responsible to manage timers internally, and ensure that the waiting functions are called with the appropriate (closest) timeout.

ThreadCtx design

As we said, objects will not call waiting functions directly; instead, they'll use the ThreadCtx API to do that. The per-thread state of ThreadCtxwill be stored in the thread TLS. For those who're not familiar with TLS: it's a "thread local storage", something like global variables; however, they're unique for every thread. TLS is a very lightweight and fast mechanism; from the performance point of view, accessing a TLS variable is comparable to accessing a regular variable.

The main API element of ThreadCtx is a Schedule. It's a base class that represents any waitable condition, and for each of the four condition types (timer, Win32 handle, Win32 message, APC), there's an appropriate derived class with the appropriate extra methods.

Signaled property. When set, this means that the waitable condition is satisfied, yet not processed (more on this later).

OnSchedule virtual function. This function is invoked when ThreadCtx detects that the waitable condition is satisfied. If overridden, it may do the in-place processing of the event. If it sets the Signaled property (like the default implementation does), this means that in-place processing was not enough, and the control should return to the appropriate component (more on this later).

Wait function. This is a blocking function (and this is the only blocking function) that waits for the condition this Schedule represents. If this condition is satisfied, the function returns true. However, this function actually keeps in mind the waitable conditions of all other Schedules, and if any of them is satisfied before, and its in-place processing was not enough (Signaled is set), the Wait function returns false.

Now, a more detailed explanation. The main function is Wait. Actually, this is the only place where ThreadCtx gets involved. This function does the following:

Checks if there're signaled schedules. If so, go to step (5).

Look at all the schedules instantiated in the current thread, and perform the waiting for any of the conditions requested (via one of the Win32 API waiting functions discussed above).

After the waiting function returns, identify the satisfied condition, and process the appropriate schedule(s) (call the appropriate OnSchedule).

Go to step (1).

If the schedule on which Wait was called is in the signaled state, reset it (set Signaled to false) and return true. Otherwise, return false.

This may sound complicated so far, but in fact, there's nothing complex here. The idea is that Wait attempts to wait for the condition of the called schedule. If it returns true, the condition is met, and the caller may go ahead. But, if Wait returns false, it means that the condition was not satisfied, however it's impossible to wait any more, because another condition is satisfied and not handled so far.

The OnSchedule may be overridden. There, when it's called, you may do something without setting the Signaled property. This would mean that you handled the condition and allowed the caller of the Wait of some schedule (no matter which) to continue waiting for that condition. Such a behavior enables to do something "while waiting" for something else (a.k.a. background processing).

As we've said, there're four types of conditions. That's how they're defined:

ScheduleHandle is a Win32 waitable handle condition. When Handle is set to a non-NULL value - it's attached to a scheduling mechanism, so that Wait is aware of it.

ScheduleTimer is a single-shot timer with the standard precision of the Win32 waiting function (not a high-precision multimedia timer). When _Timeout is set to some value other than INFINITE, it's attached to a scheduling mechanism, so that Wait is aware of it.

NeedApc represents the wish to wait for a queued APC completion. Whenever this object is instantiated in the thread, the ThreadCtx is aware of that, hence the consequently called Wait function will select the APC-aware Win32 waiting function. Note: The NeedApc is not derived from Schedule. This is because the processing of the APC event is done automatically by the OS (the APC routine is called).

NeedMsg - same as with NeedApc. It represents the wish to wait for a Windows message in the messages queue, and this tells the Wait function to select the messages-aware Win32 waiting function. Whenever messages are detected in the messages queue, they're automatically dispatched (by the DispatchMessage), and the actual message processing is done by the appropriate window.

Below, we'll look at a couple of examples to make things clear.

ThreadCtx::ScheduleHandle abortSched;
abortSched._Handle = handleThreadStop; // this will be set
// if we should abort (by another thread for instance)
// Say, we have timeout for the I/O:
ThreadCtx::ScheduleTimer timeoutSched;
timeoutSched._Timeout = 5000;
CallIo();
// Perform some network I/O
void CallIo()
{
// Create a socket, associate it with the event
// handle (WSAEventSelect), and issue an operation
// ...
// Now wait for its completion
ThreadCtx::ScheduleHandle socketIoSched;
socketIoSched._Handle = handleSocketEvent;
// this will be set once the I/O is completed
if (!socketIoSched.Wait())
return; // abort/timeout
// Ok, the condition is satisfied (I/O completed).
// Continue as planned...
}

In this example, CallIo performs some blocking operation, which may be aborted by different conditions: either timeout expiration or the abort event signaled. The most important is that CallIo doesn't have to be aware of all the "aborting" conditions; it just attempts to do what it wants and the abortion happens automatically.

In this example, we initialize several objects, and then wait for the termination handle to become signaled. During this period, all the objects will process their events.

Something to remind

This design reminds the Windows messages system in some way. That is, you may create several windows and "run the message loop" which is a loop of GetMessage and DispatchMessage (with optional TranslateMessage, idle processing, and other things), and all the windows will get their messages and their window procedure will process them. Such windows may process some events asynchronously, without direct interaction between each other. Plus, windows could do some timed jobs, using per-window timers. In addition, there're some blocking calls, such as MessageBox, DialogBoxParam, GetOpenFileName and etc. They're wrapped in MFC by something like AfxMessageBox, DoModal etc. Those are all blocking calls; however, they run the message loop as well, and hence all the windows created by the thread process their messages as well.

This is similar to our design, though pretty awkward. Creating a window object is equivalent (in some way) to creating a Schedule in our design; the role of the window procedure now plays the virtual OnSchedule function. The major difference is that when you create a window, you receive all the messages in the world, most of which you just pass to DefWindowProc, whereas our schedule gets only one specific notification.

Calling one of the "blocking" functions (such as DialogBoxParam) is equivalent to the Wait function of our schedule. Exactly as in Windows API calling DialogBoxParam doesn't prevent other windows to receive their messages, in our API, calling Wait on one Schedule doesn't disable all the others.

Note: our Wait may be aborted (if another Schedule gets scheduled and its Signaled property is set), and this is designed like that in Windows too! In Windows, when you call GetMessage, it returns 0 if the encountered message is WM_QUIT, and by convention, the appropriate message loop should stop. You call the PostQuitMessage function, which automatically "aborts" the blocking calls. Means, the DialogBoxParam or MessageBox usually returns the result passed to EndDialog; however, if the WM_QUIT message is encountered, they return -1 to indicate the "aborted blocking operation".

In the times of Windows 3.11, things worked that way. In fact, anyone could block others from execution either by doing some long-duration things (such as heavy calculations) or call the GetMessage for a specific window only (actually, I have no idea why this was allowed), but that was known to be bad and not welcome of course, and everyone was aware of that. Things got screwed when Win32 arrived with all the threads, synchronization objects, APC queues, and many different waiting functions that wait for specified things only. In fact, the wait-for-anything function (MsgWaitForMultipleObjectsEx) was added in Windows 98 only.

Our design is not totally new. It just creates a convention that had to arrive with Win32. Plus, unlike the awkward messages subsystem, we have very clear definitions about how things behave. Plus, instead of one heavy window object that should be responsible for everything, we have lightweight primitives, each responsible for one specific thing.

ThreadCtx implementation

The implementation is very lightweight. Its extra cost is negligible in comparison with a call to any waiting function (which involves a kernel-mode transaction).

The ThreadCtx object must be instantiated in every thread that may access it (where Schedules are used). For instance, it can be created on the stack "at the beginning" of the thread. In the constructor, it saves its pointer in the TLS, and every function involved on a Schedule accesses it.

Note: optionally, this can be changed. The ThreadCtx object may be created in every thread on-demand. (That is, a first call to a Schedule automatically creates the ThreadCtx for the given thread). This method demands the per-thread uninitialization as well.

The ThreadCtx consists of the following:

List of pointers to NScheduleHandle objects. Every ScheduleHandle object automatically inserts itself into this list when its Handle is set to a non-NULL value, and removes itself from this list when it's set to NULL (obviously, it ensures removal from this list in d'tor).

Binary tree of ScheduleTimer objects. When the _Timeout of the ScheduleTimer object is set to something other than INFINITE - it inserts itself into that tree, whereas the tree key is the "wakeup" time of this timer (GetTickCount() + _Timeout). By this, before invocation of any waiting function, the ThreadCtx can identify the closest timer in logarithmic time (which is fast, of course). It correctly handles the ticks counter wraparound, as well as detects already-late timers.

A counter of instantiated NeedMsg objects, and a counter of instantiated NeedApc objects. Those objects increment/decrement those counters in c'tor/d'tor. By this, before invocation of a waiting function, ThreadCtx knows if the waiting should be messages/APC-aware (the appropriate counter is non-zero).

A counter of schedules which are in Signaled state. Upon state change, every schedule automatically increments/decrements this counter. So ThreadCtx always knows if there're signaled schedules before calling a waiting function.

When you call Schedule::Wait, it accesses the ThreadCtx and invokes a waiting function in a loop and schedules the appropriate Schedule object(s), until one of them sets the Signaled property. Every invocation of a waiting function goes in the following order:

Find the earliest timer (in the binary tree). If this timer is already late, remove it from the tree, schedule it, and return. If it's not late, calculate its timeout. And, if the tree is empty, the timeout is INFINITE.

Check the counters of instantiated NeedMsg objects and NeedApc objects. Based on this, select the appropriate waiting function.

Prepare all the Win32 waitable handles in an array that can be used as a parameter to the selected waiting function. If the count of objects don't exceed the MAXIMUM_WAIT_OBJECTS limitation, call the waiting function with all the objects. Otherwise, call it several times (every time, put different handles in the array) with timeout=0, and if the waiting function returns on timeout, call Sleep(0).

Now, analyze the return value of the waiting function. If it returns on either handle or timeout, schedule the appropriate Schedule. If it returns the Win32 message, do the PeekMessage/DispatchMessage loop.

Optionally, the ThreadCtx may be modified to perform additional steps for APC completion and Windows messages. For instance, we may want to have a global 'pre-translator' for window messages (or the per-thread messages handler).

Mixing with UI

Theoretically, this can be done. Means, the whole application may be done single-threaded, including the UI. What's needed for this to work correctly is to avoid calling blocking functions directly, such as GetMessage and WaitMessage. In particular, the program message loop should be rewritten:

Note: In the last case, we've used the base Schedule class, not its derivative. This schedule does not have an intrinsic (built-in) waitable condition, and it may only become signaled explicitly. Its main use is to execute "message loops" which may be interrupted explicitly.

So theoretically, all including the UI may be done in a single thread. However, in big projects with heavy UI, it may be a challenge to guarantee that all the blocking functions are executed in terms of ThreadCtx and never directly. Big UI projects are usually written using some runtime (such as MFC) which conceals the message loop completely. Theoretically, you may override the message loop (in MFC, it's possible to override CWinThread::Run, CWinThread::PumpMessage), but still it's risky, G*d knows how many such places there are. And, it's very easy to screw things: one call to MessageBox is enough to turn off the whole mechanism until the user closes the message box.

Another limitation is that it becomes impossible to use standard UI components that may do their message loop. This concerns all the standard Windows dialogs, such as GetOpenFileName, ChooseFont, ChooseColor, various ActiveX controls, and "console" functions such as scanf, getch (a.k.a. easy-win).

Theoretically, there're ways to "intercept" those functions calls (such as J. Richter's method for replacing import section addresses in an executable module) and do what's needed in a ThreadCtx-aware way. But all those methods are tricky, and not guaranteed to work always.

In addition, projects with "rich" UI tend to be heavy. That is, UI that in some circumstances actually consumes significant CPU time to be processed. So:

No problem to mix a "simple" (and well-controlled) UI with other stuff.

May be complex for "rich" UI, especially when used with some runtime. Better not to mix.

Writing complex objects

So far, we've used schedules as-is, or inherited them to override the OnSchedule to perform in-place processing of something. But, all our schedules are primitives - each only processes one specific event. What if we need to process various events of different kinds in a dependent way? There're some patterns to do so:

This may look complicated; however, there's nothing complex. The magic macro NESTED_SCHEDULE declares a class that is derived from a schedule type you pass as a parameter (third parameter to the macro), overrides OnSchedule, and there it invokes the function passed (fourth parameter) of the outer object. Another magic is how the declared class has the pointer to the outer object. This is due to the NESTED_OBJ macro, which gives the inner object the pointer to the outer object, based on the offsets at compile time.

Additional notes

First, it is worth to say that there're no limitations with re-entrance. Means, you may call Wait inside the OnSchedule which was executed in the context of another Wait (that might be executed inside another OnSchedule and so on). But, you should be careful with re-entrance, since program flow may become complicated there. Also, take into account that if one Wait is executed inside another Wait, and the Schedule of the outer Wait is Signaled before the inner one, the inner Wait will stop waiting and returns false. So, re-entrance is a complex matter, and if used, should be controlled very well.

The correct pattern should be state-machine. That is, most objects should not call Wait at all. They should do whatever is needed in OnSchedule and return immediately. One object per thread can be implemented in an "algorithmic" way, but not more.

Also worth to note that the whole mechanism is exception-proof. Means, when an exception is raised, it passes freely all the Wait/OnSchedule scopes and propagates directly to the catching block. And, if there're schedules in the way declared on stack, they're destroyed (like all C++ objects) and removed from the scheduling mechanism (in d'tors). So there's no problem.

All schedules may be allocated on either stack or heap, this doesn't matter. Of course, using stack memory is very much faster, and since schedules are very light-weight objects, it's better to allocate them on stack.

Application may be multi-threaded, and every thread may use the ThreadCtx library (it maintains per-thread state). However, all the operations on a particular schedule must be done in the same thread! It's very immoral and absolutely impossible to work with the same Schedule object in several threads.

In order to use ThreadCtx in DLLs - it must be modified. It uses TLS - hence it requires the global TLS index. Currently, this is achieved by decorating the per-thread pointer to the ThreadCtx by the __declspec(thread) semantics. But this won't work correctly with DLLs. Instead, the access of TLS must be explicit. That is, a global TLS index must be explicitly allocated (TlsAlloc). It must be shared with the DLL.

Conclusions

Well, this whole article may seem complicated on the first read, especially the implementation part. But in fact, it's not so complex. Just look at the implementation code: it's just about 350 lines of code!

One more thing: if you really intent to look at the code level - I recommend reading the following articles:

Share

About the Author

My name is Vladislav Gelfer, I was born in Kiev (former Soviet Union), since 1993 I live in Israel.In programming I'm interested mostly in low-level, OOP design, DSP and multimedia.Besides of the programming I like physics, math, digital photography.