So I'm happy with how my WorkMgr works for pure CPU work items. It has one worker thread per core, the Worklets can be dependent on other Worklets, and
it has a dispatcher to farm out Worklets using lock-free queues and all that.

(ASIDE : there is one major problem that ryg describes well , which is
that it is possible for worker threads that are doing work to get swapped out for a very long time while workers on another core
that could have CPU time can't
find anything to do. This is basically a fundamental issue with not being in full control of the OS, and is related to the "deficiency of Windows'
multi-processor scheduler" noted above. BTW this problem is much worse if you lock your threads to cores; because of that I advise that in Windows
you should *never* lock your threads to cores, you can use affinity to set the preferred core, but don't use the exclusive mask.
Anyway, this is an interesting topic that I may come back to in the future, but it's off topic so let's ignore it for now).

So the funny issues start arising when your work items have dependencies on external non-CPU work. For concreteness I'm going to call this "IO"
(File, Network, whatever), but it's just anything that takes an unknown amount of time and doesn't use the CPU.

Let's consider a simple concrete example. You wish to do some CPU work (let's call it A), then fire an IO and wait on it, then do some more CPU
work B. In pseduocode form :

WorkletLinear
{
A();
h = IO();
Wait(h);
B();
}

Now obviously you can just give this to the dispatcher and it would work, but while your worklet is waiting on the IO
it would be blocking that whole worker thread.

Currently in my system the way you fix this is to split the task. You make two Worklets, the first does work A and fires the IO, the second
does work B and is dependent on the first and the IO. Concretely :

so Worklet1 finishes and the worker thread can then do other work if there is anything available. If not, the worker thread goes to sleep
waiting for one of the dependencies to be done.

This way works fine, it's what I've been using for the past year or so, but as I was writing some example code it occurred to me that it's
just a real pain in the ass to write code this way. It's not too bad here, but if you have a bunch of IO's, like do cpu work, IO, do cpu work,
more IO, etc. you have to make a whole chain of functions and get the dependencies right and so on.
It's just like writing code for IO completion callbacks, which is a real nightmare way to write IO code.

The thing that struck me is that basically what I've done here is create one of the "ghetto coroutine" systems. A coroutine is a function call
that can yield, or a manually-scheduled thread if you like. This split up Worklet method could be written as a state machine :

In this form it's obviously the state machine form of a coroutine. What we really want is to yield after the IO and
then be able to resume back at that point when some condition is met. Any time you see a state machine, you should prefer a *true* coroutine.
For example, game AI written as a state machine is absolutely a nightmare to work with. Game AI written as simple linear coroutines are
very nice :

WalkTo( box )
obj = Open( box )
PickUp( obj )

with implicit coroutine Yields taking place in each command that takes some time. In this way you can write linear code, and when some
of your actions take undetermined long amounts of time, the code just yields until that's done. (in real game AI you also have to handle
interruptions and such things).

that gives us linear-looking code that actually gets swapped out and back in. Unfortunately, it's not
practical because this C-coroutine hack doesn't preserve local variables, is creating weird scopes all over,
and just is not actually usable for anything but super simple code. (the switch method gives you stackless
coroutines; obvious Worklet can be a class and you could use member variables). Implementing a true (stackful) coroutine system
doesn't really seem practical for cross-platform (it would be reasonably easy to do for any one platform, you
just have to record the stack in crStart and copy it out in crWait,
but it's just too much of a low-level hacky mess that would require intimate knowledge of the quirks of
each platform and compiler). (you can do coroutines in Windows with fibers, not sure if that would be a viable
solution on Windows because I've always heard "fibers are bad mmkay").

The next obvious option is a thread pool. We go ahead and let the work item do IO and put the worker thread to sleep,
but when it does that we also fire up a new worker thread so that something can run. Of course to avoid creating new
threads all the time you have a pool of possible worker threads that are just sitting asleep until you need them.
So you do something like :

WorkletThreadPool
{
A();
h = IO();
TheadPoolWait(h);
B();
}
TheadPoolWait(h)
{
number of non-waiting workers --;
CheckThreadPool();
Wait(h);
number of non-waiting workers ++;
CheckThreadPool();
}
CheckThreadPool();
{
if ( number of non-waiting workers < desired number of workers &&
is there any work to do )
{
start a new worker from the pool
}
if ( number of non-waiting workers > desired number of workers )
{
sleep worker to the pool
}
}
// CheckThreadPool also has to be called any time a work item is added to the queue

or something like that. Desired number of workers would be number of cores typically. You have to be very careful of the details of this
to avoid races, though races here aren't the worst thing in the world because they just mean you have not quite the ideal
number of worker threads running.

This is a reasonably elegant solution, and on Windows is probably a good one. On the consoles I'm concerned about
the memory use overhead and other costs associated with having a bunch of threads in a pool.

Of course if you were Windows only, you should just use the built-in thread pool system. It's been in Windows forever
in the form of IO Completion Port handling. New in Vista is much simpler, more elegant thread pool that basically just
does exactly what you want a thread pool to do, and is managed by the kernel so it's fast and robust and all that.
For example, with the custom system you have to be careful to use ThreadPoolWait() instead of normal OS Wait() and if
you can't get nice action when you do something that puts you to sleep in other ways (like locking a mutex or whatever).

6 comments:

I had a brief experience using windows fibers to work around limitations of the CUDA asynchronous API.

Each CUDA stream defines a sequence of asynchronous operations that need to be completed in order, so that in theory multiple streams could run in parallel. Ideally you would like to express the operations of each stream sequentially:

So, you have to manually schedule your code, which is completely fragile, since the duration of the kernels and the copies is not necessarily known.

Using fibers I could define each stream sequentially in the run method of the fiber. Each asynchronous calls would yield to a custom scheduler that decided what fiber to run next.

It was very trivial to get it to work, the only problem was that it was confusing to debug, you would be stepping through the code in one stream and as you step over an asynchronous call it would jump to completely different fiber, but other than that it seemed fairly robust.

This was just a toy example, so maybe in practice there are other problems.

I wonder if you could use macros and C-like 'variables must be declared at the beginning' to keep a counter of 'local variable stack space needed', on a declaration end macro malloc and fix-up all the pointers to be relative to the malloc location... that'd effectively be like member variables but mostly automated..

Yeah, copying works too. That's how Mono Continuations does it (disappointingly, it isn't very reliable with their garbage collector or soft debugger -- a great feature to 'embrace and extend' MS.NET, but they really need to make it _work reliably_...).

I was thinking you could avoid copying by having the macro resolve to (essentially)

int& a = *(int*)(state->data + 0);int& b = etc...

and each macro adding to some 'offset' int local the sizeof() the current variable being declared to generate these lines.

Coroutines/Fibers are what you want in this situation. Stack switching is the only 100% robust-by-default means of getting this done, um, elegantly (hate how programmers use that word). By elegantly, I mean it works with pointers pointing to variables on the stack without any fiddling. Stack copying just isn't robust, with or without pointer fixups. Macros for doing offset calculations is a code smell IMO.