11/30/2011

11-30-11 - Basic sketch of Worker Thread system with dependencies

You have a bunch of worker threads and work items. Work items can be dependent, on other work items, or
on external timed events (such as IO).

I've had some trouble with this for a while; I think I finally have a scheme that really works.

There are two queues :

RTR = ready to run : no dependencies, or dependencies are done
NR = not ready ; dependencies still pending

Each queue has an associated semaphore to count the number of items in it.

The basic work popping that each worker does is something like :

// get all available work without considering sleeping -
while( try_wait( RTR_sem ) )
{
pop RTR_queue and do work
}
// (optionally spin a few times here and check RTR_sem)
// I may have to sleep -
wait( RTR_sem OR NR_sem ); // (*1)
if ( wakeup was from RTR_sem )
{
pop RTR_queue and do work
}
else
{
NRI (not ready item) = pop NR_queue
deps = get dependencies that NRI needs to wait on
wait( deps OR RTR_sem ); // (*3)
if ( wakeup was from RTR_sem )
{
push NRI back on NR_queue and post NR_sem // (*4)
pop RTR_queue and do work
}
else
{
wakeup was because deps are now done
NRI should be able to run now, so do it
(*2)
}
}

*1 : the key primitive here is the ability to do a WFMO OR wait, and to know which one of the items signalled you.
On Windows this is very easy, it's just WaitForMultipleObjects, which returns the guy who woke you. On other platforms
it's trickier and probably involves rolling some of your own mechanisms.

Note that I'm assuming the semaphore Wait() will dec the semaphore at the time you get to run, and the OR wait
on multiple semaphores will only dec one of them.

*2 : in practice you may get spurious wakeups or it may be hard to wait on all the dependencies, so you would loop
and recheck the deps and possibly wait on them again.

How this differs from my previous system :

My previous system was more of a traditional "work stealing" scheme where each worker had its own queue and
would try to just push & pop works from its own queue. This was lower overhead in the fast path (it avoids
having a single shared semaphore that they have to contend on, for example), but it had a lot of problems.

Getting workers to go to sleep & wake up correctly in a work stealing scheme is a real mess. It's very hard
to tell when you have no work to do, or when you have enough work that you need to wake a new worker, because
you don't atomically maintain a work count (eg. a semaphore). You could fix this by making an atomic
pair { work items, workers awake } and CAS that pair to maintain it, but that's just a messy way of being a semaphore.

The other problem was what happens when you have dependent work. You want a worker to go to sleep on the
dependency, so that it yeilds CPU time, but wakes up when it can run. I had that, but then you have the
problem that if somebody else pushes work that can immediately run, you want to interrupt that wait on
the dependency and let the worker do the ready work. The semaphore OR wait fixes this nicely.

If you're writing a fractal renderer or some such nonsense then maybe you want to make lots of little work items
and have minimal overhead. But that's a very special purpose rare case. Most of the time it's much more
important that you do the right work when possible. My guiding principles are :

If there is no work that can be done now, workers should go to sleep (yield CPU)
If there is work that can be done now, workers should wake up
You should not wake up a worker and have it go back to sleep immediately
You should not have work available to do but the workers sleeping

Even in the "fractal renderer" sort of case, where you have tons of non-dependent work items, the only
penalty here is one extra semaphore dec per item, and that's just a CAS (or a fetch_add) assuming you use something like
"fastsemaphore" to fast-path the case of being not near zero count.

There is one remaining issue, which is when there is no ready-to-run work, and the workers are asleep on
the first semaphore (they have no work items). Then you push a work item with dependencies. What will
happen in the code sketch above is that the worker will wake up, pop the not ready item, then go back to
sleep on the dependency. This violates article 3 of the resolution ("You should not wake up a worker and have it go back to sleep immediately").

Basically from *1 to *3 in the code is a very short path that wakes from one wait and goes into another wait;
that's always bad.

But this can be fixed. What you need is "wait morphing". When you push a not-ready work item and you go into
the semaphore code that is incrementing the NR_sem , and you see that you will be waking a thread - before you
wake it up, you take it out of the NR_sem wait list, and put it into the NRI's dependency wait list. (you leave
it waiting on RTR_sem).

That is, you just leave the thread asleep, you don't signal it to wake it up, it stays waiting on the same handle,
but you move the handle from NR_sem to the dependency. You can implement this a few ways. I believe it could be
done with Linux'es newer versions of futex which provide wait morphing. You would have to build your semaphore
and your dependency waiting on futex, which is easy to do, then wait morph to transfer the wait. Alternatively
if you build them on "waitset" you simply need to move an item from one waitset to the other. This can be done
easily if your waitset uses a mutex to protect its internals, you simply lock both mutexes and move the
waitable handle with both held.

The net result with wait morphing is very nice. Say for example are you workers are asleep. You create a work
item that is dependent on an IO and push it. None of the workers get woken up, but one of them is changed from
waiting on work available to waiting on the dependency. When the IO completes it wakes that worker and he runs.
If somebody pushed a ton of work in the mean time, all the workers would be woken and they would do that work,
and the dependent work would be pushed back on the NR queue and set aside while they did RTR work.

ADDENDUM : at the spot marked (*4) :

push NRI back on NR_queue and post NR_sem // (*4)
pop RTR_queue and do work

In real code you need do something a bit more complex here. What you do is something like :

if ( NRI is ready ) // double check
{
RTR_sem.post() // we woke from RTR_sem , put it back
do NRI work
}
else
{
push NRI onto NR_lifo and post NR_sem
pop RTR_queue and do work
}

we've introduced a new queue , the NR_lifo which is a LIFO (eg. stack). Now whenever you get an NR_sem post, you do :

// NR_sem just returned from wait so I know an NR item is available :
NRI = NR_lifo.pop()
if ( NRI == NULL )
NRI = NR_queue.pop()

the item must be in one or the other and we prefer to take from the LIFO first. Basically the LIFO is a holding area for
items that were popped off the FIFO and were not yet ready, so we want to keep trying to run those before we go back to
the FIFO. You can use a single semaphore to indicate that there is an item in either queue.

4 comments:

Text does not show how nr-task can end in rtr-queue so it might be obvious. The question is: Can one particular nr-task get stuck if it will regularly go through pop from nr-queue front and immediately push to its back?

"Text does not show how nr-task can end in rtr-queue so it might be obvious."

It doesn't, it stays in NR queue until its deps are done, at which point it will get executed at spot (*2)

"The question is: Can one particular nr-task get stuck if it will regularly go through pop from nr-queue front and immediately push to its back?"

Sort of; it shouldn't get stuck, eventually its deps will be done and it will be run and not re-pushed. You are correct however that as written I was popping off the front of a FIFO and putting on the back, which is not ideal.