12/03/2011

12-03-11 - Worker Thread system with reverse dependencies

That system is okay; I believe it works, but it has two disadvantages : 1. It requires some non-standard synchronization
primitives such as OR waits, and 2. There is a way that it can fail to do work as soon as possible; that is, there is
the possibility for moments when work could be done but the worker that could do it is asleep. It's one of our design
goals to not let that happen so let's see why it happens :

The problem basically is the NR (not ready) queue. When we have no RTR (ready to run) work, we popped one item from the NR
queue and waited on its dependencies. But there could be other items later in the NR queue which become ready sooner.
If the items in the NR queue become ready to run in order, this doesn't occur, but if they can become ready in different
orders, we could miss out on chances to do work.

Anyhoo, both of these problems go away and everything becomes much simpler if we reformulate our system in terms of
"forward dependencies" instead of "backward dependencies".

Normal "dependencies" are backwards; that is, A depends on B and C, which were created earlier in time. The opposite
direction link I will call "permits" (is there a standard term for this?). That is, B and C permit A. A needs 2 permissions
before it can run.

I propose that it is conceptually easier to set up work in terms of "dependencies", so the client still formulates work items
with dependencies, but when they are submitted to be run, they are converted into "permissions". That is, A --> {B,C} is changed
into B --> {A} and C --> {A}.

The main difference is that there is no longer any "not ready" queue at all. NR items are not held in any global list, they are
only pointed to by their dependencies. Some dependency back in the tree should be ready to run, and it will then be the root that
points through various NR items via permission links.

With no further ado, let's look at the implementation.

The worker thread becomes much simpler :

worker :
wait( RTR_sem );
pop RTR_queue and do work

that's it! Massively simpler. All the work is now in the permissions maintenance, so let's look at that :

How do we maintain permissions? Each item which is NR (not ready) has a (negative) count of the # of permissions needed before it can run.
Whenever an item finishes, it walks its permission list and incs the permit count on the target item. When the count reaches zero, all
permissions are done and the item can now run.

A work item now has to have a list of permissions. In my old system I had just a fixed size array for dependencies; I found that [3]
was always enough; it's simply the nature of work that you rarely need lots of dependencies (and in the very rare cases that you do
need more than 3, you can create a dummy item which only marks itself complete when many others are done).
But this is not true for permissions, there can be many on one item.

For example, a common case is you do a big IO, and then spawn lots of work on that buffer. You might have 32 work items which depend on
the IO. This only needs [1] when expressed as dependencies, but [32] when expressed as permissions. So a fixed size array is out and
we will use a linked list.

the result is that when you submit not-ready items, they go into the permits list somewhere, then as their dependencies get done their
permits count inc up towards zero, when it hits zero they go into the RTR queue and get picked up by a worker.

The behavior is entirely the same as the previous system except that workers who are asleep because they have no RTR work can wake up
when any NR item becomes RTR, not just when the single one they popped becomes RTR.

One annoyance with this scheme is you need to lock the item to maintain the permits_list ; that's not really a big deal (I use an
indexed lock system similar to Nt Keyed Events, I don't actually put a lock object on each item), but I think it's possible to
maintain that list correctly and simply lock free, so maybe we'll revisit that.

ADDENDUM : hmm , not easy to do lock free. Actually maintaining the list is not hard, and even doing it and avoiding
races against the permitted count is not hard, the problem is that the list
is in the work item and items can be deleted at any time, so you either need to hold a lock on the item to prevent deletion, or
you need something like RCU or SMR.

We also do exactly this (we have a runtime library that our compiler generated C code calls into).

One way of avoiding the lock on the permits list, is to maintain a lock-free list of dependence objects in each task. When a task retires (finishes), it sets a finished flag in its record, goes through this list, does an atomic XCHG on the dependence item to set a flag, and if the XCHG succeeds decrements the dependence count of the dependent task.

When a task adds a dependence record to the task it depends on, it adds the record to the lock free list, does an MFENCE, checks if the task with the list has finished, and if so tries an atomic XCHG on the dependence record (and if it succeeds doesn't count the dependence in its local count).

One additional trick we do is to use a large bias in the dependence count of the task we are setting up, keep track of the dependences the task has, and then decrement the bias minus the actual count from its count. It avoids a bunch of more expensive atomic decrements...

My problem is that objects can get deleted, so I basically have to hold a lock on them to prevent deletion. I started to make the "permits list" lockfree and realized it was pointless because I need a lock anyway to ensure lifetime.

I've got a nice way to make the fast path (no dependencies) lock free, so I'm happy with that.

BTW. It isn't clear to me that making it lock free is a performance win. The real places we suffered from were (under Linux) as the amount of work per task becomes small was (1) the memory allocator as you generate a whole bunch of task records and (2) cache line contention on the work queue as tasks become ready (in practice one thread makes most tasks ready in our system).

We bought into lock-free reference counting for freeing objects to avoid the deleting issue you describe. And we don't actually free these types of objects, but simply add them to a queue to be recycled (see system allocator performance issues)...

multiplexing the queue is the standard solution to reduce contention in the high load case; in the low load scenario (queues nearly empty) you have some amount of contention that you cannot avoid, but in the high load scenario (# items >> #workers) you can make contention as small as possible (one cache line handoff is inevitable)

"We bought into lock-free reference counting for freeing objects"

I haven't wanted to impose things like that, I want the system to have a minimum of requirements to use correctly, but I may regret this later. I also considered building a RW lock into my handle system but decided against it for the moment.

The problem we have is that we don't control the structure of the tasks (we're compiling arbitrary code to use our library). Every core actually has its own work queue. But if one core is producing most of the work at a high enough rate (and a bunch of worker cores are completing the tasks fast enough), the cache line contention from the work stealers can be problematic... This was the problem we hit when we had a single core spinning off a new task every 700 clock cycles on a 8 core Nehalem or 1600 clock cycles on a 24 core Opteron. We've thought about trying to batch items to amortize the cache line miss on the insert operation, but ensuring timely delivery of tasks would become a pain...

Sounds really good. Have you considered implementing this with fibers instead of threads? The first problem I see is you pretty much have to roll your own synchronization primitives. To my knowledge, under Windows, stock mutexes and semaphores will put the whole thread to sleep, which isn't what you want.