to thread or not to thread -- performance and design issues

This is a discussion on to thread or not to thread -- performance and design issues within the C Programming forums, part of the General Programming Boards category; I wrote a hash table of bayer trees to use for fast key/value pair look-ups. I've been testing it on ...

to thread or not to thread -- performance and design issues

I wrote a hash table of bayer trees to use for fast key/value pair look-ups. I've been testing it on various sized sets of data against C++ std::map and against a plain bayer tree, and I seem to have done a decent job -- the plain bayer tree is slightly faster than map on small sets (there's a sweet spot where it's about 30% faster) but ~20% slower on the biggest sets (12 million items, which eats almost all my memory -- map uses the most). The double hashed table of b-trees is faster still than the plain bayer on small sets and about matches map on the biggest sets.

AFAIK, there is no way to thread single tree operations -- you cannot be rearranging the same tree simultaneously. But with the hash table, all the trees are independent, so in my original implementation I figured I'd fire a thread for every operation. This turned out to be real stupid -- the overhead for starting a new thread thousands and millions of times is just too much -- so I moved to a genuine producer-consumer* set up, whereby when the table is initialized, a small fixed number of threads are started, then they wait on conditions to pick up data to add. Each tree has a lock, but generally two "consecutive" adds will not end up with the same hash value, so they can mostly be done concurrently.

But the threaded producer-consumer version is still twice as slow as the unthreaded version.

Now here's the crux: my assumption was that the heftiest work would be in the actual tree ops.** All the other functions are quite short and sweet -- no mallocs or math, no string ops, just a few assignments and some passing of pointers. The only difference between the threaded and unthreaded version is the use of a few locks and conditions. So I have to assume this is where the time is being spent: the actual "add" function sets a shared pointer to the data, then calls pthread_cond_signal to wake a thread to add it to the table, then pthread_cond_wait. The thread then assigns it's own pointer to the data and calls cond_signal to release the add() function (initially I used a mem pooled linked list queue for that, then realized it was pointless -- if the threads, collectively, cannot keep up to the main process, the main process might as well wait, and the list/mempool handling is still more functions). This means that for each add we have the several pthread function calls to deal with passing the data (no way out of that). AFAICT, these are taking as much (or more) time as the actual tree ops, which negates any advantage concurrency would bestow there.

I'm a little shagged at this point, and don't really mind giving up on the threaded version -- it was interesting, a learning experience, and I did get it to work. But I'd like to get some second opinions on my diagnosis first. It seems to me this an issue with which people must have experience: obviously, a "dining philosophers" type scenario where multiple threads/processes just increment a counter is for sure going to be slower than just a single process incrementing a counter, because incrementing a counter all by itself takes no time at all whereas all the condition and lock handling necessary for the threaded model will, relatively. But I was hoping the tree adds -- which can involve rebalancing, etc -- would not fall into that category. I guess I'm wrong? At what point would a task be significant enough to make threading this way a performance bonus? There's a couple of other assumptions I've made here that could be incorrect (eg. "if the threads, collectively, cannot keep up to the main process, the main process might as well wait").

Worst case is I'm totally wrong and my thread design sucks (but of course it doesn't seem that way to me ).

* Nb: up to now, other than little exercises, the only thing I've used threads for is a few concurrent process, eg, a GUI controlling something in real time, and not to speed up any kind of processing.

** I've run gprof on it, and it does appear that 90% of the time is spent there, but to be honest I am very dubious of gprof's evaluation, since with the prior threaded model it completely failed to spot the problem -- I presume I'd have to compile pthreads itself appropriately in order to get accurate profiling here.

I'd say the problem is probably in how your tests are using the data structure. I imagine you have a loop that fills up the tree and then a loop that performs lookups etc right? Or maybe worse, you add an item and then immediately look it up?
What you've designed could have a performance improvement in a very specific usage scanerio, for example when adds are performed just occasionally and must return to the caller absolutely as fast as possible, and they aren't looked up straight away. Your test code just probably isn't equivalent to that scenario.

Can you use the more commonly known term "B-Tree" rather than "bayer tree" please.

>> so I moved to a genuine producer-consumer* set up, ... [threads] wait on conditions to pick up data to add.
So you have a container (a hash-table of B-trees) - and there are dedicated threads for adding data to this container?

It seems like find/insert/remove operations would be "too fast" to warrant an asynchronous interface. So if I were to encounter this in the wild, my first thoughts would be: "What problems were they trying to solve with this design?"

In this case it seems you were just looking for some speedup. I would want to know what's being measured as the baseline - for example, "the time it takes for a single thread to insert 1 million items". Or the goal of a design may be to improve multi-threaded accesses, and not necessarily single-threaded use.

To formally discuss speedup we need now what exactly is being measured, and how.

- add N items to the structure (these are read in from a randomized file)
- delete N/2 items from a separate randomized list (which is also a look-up, obviously)
- produce a sorted (in order traversal) list of all remaining items

The last test was mostly to insure that the structure was working properly (that the things that were supposed to be deleted were deleted, that the things that weren't weren't, and that what remains can be properly traversed to produce a sorted list) so I have not actually timed this against std::map yet.

Also, I haven't written a delete for the hash table version yet (so the sort test was just a sort test of all items -- each tree maintains a first/next pointer kept sorted in a master linked list to facilitate this) but that's very simple since the delete for the b-tree has been tested. So the only timing test I've done with the threaded hash table is the add test.

Originally Posted by iMalc

I imagine you have a loop that fills up the tree and then a loop that performs lookups etc right?

Yeah, exactly, except the look-up was also a delete. WRT to a plain look-up, I was probably not going to thread that because it would mean passing the result back from the thread, meaning the main process would have to wait anyway. Whereas with add/delete, nothing need be returned. Also, add and delete* can lead to rearrangements of the nodes, which this is what I figured was the most expensive event, gprof bore than out (30% of the total time for all adds was spent re-arranging; presumably the later adds, when the trees are bigger, are more significant this way).

Originally Posted by Codeplug

It seems like find/insert/remove operations would be "too fast" to warrant an asynchronous interface. So if I were to encounter this in the wild, my first thoughts would be: "What problems were they trying to solve with this design?"

That's how it seems to me, I am just surprised that the thread operations are that serious -- this is the first time I've done something like this (so that is mostly the "problem" I'm solving ). Like I said, I would not think that threading count++ would be anything but a slow down. But traversing and rearranging a b-tree of say several hundred items (if there are 2 000 000 in total and 10 000 double hashed trees -- I did distribution tests on the b-tree and it was pretty well balanced inc. depth**, but I haven't done anything on the hash table yet), possibly having to split and promote a whole series of nodes, would be much more complicated that a couple of pthread_cond_signal calls and some pointer assignments. Interestingly, setting the table up so the trees were way bigger (2 000 000 items in 1000 trees) did not make that much difference. This is true in the unthreaded version too. B-trees are pretty efficient things I guess. Anyway, for reference, this is all that's involved with the thread handling:

The reason for the BHTDONE catch was to deal with pthread_signal_cond potentially releasing more than one thread (an irritating pthread feature)-- then one gets the tasklock, does the task, and releases the lock, such that another would then get the lock and perform the same task again. "If (!threadup)" rarely applies, so I guess the total number of pthread calls per add would normally be:

So I dunno: how complex could a lock-up/down and signal call be? I would have thought maybe a simple assignment or two -- is that completely naive?

Another observation: I have a dual core processor. Using two threads was optimal. Using even just 4 threads lead to a noticable (25%) slow down, 8 threads 30%+. I would not have thought this would make so much difference either...the baseline here is still 200% of the single process unthreaded version.

I suppose a big factor here would be the size of the key because this would affects the strcmps. With the small tests I used stuff like a dictionary, but with the larger tests I've just been using 5 char permutations. I should try 10 and 20 and 100 char strings probably.

* the lookup for the delete would use the b-tree functions directly, whereas the look-up for the hash table proper would be a separate wrapper for the b-tree look-up.

** something that surprised me a lot there was how low the "optimal" m values turned out to be -- eg, even with 12 million items in a single tree, an m value over 20 proved not worth while, and for a million items anything beyond 6-8 was a slow down.

Take the time for the threads with no real b-tree ops at all. Time for the threaded version on 2M items without an actual add to the table: ~16 seconds. Time for the unthreaded version for the exact same thing, but really adding entries to the table: ~ 9 seconds.

So wow, I am extremely surprised by just how hefty pthread ops are! Live and learn...

>> ... a simple test would be to try this:
It's not always as simple as it seems, because it may also change: the amount of contention on the lock - the number of context switches incurred - L1/L2 cache hit/miss ratio - etc.

To get all the benefit of an asynchronous add, the adder shouldn't have to wait for anything. The adder could just: alloc data, lock, publish data, signal, unlock. So the add interface just adds data to a thread-safe container. With a linked-list as the container, all you need is a "not-empty" condition variable whose predicate is "container::not_empty()". The lock is only held long enough to remove or add data from the shared container. The idea is that by the time the client adds the 2M objects, the worker thread will still have a back-log of objects to insert. For this to be beneficial, the average time to add to the shared container with synchronization needs to be faster than the average time to add to the hashed B-trees without synchronization.

A main goal to keep in mind is to minimize lock contention. There is "contention" if a thread tries to acquire a lock that is already acquired. In general, it's a lot faster to acquire an uncontested lock than it is to acquire one which is already locked - because that implies a call to the OS to put the thread to sleep while waiting on the lock. (This is true for process-local pthread_mutex_t's and CRITICAL_SECTION's.)

For example the table of trees could be logically split in half, each controlled by its own worker thread with its own shared container. The adder would pre-compute the hash and enqueue the data to the correct worker thread. Assuming an even distribution across the table, the contention for the two locks should be less than the contention for a single lock. You probably don't want more worker threads than you have cores to run them on - because then they would be sharing time with the adder thread, which we're trying to speed up.

Another way to reduce contention is to have the worker "consume" the entire container, and not just dequeue the next object while the lock is acquired. So you would: acquired lock, swap the pointer to the container with a pointer to an empty container, unlock. So with a single lock acquisition, the worker has a whole list of work to do. This keeps the worker busy a little longer, which allows the adder to refill the now empty list without contention.

In the end an asynchronous add interface may not be worthwhile. But it's a fun skill-building exercise

A main goal to keep in mind is to minimize lock contention. There is "contention" if a thread tries to acquire a lock that is already acquired. In general, it's a lot faster to acquire an uncontested lock than it is to acquire one which is already locked - because that implies a call to the OS to put the thread to sleep while waiting on the lock. (This is true for process-local pthread_mutex_t's and CRITICAL_SECTION's.)

Alright. I thought using as few locks as possible (eg, I have three conditions but only one lock) would mean fewer function calls (since otherwise I'd be releasing and or contending for several locks, altho that is not necessary, since one lock can be made to work with all three conditions -- but it may add to the contention).

For example the table of trees could be logically split in half, each controlled by its own worker thread with its own shared container. The adder would pre-compute the hash and enqueue the data to the correct worker thread. Assuming an even distribution across the table, the contention for the two locks should be less than the contention for a single lock.

The problem with that (giving each thread a specific domain) is that with two threads, 50% of the add calls will be for the same domain again -- ie, half the time there won't be parallel tasks, unless:

With a linked-list as the container, all you need is a "not-empty" condition variable whose predicate is "container::not_empty()". The lock is only held long enough to remove or add data from the shared container. The idea is that by the time the client adds the 2M objects, the worker thread will still have a back-log of objects to insert. For this to be beneficial, the average time to add to the shared container with synchronization needs to be faster than the average time to add to the hashed B-trees without synchronization.

I did use a LL at first, as I said eventually it seemed there was no point in the main process getting ahead of all threads -- it just needs to stay ahead of one of them, and then there will always be something for the next one to do. The LL handling adds more overhead, so I went to just a single task pointer. However, I could eliminate the multiple release problem with cond_signal that way.

In reality it did not get ahead anyway. The task is two small -- which maybe isn't surprising, it's literally the equivalent of a single map insert. I suppose they could be batched but that would be a fairly specialized sort of container which wasn't what I was going for. Maybe "batch-able" could be tried (some special batch_add function for use in loops, followed by a batch_execute*).

You probably don't want more worker threads than you have cores to run them on - because then they would be sharing time with the adder thread, which we're trying to speed up.

I would not have thought that 8 or 16 threads would be slower than 2 -- any time they use is still task time? -- but you are right, more than two gets slower.

In the end an asynchronous add interface may not be worthwhile. But it's a fun skill-building exercise

That's what I'm feeling.

* I like that idea the best: then each thread can be handed a pointer to a list that is spilt into domains, and the hand off only occurs once for the whole batch. Or say every 10-20 add calls while in batch mode, so that it is not all subsequent.

>> eventually it seemed there was no point in the main process getting ahead of all threads
If the goal is adder speed up, let it get as far ahead as possible.

>> 50% of the add calls will be for the same domain again -- ie, half the time there won't be parallel tasks
The goal of two worker threads isn't to achieve parallelism (though when they do run on their own cores, you can get parallelism between the workers). The goal was to turn the 1 lock into 2, cutting lock-contention in half and possibly making the adder faster.

>> However, I could eliminate the multiple release problem with cond_signal that way.
Condition variables should have an associated predicate condition, which is looped-on while waiting for that predicate to become true. Of your 3 conditions, only one is using a predicate wait-loop - "bht->readycnd" has a predicate loop of "bht->todo.type == BHTADD".

If you use something like a circular buffer (what you have now is a circular buffer of 1), then you only need two conditions: "try to produce" and "try to consume", using the predicate conditions "buffer full" and "buffer empty":

Notice that all calls to wait() are in a predicate loop - so that any "spurious wakeups" will just re-check the predicate and go back to waiting.

If you use a linked list, the adder doesn't have to wait and is only limited by memory - but at a cost.

A circular buffer doesn't need to allocate on each addition. Also, sequential access to an array has better cache locality. Ideally, the buffer size would be large enough so that the adder never has to wait on a full buffer.

Condition variables should have an associated predicate condition, which is looped-on while waiting for that predicate to become true.

That's interesting. I may still take another stab at this, even if only to take advantage of the opportunity to experiment.

If you use a linked list, the adder doesn't have to wait and is only limited by memory - but at a cost.

A circular buffer doesn't need to allocate on each addition. Also, sequential access to an array has better cache locality. Ideally, the buffer size would be large enough so that the adder never has to wait on a full buffer.

What I actually did with the linked list before was allocate a pool of half a dozen structs. Structs can be removed from the list, meaning it was of variable length (and initially zero), but they all came from the pool. If the allocation function could find a struct in the pool marked DONE (meaning it was detached from the list), it returned that one, otherwise it would realloc and return a new one. This way the pool could adjust itself to some size if the default was too small (rather than using a wait). Of course, that means using the lock in the allocation. In fact, like I said, the realloc very rarely took place, but that was when I started to think the whole list was unnecessary and even detrimental (since if the threads did fall behind, in that system the list pool would just keep realloc'ing).

MK; about two years ago I sunk a weekend into a fun mental project: making a lock-less yet thread-safe producer-consumer model. It is as you surmised not the time spent IN the threads but switching between them. Depending on stack size and other environmental issues, switching can be expensive. I may still finish this sometime if the need arises; I got quite a speed boost out of it over a rather standard model. The trick (such as it was) involved using trylock on mutexes that guarded the common resource but if a given thread did not get the lock, it always had a safe block of data to work on and it would periodically check back to see if the resource was free. If not, it went back to the secondary job. That by itself was only part of it; it would also after any given mutex check (once it had released the mutex if it got it) would sleep for a second, then get timings of how long the 1-second sleep actually took. This gave me a rough idea on system load. Then I would also measure how long an average unit of work took to process, then divide the time between actual locks (so I could access the main work queue) which basically told me that under the current load, I could expect to get the main resource every so many milliseconds and if that was actually say once a second and and I could do ten units of work in that second (100ms per unit), the next time I got an actual lock I would get that many (plus one) units of work from the queue. In short if it knew it would only get a lock say 1/4 of the times it tried, it would collect enough work when it DID get a lock to keep it busy the rest of the time in case it could not. I could also tell this system (effectively) only use say 50 per cent of the total CPU so if it was set that way and some other unrelated process started up and used 25 pct, this system would throttle itself back and only use 25 per cent itself.

It was a fun project, really made you think and I was surprised how much work I could get out of the system only using a fraction of the CPU....

OK it wasn't totally lockless but it used more like an advisory lock as opposed to the more adversarial method usually employed (gimme the lock and I OWN it as long as I want it). This was more like "is the resource busy? if so, I can do something else and come back later, else if no one was using it, lock it and use it. Actually the lock was more like a semaphore but who is arguing semantics...