My name is Bartosz Adamczewski and this blog is all about programming on the NET platform and beyond. I'm focusing on two languages C# and F#, but also I'm planning to throw some SQL to the mix, and maybe Java + Android platform, and C++ as cherry on top.

Pages

Monday, August 13, 2012

Lock Free Work Stealing Queue

Lock free data structures are very handy for concurrent programs especially that the number of cores increases. The benefits of such data structures is that they never acquire a lock, instead they do a form of a spin lock, where a local copy of the data must be stored and exchanged or updated by an processors atomic operation (like "compare and swap" or "fetch and add") only when the global value and the local copy match, otherwise we repeat the whole procedure. Almost every lock free data structure follows this pattern, so you may ask where's the benefit in that? Well the chances are that the local data race will never issue a spin and everything will end in just one loop cycle, thus making the process very fast, instead of always acquiring and holding the lock, where other threads would just queue up.

These types of data structures are very hard to implement correctly as we need to think about other threads, where they are and what they are doing, and to be even more hard there types of structures can suffer from the ABA problem where a memory location will be read twice but between reads some other thread might introduced sideffects by changing the values in that location (resusing location) and thus corrupting the entire state, but the switched out thread will just hold the memory address thus thinking that the data is still the same. Luckily for us in .NET the ABA problem does not exist as the memory location will only get reused when there are no references to it.

Work Stealing

So where does work stealing come into play here? Well a work stealing queue is simply a queue where the main thread can dequeue from the head but the stealing threads can dequeue elements from the tail, in a standard locking queue this helps alot as the stealing thread does not participate in the lock when dequeueing if the head and tail are far away from each other, if they are close then a lock can be put for extra safety. So almost by design this queue is supposed to be almost lock free. But what it gives us when our queue will not use any locks at all and will just do a form of a spin lock ? Well there will be almost the same benefits that we will not participate head dequeing making it a lot less likely that our operation will last more then one loop cycle. There is however one difference as we may slow down the enqueing operation as we will be racing with the enqueing threads, but the enqueue thread will not slow our own stealing operation.

So lets go into code, again most of explanations of how it's working is done through comments in the code, as I just feel that's the better way :)

///<returns>the value indicating that there are still values to be peeked.</returns>

publicbool TryPeek(out T value)

{

Node currentNode = head.next;

if (currentNode == null)

{

value = default(T);

returnfalse;

}

else

{

value = currentNode.val;

returntrue;

}

}

///<summary>

/// Gets the enumerator.

///</summary>

///<returns>enumerator.</returns>

publicIEnumerator<T> GetEnumerator()

{

Node currenNode = head.next;

Node localTail = GetTail();

while (currenNode != null)

{

yieldreturn currenNode.val;

if (currenNode == localTail)

break;

currenNode = currenNode.next;

}

}

///<summary>

/// Gets the enumerator.

///</summary>

///<returns>enumerator.</returns>

IEnumeratorIEnumerable.GetEnumerator()

{

return ((IEnumerable<T>)this).GetEnumerator();

}

}

Performance

Lets do some perfomance tests for the solution, please note that those tests were conducted on a four physical core I7 system. The tests will only do a basic enqueue/dequeue test as it's impossible to campare performances of work stealing so that should be tested individually for performance depending on the use case of such structure. The test code simply inserts and removes from the queue from multiple threads at the same time, the initial count will give us the corretness indicator as it needs to be the same as when we started our program.

Conclusion

Now if those results are somewhat strange to you, let me explain. Performance benefits will only be visible on a high count core/processor system as only then processors will need to be in sync when waiting on a lock acquired by a thread on another processor, so for a low core systems you will be better off with standard locking (work stealing) queue.

Update On Performance

As correctly pointed out by Sławomir in the comments section, the performance test is actually wrong for a couple of reasons. The first being that ThreadPool uses work stealing algorithms itself so it will run on NumOfCores * 4 threads instead of the provided number, this is especially true if the tests were performed on .NET 4.0, besides the work stealing algorithm will obscure the results as threads will steal queued tasks. The correct solution to the problem is to use plain simple threads, so the code test code is changed to reflect that:

I extended the test procedure and also tested scenarios with high enq low deq ratio, and the other way around as it is rare for the most systems to have the same number of enq and deq at any given point in time, usually it will vary. I provided a table with results for given number of items and threads for given test scenario along with a graph showing the performance.
Let's look at the performance of both solutions when we have the same amount of dequeue and enqueue threads:

For 1k and 10k items:

Threads

10

20

40

80

100

200

400

800

1000

LockFree 1000

142

210

337

743

999

1885

3304

3500

3594

Locking 1000

96

135

483

1037

1043

1628

2600

2638

2760

LockFree 10000

254

499

828

1907

2327

4922

9940

18637

23549

Locking
10000

257

481

890

2386

2888

5005

9505

17900

22900

As we can see the performance of the lock free solution on my machine (4 physical cores) is almost on par with the locking queue, and there is a pattern here where the lock free solution outperforms the locking code (by a hair) where the number of threads is between 40 and 100. For better results of the lock free code we would need to redo this test on multi core machine (8 to 10 physical cores), but unfortunately I have no access to such machine (sic!) :(.
Now let's test the performance where we have varying number of enqueue and dequeue threads.

For 10k items and varying number of dequeue threads:

Threads

10d 10e

20d 10e

40d 10e

80d 10e

100d 10e

200d 10e

400d 10e

800d 10e

1000d 10e

Locking
10000

234

419

532

1210

1414

2364

4876

10257

13067

LockFree 10000

190

393

524

958

1181

2391

4528

9141

10878

For 10k items and varying number of enqueue threads:

Threads

10d 10e

10d 20e

10d 40e

10d 80e

10d 100e

10d 200e

10d 400e

10d 800e

10d 1000e

Locking
10000

201

429

797

1369

1408

2703

5063

9916

12917

LockFree 10000

168

416

587

1184

1300

2426

4867

10645

13267

Now we do see that the lock free queue generally outperforms the locking version of the queue, but again on low core machine the gain is very small.

Update On Conclusion

So should we bother with lock free solution ? If we do have a very loaded system that runs on multiple cores then yes! Besides this implementation is a work stealing queue that has a primary role to boost performance on loaded system by letting the low loaded threads steal work from other queues, however the primary question here would be should we go for the implementation of lock based work stealing queue where or should we go lock free all the way? Well in lock based work stealing we would participate in a enqueue lock upon stealing and in the lock free version we would still participate in the same pseudo lock, but when the enqueue ratio is small at the moment there is a big chance that we will not spin and in result the whole steal operation will be faster then acquiring the lock and wait on on it (especially when the number of cores goes up). Another thing is that very short locks in .NET are implemented internally as spin locks thus they do not switch to the Operating System so the performance of such short locks are very good (I will try to shed more info on that topic in the future posts). So the answer to the question is go lock free if the production system uses a high core machine and go lock based if you cannot spare cores.

2 comments:

Interesting article, however I'm just curious, do you know how many threads were created during your tests on thread pool?According to MSDN it may vary: http://msdn.microsoft.com/en-us/library/system.threading.threadpool.aspxMaybe you could check it and put into article?

About Me

I'm a software consultant working day to day on financial systems. I specialize in high frequency concurrent systems that incorporate fragmentation free analysis and software design.
For speaking engagements or other opportunities contact me at: bartoszadamczewski(AT)gmail.com
http://about.me/badamczewski