In this final instalment of the synchronisation series, we will look at fully scalable solutions to the problem first stated in Part 1 – adding monitoring that is scalable and minimally intrusive.

Thus far, we have seen how there is an upper limit on how fast you can access cache lines shared between multiple cores. We have tried different synchronisation primitives to get the best possible scale.

Throughput this series, Henk van der Valk has generously lent me his 4 socket machine and been my trusted lab manager and reviewer. Without his help, this blog series would not have been possible.

And now, as is tradition, we are going to show you how to make this thing scale.

Summary so Far

Before we move to the solution, let us just summarise what we have learned so far:

Even without locks, a single (Intel X7560 @ 2.27Ghz) core is limited to around 90K operations/ms

Working backwards from the clock speed, we can conclude that 20-30 CPU cycles are need to do one call to Add()

Inside the L2/L3 caches, we see a nearly constant throughput for all synchronisation primitives

Contention from threads outside the CPU socket degrades the throughput dramatically, until it eventually reaches a steady state (around 16-20 Logical Cores on the 4 socket box)

Sharing cache lines between different copies of the OperationData struct introduces false sharing effects. To eliminate these effects, we have to pad the 32B wide struct containing the data

There is a trade-off between the scalability of a lock and the cost to acquire it. For example, the TTAS spin lock is cheaper to acquire than the .NET SpinLock – but it scales poorly

Partitioned Data Structure Implementation

Based on what we have learned so far, we can now design the optimal solution. There are several challenges to overcome.

First, we need to make sure that the OperationData struct lives on its own cache line. This can be done with the implementation we have already seen:

C#

1

2

3

4

5

6

7

8

9

10

[StructLayout(LayoutKind.Sequential)]

privatestructOperationData

{

privatefixedbytePad0[64];

publiclongTimeMs;

publiclongCount;

publiclongMaxTimeMs;

publiclongMinTimeMs;

privatefixedbytePad1[64];

}

Depending on the synchronisation method we choose, we may add additional fields (example: the IsHeld field in the TTAS spin lock from the previous post)

Second, we have to avoid sharing cache lines between cores when we call Add(). This can be achieved by maintaining one copy of the OperationData array per core. Instead of using a one dimensional array of OperationData, we can make it two dimensional, with the second dimension being the CPU whose L1 cache we want to hold the structure.

C#

1

2

3

4

privateconstintMAX_CPU=64;

privateOperationData[,]_opsData

=newOperationData[MAX_CPU,OPSLIST_LENGTH];

The drawback of this solution is that we need to visit every core’s copy when we want to report on the data (or when we call Clear()). However, reporting is a rare event compared with the Add().

Third, we now need to make sure that a thread updating the OperationData always picks the one that is most likely to be on the core it runs on. At this point, we run into trouble with the .NET framework. It turns out that it is not possible for a thread to ask which core it currently runs on. The framework wont expose this information to you, it only available via C++/Win32 (see: http://henkvandervalk.com/setting-processor-affinity-in-windows2008r2)

At this point, we have to fall back on our knowledge of Win32 and Windows. Each thread in Windows has a preferred core that it runs on. The Windows scheduler will generally prefer to wake this thread up on the same core that it last ran on (this is done to preserve L2 caches).

It is possible for a thread to ask which core it current runs on by calling GetCurrentProcessorNumber(). This call is relatively cheap and .NET (unlike other “Our one size coffee fits all, but we will pretend you can have it your way” managed languages) has a very nice way to communicate directly with the underlying operating system. We can simply call out to Windows like so:

C#

1

2

[DllImport("kernel32.dll")]

staticexternintGetCurrentProcessorNumber();

We can now express Add() using this template:

C#

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

[ThreadStatic]

privatestaticboolknowsCore;

[ThreadStatic]

privatestaticintcpuIndex;

publicvoidAdd(intoperation,inttimeMs)

{

if(knowsCore){}

else

{

cpuIndex=GetCurrentProcessorNumber();

knowsCore=true;

}

fixed(OperationData*s=&_opsData[cpuIndex,operation])

{

/* Syncronisation goes here */

}

}

By storing the the CPU ID we run on in a ThreadStatic (cpuIndex) the thread can remember which part of the array it needs to speak with.

At this point you may ask: Why not call GetCurrentProcessorNumber() every time? Well, it turns out that while this call is cheap, it is not THAT cheap. We measured a significant drop in throughput when we tried the first implementation.

What we do from here in Add solely depends on which synchronisation primitive we decide to use.

Benchmark: Partitioned Implementations

Using the knowledge we have acquired so far, we can now create several different implementations of the partitioned data structure – each using their own synchronisation primitives.

Let us see how they do in test:

Now THAT, ladies and gentlemen, is scalability. Even the worst partitioned, thread safe implementation tracks the unsafe un-partitioned version at low scale and massively outperforms (by about a factor 20) the unsafe version at high scale.

At the highest scale the low overhead, partitioned TTAS spin lock outperforms the unsafe un-partitioned version by a factor 85.

It is also clear that when the contention on a lock is low – a light weight alternative to the .NET SpinLock (like the TTAS spin) is very good – and so, interestingly, is the .NET lock() operation.

A Word of Warning

Getting scalability right requires a keen eye for detail – bordering on OCD. Henk and I had to make a lot of small adjustment to the data structure to hit the relatively clean scale curves you see above. Constant benchmarking and analysing of results is the key to success – nothing beats empirical data on a big machine.

Going over the mistakes I made while adjusting the code would require an entirely new blog entry. Yet, it is probably worth showing one example to illustrate the point. In one of our partitioned data structure tests, I had forgotten to add padding to the struct. This resulted in false sharing rearing its ugly head as can be seen below:

The moment false sharing occurs the scalability becomes jagged. The small things matter!

Conclusion

In this blog series, we have shown you how to build scalable data structures and quantify the overhead of synchronisation primitives.

We have seen that the only viable solution for high scalability is to use partitioned data structures.

When data structures are designed for low contention, the choice of synchronisation is determined by two factors:

Do you want to block when you cannot access the critical region. If so use lock()

Do you want to spin when you cannot enter the critical region? In this case, rolling your own TTAS spin lock might be worth it.

In a fully asynchronous programming model, the .NET lock() statement will likely be the best choice for nearly all implementation. There are very few cases where SpinLock is preferable – unless of course, you have no way to avoid contention.

At the end, we managed to build a highly scalable class that works well even on a lot of cores. This means that we can feel confident about adding this instrumentation generously to our application server code – at least until we manage to drive a throughput of more than 800K operations/ms.

Thank you for this excellent serie of articles! It perfectly illustrates the hidden danger of multithreaded implementations in an understandable way!

As far as the unsafe partitioned implementation goes: it is possible to for an unpinned thread to switch cores (and the cpuIndex becoming out-of-sync with the actual value) or for multiple threads to initialize the cpuIndex on the same core (for example when there are more threads than cores). In these situations the cpuIndex is not unique and object sharing/data corruption is possible.

If an unsafe implementation is desired (since it seems to outperform TTAS by roughly 30% in your benchmark), I can imagine that each thread could be allocated their own individual [ThreadStatic] OperationData array by sacrificing speed of the one-time initialization. The initialization would allocate a new OperationData array, register it a global list and set the [ThreadStatic] reference to the allocated array. The global list will serve as a way to aggregate performance information when it is queried and a list to walk when Clear() is called.
If threads are regularly created and destroyed, the global list could be extended by registering the System.Threading.Thread to which the array belongs. A worker thread could periodically walk the global list and cleanup all Threads on which IsAlive == false by aggregating their values to a separate ‘history’ entry and removing the entry from the global list.
Would such an implementation work?

Thanks for this great series. It really hit the sweet spot for me in terms of continuing to get a grip on those pesky small details that often tend to get glossed over or presented in fluffy “in-flight magazine” marketing speak.

I was planning to clean up the test harness before releasing (it’s a bit dirty :-), but if you want it now, just shoot me an email.

As I understand the ThreadLocal, it won’t work. You cannot be guaranteed that the thread who wrote a value (and its data) is still around when you query it. If you had a fixed pool of threads, it would be fine though. I would expect it to scale similarly to the unsafe version.