In the previous blog post we saw how the lock() statement in .NET scales very poorly when there is a contention on a data structure. It was clear that a performance logging framework that relies on an array with a lock on each member to store data will not scale.

Today, we will try to quantify just how much performance we should expect to get from the data structure if we somehow solve locking. We will also see how the underlying hardware primitives bubble up through the .NET framework and break the pretty object oriented abstraction you might be used to.

Because we have already proven that ConcurrentDictionary adds to much overhead, we will focus on arrays as the backing store for the data structure in all future implementations.

Implementation: Unsafe Array

We saw how the lock() statement itself is very expensive and dominates the CPU consumption of the test. What might happen if we decided to make the data structure complete unsafe and just get rid of locks?

Let us try this implementation:

C#

1

2

3

4

5

6

7

8

9

10

11

12

privateOperationData[]_opsData=newOperationData[OPSLIST_LENGTH];

publicvoidAdd(intstatistics,inttimeMs)

{

OperationDatas=_opsData[statistics];

s.TimeMs+=timeMs;

s.Count++;

s.MaxTimeMs=Math.Max(timeMs,s.MaxTimeMs);

s.MinTimeMs=Math.Min(timeMs,s.MinTimeMs);

Thread.MemoryBarrier()

}

Since there is no thread coordination here, all bets are off on the results we can expect. Just for the experiment, let us see how fast this will go.

Note: The Thread.MemoryBarrier() call is necessary to make sure that the update of the data is visible to other threads and that the compiler wont remove or reorder instructions. The MSDN documentation states that this method is only needed on IA-64, but this is wrong. Instead of explaining this, I refer you to this excellent blog: http://www.albahari.com/threading/part4.aspx

Benchmark: Unsafe Array

Below is the benchmark of the unsafe array compared with the previous run using a lock based array:

This might surprise you. The unsafe version of the implementation, while significantly faster at low concurrency, still scales poorly at high concurrency.

Observations

What is going on here? Even without locks, the data structure does not scale.

A code profile of where we are spending the time shows something most curious:

Notice that the first access of the class (s.timeMs += TimeMs) takes almost twice as long as the rest of the modifications, though we would not expect adding an integers to be a “long” operation (certainly shorter than taking the max of two values).

Another noteworthy phenomena is that the first 2 CPU threads seems to scale well, followed by steep drop in scale to an almost flat line and then another drop at 16 CPU threads. If you squint your eyes, you can see the same effect in the locking run.

Welcome to the world of hardware! Because all threads (each running on their own core) need to update the OperationData class, the CPU root complex have to coordinate the writes to the memory so each thread sees a consistent picture of the values we change when they read it next (preserving strong memory semantics). CPU cores exchange information about memory state using the MESI protocol. This coordination is not free, as cache lines become dirty (by writing to memory from code) information has to flow between the CPU cores.

In the case of an Intel CPU, the communication between two hyper threads (in L1 cache) on the same core is relatively cheap – which explains that the first two CPU threads have good scale. As we move out to other cores on the same socket, we have to touch L2 and L3 caches, which causes the first big drop in scale. Finally, as we leave the CPU socket, the coordination becomes very expensive – dropping us to our final, bottlenecked workload.

Below illustrates the jumps we make in scale and between CPU cores:

THIS is why even the unsafe variant of the data structure doesn’t scale.

The MESI protocol does not operate on individual integers, it operates on a full cache line. In the case of the x86/x64 architecture, a full cache line is 64B and anything that changes inside that cache line will have to coordinate with the other cores.

Implementation: Padded, Unsafe Array

Consider again the data structure that holds the measurements:

C#

1

2

3

4

5

6

7

privateclassOperationData

{

publiclongTimeMs;

publiclongCount;

publiclongMaxTimeMs;

publiclongMinTimeMs;

}

This class takes up 8B/long x 4 longs = 32B. That’s only half a cache line. What might happen if two of these end up next to each other in memory?

Recall that the test harness calls the Add function with 4 different operations, which means there are 4 instances of the OperationData class in play. Consider a memory layout that looks like this:

Since coordination of dirty data is done on a cache line (64B) granularity in the CPU, a layout like this would effectively serialise access, at the CPU level, to the three OperationData objects – even though they are entirely independent data structures.

Is it possible to validate if this is happening? Of course: you can simply break into the code and take the address of the first long in each class. Like so: &(_opsData[0].TimeMs).

Comparing the byte offsets of each long in the four instances of OperationData, we can see that the above is indeed happening! This phenomena is known as “False Sharing”.

What can we do about it? Well, one way is to use a struct instead of a class to store the data. Structs in .NET have the advantage of giving the user full control over the memory layout of the data. For example, we could do this:

C#

1

2

3

4

5

6

7

8

9

10

[StructLayout(LayoutKind.Sequential)]

unsafeprivatestructOperationData

{

privatefixedbytePad1[64];

publiclongTimeMs;

publiclongCount;

publiclongMaxTimeMs;

publiclongMinTimeMs;

privatefixedbytePad2[64];

}

Note that byte array at the beginning and end of the struct. These will prevent the struct from sharing a cache line with any other copies because the struct is now larger than a cache line. Because structs are value and not reference types, we have to be a bit more careful with the Add method, it has to take out a pointer to the array of structs:

C#

1

2

3

4

5

6

7

8

9

10

11

12

unsafepublicvoidAdd(intstatistics,inttimeMs)

{

fixed(OperationData*s=&_opsData[statistics])

{

s->TimeMs+=s->TimeMs+timeMs;

s->Count++;

s->MaxTimeMs=Math.Max(timeMs,s->MaxTimeMs);

s->MinTimeMs=Math.Min(timeMs,s->MinTimeMs);

Thread.MemoryBarrier();

}

}

The results are rather impressive for such a minor change:

Faster in every single case, by about 40%.

Conclusions so Far

In this blog post, we have seen that even when we remove locking from the equation, the data structure is still not scalable. This experiment has provided us some upper boundary of how fast this structure will go with the current implementation patterns (around 90K operations/ms).

We saw how even very small changes to the cache layout of data structure can lead to a significant scale boost. Its also clear that going beyond a single CPU socket on the Xeon has a large negative impact on the scalability of the code. What does this mean in practical terms? If you ran this data structure on a large machine with a great many cores, it would run SLOWER than on your laptop. This is the kind of physical reality that a managed language will try to hide from you with all of its fancy, object oriented blah blah. Such abstractions leak horribly when concurrency goes up.

Without knowledge of the hardware, we would have had a real “WTF moment” seeing this data. Understanding scenarios like these, which are more common than you might hope, is what distinguishes a great programmer from a good one. This is why I believe that teaching languages like JAVA and C# in university is outright harmful until a student has a good grasp of C/Assembler and machine architectures.

One more idea: Align the struct on 64 byte boundaries (can StructLayout do this? If not, overallocate). That way you touch only one cache line per operation. That should give a constant-factor speedup (and do nothing for scaling).

Indeed! Aligning on 64B does work beautifully and gives the same performance (on fewer cache lines). Unfortunately, the StructLayout attribute does not give you full control over this and you have to over allocate to hack it.