NUMA performance improvement for ThreadRipper in Sandra SP2

What is NUMA?

Modern CPUs have had a built-in memory controller for many years now, starting with the K8/Opteron, in order to higher better bandwidth and lower latency. As a result in SMP systems each CPU has their own memory controller and its own system memory that it can access at high speed – while to access other memory it must send requests to the other CPUs. NUMA is a way of describing such systems and allow the operating system and applications to allocate memory on the node they are running on for best performance.

As ThreadRipper is really two (2) Ryzen dies connected internally through InfinityFabric – it is basically a 2-CPU SMP system and thus a 2-node NUMA system.

While it is possible to configure it in UMA (Uniform Memory Access mode) where all memory appears to be unified and interleaved between nodes, for best performance the NUMA mode is recommended when the operating system and applications support it.

While Sandra has always supported NUMA in the standard benchmarks – some of the new benchmarks have not been updated with NUMA support especially since multi-core systems have pretty much killed SMP systems on the desktop – with only expensive severs left to bring SMP / NUMA support.

Note that all the NUMA improvements here would apply to competitor NUMA (e.g. Intel) systems, thus it is not just for ThreadRipper – with EPYC systems likely showing a far higher improvement too.

In this article we test NUMA performance; please see our other articles on:

In this heavy algorithm using FP64 to mantissa extend FP128 we see no improvement.

Fractals are compute intensive with few memory accesses – mainly to store results – thus we see a maximum of 3% improvement with NUMA support with the rest insignificant. However, this is a simple 2-node system – bigger 4/8-node systems would likely show bigger gains.

Crypto AES256 (GB/s)

27.1 [+139%]

11.3

AES hardware accelerated is memory bandwidth bound thus NUMA support matters; even in this 2-node system we see a over 2x improvement of 139%!

Crypto AES128 (GB/s)

27.4 [+142%]

11.3

Similar to above we see a massive 142% improvement by allocating memory on the right NUMA node.

Crypto SHA2-256 (GB/s)

32.3 [+50%]

21.4

SHA is also hardware accelerated but operates on a single input buffer (with a small output hash value buffer) and here out improvement drops to 50%, still very much significant.

Crypto SHA1 (GB/s)

34.2 [+56%]

21.8

Similar to above we see an even larger 56% improvement for supporting NUMA.

Finally in streaming algorithms we see just how much NUMA support matters: even on this 2-note system we see over 2x improvement of 140% when working with 2 buffers (in/out). When using a single buffer our improvement drops to 50% but still very much significant. TR needs NUMA suppport to shine.

SGEMM (GFLOPS) float/FP32

395 [114%]

184

As with crypto, GEMM benefits greatly from NUMA support with an incredible 114% improvement by allocating the (3) buffers on the right NUMA nodes.

DGEMM (GFLOPS) double/FP64

183 [131%]

79

Changing to FP64 brings an even more incredible 131%.

SFFT (GFLOPS) float/FP32

11.6 [86%]

6.25

FFT also shows big gains from NUMA support with 86% improvement just by allocating the buffers (2+1 const) on the right nodes.

DFFT (GFLOPS) double/FP64

10.6 [112%]

5

With FP64 again increases

SNBODY (GFLOPS) float/FP32

479 [=]

483

Strangely N-Body does not benefit much from NUMA support with no appreciable improvement.

DNBODY (GFLOPS) double/FP64

189 [=]

191

With FP64 workload nothing much changes.

As with crypto, buffer heavy algorithms (GEMM, FFT, N-Body) greatly benefit from NUMA support with performance doubling (86-131%) by allocating on the right NUMA nodes; in effect TR needs NUMA in order to perform better than a standard Ryzen!

Blur (3×3) Filter (MPix/s)

2090 [+71%]

1220

Least compute brings highest benefit from NUMA support – here it is 71%.

Sharpen (5×5) Filter (MPix/s)

886 [=]

890

Same algorithm but more compute brings the improvement to nothing.

Motion-Blur (7×7) Filter (MPix/s)

494 [=]

495

Again same algorithm but even more compute again no benefit.

Edge Detection (2*5×5) Sobel Filter (MPix/s)

720 [=]

719

Using two buffers does not seem to show any benefit either.

Noise Removal (5×5) Median Filter (MPix/s)

116 [=]

117

Different algorithm keeps with more compute means no benefit either.

Oil Painting Quantise Filter (MPix/s)

40.3 [=]

40.7

Using the new scatter/gather in AVX2 does not help matters even with NUMA support.

Diffusion Randomise (XorShift) Filter (MPix/s)

1880 [+90%]

982

Here we have a 64-bit integer workload algorithm with many gathers not compute heavy brings 90% improvement.

Marbling Perlin Noise 2D Filter (MPix/s)

397 [=]

396

Heavy compute brings down the improvement to nothing.

As with other SIMD tests, low compute algorithms see 70-90% improvement from NUMA support; heavy compute algorithms bring the improvement down to zero. It all depends whether the overhead of accessing other nodes can be masked by compute; in effect TR seems to perform pretty well.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that ThreadRipper needs NUMA support in applications – just like any other SMP system today to shine: we see over 2x improvement in bandwidth-heavy algorithms. However, in compute-heavy algorithms TR is able to mask the overhead pretty well – with NUMA bringing almost no improvement. For non NUMA supporting software the UMA mode should be employed.

Let’s remember we are only testing a 2-node system, here, a 4+ node system is likely to show higher improvements and with EPYC systems stating at 1-socket 4-node we can potentially have common 4-socket 16-node systems that absolutely need NUMA for best performance. We look forward to testing such a system as soon as possible 😉