Here's a (strongly NUMA-centric) performance comparison of the
three NUMA kernels: the 'balancenuma-v10' tree from Mel, the
AutoNUMA-v28 kernel from Andrea and the unified NUMA -v3 tree
Peter and me are working on.

The goal of these measurements is to specifically quantify the
NUMA optimization qualities of each of the three NUMA-optimizing
kernels.

There are lots of numbers in this mail and lot of material to
read - sorry about that! :-/

I used the latest available kernel versions everywhere:
furthermore the AutoNUMA-v28 tree has been patched with Hugh
Dickin's THP-migration support patch, to make it a fair
apples-to-apples comparison.

I have used the 'perf bench numa' tool to do the measurements,
which tool can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/bench

# to build it install numactl-dev[el] and do "cd tools/perf; make -j install'

To get the raw numbers I ran "perf bench numa mem -a" multiple
times on each kernel, on a 32-way, 64 GB RAM, 4-node Opteron
test-system. Each kernel used the same base .config, copied from
a Fedora RPM kernel, with the NUMA-balancing options enabled.

( Note that the testcases are tailored to my test-system: on
a smaller system you'd want to run slightly smaller testcases,
on a larger system you'd want to run a couple of larger
testcases as well. )

'NUMA convergence' latency is the number of seconds a workload
takes to reach 'perfectly NUMA balanced' state. This is measured
on the CPU placement side: once it has converged then memory
typically follows within a couple of seconds.

Because convergence is not guaranteed, a 100 seconds latency
time-out is used in the benchmark. If you see a 100 seconds
result in the table it means that that particular NUMA kernel
did not manage to converge that workload unit test within 100
seconds.

The NxM denotion means process/thread relationship: a 1x4 test
is 1 process with 4 thread that share a workload - a 4x6 test
are 4 processes with 6 threads in each process, the processes
isolated from each other but the threads working on the same
working set.

As expected, mainline only manages to converge workloads where
each worker process is isolated and the default
spread-to-all-nodes scheduling policy creates an ideal layout,
regardless of task ordering.

[ Note that the mainline kernel got a 'lucky strike' convergence
in the 4x6 workload: it's always possible for the workload
to accidentally converge. On a repeat test this did not occur,
but I did not erase the outlier because luck is a valid and
existing phenomenon. ]

The 'balancenuma' kernel does not converge any of the workloads
where worker threads or processes relate to each other.

AutoNUMA does pretty well, but it did not manage to converge for
4 testcases of shared, under-loaded workloads.

The other set of numbers I've collected are workload bandwidth
measurements, run over 20 seconds. Using 20 seconds gives a
healthy mix of pre-convergence and post-convergence bandwidth,
giving the (non-trivial) expense of convergence and memory
migraton a weight in the result as well. So these are not
'ideal' results with long runtimes where migration cost gets
averaged out.

[ The denotion of the workloads is similar to the latency
measurements: for example "2x3" means 2 processes, 3 threads
per process. See the 'perf bench' tool for details. ]

The 'numa02' and 'numa01-THREAD' tests are AutoNUMA-benchmark
work-alike workloads, with a shorter runtime for numa01.

The first column shows mainline kernel bandwidth in GB/sec, the
following 3 colums show pairs of GB/sec bandwidth and percentage
results, where percentage shows the speed difference to the
mainline kernel.

Noise is 1-2% in these tests with these durations, so the good
news is that none of the NUMA kernels regresses on these
workloads against the mainline kernel. Perhaps balancenuma's
"2x1-bw-process" and "3x1-bw-process" results might be worth a
closer look.

No kernel shows particular vulnerability to the NOTHP tests that
were mixed into the test stream.

As can be expected from the convergence latency results, the
'balancenuma' tree does well with workloads where there's no
relationship between threads - but even there it's outperformed
by the AutoNUMA kernel, and outperformed by an even larger
margin by the NUMA-v3 kernel. Workloads like the 4x JVM SPECjbb
on the other hand pose a challenge to the balancenuma kernel,
both the AutoNUMA and the NUMA-v3 kernels are several times
faster in those tests.

The AutoNUMA kernel does well in most workloads - its weakness
are system-wide shared workloads like 2x16-bw-thread and
1x32-bw-thread, where it falls back to mainline performance.

The NUMA-v3 kernel outperforms every other NUMA kernel.

Here's a direct comparison between the two fastest kernels, the
AutoNUMA and the NUMA-v3 kernels:

A third, somewhat obscure category of measurements deals with
the 'execution spread' between threads. Workloads that have to
wait for the result of every thread before they can declare a
result are directly limited by this spread.

The 'spread' is measured by the percentage difference between
the slowest and fastest thread's execution time in a workload:

The results are pretty good because the runs were relatively
short with 20 seconds runtime.

Both mainline and balancenuma has trouble with the spread of
shared workloads - possibly signalling memory allocation
assymetries. Longer - 60 seconds or more - runs of the key
workloads would certainly be informative there.

NOTHP (4K ptes) increases the spread and non-determinism of
every NUMA kernel.

The AutoNUMA and NUMA-v3 kernels have the lowest spread,
signalling stable NUMA convergence in most scenarios.

Finally, below is the (long!) dump of all the raw data, in case
someone wants to double-check my results. The perf/bench tool
can be used to double check the measurements on other systems.