Tag: Linux

Since cc-NUMA architectures have become ubiquitous in the x86 server world, it is very important to optimize memory and thread or process placement, especially for Shared-Memory parallelization. In doing so I was pretty successful in optimizing several of our user codes for cc-NUMA architectures by introducing manual binding strategies. I like the cpuinfo tool that comes with Intel MPI 3.x a lot, it is to query how all the cores are related (i.e. which cores share a cache). Based on that output I used to figure out my strategies for every architecture that we have in our center or that I have access to elsewhere. However, during the last couple of days I observed some benchmark results that did not make much sense to me, and today I stumbled upon the cause for that – something I just did not expect. I will tell you in a second, but my statement is: Manual Binding can be a bad thing, although one can achieve a nice speedup by doing it right even experts can easily be fooled, therefore it is high time to get access to a standardized interface to communicate with the threading runtime and the OS!

We have dual-socket Intel Nehalem-EP systems from two different vendors: Sun and HP. The Sun systems are intended for HPC and are equipped with Xeon X5570 (2.93 GHz) CPUs, the HP systems are intended for infrastructure services and are equipped with Xeon E5540 (2.53 GHz) CPUs. Anyhow, I got hold of both, put some jobs on the boxes and was really disappointed by the speedup measurements on the HP system. In investigating the reason for that I found out that the numbering of the logical cores on both systems is different. Oh dear, two dual-socket systems with Intel Nehalem-EP processors, in one system the cores 0 and 1 are on the same socket, but in the other system they are on a different socket. Lets take a look at the output of cpuinfo on the Sun system:

Cache sharing

Lets take a closer look at this table. Wherever you find the identification ‘processor’, this refers to the logical core as visible to the operating system. A ‘package’ is a socket, and we have two ‘(hyper-)threads’ per ‘package’. On the Sun system, the logical cores 0 and 1 are located on the same socket, the cores 0 to 8 refer to eight full cores on two sockets making use of all caches. On the HP system, the logical cores 0 and 1 are located on two sockets, the cores 0 to 8 refer to four hyper-threaded cores on two sockets making use of only half the caches. I am not saying one of the two strategies is better – but if you use one machine to determine what the best is for you application, put this into a start-up script and change the machines in between your measurements, that you will be surprised (and not to the good).

How is the core numbering determined? Well, the short answer is “not by the OS, but by the BIOS”; the honest answer is “I don’t know exactly”. The BIOS has a lot of influence, for example one can take a look at the Advanced Configuration and Power Interface Specification (ACPI: http://www.acpi.info/DOWNLOADS/ACPIspec40.pdf) in section 5.2.17 that there is a System Locality Distance Information Table (SLIT) that lists the distance between hardware resources on different NUMA nodes. In theory the OS kernel can make use of that table, and it does or it fills in constant values (i.e. 10 = local, 20 = remote) in case the table is empty. But the ACPI specification does not specify how the table is generated – that is up to the BIOS implementation itself, and probably up to BIOS settings. The important take-away is that (i) BIOS settings influence the core numbering scheme, (ii) obviously BIOS settings are not the same across vendors, (iii) the numbering can change over time anyhow and other OSes (i.e. Windows) do it differently -> (iv) do not rely on the numbering scheme being static.

What should you do instead? We do not have a standardized way to influence the thread / process binding. Using tools such as numactl (Linux) or start /affinity (Windows) accept core ids as argument, which is far from optimal. The same holds for explicit API calls to do the binding. Instead, the Intel compiler is following a good path: The environment variable KMP_AFFINITY can be used to define an explicit thread-to-core mapping, but it also accepts two strategies: scatter and compact. The idea of scatter is to bind the threads as far apart as possible (to use all the caches and to have all the memory bandwidth available); the idea of compact is to bind the threads as close together as possible (to profit from shared caches). Running a program with two threads using the scatter strategy on the Sun system results in binding thread 0 to the core set {0,8} and thread 1 to the core set {4,12} (-> two sockets). The same experiment on the HP systems results in binding thread 0 to the core set {1,9} and thread 1 to the core set {0,8} (-> two sockets, again). This abstracts from the hardware / system details and allows the user, who might not be an HPC expert, to concentrate on optimizing the application by choosing from just two strategies, still getting “portable performance” on Intel CPUs. A portable thread binding interface is under discussion for OpenMP 3.1 (see my previous blog post), and I am in a strong favor for allowing the user from choosing strategies. The one shortcoming of Intel’s current implementation occurs when you have multiple levels of Shared-Memory parallelization in one application and want to mix strategies – which might make sense. But this could easily be overcome. Let’s see what the future might bring, for now I just fixed my scripts to include a sanity check that the core numbering is indeed as expected…