NUMA: Theory and Practice

Non-Uniform Memory Access (NUMA) used to be one of those concepts that only people who built massive multiprocessing machines cared about. Things have changed.

With today’s hottest processors easily passing the 3GHz mark and x86-based 64-bit chips like Intel’s Itanium II and AMD’s Opteron becoming easily available the need for faster, more powerful memory management has become more important. Indeed, as Symmetrical MultiProcessing (SMP) computing, clustering, and distributed computing become commonplace, the need for better memory management has become critical.

Why? The answer lies in an almost 40 year old computer engineering rule of thumb called Amdahl’s balanced system law: “A system needs a bit of I/O per second per instruction per second.” That works out to out 8 million-instruction per second (MIPS) for every Megabyte per second (MBps) of data throughput.Now, MIPS haven’t been regarded as the be-all and end-all of CPU benchmarking for years, but they’re still a useful approximation of overall performance. And, what’s important about using MIPS to justify NUMA is that MIPS do give interesting insights on the overall performance of memory and CPU together. So, for example, BiT-Technologies have found that a 2.4GHz Pentium IV (Northwood) runs marginally faster (168.73 MIPS) with Double Data Rate-Synchronous DRAM (DDR-SDRAM) memory than the same processor with Rambus memory (166.83 MIPS).

There are two morals to this performance story. The first is that even a single 32-bit hot, but already commonplace, processor is starting to push the limits of standard memory performance. The second is that even conventional memory types differences play a role in overall system performance. So it should come as no surprise that NUMA support is now in server operating systems like Microsoft’s Windows Server 2003 and in Linux 2.6 kernel.

How NUMA Works

Not all memory is created equal. Generally speaking the closer, in terms of access time, memory is to a processor; the faster the system’s overall I/O will go, thus improving all aspects of system performance.
The example we all probably know best is the use of cache to store frequently accessed data from a hard drive. Since even slow memory’s access speed is much faster than even the speediest hard drive, the end result is a much faster system.

The same trick works for processors and memory by using small amounts of fast RAM either on the chip itself (L1 or primary cache)-or immediately next to the CPU (L2 or secondary cache) to speed up main memory performance in the same way that hard drive cache speeds up disk performance.

NUMA takes cache’s basic concepts of memory locality and expands on them so that multi-processor systems can make effective use of not just their local memory but also of memory that are on different buses, or, for that matter, are only connected to the processor over a fast network.

Specifically, NUMA in the past has been hardware memory architecture in which every processor, or groups of processors, access other CPU’s memory. That does not mean they don’t have access to their own local memory, they do, but by sharing memory there’s much more capable of performing efficient parallel processing and dividing up massively data intensive tasks into manageable sizes.

Does this sound familiar? It should, SMP and clustering both try to do the same things for processing over either a system bus, backplane or a fast network connection. What NUMA does is add memory management to both of those technologies so they can gain better memory access for overall faster performance.

For instance, in SMP transport memory is usually shared by an interconnect bus. As the number of processors increases, so does the bus traffic and eventually throughput starts to decline.

NUMA machines, however, use multiple buses to handle memory thus making it much harder to slow a system down by throwing too much data at it. At the same time though, NUMA provides a linear memory address space that enables processors to directly address all memory. Distributed memory, a technique with similar aims, has to contend more with data replication overhead.

Within most NUMA system, as in an IBM NUMA-Q box, there is an additional area of fast memory called a L3 Cache. All of the processors on a given bus first access memory from this cache and them look to the bus’ main memory resources for data.

A NUMA machine’s main memory is its Uniform Memory Access (UMA) region. Together a set of local processors and its UMA are called a node. In a node, the processors share a common memory space or “local” memory. For example, an SMP system’s shared memory would make up a node. This memory, of course, provides the fastest non-cache memory access for the node’s CPU. Multiple nodes are then combined together to form a NUMA machine. Memory transfers between nodes are handled by routers. Memory that can only be accessed via a router is called ‘remote’ memory.

Needless to say, the name of the game in NUMA is to make those memory transfers as fast as possible to avoid a node’s processors working with out of date data. There’s a whole set of problems that are associated with this and they’re addressed by a variety of techniques deriving from cache coherency theory.

Programming for NUMA

One of the simplest ways to avoid coherency problems is the key to effectively programming on NUMA machines and that is to maximize references to local memory on the node while minimizing references to remote memory. After all, access to remote memory may take three to five times as long as access to local memory even in a well designed NUMA system. By programming in this way, and letting the underlying hardware architecture and operating system deal with NUMA’s memory coherence issues, developers can create effective programs.

That’s easier said than done. To do it today, most developers use tools like Multipath’s Fast Matrix Solver. These typically work by binding I/O threads to specific processors and nodes and by enabling you to explicitly place elements into local node memory.

A NUMA architecture or operating system tries to allocate memory efficiently for you, but its memory management works best for general cases and not for your specific application. As a NUMA system overall load increases, its memory management overhead takes up more time, thus resulting on overall slower performance.

Of course, NUMA doesn’t work magic. Just because your application ‘sees’ what appears to be an extraordinarily large memory space doesn’t mean that it knows how to use it properly. Your application must be written to make it NUMA aware. For example, if you’re old enough to recall writing applications using overlays to deal with the high likelihood of having to dip into virtual memory, those techniques might come in handy for dealing with remote memory.

To help you deal with remote memory, NUMA uses the concept of ‘distance’ between components such as CPUs and local and remote memory. In Linux circles, the metric usually used to describe ‘distance’ is hops. Less accurately, the terms bandwidth and latency are also bandied about. How the term is used varies, but generally speaking the lower the number of hops, the closer a component is to your local processor.

Sounds like a lot of trouble doesn’t it? So why use NUMA? The first reason for many of us is that NUMA can make scaling SMP and tightly-coupled cluster applications much easier. The second is that NUMA is ideal for highly parallel computations of the kind that super-computers thrive on such as weather modeling.

It’s the first, combined with the enormous data hunger of today’s SMP boxes, which has brought NUMA out of a few specialized niches into Server 2003 and Linux 2.6. While long term deployment of NUMA aware applications is probably still a few years away, as development tools come out that make NUMA memory management easier for application developers, it should emerge as the memory management technique of choice for high-end server application developers.

[…] its core virtualization hypervisor, Xen, to Xen 3.1.2. It also features improvements in its NUMA (Non-Uniform Memory Access) interface as well as support for up to 64 processors per system with up to 512GB of memory per […]