Registered memory imbalances

In prior blog posts, I’ve talked about the implications of registered memory for both MPI applications and implementations.

Here’s another fun implication that was discovered within the last few months by Nathan Hjelm and Samuel Gutierrez out at Los Alamos National Labs: registered memory imbalances.

As an interesting side note: as far as we can tell, no other MPI implementation attempts to either balance registered memory between MPI processes, or handle the performance implications that occur with grossly imbalanced registered memory consumption.

Let’s review a few key points before defining what registered memory imbalances are.

Recall that registering memory means two things:

Pinning virtual memory in place to a specific physical memory location

Notifying one or more entities of this virtual-to-physical memory mapping (e.g., notifying an OS-bypass capable NIC, such as an InfiniBand HCA)

In any given system, there is a limit on how much memory can be registered. There are multiple sources of this limit (e.g., hardware resources on the NIC, tunable parameters for the NIC, operating system limits, amount of physical RAM, etc.), but they all resolve down to one thing: you can only register a fixed, finite amount memory at a time.

Regardless, it’s a finite resource. And it’s a shared resource between all MPI processes that are running on the same machine.

As a consequence, Nathan and Samuel discovered that what can happen is that a small number of MPI processes can consume a large portion of the available registered memory, thereby starving other MPI processes on the same machine. Indeed, in their experiments, they saw that MPI processes that were close to the HCA (NUMA-wise, that is) had a much greater chance of consuming inordinately more registered memory than other MPI processes.

In hindsight, this is a fairly obvious race condition and consequence of a shared resource. But there are no good tools to detect such a situation (Nathan only found it by adding instrumentation deep within the bowels of Open MPI), which is one reason we assume no one has discovered this specific issue before.

This registered memory imbalance between MPI processes can actually cause serious performance degradation. For example, MPI processes can be forced to avoid RDMA-based protocols and fall back to send/receive protocols (which tend to be less efficient on InfiniBand hardware).

In the upcoming Open MPI 1.6.1 release, we will fix a bug related to registered memory imbalances, but not try to address the consequent performance issues that occur. In the 1.7 series, we have a few ideas about how to ensure that MPI processes don’t (permanently) consume inordinately more registered memory than their peers.

It’s a tricky problem, because you periodically want individual MPI processes to be able to use “too much” registered memory (i.e., more than their “fair share” compared to their peers running on the same machine) to be able to absorb bursty MPI traffic. But then that “burst” of registered memory must be returned in order to enforce long-term stability of registered memory consumption.

We'd love to hear from you! To earn points and badges for participating in the conversation, join Cisco Social Rewards. Your comment(s) will appear instantly on the live site. Spam, promotional and derogatory comments will be removed.

MSMPI has some tunable parameters to control how large the memory registration cache can grow. It's not perfect, though, and can still leave large amounts of memory registered because we currently only flush the cache when we register new buffers, rather than also flushing at the end of an RDMA transfer. Our default registration cache size limit is half of the per-core physical memory per process.
Windows provides an API, CreateMemoryResourceNotification, which returns a handle that is signalled when the requested condition is met. The notification object can be queried (non-blocking), or can be passed to any of the OS wait routines for a blocking notification, and can be used to detect low physical memory conditions.
Another routine that can be handy is the GlobalMemoryStatusEx function, as it returns total and avaialble physical memory, as well as a 'memory load' that indicates how much of physical memory is in use as a percentage.

Open MPI has a less-fine-grained approach for registered memory: you can cap the amount of registered memory in a given process (to a fixed number of bytes). It defaults to unlimited, however, which is one reason you can get into these registered memory imbalances.
We definitely need some work in this area; we'll be looking at that during the v1.7 series.

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.