MCAPI plays better in the embedded space than MPI (that’s what MCAPI was designed for, after all). Simply put: MPI is too feature-rich (read: big) for embedded environments, reflecting the different design goals of MCAPI vs. MPI.

MCAPI + MPI might be a useful combination. The article cites a few examples of using MCAPI to wrap MPI messages. Indeed, I agree that MCAPI seems like it may be a useful transport in some environments.

One thing that puzzled me about the article, however, is that it states that MPI is terrible at moving messages around within a single server.

Huh. That’s news to me…

To be clear: there has actually been quite a bit of research into making MPI highly efficient on multicore systems. Open MPI’s shared memory transport, for example, has evolved through multiple generations of research (and is just about to see another update). MPICH2’s Nemesis shared memory transport has been the subject of many academic papers.

Both provide excellent performance for moving MPI messages between processes on the same server.

Is MCAPI that much more efficient at moving messages between memory spaces than optimized MPI implementations? We MPI implementors always have more optimization work to do, but I’m unaware (perhaps ignorant?) of what it could be doing that is fundamentally different than typical MPI implementations.

That being said, I might well be mis-understanding the authors’ intent. The examples at the end of the article seem to imply that they may be referring to using MCAPI to communicate between server cores and accelerators. I can certainly see a few cases where using MCAPI as an MPI transport might be useful:

exchanging messages with an MPI process on an accelerator. Although the idea of running MPI on an accelerator hasn’t panned out well yet in current research circles… but there are people still thinking about the problems involved. For the moment, using plain MCAPI to send to an accelerator might be better.

exchanging messages with an MPI process on a different virtual machine on the same physical server. …but I don’t know if that would be allowed by the hypervisor. Hmm. (you may scoff at the idea of running more than one — or even one! — HPC-oriented virtual machine on a server, but core counts keep rising…)

exchanging messages with an MPI process on an FPGA or other specialized hardware computational resource (e.g., connected via PCI, QPI, HT, or some other “fast” connection network).

And who knows? Maybe MCAPI implementations are faster than MPI shared memory transports. Perhaps it would be useful to have a performance shootout between traditional MPI shared memory implementations and MPI-over-MCAPI.

9 Comments.

Nobody has in-kernel userspace to userspace memory copy working again yet do they? Without this you have to use shared memory copy-in/copy-out buffers which halves the bandwidth.
At Quadrics we had two features here, firstly we could remap the whole of the BSS and heap allocators to shared memory so you could just memcpy() to and from remote address space and we had a modified kernel ptrace API that you could use to get the kernel to do direct userspace to userspace copy into a remote processes address space.

Ashley,
You might look at http://www.ipdps.org/ipdps2010/ipdps2010-slides/CAC/slides_cac_Mor10OptMPICom.pdf and related work on Nemesis in MPICH2 by INRIA and Argonne.
See also XPMEM (http://code.google.com/p/xpmem/), which is developed by some folks associated with OpenMPI.
On Blue Gene/P, MPI can exploit the static TLB map to directly access memory in other processes with no overhead, but this exists because of the unique properties of CNK, e.g. the bijective mapping of virtual and physical addresses.

Microsoft Windows has had support for user-space to user-space copy for over a decade (since Windows 2000), allowing processes to move data in either direction via the ReadProcessMemory and WriteProcessMemory APIs.
Microsoft MPI takes advantage of this, though I don't know if any other MPI libraries on Windows do (they really should).

I find it interesting that the whole premise of the article seems to be that MPI applications fundamentally use the dynamic process model. I'm not sure a lot of MPI applications could handle a "user's inadvertent unplugging of a physical network cable". In my experience, MPI apps expect the cluster they run on to be effectively static for the duration of the job, else they fail.
The article seems to boil down to "because MPI can run across the WAN, it is slow", without acknowledging that the MPI libraries know and take advantage of the fastest interconnect available between any given processes.

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.