The Troubling State of MPI

The Message Passing Interface, or MPI
(not to be confused with the Max Planck Institutes), is in
the peculiar situation of being one of the most widely used
technologies in HPC and supercomputing, despite being
declared dead since decades. Lately however, my nose is
picking up some smells which are troubling me. And others.

tl/dr: MPI is still standing strong, but its inflexibility is giving away its age. Contenders are flexing their muscles.

To make one thing clear: I'm not jumping on the MPI is
dead bandwagon here. My point of view is that MPI will continue
to be the standard interface for moving data on HPC machines, at least for a
very long time. MPI has outlived so many other technologies, it has
been here before InfiniBand, before the multi-core revolution, before
the ubiquity of accelerators... and it just won't die. Instead, it continues to
adapt:

Thanks to the plugin-based architecture of most implementations,
newer interconnects are easily supported. Often vendors will even
provide a custom implementation of the ibverbs library (even if the
new interconnect is not related to InfiniBand -- Cray did this for
Gemini IIRC), thereby removing the need of adapting MPI alltogether.

Anyway. Thanks to a sequence of lucky coincidences, I had the pleasure to
attend the 2014 Mardi
Gras Conference earlier this year. It was organized by CCT at LSU,
Baton Rouge, LA. Pavan Balaji from Argone was giving a talk there on
MPI for exascale. Pavan is deeply involved in MPI in general, but also
in the MPICH project -- just the guy to ask about the guts of MPI.

Supported Thread-Levels

My first question was related to why so many MPI implementations
are struggling to support MPI_THREAD_MULTIPLE well. Having multiple
threads in one MPI process can still be valuable, despite CMA and
such, as these may harness the CPU's shared caches -- especially
useful for memory bound codes.

As it turned out, the strategy most vendors chose to implement in
this case was to wrap all MPI calls in a big lock. The more threads
hammer MPI, the more this lock will hurt you -- contention is bad. But
these locks add overhead, even if just one thread ends up calling MPI.
There are smarter ways of going about this: don't use one giant lock,
but multiple, each for a smaller scope. And we know lock-free
algorithms, especially useful for managing queues of transfers. And
much more. Given that multi-core CPUs aren't exactly new, I'm
surprised that this is still not resolved.

Interestingly, this also explains why most MPI variants have good support
MPI_THREAD_SERIALIZED and MPI_THREAD_FUNNELED. The currently
accepted workaround for users is to funnel all MPI calls through one single
thread. The application developer needs to take care of not
overburdening this single thread. Oh, and of course the user who's
submitting the job also needs to take care of starting sufficient
numbers of processes per node, if nodes have lots of cores. And the
sysadmin needs to provide presets for facilitating different allocation
and pinning schemes. It's a pain.

Beyond the 2 GB Barrier

If you dig around the MPICH and Open MPI user mailing lists, you
will encounter multiple posts complaining about not being able to
send/receive more than 2 GB en block. 2 GB is not much, considering
that our machines
at last year's Student Cluster Competition came with 128 GB of RAM
per node. One source for these errors are genuine bugs in the MPI
implementation (e.g. using an int to track a message size, instead of
a long). These are easily fixed.

The other source is apparently harder to come by: the MPI standard
mandates that the number of items to be sent is specified by an
MPI_INT. And that locks you down to 2^31 elements. If you're
sending chars, you've just lost. It's not much better to be limited to
16 GB when sending doubles, though. Yes, you'll rarely send so many
doubles for now. But it creates nasty glitches in user code, which are
hard to hunt down.

Now, the textbook solution would be to optionally replace MPI_INT
by size_t. But this would either require all MPI functions to be
specified twice in the standard: once for ints, once for size_t. A
huge, but rather simple and mechanical change to the standard. I don't
know why, but according to Pavan, this solution was instantly rejected
by the MPI Standard Committee.
Another solution would be to use packed datatypes, e.g. instead of
sending 2^35 doubles, I could send them in 2^25 batches of 2^10
doubles each. So convenient! Not.

Finally, one could imagine to create a meta-MPI-implementation
which is not part of the standard, but only provides 64-bit enabled
variants of the MPI API. Internally this meta-implementation would
wrap all calls around the original 32-bit API and make sure buffer
sizes etc. are set correctly. Sounds like a huge PITA, and a giant
waste of time? Well, according to Pavan, work in this is already in
progress. The name of the project escapes me though.

Update: the project is
called BigMPI.
At the time of writing the last commit was on 2013.09.24.

Update^2: Jeff Hammond, the
author of BigMPI, got back to me to let me know that the project was
rather meant as a tutorial to show users how to skirt this current
limitation of MPI. He did however hint at the possiblity to develop a
fully fledged wrapper library at a later point of time.

C++ Bindings Removed from MPI-3

...with the rationale of this being that the original bindings were
basically a fig leave on top of the original C-code, and no one was
using it anyway. So, no one is using code which isn't going to benefit
him anyway? Color me impressed. Today folks are flocking around Boost.MPI,
and rightfully so. Boost.MPI brings many features direly missing in
vanilla MPI (e.g. support for STL types). If you ask me: Boost.MPI is
what the C++ bindings of MPI should have been. This goes to prove that
it is possible to bridge from MPI to C++ well.

Asynchronous vs. Non-blocking Communication

A lot of users think of MPI_Isend/recv and friends as asynchronous
counterparts of MPI_Send/Recv etc. Implementers generally call them
non-blocking, and for a good reason: often MPI will do
progress (e.g. actually send the data) only if you're blocking in a
call to MPI_Wait, or similar. The reason for this is simple: even if
the interconnect supports RMDA and bus mastering, MPI still needs to
provide it with new addresses, move memory to pinned pages and so on
and so on. It's complicated. Still, asynchronous communication can be
hugely advantegous, especially in strong scaling setups. So, how to
achieve asynchronous progress?

Regularly ping MPI, e.g. via MPI_Test(). The frequency of these
calls needs to be carefully chosen though. Too few calls, and MPI
won't have enough cycles to make good progress. Too many calls and
you'll incur overhead. You might need to determine optimum parameters
not just for every new machine, but even for every problem size.

A pace maker: some architectures, e.g. IBM's Blue Gene/Q come
with a core dedicated MPI pacing. That's nice. But what if you're st(r)uck
with a machine that doesn't?

Victim threads, e.g. nemesis engine: an elegant solution, which
comes at a price. Considering that your company just spent a gigantic
sum on procuring a new machine, people from accounting might not be
super happy if you told them that you're wasting 10% of the cores on
just waiting for communication.

Pavan's opinion regarding these issues was: it's not MPI's task to
make writing any parallel program easy. It's about making writing
trivial programs easy and writing hugely complex programs feasible. He
said, it was his opinion that no end user (i.e. domain scientist, e.g.
a physicist writing a new simulation code) should ever touch MPI.
These should use computationally libraries, which are easier to use and
will deliver crucial performance optimizations. A surprising, and
interesting point of view. One I can sympathize with. After all,
that's why I'm working on LibGeoDecomp.

This begs the question though if there was a way to express
parallelism in a generic, yet user-friendly way.

Feature Regressions in Open MPI

From time to time RFCs pop up on the Open MPI devel list. These are
used to discuss potentially disruptive changes to the code base with
the larger developer community. Usually they're concerned with adding
new features, but sometimes they also deal with cleaning up code or
removing outdated, unmaintained code. That's fine. Open MPI is an
active research project, and as members join and leave the project,
portions of the code which are not actively used, may be orphaned.

A while ago one of my colleagues, Adrian
Knoth, added IPv6 support to Open MPI. Sounds like a trivial
change, right? After all, IPv6 is like IPv4, just with longer
addresses, right? Well, no. Today IPv6 support is disabled
by default, as it is broken since five years and no one is
maintaining it.

Recently IBM has committed to opening their Power chips to
collaborators. Simultaneously, Open MPI developers are discussing
whether support for heterogeneous runs should
be removed. MPI's hugely complicated type system is usually
motivated by stating that if MPI could understand the structure of the
data being sent, then it could translate between different
architectures. If it can't do this anyway, why bother with defining
MPI datatypes?

The Gist of It

None of the issues I've presented are catastrophic. Cleaning up
code and removing rotting passages is part of a healthy software
engineering process. And yet it stinks:

MPI is not becoming easier to use, but harder. The voodoo dance an ordinary user has to complete to max out e.g. perfectly ordinary two-socket, 16-core nodes is inane:

polling MPI for asynchronous progress,

using a custom locking regime to funnel MPI calls into one thread,

pack data into arbitrary chunks to skirt the limitations of 32-bit ints.

Previously usable and useful features are being removed, sometimes confusing, sometimes even alienating users.

Trivial changes (e.g. the use of size_t) seem next to impossible to implement.

Combined, these stinks indicate that the MPI user
experience is getting worse. Historically, this is not unusual for
software. Projects may experience difficult times when they're
undergoing major reconstructions (think KDE 4.0) -- but that is not to
say that they can't recover (I'm perfectly happy with my current KDE
4.12). Other reasons for degradations could be a growing alienation
between developers and users (think GNOME), or even developers no
longer being able to keep up the pace (think Amarok).

As said, I don't think MPI is going away anytime soon. All I'm
trying to say is that future developers might not be too sad if they
didn't have to touch MPI themselves. We can already see a trend
towards generic computational libraries (e.g. Physis or LibGeoDecomp)
and domain-specific problem solving environments (e.g. Cactus,
OpenFOAM, Gromacs). And if these libraries were to switch all of a
sudden to an MPI replacement...?

A New Hope: HPX

For me, a glimpse into such a better future is HPX. I won't go
into detail on how High Performance ParalleX (HPX) works, as
this would be beyond the scope of this post. But basically it is a
parallel runtime for C++ which gives you (amongst other things) a virtual global address
space, even on distributed memory machines. It provides you with
dozens of ways to express parallelism in your code naturally, but you
are still free to write very MPI-esque code with it. And even if you
abuse it in such a way, it will not expose any of the weaknesses
discussed above. HPX owes much of its prowess state of the art software
engineering practices, one of them being the use of a modern
programming language: C++ instead of C. I won't open the
object-orientation vs. procedural programming can of
worms now. All I'm saying is that modern C++ has some elegant ways of
managing complexity.

I wouldn't label HPX an MPI killer, as it's really solving a
different problem: it's actually trying to do more. HPX is not merely
a message passing interface, it goes beyond. Yet, as a pure C++
library, it is of limited use for Fortran users. Also, it's still not
as mature as MPI. But the results I've seen so far, both
performance-wise and from a usability point of view, are extremely
promising.