Unsung heros: MPI run time environments

Most people immediately think of short message latency, or perhaps large message bandwidth when thinking about MPI.

But have you ever thought about what your MPI implementation has to do before your application even calls MPI_INIT?

Hint: it’s pretty crazy complex, from an engineering perspective.

Think of it this way: operating systems natively provide a runtime system for individual processes. You can launch, monitor, and terminate a process with that OS’s native tools. But now think about extending all of those operating system services to gang-support N processes exactly the same way one process is managed. And don’t forget that those N processes will be spread across M servers / operating system instances.

Parallel runtime environments have been a topic of much research over the past 20 years. There have been tremendous advancements made, largely driven by the needs of the MPI and greater HPC communities.

When I think of MPI runtime environments, I typically think of a spectrum:

On one end of the spectrum, there are environments that provide almost no help to an MPI implementation — they provide basic “launch this process on that server” kind of functionality. ssh is a good example in this category.

On the other end of the spectrum is environments that were created specifically to launch, initialize, and manage, and large-scale parallel applications. These environments do everything behind the scenes for the MPI implementation; the bootstrapping functionality in MPI_INIT can be quite simple.

Put differently, there are many services that an MPI job requires at runtime. Some entity has to provide these services — either a native runtime system, or the MPI implementation itself (or a mixture of both).

Here’s a few examples of such services:

Identification of the servers / processors where the MPI processes will run

Launch of the individual MPI processes (which are usually individual operating system processes, but may be individual threads, instead)

Allocation and distribution of network addresses in use by each of the individual MPI processes

Standard input, output, and error gathering and redirection

Distributed signal handling (e.g., if a user hits control-C, propagate it to all the individual MPI processes)

Monitor each of the individual MPI processes and check for both successful and unsuccessful termination (and then decide what to do in each case)

That is a lot of work to do.

Oh, and by the way, these tasks need to be done scalably and efficiently (this is where the bulk of the last few decades of research have been spent). There are many practical, engineering issues that are just really hard to solve at extreme scale.

For example, it’d be easy to have a central controller and have each MPI process report in (this was a common model for MPI implementations did in the 1990’s). But you can easily visualize how that doesn’t scale beyond a few hundred MPI processes — you’ll start to run out of network resources, you’ll cause lots of network congestion (to include contending with the application’s MPI traffic), etc.

So use tree-based network communications, and distribute the service decisions among multiple places in the computational fabric. Easy, right?

Errr… no.

Parallel runtime researchers are still investigating the practical complexities of just how to do these kinds of things. What service decisions can be distributed? How do they efficiently coordinate without sucking up huge amounts of network bandwidth?

And so on.

Fun fact: a sizable amount of the research into how to get to exascale involves figuring out how to scale the runtime system.

Just look at what is needed today: users are regularly runing MPI jobs with (tens of) thousands of MPI processes. Who wants an MPI runtime that takes 30 minutes to launch a 20,000-process job? A user will (rightfully) view that as 29 minutes of wasted CPU time on 20,000 cores.

Indeed, each of the items in the list above are worthy of their own dissertation; they’re all individually complex in themselves.

So just think about that the next time you run your MPI application: there’s a whole behind-the-scenes support infrastructure in place just to get your application to the point where it can invoke MPI_INIT.

I strongly disagree with the statement that a 30-minute job launch time is unacceptable. This ignores the benefits of doing detailed diagnostics during job launch that dramatically reduce the unexpected job failure rate and improve performance reproducibility. Launching a job on 100K processes of Blue Gene/P took around 15 minutes but it was absolutely worth it compared to similarly sized machines that booted much faster but which had much lower reliability and reproducibility. The time the user loses for booting jobs is more than made up for by the reduction in faults by virtue of detecting hardware issues before MPI_INIT is called.

I think you're comparing apples and oranges.
Sure, having the ability to have a slower, fully instrumented launch is a good thing. But most of the time, there isn't a failure during launch, so why pay the penalty?
I think having a reasonable launch speed with the ability to report "common" errors is a Good Enough. And then having a slower launch speed that can report detailed errors for those who want/need more information.
Put it this way -- if you give the user the following choice, "You can have a fast launch at scale that has less-detailed errors vs. a slower launch with more detailed errors", in the common case (i.e., day-to-day runs), they'll choose the faster launch every time.

A big additional challenge in this space is making the whole assemblage fault tolerant, so that it can potentially keep running of some of those M servers, running a bunch of the N processes, cease functioning.

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.