MPI progress

Can you describe the difference between the current situation and true background progression? Does the lack of background progression mean having to occasionally explicitly relinquish control to MPI in order to let one-sided operations proceed? Once true background progression is in place, would it involve extra threads and context switching, or use some other mechanism?

A great question. He asked it in the context of the new MPI-3 one-sided stuff, but it’s generally applicable for other MPI operations, too (even MPI_SEND / MPI_RECV).

Progressing of long-running MPI operations, such as a non-blocking send of a long message, is a difficult thing. There’s what the MPI standard says, and then there’s what (typically) happens in reality.

The MPI-2.2 document says the following in section 3.5 (my emphasis added):

If a pair of matching send and receives have been initiated on two processes, then at least one of these two operations will complete, independently of other actions in the system: the send operation will complete, unless the receive is satisfied by another message, and completes; the receive operation will complete, unless the message sent is consumed by another matching receive that was posted at the same destination process.

There is no mention of how the message passing progress occurs — it just says that it must occur.

Specifically, it doesn’t say that an MPI implementation may rely on calls to MPI functions to trip the internal progress engine. Most MPI implementations use this strategy for a cheap/easy way to keep progress occurring, but the above text make it pretty clear that an MPI implementation is not allowed to solely rely on such mechanisms for ultimate completion.

However, there is the law, and then there is the letter of the law.

Many MPI implementations do not have fully asynchronous progress for all cases. For example, having a progress thread running in the background can (severely) negatively impact latency and/or resource consumption. For example, an asynchronous thread will require locks into the critical message passing code paths, potentially thrash caches, incur context switching costs, consume more resources, etc. The effects of progress threads — particularly for short messages — are… complex, at best.

However, progress threads have been shown to be an acceptable way to get some types of asynchronous progress for large messages, particularly when most of the work is handled by hardware — not software. When dealing with large messages, the added costs of context switching (etc.) don’t matter as much.

Looking at it in one way: MPI implementations have weaseled out of doing the hard work of true asynchronous progression. But looking at it another way, MPI implementations have stayed out of that (very) difficult feature because it typically adds to short message latency — which is one of the first metrics that anyone looks at in an MPI implementation.

If your MPI implementation has high short message latency, no one will care if you have true asynchronous progress. That’s a cynical statement, but it’s true (some MPI implementations went out of business because it’s true!).

All that being said, modern networking and co-processor hardware can help in many cases. For example, an MPI implementation can (sometimes) hand off a long message to capable NICs and let the hardware handle the entire transmission (and/or receipt). Once the network action is complete, the hardware can notify the software so that the MPI layer can mark the corresponding MPI_Request as complete.

Hence, most MPI implementations try to co-opt hardware to assist as much as possible (e.g., for offloading message passing operations) and provide the cheapest way possible for honor the MPI progress rule.

But that’s usually still not enough. I know of (at least) one MPI implementation that raises SIGALARM at least once a day to asynchronously trip their progression engine.

Terrible?

Yes.

But it honored the MPI progress rule. And that MPI implementation was still able to have low latency because it wasn’t encumbered by locks for progress threads, etc.

But let’s tie this back to the original question — why was Geoffrey asking about asynchronous progress in terms of MPI-3 one-sided stuff?

Because the MPI-3 one-sided working group people tell me that the new MPI-3.0 one-sided functionality will pretty much force the issue of asynchronous progress on all MPI implementations. I honestly don’t know the details (the new MPI-3.0 one-sided chapter scares me!), but I believe them.

Meaning: all of us MPI implementers are going to have to figure out how to do true asynchronous progress — including that of short messages — without adding latency. Yowzers.

Unfortunately, I don't. :-(
There have been a bunch of academic/research papers about new ways to do things in MPI implementations over the past several years, but I'm unaware of anyone writing anything approaching a comprehensive "here's all the complicated / intricate / difficulty / nifty things that an MPI implementation needs to worry about" kind of document (book or otherwise).
I wrote a chapter in the Architecture of Open Source Applications book (volume 2) about the overall architecture of Open MPI, but it kinda skims the topic of MPI itself and focuses more on the backbone infrastructure of Open MPI.
So -- I don't have a good answer for you. :-( Sorry!

The mpi3 one-sided stuff is really nothing that
brings you to the next level of performance or
scalability. for exascale level computing most
of the people will go with pgas like apis like this
one (http://www.gaspi.de/projekt.html).

My $0.02 is that it's too early to say definitively what most people will be doing for exascale.
1. Keep in mind that the total group of people performing exascale computations is going to be a pretty small, elite group for quite a while. Petascale is getting somewhat pedestrian these days, but still -- your grandma ain't doing it yet.
2. Many technologies -- and corresponding software APIs -- need to be adapted/changed/re-invented for exascale. I admit to not following exascale progress closely, but I think the jury is still out for exactly what technologies will and will not scale up that high.

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.