Comments

edited by dholmes-epcc-ed-ac-uk

This was originally posted on the mpi-comments mailing list, but as @AndrewGaspar noted in #86, this repository is much more active.

Problem

The topology creation functions (MPI_GRAPH_CREATE, MPI_DIST_GRAPH_CREATE, and MPI_DIST_GRAPH_CREATE_ADJACENT) define a virtual topology, described as a communication graph. Edges are defined between sources and destinations, both of which are ranks of processes in the input communicator [0].
In some instances, such as a communication graph where most nodes have the same number of neighbors (for instance a binary tree topology), it may be useful to have a "dummy" source or destination. This is provided for in MPI as MPI_PROC_NULL [1]. Then a caller can have boundary nodes (such as a root or leaves in a tree) communicate with these "dummy" neighbors, simplifying topology code logic.
This in analogous to what is done in non-periodic cartesian topologies [2]; MPI 3.1 documentation on MPI_CART_SHIFT [3] states:

Depending on the periodicity of the Cartesian group in the specified coordinate direction, MPI_CART_SHIFT provides the identifiers for a circular or an end-off shift. In the case of an end-off shift, the value MPI_PROC_NULL may be returned in rank_source or rank_dest, indicating that the source or the destination for the shift is out of range.

For a Cartesian topology, created with MPI_Cart_create, the sequence of neighbors in the send and receive buffers at each process is defined by order of the dimensions, first the neighbor in the negative direction and then in the positive direction with displacement 1. The numbers of sources and destinations in the communication routines are 2*ndims with ndims defined in MPI_Cart_create. If a neighbor does not exist, i.e., at the border of a Cartesian topology in the case of a non-periodic virtual grid dimension (i.e., periods[ . . . ]==false), then this neighbor is defined to be MPI_PROC_NULL.

If a neighbor in any of the functions is MPI_PROC_NULL, then the neighborhood collective communication behaves like a point-to-point communication with MPI_PROC_NULL in this direction. That is, the buffer is still part of the sequence of neighbors but it is neither communicated nor updated.

Proposal

Explicitly allow MPI_PROC_NULL as a neighbor in graph topologies. Communication with such a neighbor in the neighborhood communication collectives (MPI_(I)NEIGHBOR_ALLGATHER(,V), MPI_(I)NEIGHBOR_ALLTOALL(,V,W)) is defined identically to point-to-point communications with MPI_PROC_NULL [1]:

A communication with process MPI_PROC_NULL has no effect. A send to MPI_PROC_NULL succeeds and returns as soon as possible. A receive from MPI_PROC_NULL succeeds and returns as soon as possible with no modifications to the receive buffer.

This is, as stated in [4], indeed already in the standard, and merely needs a clarification that MPI_PROC_NULL is explicitly permitted as a neighbor in graph topologies.

Changes to the Text

On pages 295, 297, and 299, clarify that MPI_PROC_NULL may be the neighbor of a process. Sample wording includes:

Processes with MPI_PROC_NULL neighbors are allowed.MPI_PROC_NULL may be a neighbor of a process.

Impact on Implementations

Implementations will be required to permit MPI_PROC_NULL as a valid neighbor in graph topologies, if they do not already. In most cases, this should be fairly simple: see [6].

Impact on Users

None for current users; all existing code will remain valid. This may open up new possibilities and allow for some simplification of existing code.

This comment has been minimized.

edited

A clear example is a virtual binary tree topology. Nodes have between 1 and 3 neighbors (a parent and up to two children); most nodes are inner nodes and all three, some a leaves and have one, and a few have two (the root and some inner nodes in a non-full tree).

It then becomes difficult to tell what is in neighbors[0]; is this a parent, or a left child? This differs depending on whether or not you are the root. The same is true for neighbors[1]; this could be a left child, a right child, or a bogus value.

Suppose I later want to have every node send something to their left child; to do so, I must either store the left child separately from the neighbors array, or store what index in the neighbors array contains the left child, or recalculate either of those on demand—each time!

While this isn't difficult for a binary tree, it is easy to see how in a more general topology that is mostly homogeneous this may become a burden.

Now neighbors[0] is always the parent; neighbors[1] is always the left child; and neighbors[2] is always the right child. These may be nonexistent (MPI_PROC_NULL), but sending and receiving from them is never incorrect. In short, this helps simplify operations near the boundaries—exactly what MPI_PROC_NULL is meant to be used for!

In many instances, it is convenient to specify a "dummy" source or destination for communication. This simplifies the code that is needed for dealing with boundaries, for example, in the case of a non-circular shift done with calls to send-receive. (MPI 3.1, Section 3.11, Page 80)

A further point is that currently, a distributed graph topology cannot emulate a cartesian topology! I believe that this is not the intention of the MPI specification; a graph topology is intended to be the most general, and it is only because cartesian topologies are so common that the cartesian topology functions are provided (so as to simplify the process).

This comment has been minimized.

Given that MPI_PROC_NULL is supposed to be able to be used wherever a rank is required as a source or destination, and coupled with the fact that the neighborhood collectives explicitly support MPI_PROC_NULL neighbors, rejecting a topology in which MPI_PROC_NULL is specified as the rank of a neighbor would appear to be erroneous—at least to me.

What is your reasoning for the validity of Open MPI's behavior? Keep in mind that the neighbors at the borders of non-periodic dimensions of Cartesian topologies are explicitly defined to be MPI_PROC_NULL by the standard.

It is interesting to note that, as far as I can tell, the distributed graph topology constructors are the only place in the standard where ranks are specified to be 'non-negative integers'; all other functions merely specify them as 'integers'. Even the older non-distributed graph topology constructor specifies the flattened edge representation of neighbors as 'integers'.

This comment has been minimized.

edited

Exactly, the sources and destinations are non-negative. This requirement implicitly forbids MPI_PROC_NULL (which by definition MUST be negative to avoid conflicting with another valid rank). As such Open MPI's behavior is not erroneous. To me it is also logical.

This behavior may or may not have been the original intent and there may be need for an errata (at which point we will change the behavior of Open MPI). This will certainly be discussed at the MPI Forum meeting in June.

edited by wesbland

This comment has been minimized.

As far as I can tell, there is no requirement that an MPI implementation support INT_MAX processes or that MPI_PROC_NULL be negative. It is usually implemented as a negative integer (-2 in Open MPI, -1 in MPICH), but this is not required—it could conceivably be implemented as INT_MAX in an implementation that supported less than INT_MAX processes.

According to @hjelmn's reasoning, that would make whether or not MPI_PROC_NULL is a valid neighbor an artifact of the implementation, which is probably not the intent.

This comment has been minimized.

edited

FWIW. My interpretation is based on there being no material benefit to allowing null. Allowing null saves some minimal bookkeeping for apps. But, I can see having a limitation allowing for some optimization in the library. There is a cost to allowing null.

This comment has been minimized.

Distributed graph topologies were added in MPI-2.2. The author for Process Topologies for MPI-2.2 was Torsten Hoefler. He was also the editor and organizer for that chapter for MPI-3.0 and MPI-3.1 and is still the chair for the upcoming MPI-4.0.

Certainly, every feature has a cost associated with it—hence why determining what the correct behavior ought to be and ensuring that implementations function according to said behavior is important.

This comment has been minimized.

Also, we went through a similar case for RMA - some implementations rejected MPI_PROC_NULL in communication calls such as MPI_Put. We viewed this as an errata and eventually added a specific statement to this effect; see 11.3

MPI_PROC_NULL is a valid target rank in all MPI RMA communication calls.

This was a clarification because of this text from section 3.11:

The special value MPI_PROC_NULL can be used instead of a rank wherever a source or a destination argument is required in a call.

This is in the context of “communication” but applies to all routines.
Yes, there is an overhead to checking for this. That is far more serious for the communication calls, particularly the RMA ones, so it is clear that the Forum has already decided that the extra overhead of checking for MPI_PROC_NULL is not a factor.
Bill
William Gropp
Director and Chief Scientist, NCSA
Thomas M. Siebel Chair in Computer Science
University of Illinois Urbana-Champaign

On Apr 29, 2018, at 3:27 PM, Omri Mor ***@***.***> wrote:
Given that MPI_PROC_NULL is supposed to be able to be used wherever a rank is required as a source or destination, and coupled with the fact that the neighborhood collectives explicitly support MPI_PROC_NULL neighbors, rejecting a topology in which MPI_PROC_NULL is specified as the rank of a neighbor would appear to be erroneous—at least to me.
What is your reasoning for the validity of Open MPI's behavior? Keep in mind that the neighbors at the borders of non-periodic dimensions of Cartesian topologies are explicitly defined to be MPI_PROC_NULL by the standard.
It is interesting to note that, as far as I can tell, the distributed graph topology constructors are the only place in the standard where ranks are specified to be 'non-negative integers'; all other functions merely specify them as 'integers'. Even the older non-distributed graph topology constructor specifies the flattened edge representation of neighbors as 'integers'.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANJTZuktykwrnbm5UYrGIYytmXq9xKH3ks5ttiI2gaJpZM4Sn4k_>.

This comment has been minimized.

The most correct statement would be “valid rank value” - the value should be between 0 and size of <group in the object that that rank is relative to> or MPI_PROC_NULL”.
I do think this could be considered a ticket 0 change, but it would be fairly broad.
Bill
William Gropp
Director and Chief Scientist, NCSA
Thomas M. Siebel Chair in Computer Science
University of Illinois Urbana-Champaign

This comment has been minimized.

I think there is a reasonable argument for choosing to make this an errata in order to increase its visibility. I have always assumed, through osmosis, that MPI_PROC_NULL is not permitted for (dist) graph topologies. I agree that there is no strong indication in the text of the MPI Standard stating that restriction but there is a clear general statement to the contrary. This may be a common mis-conception and it may behove the MPI Forum to publicise this change much more loudly than is typically done for a ticket 0 change.

@wgropp so, for dist graph, the text for the sources and destinations arguments would read "array of valid rank values" instead of "array of non-negative integers"? And, for graph, the text for the edges argument would read "array of valid rank values describing graph edges (see below)"?

I think an explicit sentence akin to the RMA errata should be added for each case as well.
For dist graph, how about adding (at line 34 on page 297):

MPI_PROC_NULL is a valid rank value for \mpiarg{sources} or for \mpiarg{destinations}.

For graph, how about adding (at line 43 on page 294):

MPI_PROC_NULL is a valid rank value for \mpiarg{edges}.

This comment has been minimized.

I agree with this. While the standard document is not a user manual, clarity is important, and these changes significantly increase the clarity without adding much text.
Bill
William Gropp
Director and Chief Scientist, NCSA
Thomas M. Siebel Chair in Computer Science
University of Illinois Urbana-Champaign

On Apr 30, 2018, at 11:49 AM, Dan Holmes ***@***.***> wrote:
I think there is a reasonable argument for choosing to make this an errata in order to increase its visibility. I have always assumed, through osmosis, that MPI_PROC_NULL is not permitted for (dist) graph topologies. I agree that there is no strong indication in the text of the MPI Standard stating that restriction but there is a clear general statement to the contrary. This may be a common mis-conception and it may behove the MPI Forum to publicise this change much more loudly than is typically done for a ticket 0 change.
@wgropp <https://github.com/wgropp> so, for dist graph, the text for the sources and destinations arguments would read "array of valid rank values" instead of "array of non-negative integers"? And, for graph, the text for the edges argument would read "array of valid rank values describing graph edges (see below)"?
I think an explicit sentence akin to the RMA errata should be added for each case as well.
For dist graph, how about adding (at line 34 on page 297):
MPI_PROC_NULL is a valid rank value for \mpiarg{sources} or for \mpiarg{destinations}.
For graph, how about adding (at line 43 on page 294):
MPI_PROC_NULL is a valid rank value for \mpiarg{edges}.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANJTZlhkfOCHq6k8jfo9soAfD56juLevks5tt0CFgaJpZM4Sn4k_>.

communication with null neighbours completes normally, like with point-to-point to/from null processes or off the edges of a cartesian topology [the easy tee shot]

Graph and dist graph enquiry functions will include null neighbours just like non-null neighbours [the edge of the fairway]

buffers/counts/datatypes/displacements/etc for null neighbours must be included for neighbourhood collectives - but not for other collectives [definitely into the rough now]

Alternatively, the graph and dist graph enquiry functions should remove all MPI_PROC_NULL values and give back only non-null neighbours (like a cartesian topology). That means getting back from a query function something different than what went into the creation function. In @omor1's example (issue #87 (comment), comment 13th Mar, binary tree), the user must keep their neighbours input array, perform MPI_GROUP_TRANSLATE_RANKS to get ranks in the new communicator (including MPI_PROC_NULL->MPI_PROC_NULL mappings), and avoid usage of the topology query functions. However, "the sequence of neighbours is defined as the sequence returned by " (MPI-3.1, section 7.6, page 314) [looks like we've found a water hazard]

None of this is trivial, so I agree that some additional explanation is required here.

This comment has been minimized.

If a neighbor in any of the functions is MPI_PROC_NULL, then the neighborhood collective communication behaves like a point-to-point communication with MPI_PROC_NULL in this direction. That is, the buffer is still part of the sequence of neighbors but it is neither communicated nor updated.

The neighborhood collectives are defined in terms of point-to-point communications with all neighbors (see e.g. § 7.6.2). As per § 7.6 (the quote above), this includes the null neighbors. Thus the inquiry functions must return null neighbors. Cartesian topologies don't have a direct analogue to MPI_DIST_GRAPH_NEIGHBORS_COUNT or MPI_DIST_GRAPH_NEIGHBORS. The closest is MPI_CART_SHIFT with input disp=1, which does indeed return null neighbors for the borders of non-periodic dimensions.

Since neighborhood collectives 'communicate' with null neighbors, buffers/counts/datatypes/displacements/etc must be included for them. I don't think that the 'normal' collectives should communicate with null neighbors though. How are normal collectives handled for cartesian topologies with non-periodic dimensions? As far as I know, all the non-neighborhood collectives only communicate with real processes, and never more than once (whereas the neighborhood collectives can communicate with null processes and with the same processes multiple times, in the case of an edge with multiplicity greater than 1). MPI_PROC_NULL 'always' belongs to every group (even MPI_GROUP_EMPTY), as per the definition of MPI_GROUP_TRANSLATE_RANKS, which states that the translation of MPI_PROC_NULL is always MPI_PROC_NULL—yet the null process is never included in 'normal' collective communications.

Some additional text clarifying interactions between communicators with virtual process topologies that have null neighbors or multiply-defined edges and non-neighborhood collective communications may be useful, but I think this is fairly unambiguous. (Side note: is it valid to pass MPI_PROC_NULL as the root in broadcast, gather, scatter, reduce, etc. operations for intracommunicators? This would appear to be ambiguous—it isn't explicitly excluded, and as per § 3.11 may be allowed. However, unlike with null neighbors, I fail to see the point of it, other than ensuring the generality of the operation.)

This comment has been minimized.

@omor1 I agree completely - my points in the list (I hope) follow the already-unambiguous text in §7.6 (without referencing it directly; thanks for finding the quotes) but show that if we start explaining (again) then the resulting text must be long and complex. Normal collectives are defined in terms of n point-to-point operations, where n is implied to be the result of MPI_COMM_SIZE, i.e. the number of real MPI processes in the communicator (e.g. MPI_GATHER, MPI-3.1, §5.5, p150, line 11, "as if the n processes in the group"). The "alternative" tries to make neighbourhood collectives like normal collectives but is, IMHO, untenable.

@jdinan Perhaps a cross-reference to §7.6 is sufficient? Such as "Section 7.6 describes how including MPI_PROC_NULL affects neighborhood collective operations."

@hjelmn This discussion about including MPI_PROC_NULL nodes (and duplicate nodes?) seems to indicate that the number of nodes in a distributed graph topology can be arbitrarily larger than the number of MPI processes in the communicator, which has implications for #89 (or, at least, for any future proposal to add MPI_DIST_GRAPH_MAP).

Given (MPI-3.1, §7.5.3, p294, line 35-36, re: MPI_GRAPH_CREATE):

The call is erroneous if it specifies a graph that is larger than the group size of the input communicator.

and (MPI-3.1, §7.5.3, p295, line 45-46, re: MPI_GRAPH_CREATE):

For a graph structure the number of nodes is equal to the number of processes in the group. Therefore, the number of nodes does not have to be stored explicitly.

it would seem that including MPI_PROC_NULL as a node in a non-distributed graph implies that one of the real MPI processes must be omitted as a node in that non-distributed graph.

The definitions for the distributed graph functions don't (I think) have this restriction.

This comment has been minimized.

This sounds ok. A couple quick comments on the text -- "valid" should be unnecessary. We would certainly like every argument be valid. The new text specifying the set of valid ranks should also capture the requirement that ranks must be members of the group of the parent/old communicator in addition MPI_PROC_NULL. I don't see this in the text (sorry if I missed it, only had time for a quick skim). Elsewhere in the spec, set notation is used for this, but that is perhaps extra credit.

This comment has been minimized.

@dholmes-epcc-ed-ac-uk@hjelmn counting MPI_PROC_NULL as a process in a communicator doesn't make sense. It isn't a 'real' process, and so poses a problem for collectives (MPI-3.1, §5.2.1, line 31-32):

All processes in the group identified by the intracommunicator must call the collective routine.

MPI_PROC_NULL can't call anything, so it can't be a process in the group identified by an intracommunicator (whether it has a virtual process topology or otherwise).

Also note that while non-distributed graphs explicitly state the number of nodes in the resulting graph in the constructor, the distributed graph constructors do not (MPI-3.1, §7.5.4, p297 line 36-37 and p299 line 34-35):

The number of processes in comm_dist_graph is identical to the number of processes in comm_old.

This comment has been minimized.

The forum decided that should be handled as a full ticket and should be considered after a further investigation on implementation impact (having to check for NULL could lead to problems in optimized implementations). Either way, the forum decided that it should be clarified whether MPI_PROC_NULL is allowed or not

This comment has been minimized.

I’m sorry that I wasn’t able to be present at the discussion.
The performance impact question doesn’t make sense in the context of decisions that have been made by MPI to allow MPI_PROC_NULL in both point-to-point and RMA communication. Those are more latency sensitive and the test applies on every call. An important contributor to MPI’s success is consistency in the underlying model (this is not perfect, but it is close). Saying the MPI_PROC_NULL can be used in some performance-critical places and not in others is not consistent with the MPI design.
Bill
William Gropp
Director and Chief Scientist, NCSA
Thomas M. Siebel Chair in Computer Science
University of Illinois Urbana-Champaign

On Jun 12, 2018, at 4:54 PM, Martin Schulz ***@***.***> wrote:
The forum decided that should be handled as a full ticket and should be considered after a further investigation on implementation impact (having to check for NULL could lead to problems in optimized implementations). Either way, the forum decided that it should be clarified whether MPI_PROC_NULL is allowed or not
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANJTZulX_AJvAn2UkX2ZmYjA8OFG4D39ks5t8DitgaJpZM4Sn4k_>.

This comment has been minimized.

@rlgraham32 will be able to more accurately represent his argument, but I think the main points were something like this:

Optimized implementations of neighborhood collectives (whether software or some future hardware) would not only have to check for MPI_PROC_NULL for the sake of not sending messages, but they would also have to shuffle buffers around to make sure the input and output buffers go to the correct places. Doing all of this extra shuffling is the concern as it may have a much greater performance impact than a single branch.

If an implementation is purely relying on point-to-point communication to implement neighborhood collectives (as all implementations we know of right now do), this has very little performance impact as you say because we only have to check for MPI_PROC_NULL, which we already do during all of those communication calls.

This comment has been minimized.

After thinking it over, I withdraw my concern. Since proc null is not in the range of the communicator's ranks, and the implementation can store a version of the graph without the proc null's, an implementation can access only data that needs to be send/received w/o any special logic for handling the proc null case.

This comment has been minimized.

Is that true? If you remove the MPI_PROC_NULLs from your list of neighbors, but the user's input to the collective will still include buffers (or perhaps NULL pointers) for those MPI_PROC_NULL MPI processes, don't you need to do some adjustment before transmitting or receiving buffers.

For instance, if these are your neighbors:

0 | 1, 2
1 | NULL, 0
2 | 0, 3
3 | NULL, 2

Your buffers on ranks 1 and 3 will include extra entries that you'll need to ignore.

This comment has been minimized.

I looked at the neighborhood alltoall and alltoallv, and a single pointer is used for both source and destination buffers. Data size and offsets, respectively, are used to get the address within the buffer, so there is really no portable way for at least the alltoall to specify an address for proc null. I may have missed something ...

This comment has been minimized.

the implementation can store a version of the graph without the proc nulls

Do you mean in addition to the version with them? They must be recorded somewhere, for the purpose of retrieving all neighbors with MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS (MPI-3.1, §7.5.5, pp. 309, lines 37–41):

The number of edges into and out of the process returned by MPI_DIST_GRAPH_NEIGHBORS_COUNT are the total number of such edges given in the call to MPI_DIST_GRAPH_CREATE_ADJACENT or MPI_DIST_GRAPH_CREATE (potentially by processes other than the calling process in the case of MPI_DIST_GRAPH_CREATE).

For this proposal, the input and/or output buffers could have blocks that are not modified by the neighborhood collectives, since the corresponding neighbor is MPI_PROC_NULL. These neighbors can't just be ignored; that bypasses the whole point of the proposal. In any case, this particular behavior is already mandated by the standard for cartesian topologies with non-periodic dimensions.

Did I misunderstand what you meant @rlgraham32? My impression of what you said was that an implementation could accept MPI_PROC_NULL neighbors, but then just ignore them.

This comment has been minimized.

the version reference in my comment, just means an implementation copy of the graph - implementation private data.

I think you are right with your comment on proc null - I guess I was assuming that destinations were ordered such that the proc null in the graph would be first or last in the list with the example of the Cartesian shift, which I guess does not need to be the case.

So, one does need to check each neighbor to see if the data really needs to be sent or not - whether this is an mpi send routine, or some other logic, when MPI semantics are not relied on for the internal collective implementation logic.

As for Wesley's concern (and mine) about conditional logic, we can replace conditional logic by adding another vector internal to the implementation with offsets - another memory reference.

This comment has been minimized.

I expect null neighbors to be uncommon (seeing as this hasn't been brought up as an issue before now), so optimizing for the more common case where neighbors actually exist is probably better. In that case, the CPU branch predictors (which should be able to predict the branches with high accuracy, since no neighbor is MPI_PROC_NULL) might get better performance with conditional logic than having to dereference memory—but that's something that would require testing & optimization.