edited

Problem

Sending more than 2Gi elements in MPI is a pain.

The general strategy for implementing large-count operations is to use datatypes. In some cases, this is straightforward, but it appears to be a very poor solution in the case of v-collectives and reductions. In order to use the datatype solution for v-collectives, one has to map (counts[],type) to (newcounts[],newtypes[]), which then requires the w-collective, since only it takes a vector of types. For reductions, one has to unwind the datatype inside of a user-defined reduction. None of the solutions available outside of MPI work for nonblocking collectives, due to the allocation of temporary vector arguments. If it is possible with generalized requests, it is onerous.

A more subtle issue is the large-displacement problem, which exists even if all of the counts are less than INT_MAX because of the limitations of the offset vector. If the sum of counts[i] up to any i<comm_size exceeds INT_MAX, then displs[i] will overflow. This means that one cannot use any of the v-collectives for relatively small data sets, e.g. 3B floats, which is only 12 GB per process. This is likely to be limiting when implementing 3D FFT, matrix transpose and IO aggregation, all of which are likely use v-collectives. Neighborhood collectives fixed the large-displacement problem, but if a user wants to use those as a drop-in replacement, they have to create a new communicator.

The displacement issue is exacerbated in the large-count case because all the displacements are interpreted in bytes rather than the extent of the datatype, so there is no way to index beyond 2GB of data, irrespective of the datatype and the counts.

Using the w-collective for large-count v-collectives has these issues:

Calling the w-collectives requires the allocation and assignment of O(nproc) vectors, which is tedious but certainly not a memory issue if one is in the large-count regime.

One cannot deallocate the argument vectors until the operation completes, which means that one cannot implement the nonblocking case, since there is no opportunity to deallocate the temporary vectors in the wait call (any solution involving generalized requests is almost certainly untenable for most users).

Because MPI_ALLTOALLW takes displacements of type int and interprets these irrespective of the extent of the datatype (see page 173 of MPI-3), it is hard to index more than 2GB of data ''using any datatype''. There is a solution using datatypes encoded with the offset internally (e.g. via MPI_Type_create_struct), but it is far from user-friendly.

In the absence of proper support in the MPI standard, the most reasonable implementation of large-count v-collectives uses point-to-point, which means that users must make relatively nontrivial changes to their code to support large counts, or they have to use something like BigMPI, which already implements these functions (vcollectives_x.c)). An RMA-based implementation is also possible, but users are unlikely to accept this suggestion.

One can map also the v-collectives to MPI_Neighborhood_alltoallw, but in a far-from-efficient manner, and this is not particularly useful for the nonblocking case because MPI_Dist_graph_create_adjacent is blocking.

Proposal

The straightforward, user-friendly solution to this problem is to add new functions that use MPI_Count and MPI_Aint for counts and displacements, respectively.

We are not proposing to add new functions for everything, just the standard collectives (neighborhood collectives will be proposed later as a separate ticket).

Adding _x versions of the v-collectives and w-collectives that have the count of type MPI_Count and displacement vectors of type MPI_Aint[] is the most direct solution and prevents users from having to allocate and set O(Nproc) vectors in the course of mapping to the most general collective available (e.g. MPI_NEIGHBORHOOD_ALLTOALLW).

We add reductions (reduce, allreduce, reduce_scatter, reduce_scatter_block, scan, exscan) as well, with the limitation that user-defined reductions are not supported because these would require a new version of MPI_User_function, MPI_Op_create, and MPI_Op_free, which is error-prone. For user-defined reductions, it is feasible to use user-defined datatypes without an obvious loss of efficiency. Furthermore, there are other issues (mpi-forum/mpi-forum-historic#339) with user-defined reductions that should be addressed if this change is made.

Alternative solution

Another solution would be to add large-count support to derived datatypes, e.g. MPI_Type_contiguous_x, but this is not user-friendly. We should not ask users to start using derived datatypes to broadcast a contiguous array of 2.2 billion elements, for example.

Changes to the Text

Impact on implementations

BigMPI implements large-count variants of most of the proposed functions, sometimes in more than one way. For example, large-count blocking collectives were implemented using point-to-point, neighbor_alltoallw, and one-sided. Nonblocking collectives are a problem, which is one of the big motivations for this ticket.

The implementations inside of MPI libraries is straightforward assuming they convert message sizes to bytes internally and support e.g. 1B 4-byte types correctly.

The BigMPI project thoroughly evaluated the Forum's contention that datatypes were sufficient to address the large-count issue and found that this solution is unlikely to satisfy the majority of users, due to a number of performance and usability issues.

This comment has been minimized.

edited

We are going to read this in Barcelona. Just this base ticket, not all its relatives that were spawned on June 14 (97, 98, 99, 100). We will bring those forward later. Tickets #98, 99, and 100 are all important and no more controversial than this ticket (#80), while #97 remains highly controversial. Also,
all the WITH_INFO tickets await resolution on Ticket #80 and other Big MPI tickets before proceeding.

This comment has been minimized.

Noting that the topology chapter is not covered by this version of the ticket nor the proposed reading material. A separate ticket will be made for that so this can proceed. If there should be an objection at the reading of this ticket that it does not address the topology chapter, we will point to the second ticket.

This comment has been minimized.

edited

Rolf notes that MPI_Alltoallw is inconsistent in its definition because it has byte displacements, yet they are defined as int, not MP_Aint. Therefore, the new API must account for this inconsistency and should it handle it via MPI_Aint for displacements; that is in fact what is currently proposed in the pull request as written.

For the byte displacements, they can always be used as relative displacements to the beginning of a buffer, or they can be used as absolute displacements (relative to MPI_BOTTOM). Thus, they must always be MPI_Aint. Additionally, the difference of two relative displacements should always be calculated MPI_Aint_diff(), not with an arithmetic minus (-) operator. The same applies for MPI_Aint_add() for the summation of an absolute address plus a relative address.

Therefore:

MPI_Count is a fine replacement for int everywhere that a count appears [no controversy]

Where there are index displacements, we should replace int with MPI_Count, because this reflects the difference of such indices (not MPI_Aint)

Where there are byte displacements, we should replace keep MPI_Aint where it is already specified, and repair any APIs that previously got this wrong in the _x version. For example, we noted MPI_Alltoallw is such a case.

It is necessary that the size of the integer representing MPI_Count >= size of the integer representing MPI_Aint. This rule is already in the standard. [See p 17 of MPI-3.1 standard, Section 2.5.8 Counts. Lines 15-19. Already covered in the standard.] In MPI-3.0, we already have MPI_GET_EXTENT_X that is using MPI_Count. So MPI_Count is not new.

What we are recommending is to change the text of this proposal as follows: We will not put MPI_Aint on all displacements. We will put MPI_Aint on displacements involving bytes; we will put MPI_Count on displacements that are of index type.

tonyskjellum
changed the title from
Big MPI---large-count and displacement support
to
Big MPI---large-count and displacement support--collective chapterSep 19, 2018

This comment has been minimized.

edited

Rolf notes that MPI_Alltoallw is inconsistent in its definition because it
has byte displacements, yet they are defined as int, not MP_Aint.
Therefore, the new API must account for this inconsistency and should it
handle it via MPI_Aint for displacements; that is in fact what is currently
proposed in the pull request as written.

It’s also a significant point in the BigMPI paper, which people should read
to understand the proposals Tony is reading.

This comment has been minimized.

And I’ll note that choice will require the hardware to double the sustained memory bandwidth for codes that have significant integer data, just to maintain the same performance - and that includes a lot of HPC applications. If you don’t care about performance, ILP64 has a lot going for it...
Bill
William Gropp
Director and Chief Scientist, NCSA
Thomas M. Siebel Chair in Computer Science
University of Illinois Urbana-Champaign

This comment has been minimized.

Latest text (all mechanical changes, corrections compared to September 5 document); plus clarifications of use of MPI_Count vs. MPI_Aint as discussed above. This version properly complies with the MPI standard for use of MPI_Count and MPI_Aint in the _X functions.

This comment has been minimized.

edited

The key outcome of the reading is the plan for an holistic look at the API across the entire API; a voting strategy followed by a final vote on the entire API addition was discussed and accepted as Forum-compliant (by acclamation / without objection).

There were no specific objections to the API as presented currently in this ticket, ticket #105

It was pointed out that we still have more tickets to write and implement besides those already open fir "Big MPI." We have to look at the entire standard end-to-end.

The current goal is to read "all" Big MPI tickets in December.

This comment has been minimized.

edited

Note also the creation of issue #107 and, in particular, the consequential question of whether we should actually replace MPI_COUNT with size_t in all C bindings and replace MPI_AINT with ptrdiff_t in all C bindings (with similar appropriate changes towards using language-specified types for the Fortran bindings).#107 (comment)

Assertion: using the naturally-sized types specified in the C language would achieve the goal of all the Big MPI issues for the C bindings. The short-term consequence (huge one-off churn affecting most APIs) is identical.

Question: are there similar appropriate types specified in the Fortran language?

Observation: the datatype naming rule proposed in issue #74 (if accepted) will permit the addition of MPI datatypes for size_t and ptrdiff_t (plus Fortran equivalents, if any) without further changes to the MPI Standard.

Corollary: issues #107 and #109 become moot.
Corollary: MPI_AINT_ADD and MPI_AINT_DIFF become superfluous.

@jdinan had a good reason to keep the MPI-namespaced types but I have completely forgotten it. @jdinan: please could you comment?

Do we want MPI to continue to move in the direction of a DSL for communication or return to its roots of a library for communication?

Note, IMHO, the concept of this/these proposal(s) is essential (cope with big machines); only the presentation style in the API is being debated. If we cannot find a technical reason to choose between language-specified and MPI-defined types, then we need the Architecture Review Board to reconvene and expurgate via a fiat.

This comment has been minimized.

On Sep 23, 2018, at 3:36 PM, Dan Holmes ***@***.***> wrote:
Note also the creation of issue #107 and, in particular, the consequential question of whether we should actually replace MPI_COUNT with size_t in all C bindings and replace MPI_AINT with ptrdiff_t in all C bindings (with similar appropriate changes towards using language-specified types for the Fortran bindings).
#107 (comment)
Assertion: using the naturally-sized types specified in the C language would achieve the goal of all the Big MPI issues for the C bindings. The short-term consequence (huge one-off churn affecting most APIs) is identical.
Question: are there similar appropriate types specified in the Fortran language?
Observation: the datatype naming rule proposed in issue #74 (if accepted) will permit the addition of MPI datatypes for size_t and ptrdiff_t (plus Fortran equivalents, if any) without further changes to the MPI Standard.
Corollary: issues #107 and #109 become moot.
Corollary: MPI_AINT_ADD and MPI_AINT_DIFF become superfluous.
@jdinan had a good reason to keep the MPI-namespaced types but I have completely forgotten it. @jdinan: please could you comment?
Do we want MPI to continue to move in the direction of a DSL for communication or return to its roots of a library for communication?
Note, IMHO, the concept of this/these proposal(s) is essential (cope with big machines); only the presentation style in the API is being debated. If we cannot find a technical reason not to choose between language-specified and MPI-defined types, then we need the Architecture Review Board to reconvene and expurgate via a fiat.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or mute the thread.

This comment has been minimized.

@dholmes-epcc-ed-ac-uk Please remember that if we change int to size_t or such widening, we will break every single use of count arrays, as are used in vector collectives.

As far as I can tell, this didn't happen with POSIX when those APIs switched from int to size_t because, while this changed the ABI, POSIX doesn't have any APIs that take vectors of counts. Rather in e.g. writev, the length is inside of the iovec struct so any code that uses this function would have to allocate an array using an expression including sizeof(iovec), which safely promotes when compiled on a 64-bit system.

This comment has been minimized.

What I mean is that switching from int to size_t, signed to unsigned, is more troublesome than switching from int to ptrdiff_t (signed to signed). The latter has the advantage of Fortran compatibility (assuming that Fortran has no unsigned integer types).

This comment has been minimized.

This comment has been minimized.

Having the vector arguments be typed with MPI_COUNT or MPI_AINT does not help with ABI portability with respect to using size_t or ptrdiff_t instead. Both sets of types are of a fixed length on a particular machine but could be different between machines. If I write code that assumes the size of any of these it will break when that size changes.

For the avoidance of doubt, I say above that the consequences to the API of using size_t are identical to using MPI_COUNT because the proposal is to churn the API in exactly the same manner. Specifically, if it is decided that we will have two symbols, the existing function signature and one with "_X" appended, then the "_X" variant will have the new type(s), whichever type of types that ends up being. Users can continue to compile against the existing symbols with their existing code and variable declarations. If and only if they wish to switch do they have to verify they are using suitably sizes variables and arrays.

If the MPI Forum decides to fork MPI (seriously discussed as an option at the Sept 2018 meeting, straw poll 16,2,0 in favour), then MPI-4.0 may change the types in the existing API function definitions without changing their symbol names, which breaks backward compatibility. This option imposes a burden on the MPI Forum and on MPI library writers to continue support for a line of MPI-3.x releases that contain existing MPI-3.1 interfaces plus minor fixes and updates cherry-picked from the MPI-4 fork.

This comment has been minimized.

@dholmes-epcc-ed-ac-uk Sorry, I misread your comment and thought you were suggesting replacing int with a wider type, as opposed to replacing MPI_Count with an ISO/POSIX-standard one.

If we are going to fork the standard, I suggest that we use MPI_Count and MPI_Aint everywhere, but prescribe how these are typedef-d. That way, we can preserve a universal API definition while supporting both ABIs. This is not unlike what I've proposed in #13 for MPI_Socket.

This comment has been minimized.

@jeffhammond I like that. So, we are suggesting that part of the C binding as defined in MPI-4k (pronounced MPI-fork) should be:

typedef size_t MPI_COUNT;
typedef ptrdiff_t MPI_AINT;

That allows humans and compilers alike to see the equivalence and use whichever they are more comfortable with.

The Fortran binding can do whatever seems appropriate for that language (probably these will remain "opaque" types).

Issue #107 becomes moot. Issue #109 is not, in fact it should be expanded to include F2C and C2F conversion functions or a promise of automatic representation conversion during heterogenous MPI communication.

This comment has been minimized.

@dholmes-epcc-ed-ac-uk We need to stopping talking about forks. Python forking was/is a disaster for users and maintainers of dependent projects. MPI-4 needs to be one standard with two well-defined ABIs.

This comment has been minimized.

Agree. The word "fork" has connotations of splitting and becoming two entirely different things. Even though I'm not there at the meeting, I get the sense that that's not what the Forum is talking about here.

This comment has been minimized.

Dan et al, Correct me if I am wrong, but I interpreted the 16-2 straw poll
to allow breaking backward compatibility to imply this kind of thinking,
not a wholesale change:
* New proposals, if backward incompatible, do not arrive DOA (or get struck
down immediately) simply for the lack of backward compatibility.
Tony

This comment has been minimized.

To clarify for those not present at the meeting, the discussion prior to the 16-2 in favour straw-poll covered a number of possible API changes related to how we should express the Big MPI adjustments (and others). There was a general (and strong) feeling that creating "_X" versions in MPI-4 only to be faced later with the necessity of creating "_Y" versions in future for some other API change was a really bad idea.

The straw-poll itself immediately followed a suggestion that MPI-4 should define two APIs, possibly to be expressed via two header files in C (and, I guess, two modules in Fortran), for example, "mpi3.h" and "mpi4.h".

The straw poll question was carefully worded to extract maximum support, something like "given the dislike for the _X mess, could you countenance supporting a proposal that breaks backwards compatibility, for example, in this way?" with the other option being "I will never support anything that is not backwards compatible under any circumstances".

Despite heavily biasing the question, I was not expecting the strength of support for such a radical idea.

Perhaps, "fork" is the wrong word. However, Python was mentioned as a cautionary tale during the discussion and before the straw-poll.

Others present can correct me, if I am mis-remembering or over-editorialising.