On Tue, 22 Mar 2005, Jeff Squyres wrote:
> Instead, I would like to propose an alternative to an MPI ABI.
This is related to one of my pet "thought projects" for several years.
We have largely completed developing an alternate cluster programming
model that is better suited to most applications.
The current MPI design is the wrong model for clusters.
It represents a static view of the cooperating machines
The MPI "cluster size", MPI_Comm_size(MPI_COMM_WORLD, ...), is
static for all time. There is no way to take advantage of new
machines, or reduce the number of machines that the application
depends on.
MPI has a model of initialize-compute-terminate.
There is no explicit support for checkpointing, executing as a
service, or running "forever".
MPI has no concept of failures
There is no mechanism to report that a rank has failed, and no way
to recover from that failure. MPI applications may issue undone
work to other ranks and complete the actual task, but they cannot
terminate cleanly with missing nodes.
MPI's strength is collective mathematically-oriented operations, not
communication. I understand that even the name "Message Passing.."
indicates that stream communication isn't the focus, but many
applications expect and work well with a sockets-based model.
Cross-architecture jobs are theoretically supported, but very
difficult to implement. The capability adds complexity without
benefit.
Communicators besides MPI_COMM_WORLD are rarely used. The capability
adds complexity with little benefit.
Some of the implications of using a dynamic model are
- We need information and scheduling interfaces
- We need true dynamic sizing, with process creation and termination
primitives
- We need status signaling external to application messages.
There needs to be new information interfaces which
- report usable nodes (which node are up, ready and will permit us
to start processes)
- report the capability of those nodes (speed, total memory)
- report the availability of those nodes (current load, available
memory)
Each of these information types is different and may be provided
by a different library and subsystem. We created 'beostat', a status
and statistics library, to provide most of this information.
There needs to be an explicit scheduler or mapper interface.
We use 'beomap', which can utilize an external scheduler or
internally create a list of usable compute nodes.
An application should be able to run single-threaded if it decides
that multiple processes are not useful.
An application should be able to use only a subset of provided
processors if they will not be useful (e.g. an application that uses
a regular grid might choose to use only 16 of 23 provided nodes.
The unused nodes should be truly unused: if they crash or are
otherwise unexpectedly removed they shouldn't affect correct,
error-free completion. Ideally processes should never be started on
those nodes.
There needs to be new process creation primitives.
We already have a well-tested model for this: Unix process
management. The only additional element is the ability to specify
remote nodes when creating processes. Monitoring and handling
terminating processes does not need extensions. We use BProc, but
the concept of remote_fork() existed long before BProc. A standard
set of calls should not have the same names, but should use exactly
the Unix semantics so that the library only needs to "wrap" the
actual system calls.
There needs to be asynchronous signaling methods
We already have this as well: Unix signals. Their (still modest)
complexity represents many years of experience and demonstrates that
getting the semantics for handling asynchronous events is difficult.
> 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on
> pay-per-view (read: no need for time-consuming, potentially fruitless
> attempts to get MPI implementors to agree on anything)
Hmmmm, does this show up on the sports page?
Donald Becker
Scyld Software
Annapolis MD 21403 410-990-9993