GASNet pami-conduit documentation
Paul H. Hargrove <PHHargrove@lbl.gov>
User Information:
-----------------
This is an implementation of the GASNet CORE and EXTENDED
API using the IBM PAMI communication protocol.
Where this conduit runs:
-----------------------
pami-conduit implements GASNet over IBM's Parallel Active Messaging Interface
(PAMI) which is available on the IBM Blue Gene/Q, IBM PERCS/POWER 775 systems,
and InfiniBand-connected clusters running IBM's Parallel Environment (PE)
software. There is a good collection of PAMI-related links available from
https://github.com/jeffhammond/pami-examples/blob/master/README
PAMI is the recommended GASNet conduit on the Blue Gene/Q, and POWER 775
(a.k.a. PERCS) systems, and has been developed and tested on both. On
InfiniBand-connected clusters, ibv-conduit is likely to provide superior
performance.
There are no known minimum required versions of PAMI or related software
Optional compile-time settings:
------------------------------
* The following compile-time settings from extended-ref
(see the extended-ref README)
+ GASNETI_THREADINFO_OPT - optimize thread discovery using hidden local variable
+ GASNETI_LAZY_BEGINFUNCTION - postpone thread discovery to first use
+ GASNETE_SCATTER_EOPS_ACROSS_CACHELINES(1/0) - scatter newly allocated eops
across cache lines to reduce false sharing
Recognized environment variables:
---------------------------------
* All the standard GASNet environment variables (see top-level README)
* GASNET_BARRIER - barrier algorithm selection
In addition to the algorithms in the top-level README, there are two
PAMI-specific values supported:
PAMIDISSEM - like AMDISSEM, but implemented using PAMI-level AMs.
PAMIALLREDUCE - barrier matching is implemented in terms of a
PAMI-level ALLREDUCE collective operation.
Currently PAMIDISSEM is the default on all PAMI platforms.
* GASNET_USE_PAMI_COLL - enable use of native-PAMI collectives
Not all collectives are supported for all input conditions, but when
support is available this setting controls if it will be used.
[NOTE: currently only blocking collectives are implemented over PAMI]
Additionally, the following allow finer-grained control over which
collective operations use PAMI_Collective() when GASNET_USE_PAMI_COLL
is enabled:
GASNET_USE_PAMI_BROADCAST - gasnet_coll_broadcast functions
GASNET_USE_PAMI_EXCHANGE - gasnet_coll_exchange functions
GASNET_USE_PAMI_GATHER - gasnet_coll_gather functions
GASNET_USE_PAMI_GATHERALL - gasnet_coll_gather_all functions
GASNET_USE_PAMI_SCATTER - gasnet_coll_scatter functions
Default value for variables in this family is YES
* GASNET_NETWORKDEPTH - depth of AM Request queues (default 1024)
This integer parameter sets the limit on the number of outstanding
Active Message Requests, where outstanding is defined in terms of
local completion of the network send.
Too-small values may reduce performance of AM-intensive applications.
Too-large values may result in excessive buffering requirements in
AM-intensive applications which can both reduce performance and can
result in excessive memory use.
Applications not sending "floods" of AMs will be be insensitve to
the value of this parameter.
* GASNET_AMPOLL_MAX - limit on work done in AMPoll (default 16)
This integer parameter sets the maxumum number of PAMI operations
to be retired by a call to gasnet_AMPoll().
Known problems:
---------------
* See the GASNet Bugzilla server for details on known bugs:
http://gasnet-bugs.lbl.gov/
Future work:
------------
The following are planned work items for pami-conduit:
* Use dynamic registration (firehose) when local addres is out-of-segment?
Initial benchmarks seem to show PAMI getting RDMA speeds for xfers of
sufficient size even when using PAMI_Put/Get, suggesting that some
dynamic registration is already used internally.
However, the gap between Put and Rput bandwidth between 2KB and 64KB as
measured with 1 proc-per-node on PERCS shows that there is currently a
*possibility* that dynamic registration (firehose) could be beneficial.
* Register bounce buffers used for AM headers and payloads and apply
the appropriate "use_rdma" hints.
* Use multiple PAMI contexts/endpoints. At a minimum it would be desirable
to separate the AM and RDMA for independent progress. Use of multiple
endpoints when using pthreads is also worth some implementation effort.
A separate context used for the exit coordination would prevent deadlock
when exiting from an AM handler.
* Explore use of PAMI's "remote_async_progress" hint.
* Explore use of bounce buffers to avoid blocking for local completion
of non-blocking/non-bulk Puts.
* Explore use of conduit-level flow control for AMs, though it is not yet
certain that this is needed as it was with dcmf-conduit.
* Improve exit handling to raise SIGQUIT for non-collective exits.
* For sufficiently small payloads, AMRequestLong could use a bounce buffer
to avoid stalling for local completion.
* Explore use of PAMI_Send_immediate() for small enough Medium and/or Long AMs.
==============================================================================
Design Overview:
----------------
* Core API:
+ GASNet's AMs are implemented in terms of PAMI's AMs, and execute
handlers directly from the PAMI callbacks.
- Short AMs use PAMI_Send_immediate() due to their length.
- Medium and Long AMs use PAMI_Send().
- Medium AMs copy their payloads to bounce buffers to avoid
stalling for local completion.
- Long Request AMs block for local completion.
- LongAsync Request AMs do NOT block for local completion.
- Long Replies AMs copy their payloads to bounce buffers, like
a Medium AM, since our running of AM handlers from PAMI's
callbacks precludes blocking for local completion.
+ The current default barrier is a PAMI-specifc implementation of the
dissemination barrier in terms of PAMI_Send_immediate().
+ GASNet's exit handling is done using an PAMI "all reduce" operation
(w/ a timeout) to determine the MAX() of the exit codes, and whether
the exit is collective. For non-collective exits, the conduit is
currently calling exit(1) and using the fact that IBM's software
will take care of aborting the job. However this does NOT get the
desired behavior of raising SIGQUIT on the non-exiting nodes.
+ PSHM is supported through the default mechanisms.
* Extended API:
+ GASNET_SEGMENT_FAST and GASNET_SEGMENT_LARGE are identical:
- The segment is allocated using mmap() via the default mechanisms.
- The segment is pinned/registered as a single PAMI memory region.
+ All Extended API operations are performed using PAMI_Rput and _Rget
when both addresses fall in the GASNet segment, and PAMI_Put and
_Get otherwise. As a result, GASNET_SEGMENT_EVERYTHING "just works".
+ The blocking operations block for remote completion, of course.
+ The non-blocking NON-BULK Put operations will stall for the required
local completion, but don't need to stall for remote completion.
There is not currently any use of bounce buffers for these Puts.