Collective communication operations often involve massive data
movement over the entire network. A bad implementation of these
operations can affect the scalability of an application to a large
number of processors. In this paper we study the collective
multicast operation. The extreme case of collective multicast is
all-to-all multicast MPI_Allgather. Here each processor multicasts
a message to all the other processors. We present optimizations and
performance studies for all-to-all multicast. These optimizations
need to be different for small and large messages. For small
messages, the major issue is minimization of software overhead.
This can be achieved by message combining. For large messages it is
network contention that can be reduced by intelligent message
sequencing. Modern NIC's have a communication co-processor that
performs message management through zero copy remote DMA
operations. We present an asynchronous non blocking collective
multicast framework that allows the processor do other computation
while the collective operation is in progress. We will also present
performance comparisons of the various algorithms implemented by
our framework with many relevant applications and benchmarks.