I have two nodes with Infiniband to perform allgather on totally 128MB data.

I split the 128MB data into eight pieces, and perform computation and MPI_Iallgatherv() on one piece of data each iteration, hoping that the MPI_Iallgatherv() of last iteration can be overlapped with computation of current iteration. A MPI_Wait() is called at the end of last iteration.

However, the total communication time (including the final wait time) is similar with that of the traditional blocking MPI_Allgatherv, even slightly higher.