Sorry for my late reply.
And thank you all for your answers and comments.

Oleg,

Same question as Aurélien. You mentionned that you have implemented some
piggyback mechanisms in Open MPI.
Are these mechanisms available ?
Would it be possible to use it ?

Regards.

Thomas Ropars
Aurélien Bouteiller wrote:
> Oleg,
>
> Is there an implementation in Open MPI of your techniques ? Can we put
> our greedy nasty pawns on it ?
>
> Thanks for the link, Josh.
>
> Aurelien
>
> Le 5 févr. 08 à 08:39, Josh Hursey a écrit :
>
>
>> Oleg,
>>
>> Interesting work. You mentioned late in your email that you believe
>> that adding support for piggybacking to the MPI standard would be the
>> best solution. As you may know, the MPI Forum has reconvened and there
>> is a working group for Fault Tolerance. This working group is
>> discussing a piggybacking interface proposal for the standard, amongst
>> other things. If you are interested in contributing to this
>> conversation you can find the mailing list here:
>> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft>>
>> Best,
>> Josh
>>
>> On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote:
>>
>>
>>> Hi,
>>>
>>> I've been working on MPI piggyback technique as a part of my PhD
>>> work.
>>>
>>> Although MPI does not provide a native support, there are several
>>> different
>>> solutions to transmit piggyback data over every MPI communication.
>>> You may
>>> find a brief overview in papers [1, 2]. This includes copying the
>>> original
>>> message and the extra data to a bigger buffer, sending additional
>>> message or
>>> changing the sendtype to a dynamically created wrapper datatype that
>>> contains a pointer to the original data and the piggyback data. I
>>> have tried
>>> all mechanisms and they work, but considering the overhead, there is
>>> no "the
>>> best" technique that outperforms the others in all scenarios. Jeff
>>> Squyres
>>> had interesting comments on this subject before (in this mailing
>>> list).
>>>
>>> Finally after some benchmarking, I have implemented *a *hybrid
>>> technique
>>> that combines existing mechanisms. For small, point-to-point messages
>>> datatype wrapping seems to be the less intrusive, at least
>>> considering
>>> OpenMPI implementation of derived datatypes. For large, point-to-
>>> point
>>> messages, experiments confirmed that sending an additional message
>>> is much
>>> cheaper than wrapping (and besides the intrusion is small as we are
>>> already
>>> sending a large message). Moreover, the implementation may
>>> interleave the
>>> original send with an asynchronous send of piggyback data. This
>>> optimization
>>> partially hides the latency of additional send and lowers overall
>>> intrusion.
>>> The same criteria can be applied for collective operations, except
>>> barrier
>>> and reduce operations. As the former does not transmit any data and
>>> the
>>> latter transforms the data, the only solution is to send additional
>>> messages.
>>>
>>> There is a penalty of course. Especially for collective operations
>>> with very
>>> small messages the intrusion may reach 15% and that's a lot. It than
>>> decreases down to 0.1% for bigger messages, but anyway it's still
>>> there. I
>>> don't know what are your requirements/expectations for that issue.
>>> The only
>>> work that reported lower overheads is [3] but they added native
>>> piggyback
>>> support by changing underlying MPI implementation.
>>>
>>> I think the best possible option is to add piggyback support for MPI
>>> as a
>>> part of the standard. A growing number of runtime tools use this
>>> functionality for multiple reasons and certainly PMPI itself is not
>>> enough.
>>> References of interest:
>>>
>>> -
>>>
>>> [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance
>>> Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI
>>> Conference, LNCS, vol. 3666, pp. 359-367, 2005. They review various
>>> techniques and come up with datatype wrapping.
>>>
>>> -
>>>
>>> [2] Schulz, M., "Extracting Critical Path Graphs from MPI
>>> Applications". Cluster Computing 2005, IEEE International, pp. 1-10,
>>> September 2005. They use datatype wrapping.
>>> - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of
>>> Communication
>>> Activity in Distributed Applications". They add support for
>>> piggyback at MPI
>>> implementation level and report very low overheads (no surprise).
>>>
>>> Regards,
>>> Oleg Morajko
>>>
>>>
>>> On Feb 1, 2008 5:08 PM, Aurélien Bouteiller <bouteill_at_[hidden]>
>>> wrote:
>>>
>>>
>>>> I don't know of any work in that direction for now. Indeed, we plan
>>>> to
>>>> eventually integrate at least causal message logging in the pml-v,
>>>> which also includes piggybacking. Therefore we are open for
>>>> collaboration with you on this matter. Please let us know :)
>>>>
>>>> Aurelien
>>>>
>>>>
>>>>
>>>> Le 1 févr. 08 à 09:51, Thomas Ropars a écrit :
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm currently working on optimistic message logging and I would
>>>>> like
>>>>> to
>>>>> implement an optimistic message logging protocol in OpenMPI.
>>>>> Optimistic
>>>>> message logging protocols piggyback information about dependencies
>>>>> between processes on the application messages to be able to find a
>>>>> consistent global state after a failure. That's why I'm interested
>>>>> in
>>>>> the problem of piggybacking information on MPI messages.
>>>>>
>>>>> Is there some works on this problem at the moment ?
>>>>> Has anyone already implemented some mechanisms in OpenMPI to
>>>>> piggyback
>>>>> data on MPI messages?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Thomas
>>>>>
>>>>> Oleg Morajko wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm developing a causality chain tracking library and need a
>>>>>> mechanism
>>>>>> to attach an extra data to every MPI message, so called piggyback
>>>>>> mechanism.
>>>>>>
>>>>>> As far as I know there are a few solutions to this problem from
>>>>>> which
>>>>>> the two fundamental ones are the following:
>>>>>>
>>>>>> * Dynamic datatype wrapping - if a user MPI_Send, let's say 1024
>>>>>> doubles, the wrapped send call implementation dynamically
>>>>>> creates a derived datatype that is a structure composed of a
>>>>>> pointer to 1024 doubles and extra fields to be piggybacked. The
>>>>>> datatype is constructed with absolute addresses to avoid
>>>>>> copying
>>>>>> the original buffer. The receivers side creates the equivalent
>>>>>> datatype to receive the original data and extra data. The
>>>>>> performance of this solution depends on the how good is derived
>>>>>> data type handling, but seems to be lightweight.
>>>>>>
>>>>>> * Sending extra data in a separate message -- seems this can have
>>>>>> much more significant overhead
>>>>>>
>>>>>> Do you know any other portable solution?
>>>>>>
>>>>>> I have implemented the first solution for P2P operations and it
>>>>>> works
>>>>>> pretty well. However there are problems with collective
>>>>>> operations.
>>>>>> There are 2 classes of collective calls that are problematic:
>>>>>>
>>>>>> 1. Single receiver calls, like MPI_Gather. The sender tasks in
>>>>>> gather can be handled in the same way as a normal send, a data
>>>>>> item is wrapped and extra data is piggybacked with the message.
>>>>>> The problem is at the receiver side when a root gathers N data
>>>>>> items that must be received in an array big enough to receive
>>>>>> all items strided by datatype extent.
>>>>>>
>>>>>> In particular, it seems impossible to construct a datatype that
>>>>>> contains data item and extra data (i.e. structure type with
>>>>>> absolute addresses) AND make an array of these datatypes
>>>>>> separated by a fixed extent. For example: data item to receive
>>>>>> from every process is a vector of 1024 doubles. Extra data is a
>>>>>> single integer. User provides a receive buffer with place for N
>>>>>> * 1024 * double. The library allocates an array of N integers
>>>>>> to
>>>>>> receive piggybacked data. How to construct a datatype that can
>>>>>> be used to receive data in MPI_Gather?
>>>>>>
>>>>>> 2. MPI_Reduce calls. There is no problem with datatypes as the
>>>>>> receiver gets the single data item and not an array as in
>>>>>> previous case. The problem is the reduction operator itself
>>>>>> (MPI_Op) because these operators do not work with wrapped data
>>>>>> types. So I can create a new operator to recognize the wrapped
>>>>>> data type that extracts the original data (skipping extra data)
>>>>>> and performs the original reduction. The point is how to invoke
>>>>>> the original reduction on an existing datatype. I have found
>>>>>> that Open MPI calls internally ompi_op_reduce(op, inbuf, rbuf,
>>>>>> count, dtype) this solves a problem. However this makes the
>>>>>> code
>>>>>> MPI-implementation dependent. Any idea on more portable
>>>>>> options?
>>>>>>
>>>>>>
>>>>>> Thank you in advance for any comment.
>>>>>>
>>>>>> --Oleg
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>>>
>>>> --
>>>> Dr. Aurélien Bouteiller
>>>> Sr. Research Associate - Innovative Computing Laboratory
>>>> Suite 350, 1122 Volunteer Boulevard
>>>> Knoxville, TN 37996
>>>> 865 974 6321
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users>