DETAILS: Based on all the feedback (thanks to everyone who looked at it), I have whittled down what I hope to accomplish with this RFC. There were suggestions to better modularize the CUDA registration code so I will take a look at that separately. Since the registration code is a performance feature, it will be dropped from this RFC and investigated separately. This significantly reduced the changes being proposed here. With this RFC, all the changes are isolated in datatype and convertor code. As mentioned before, the changes mostly boil down to replacing memcpy with cuMemcpy when moving the data to or from a CUDA device buffer.

Per suggestions, the choice to disable the large memory RDMA now happens on a per message basis. This is done by adding a flag to the convertor which tells the BTLs that an intermediate buffer is needed when dealing with device memory.

As before, this code would be enabled via a configure option. A mostly completed version is viewable on bitbucket although I know the configure code is sorely lacking.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------