This worked. Did this magic line disable the use of per-peer queue pairs? I have seen a previous post by Jeff that explains what this line does generally, but I didn't study the post in detail, so if you could provide a little explanation I would appreciate it.

This problem can be caused by a variety of things, but I suspect our default queue pair parameters (QP) aren't helping the situation :-).

What happens when you add the following to your mpirun command?

-mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12

OMPI Developers:

Maybe we should consider disabling the use of per-peer queue pairs by default. Do they buy us anything? For what it is worth, we have stopped using them on all of our large systems here at LANL.

Thanks,

Samuel K. Gutierrez
Los Alamos National Laboratory

On Sep 12, 2011, at 9:23 AM, Blosch, Edwin L wrote:

I am getting this error message below and I don't know what it means or how to fix it. It only happens when I run on a large number of processes, e.g. 960. Things work fine on 480, and I don't think the application has a bug. Any help is appreciated...