The subject heading is a little misleading because this is in response
to part of that original contact. I tried the first two suggestions
below (disabling eager DMA and using tcp btl), but to no avail. In all
cases I am running over 20 12-core nodes through SGE. In the first case,
I get the errors:

--------------------------------------------------------------------------
[compute-6-1.local:22658] 2 more processes have sent help message
help-odls-default.txt / odls-default:could-not-kill
[compute-6-1.local:22658] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
--------------------------------------------------------------------------
***

The first error is at the same place as before
([btl_openib_component.c:3492:handle_wc]) and the message is only
slightly different (LP -> HP).

For the second suggestion, using tcp btl, I got a whole load of these:

there are 1826 "Connection timed out" errors at an earlier spot in the
code than in the case above. I checked iptables and there is no reason
the connection would have been refused. Is it possible I'm out of file
descriptors (because sockets count as files)? `ulimit -n` yields 1024.