I'm investigating some very large performance variation and have reduced
the issue to a very simple MPI_Allreduce benchmark. The variability
does not occur for serial jobs, but it does occur within single nodes.
I'm not at all convinced that this is an Open MPI-specific issue (in
fact the same variance is observed with MVAPICH2 which is an available,
but not "recommended", implementation on that cluster) but perhaps
someone here can suggest steps to track down the issue.

The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected
with QDR InfiniBand. The benchmark loops over

with nlocal=10000 (80 KiB messages) 10000 times, so it normally runs in
a few seconds. Open MPI 1.4.1 was compiled with gcc-4.3.3, and this
code was built with mpicc -O2. All submissions were 8 process, timing
and host results are presented below in chronological order. The jobs
were run with 2-minute time limits (to get through the queue easily)
jobs are marked "killed" if they went over this amount of time. Jobs
were usually submitted in batches of 4. The scheduler is LSF-7.0.

The HOST field indicates the node that was actually used, a6* nodes are
of the type described above, a2* nodes are much older (2-socket Opteron
2220 (dual core, 2.8 GHz)) and use a Quadrics network, the timings are
very reliable on these older nodes. When the issue first came up, I was
inclined to blame memory bandwidith issues with other jobs, but the
variance is still visible when our job runs on exactly a full node,
present regardless of affinity settings, and events that don't require
communication are well-balanced in both small and large runs.

I then suspected possible contention between transport layers, ompi_info
gives

so the timings below show many variations of restricting these values.
Unfortunately, the variance is large for all combinations, but I find it
notable that -mca btl self,openib is reliably much slower than self,tcp.

Note that some nodes are used in multiple runs, yet there is no strict
relationship where some nodes are "fast", for instance, a6200 is very
slow (6x and more) in the first set, then normal on the subsequent test.
Nevertheless, when the same node appears in temporally nearby tests,
there seems to be a correlation (though there is certainly not enough
data here to establish that with confidence).

As a final observation, I think the performance in all cases is
unreasonably low since the same test on a (unrelated to the cluster)
2-socket Opteron 2356 (quad core, 2.3 GHz) always takes between 9.75 and
10.0 seconds, 30% faster than the fastest observations on the cluster
nodes with faster cores and memory.