> A non-intrusive test you could try is to replace your MPI (mpich) with a
> lower-latency one. Scali or MPI/Gamma are just to name two. These can lower
> your latency down to 15muS or so.
gamma is highly hardware dependent. does scali really provide a latency
improvement independent of hardware?
> If this drastically ups your efficiency you know where your bottleneck is.
indeed. but another alternative is to find a _SLOWER_ MPI implementation.
in fact, I wonder if there's a handy place in, say, mpich, to put a simple
usleep() for this purpose. perhaps just enable tracing.
usleep as a tool for performance characterization!