I started today reading e-mail quickly and out of order. So, I'm going
back to an earlier message now, but still with the new Subject heading,
which better reflects where you are in your progress. I'm extracting
some questions from this thread, from bottom/old to top/new:

1) What tools to use? As others have commented, the issue may have
less to do with the kernel or inefficient data movement and more to do
with what's going on in your application -- that is, some processes are
reaching synchronization points earlier than others. For
application-level tools, there is an OMPI FAQ category at
http://www.open-mpi.org/faq/?category=perftools and specifically a list
of some tools you could "download for free". Anyhow, sounds like
you're getting traction with Sun Studio -- and personally I think
that's a good choice. :^)

2) Spread processes on multiple nodes or colocate on one node? It
depends, but I agree with Ralph that it's not surprising if running on
a single node is faster, presumably due to faster communication
performance. (But, I'd temper his statement that it would *always* be
so.)

3) I agree with David and Ralph that you're probably spending more
time waiting on messages than actually moving data. If you use Sun
Studio with Sun ClusterTools (now Oracle Message Passing Toolkit),
it'll also break apart "MPI wait" time from "MPI work" time (time spent
moving data) and you could see this rather clearly. But, any of the
MPI tracing tools will be able to show you other indications of this
problem. Most graphically, you might look at an MPI timeline. E.g.,
you might see one or more functions stuck in a long MPI_Barrier call,
with other processes lingering in computation before the last process
enters the barrier and all processes are released from that barrier.
Or, one function stuck in a long MPI_Recv, released when the message
arrives, as indicated by a message line joining the sending and
receiving processes.

4) I could imagine the distributed memory version of LS-DYNA running
faster than the shared-memory version since it has better data locality.

5) The I/O portion might be very, very significant. Maybe processes
are waiting on other processes that are not computing but doing a lot
of I/O. You say runs are faster when all processes are on the same
node versus when they're on different nodes. One experiment would be
to compare all processes running on node A versus running all processes
on node B versus distributing processes among both A and B. This would
be a way of discriminating between "all on one node is faster than
distributed" versus "one node is faster than another". Alternatively,
if look at the timeline and see one process stuck in a long MPI call,
look at the process it's waiting on. Does the call stack of the
"laggard" suggest what that laggard is doing? Computation or I/O? You
might need to know a little about LS-DYNA to tell.

6) Functions with "_" at the end are probably Fortran names, which
should make sense for LS-DYNA (which I think is basically Fortran).

7) Regarding two sets of functions for "everything." I think this is
no surprise. There are "inclusive" and "exclusive" times. If you look
at "inclusive" times (time spent in a function and all its callees),
the time for a function will be almost exactly equal to the time spent
in a wrapper that calls that function. I think this is what you're
seeing. If you were to look instead at "exclusive" times (time spent
in a function, excluding time spent in its callees), the difference
between a "real" function and a wrapper that calls it would be very
clear. If the numbers are percentages, then it's clear they are
"inclusive" since they add up to well over 100%.

8) Regarding your earlier problems using Studio: no, you shouldn't
need to build OMPI specially for Studio use. (I'm interested in
hearing more about the problems you encountered and how you resolved
them. You can send to me off line since I suspect most of the mail
list would not be interested.)

I hope at least some of those comments are helpful. Good luck.

Robert Walters wrote:

I think I forgot to mention earlier that the application
I am using is pre-compiled. It is a finite element software called
LS-DYNA. It is not open source and I likely cannot obtain the code it
uses for MPP. This version I am using was specifically compiled, by the
parent company, for OpenMPI 1.4.2 MPP operations.

I recently installed the Sun Studio 12.1 to attempt to analyze the
situation. It seems to work partially. It will record various processes
individually, which is cryptic. The function it fails on, though, is
the MPI Tracing. It errors that "no MPI tracing data file in
experiment, MPI Timeline and MPI Charts will not be available".
Sometime during the analysis (about 10,000 iterations later, the
VT_MAX_FLUSHES complains that there are too many i/o flushes and its
not happy. I've increased this number in the environmental variable and
killed the analysis before it had a chance to error but still no MPI
Trace data is recorded. Not sure if you guys have heard of that
happening or know any way to fix it...Did OpenMPI need to be
configured/built for Sun Studio use?

I also noticed that from the data I do get back, there are two sets of
functions for everything. There is mpi_recv and then my_recv_, both
with the same % utilization time. The mpi one comes from your program's
library and the my_recv_ one comes from my program. Is that typical or
should the program I'm using be saying mpi_recv only? This data may be
enough to help me see what's wrong so I will pass it along. Keep in
mind this is percent time of total run time and not percent of MPI
communication. I attached the information in a picture rather than me
attempting to format a nice table in this nasty e-mail application.
I blacked out items that are related to LS-DYNA but afterward I just
realized that I think every function with an _ at the end represents a
command issuing from LS-DYNA.

These are my big spenders. The processes I did not include are in the
bottom 4%. The processes that would be above these were the LS-DYNA
applications at 100%. Like I mentioned earlier, there are two instances
of every MPI command, and they carry the same percent usage. It's
curious that this version, built for OpenMPI, uses different functions.

Just for a little more background info, OpenMPI is being launched from
a local hard drive on each machine, but the LS-DYNA job files, and
related data output files, are on a mounted drive on that machine,
where the mounted drive is located on a different machine also in the
cluster. We were thinking that might be an issue but it isn't writing
enough data for me to think that would significantly decrease MPP
performance.

I would like to make one last mention. That is that OpenMPI running 8
cores on a single node, with all the communication, works flawlessly.
It works much faster than the Shared Memory Parallel (SMP) version of
LS-DYNA that we currently have used scaled to 8 cores. LS-DYNA seems to
be approximately 25% faster (don't quote me on that) when using the
OpenMPI installation than when using the standard SMP, which is
awesome. My point being that OpenMPI seems to be working fine, even
with the screwy mounted drive. This leads me to continue to point at
the network.

Anyhow, let me know if anything seems weird on the OpenMPI
communication subroutines. I don't have any numbers to lean on from
experience.

I'm afraid that having 2 cores on a single
machine will always outperform having 1 core on each machine if any
communication is involved.

The most likely thing that is happening is that OMPI
is polling waiting for messages to arrive. You might look closer at
your code to try and optimize it better so that number-crunching can
get more attention.

On Jul 13, 2010, at 12:22 PM, Robert Walters wrote:

Following up. The sysadmin opened ports for machine to
machine communication and OpenMPI is running successfully with no
errors in connectivity_c, hello_c, or ring_c. Since, I have started to
implement our MPP software (finite element analysis) that we have, and
upon running a simple, 1 core on machine1, 1 core on machine2, job, I
notice it is considerably slower than a 2 core job on a single machine.

A quick look at top shows me kernel usage is almost twice what cpu
usage is! On a 16 core job, (8 cores per node so 2 nodes total) test,
OpenMPI was consuming ~65% of the cpu for kernel related items rather
than number-crunching related items...Granted, we are running on GigE,
but this is a finite element code we are running with no heavy data
transfer within it. I'm looking into benchmarking tools, but my
sysadmin is not very open to installing third party softwares. Do you
have any suggestions for what I can use that would be "big name" or
guaranteed safe tools I can use to figure out what's causing the hold
up with all the kernel usage? I'm pretty sure its network traffic but I
have no way of telling (as far as I know because I'm not a Linux whiz)
with the standard tools in RHEL.