It sounds like there is a race happening in the shutdown of the
processes. I wonder if the app is shutting down in a way that mpirun
does not quite like.

I have not tested the C/R functionality in the 1.4 series in a long
time. Can you give it a try with the 1.5 series, and see if there is
any variation? You might also try the trunk, but I have not tested it
recently enough to know if things are still working correctly or not
(have others?).

On Fri, Sep 23, 2011 at 3:08 PM, Dave Schulz <dschulz_at_[hidden]> wrote:
> Hi Everyone.
>
> I've been trying to figure out an issue with ompi-checkpoint/blcr. The
> symptoms seem to be related to what filesystem the
> snapc_base_global_snapshot_dir is located on.
>
> I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then the
> highest sends to 0. then it waits 1 sec and repeats.
>
> I'm using openmpi-1.4.3 and when I run "ompi-checkpoint --term
> <pidofmpirun>" on the shared filesystems, the ompi-checkpoint returns a
> checkpoint reference, the worker processes go away, but the mpirun remains
> but is stuck (It dies right away if I run kill on it -- so it's responding
> to SIGTERM). If I attach an strace to the mpirun, I get the following from
> strace forever:
>
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6,
> 1000) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6,
> 1000) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6,
> 1000) = 0 (Timeout)
>
> I'm running with:
> mpirun -machinefile machines -am ft-enable-cr ./mpiloop
> the "machines" file simply has the local hostname listed a few times. I've
> tried 2 and 8. I can try up to 24 as this node is a pretty big one if it's
> deemed useful. Also, there's 256Gb of RAM. And it's Opteron 6 core, 4
> socket if that helps.
>
>
> I initially installed this on a test system with only local harddisks and
> standard nfs on Centos 5.6 where everything worked as expected. When I
> moved over to the production system things started breaking. The filesystem
> is the major software difference. The shared filesystems are Ibrix and that
> is where the above symptoms started to appear.
>
> I haven't even moved on to multi-node mpi runs as I can't even get this to
> work for any number of processes on the local machine except if I set the
> checkpoint directory to /tmp which is on a local xfs harddisk. If I put the
> checkpoints on any shared directory, things fail.
>
> I've tried a number of *_verbose mca parameters and none of them seem to
> issue any messages at the point of checkpoint, only when I give-up and send
> kill `pidof mpirun` are there any further messages.
>
> openmpi is compiled with:
> ./configure --prefix=/global/software/openmpi-blcr
> --with-blcr=/global/software/blcr
> --with-blcr-libdir=/global/software/blcr/lib/ --with-ft=cr
> --enable-ft-thread --enable-mpi-threads --with-openib --with-tm
>
> and blcr only has a prefix to put it in /global/software/blcr otherwise it's
> vanilla. Both are compiled with the default gcc.
>
> One final note, is that occasionally it does succeed and terminate. But it
> seems completely random.
>
> What I'm wondering is has anyone else seen symptoms like this -- especially
> where the mpirun doesn't quit after a checkpoint with --term but the worker
> processes do?
>
> Also, is there some sort of somewhat unusual filesystem semantic that our
> shared filesystem may not support that ompi/ompi-checkpoint is needing?
>
> Thanks for any insights you may have.
>
> -Dave
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users>
>