Included is also an example to integrate SGE with the Condor
checkpointing library in standalone mode.

Purpose of the checkpointing interface can be to copy the files from a
local (checkpointing) directory on a node to a shared space like
/home/checkpoint (the $SGE_CKPT_DIR [I even greated a subdirectory with
the $JOB_ID therein in the examples]). Later on the files can be copied
to the (maybe different) nodes again (either in a queue prolog or the
job script) when the job restarts.

-- Reuti

Am 14.12.2009 um 18:25 schrieb Sergio Díaz:

Hi Reuti,

Yes, I sent a job with SGE and I checkpointed the mpirun process, by
hand, entering into the mpi master node. Then I killed the job with
qdel and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt environment but I
think that it could be a bit difficult because:
1 - when I do checkpoint I can't specify a directory with a
name like checkpoint_jobid
2 - I can't specify the scratch directory and I have to use
the /tmp instead of SGE's scratch directory.
3 - I tried to restart the snapshot and it only works if I
use the same machinefile. That is, If the job ran in the c3-13 and
c3-14, I have to restart the job using a machinefile with these two
nodes.

I got a successful checkpoint with a
fresh installation and without use the trunk. I can't understand why it
is working now and before I could do a successful restart... Maybe
there was something wrong in the openmpi installation and then the
metadata was created in a wrong way.
I will test it more and also I will test the trunk.

I will try to apply the trunk but I think that I broke-up my openmpi
installation doing "something" and I don't know what :-( . I was
modifying the mca parameters...
When I send a job, the orted daemon expanded in the SLAVE host is
launched in a bucle till they spend all the reserved memory.
It is very strange so I will compile it again, I will reproduce the bug
and then I will test the trunk.

You were right. The main problem was the /tmp. SGE uses a scratch
directory in which the jobs have temporary files. Setting TMPDIR to
/tmp, checkpoint works!
However, when I try to restart it... I got the following error (see
ERROR1). Option -v agrees these lines (see ERRO2).

It is concerning that ompi-restart is segfault'ing when it errors out.
The error message is being generated between the launch of the
opal-restart starter command and when we try to exec(cr_restart).
Usually the failure is related to a corruption of the metadata stored
in the checkpoint.

Can you send me the file below:
ompi_global_snapshot_28454.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data

Can you try the trunk to see if the problem goes away? The development
trunk and v1.5 series have a bunch of improvements to the C/R
functionality that were never brought over the v1.3/v1.4 series.

I was trying to use ssh instead of rsh but I was impossible. By default
it should use ssh and if it finds a problem, it will use rsh. It seems
that ssh doesn't work because always use rsh.
If I change this MCA parameter, It still uses rsh.
If I set OMPI_MCA_plm_rsh_disable_qrsh variable to 1, It try to use ssh
and doesn't works. I got --> "bash: orted: command not found" and
the mpi process dies.
The command which try to execute is the following and I haven't found
yet the reason why this command doesn't found orted because I set the
/etc/bashrc in order to get always the right path and I have the right
path into my application. (see ERROR4).

This seems like an SGE specific issue, so a bit out of my domain. Maybe
others have suggestions here.

-- Josh

Many thanks!,
Sergio

P.S. Sorry about these long emails. I just try to show you useful
information to identify my problems.

There is not directory on the /tmp of the node. However, if the
application is run without SGE, the directory is created

This may be the core of the problem. ompi-ps and other command line
tools (e.g., ompi-checkpoint) look for the Open MPI session directory
in /tmp in order to find the connection information to connect to the
mpirun process (internally called the HNP or Head Node Process).

Can you change the location of the temporary directory in SGE? The
temporary directory is usually set via an environment variable (e.g.,
TMPDIR, or TMP). So removing the environment variable or setting it to
/tmp might help.

but if I do ompi-ps -j
MPIRUN_PID, it seems hanged and I interrupt it. Does it take long time?

It should not take a long time. It is just querying the mpirun process
for state information.

what means the option -j of
ompi-ps command? isn't it related to a batch system(like sge,
condor...), is it?

The '-j' option allows the user to specify the Open MPI jobid. This is
completely different than the jobid provided by the batch system. In
general, users should not need to specify the -j option. It is useful
when you have multiple Open MPI jobs, and want a summary of just one of
them.

Thanks for the ticket. I will
follow it.

Talking with Alan, I realized that there are few transport protocols
that are supported. And maybe it is the problem. Currently, SGE is
using qrsh to expand mpi process. I can change this protocol and use
ssh. So, I'm going to test it this afternoon and I will comment to you
the results.

Try 'ssh' and see if that helps. I suspect the problem is with the
session directory location though.

Regards,
Sergio

Josh Hursey escribió:

On Oct 28, 2009, at 7:41 AM,
Sergio Díaz wrote:

Hello,

I have achieved the checkpoint of an easy program without SGE. Now, I'm
trying to do the integration openmpi+sge but I have some problems...
When I try to do checkpoint of the mpirun PID, I got an error similar
to the error gotten when the PID doesn't exit. The example below.

I do not have any experience with the SGE environment, so I suspect
that there may something 'special' about the environment that is
tripping up the ompi-checkpoint tool.

First of all, what version of Open MPI are you using?

Somethings to check:
- Does 'ompi-ps' work when your application is running?
- Is there an /tmp/openmpi-sessions-* directory on the node where
mpirun is currently running? This directory contains information on how
to connect to the mpirun process from an external tool, if it's missing
then this could be the cause of the problem.

Any ideas?
Somebody have a script to do it automatic with SGE?. For example I have
one to do checkpoint each X seconds with BLCR and non-mpi jobs. It is
launched by SGE if you have configured the queue and the ckpt
environment.

I do not know of any integration of the Open MPI checkpointing work
with SGE at the moment.

-am <arg0> Aggregate MCA parameter set file list
-gmca|--gmca <arg0> <arg1>
Pass global MCA parameters that are applicable
to
all contexts (arg0 is the parameter name; arg1
is
the parameter value)
-h|--help This help message
--hnp-jobid <arg0> This should be the jobid of the HNP
whose
applications you wish to checkpoint.
--hnp-pid <arg0> This should be the pid of the mpirun
whose
applications you wish to checkpoint.
-mca|--mca <arg0> <arg1>
Pass context-specific MCA parameters; they are
considered global if --gmca is not used and
only
one context is specified (arg0 is the
parameter
name; arg1 is the parameter value)
-s|--status Display status messages describing the
progression
of the checkpoint
--term Terminate the application after checkpoint
-v|--verbose Be Verbose
-w|--nowait Do not wait for the application to finish
checkpointing before returning