NAME

OPAL_CRS - Open PAL MCA Checkpoint/Restart Service (CRS): Overview of
Open PAL's CRS framework, and selected modules. Open MPI 1.4.3.

DESCRIPTION

Open PAL can involuntarily checkpoint and restart sequential programs.
Doing so requires that Open PAL was compiled with thread support and
that the back-end checkpointing systems are available at run-time.
PhasesofCheckpoint/Restart
Open PAL defines three phases for checkpoint / restart support in a
procress:
Checkpoint
When the checkpoint request arrives, the procress is notified of
the request before the checkpoint is taken.
Continue
After a checkpoint has successfully completed, the same process as
the checkpoint is notified of its successful continuation of
execution.
Restart
After a checkpoint has successfully completed, a new / restarted
process is notified of its successful restart.
The Continue and Restart phases are identical except for the process in
which they are invoked. The Continue phase is invoked in the same
process as the Checkpoint phase was invoked. The Restart phase is only
invoked in newly restarted processes.

GENERALPROCESSREQUIREMENTS

In order for a process to use the Open PAL CRS components it must
adhear to a few programmatic requirements.
First, the program must call OPAL_INIT early in its execution. This
should only be called once, and it is not possible to checkpoint the
process without it first having called this function.
The program must call OPAL_FINALIZE before termination. This does a
significant amount of cleanup. If it is not called, then it is very
likely that remnants are left in the filesystem.
To checkpoint and restart a process you must use the Open PAL tools to
do so. Using the backend checkpointer's checkpoint and restart tools
will lead to undefined behavior. To checkpoint a process use
opal_checkpoint (opal_checkpoint(1)). To restart a process use
opal_restart (opal_restart(1)).

AVAILABLECOMPONENTS

Open PAL ships with two CRS components: self and blcr.
The following MCA parameters apply to all components:
crs_base_verbose
Set the verbosity level for all components. Default is 0, or silent
except on error.
crs_base_snapshot_dir
The directory to store the checkpoint snapshots. Default is /tmp.
selfCRSComponent
The self component invokes user-defined functions to save and restore
checkpoints. It is simply a mechanism for user-defined functions to be
invoked at Open PAL's Checkpoint, Continue, and Restart phases. Hence,
the only data that is saved during the checkpoint is what is written in
the user's checkpoint function. No libary state is saved at all.
As such, the model for the self component is slightly differnt than for
other components. Specifically, the Restart function is not invoked in
the same process image of the process that was checkpointed. The
Restart phase is invoked during OPAL_INIT of the new instance of the
applicaiton (i.e., it starts over from main()).
The self component has the following MCA parameters:
crs_self_prefix
Speficy a string prefix for the name of the checkpoint, continue,
and restart functions that Open PAL will invoke during the
respective stages. That is, by specifying "-mca crs_self_prefix
foo" means that Open PAL expects to find three functions at run-
time:
int foo_checkpoint()
int foo_continue()
int foo_restart()
By default, the prefix is set to "opal_crs_self_user".
crs_self_priority
Set the self components default priority
crs_self_verbose
Set the verbosity level. Default is 0, or silent except on error.
crs_self_do_restart
This is mostly internally used. A general user should never need to
set this value. This is set to non-0 when a the new process should
invoke the restart callback in OPAL_INIT. Default is 0, or normal
execution.
blcrCRSComponent
The Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is
a software system developed at Lawrence Berkeley National Laboratory.
See the project website for more details:
http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
The blcr component has the following MCA parameters:
crs_blcr_priority
Set the blcr components default priority.
crs_blcr_verbose
Set the verbosity level. Default is 0, or silent except on error.
noneCRSComponent
The none component simply selects no CRS component. All of the CRS
function calls return immediately with OPAL_SUCCESS.
This component is the last component to be selected by default. This
means that if another component is available, and the none component
was not explicity requested then OPAL will attempt to activate all of
the available components before falling back to this component.