Introduction

ORCHECK stands for "ORCHEstrated CHECKpointing". Motivated by the recognition that I/O contention
is a dominant factor that impedes the performance of parallel checkpointing, ORCHECK proposes
a systematic approach to improving the performance of parallel checkpointing.
The main idea of ORCHECK is to orchestrate the concurrent checkpoints in an optimized and controllable way to minimize
the I/O contentions.
The targeted platform for ORCHECK is
large-scale parallel computing systems with multi-core architecture and parallel file system such as PVFS2.

From the perspective of Parallel File System (PFS), ORCHECK utilizes vertical checkpointing to rearrange the
data layout of the checkpoint files to reduce the number of files serviced by each I/O server and the corresponding I/O contention.

From the perspective of checkpointing middleware, ORCHECK leverages a staged checkpointing marshaling technique to serialize the
concurrent checkpoints on each compute node to further improve the checkpointing performance.

A prototype of ORCHECK is implemented at the system-level under Open MPI over the PVFS2 file system.

Features

Compatible with Open MPI version 1.4 and PVFS2 version 2.8.2.

Easy installation and configuration with the patch file.

Component independency: vertical checkpointing and the staged checkpointing marshaling can be set up individually.

Patched Packages

* ORCHECK is developed as add-ons to Open MPI and PVFS2 and follows all the licences that are applied to Open MPI and PVFS2. Open MPI code base is licensed under the new BSD license . PVFS2 is released as GPL/LGPL.

Acknowledgments

This research was supported in part by National Science Foundation under NSF grant CCF-0621435, CCF-0937877, CNS-0834514, CNS-0751200, CCF-0702737, and DOE, SciDAC-2 (DE-FC02-06ER41442).
The authors would like to acknowledge Joshua Hursey of Open MPI group at Indiana University and Samuel Lang of PVFS2 group at Argonne National Lab for their valuable assistance in the implementation of checkpointing orchestration.
We are also thankful to Dr. Ioan Raicu of Illinois Institute of Technology and the MCS department at Argonne National Lab for the support to run large-scale simulations on the SiCortex computing system.