FeTOL

FeTOL - Towards Fault Tolerant Massively Parallel Computations on Peta-scale Platforms - is a project funded by the German BMBF doing research into fault-tolerance for applications on future HPC systems.

Duration: 36 months, starting June 2011

Objective

It is well known that for massively parallel computations beyond the Teraflop scale the combined probability of local hardware / network failures will reach a level that substantially decreases the productivity of HPC-systems due to failure of submitted jobs even for moderate runtimes. This also holds for sub-Teraflop applications with extreme runtimes such as MD-applications. Thus it is mandatory to create software frameworks which increase the resilience of HPC applications to partial failures of the underlying hardware resources and thus avoiding a complete restart of a massively parallel application run.

Failure of a single process in an MPI job leads to unrecoverable error condition an aborting of the whole job. FeTOL thus suggest to break down large MPI jobs into a range of smaller MPI jobs, so called fibers, which are interconnected by BOND, a framework similar in functionality as MPI. If a node crashes, the local MPI fiber will crash, too, but the remaining fibers will survive the fault. BOND will then re-assign resources to the failing MPI jobs and restart it from a adequate checkpoint. This operation is much cheaper and resource efficient than loosing the whole job.

The main contributions of HLRS are:

improving the resilience and robustness of the Infiniband network layer in MPI in order to allow to survive transient network errors

implementation of a high-level, persistent storage mechanism that allows an application to store essential data to be used in case of restart after a failure