ABSTRACT:In today's high performance computing practice, fail-stop
failures are often tolerated by checkpointing. While check-
pointing is a very general technique and can often be used in
many types of systems and to a wide range of applications,
it often introduces a considerable overhead especially when
computations reach petascale and beyond. In this poster,
we design algorithm-based recovery techniques for selected
linear algebra operations to tolerate fail-stop failures
without checkpointing. Because no periodical checkpoint
is necessary during the whole computation
and no roll-back is necessary during the recovery,
the proposed algorithm-based recovery scheme is often highly scalable
and have a good potential to scale to extreme scale computing
and beyond. Experimental results demonstrate that the proposed
fault tolerance technique introduces much less overhead than checkpointing
on the current world's fourth fastest supercomputer Kraken.