Agradecimientos:
European Cooperation in Science and Technology. COSTThis research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds
of the EU (Project TIN2013-42148-P, and the predoctoral grant of Nuria Losada ref. BES-2014-068066) and
by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).

Resumen:

Future exascale systems are predicted to be formed by millions of cores. This is a great opportunity for HPC
applications, however, it is also a hazard for the completion of their execution. Even if one computation node
presents a failure every one century, Future exascale systems are predicted to be formed by millions of cores. This is a great opportunity for HPC
applications, however, it is also a hazard for the completion of their execution. Even if one computation node
presents a failure every one century, a machine with 100.000 nodes will encounter a failure every 9 hours. Thus,
HPC applications need to make use of fault tolerance techniques to ensure they successfully finish their execution.
This PhD thesis is focused on fault tolerance solutions for generic parallel applications, more specifically in checkpointing
solutions. We have extended CPPC, an MPI application-level portable checkpointing tool developed in
our research group, to work with OpenMP applications, and hybrid MPI-OpenMP applications. Currently, we
are working on transparently obtaining resilient MPI applications, that is, applications that are able to recover
themselves from failures without stopping their execution.[+][-]