Resilience from Soft Faults in Large-scale Scientific Simulations

Research areas

Temporary Supervisor

Associate Professor Peter Strazdins

Description

As scientific simulations move towards very large scale, `soft errors' become a limiting issue. These arise from random bit flips in memory cells, on data paths in the memory hierarchy, and within the CPU. On very large scale systems, their sheer size increases the frequency of such faults to well within the execution time of the simulation. These soft errors are silent (generally do not cause exceptions) but cause errors to propagate through the data fields as the simulation evolves. On smaller systems, soft errors can still be problematic when the system is run under minimal power (i.e. to save energy), is made up of very cheap but less reliable components or is under harsh operating conditions. The detection and recovery from such errors in large-scale computations is therefore a pressing problem. Using new mathematical techniques which are naturally fault-tolerant, this project will explore general solutions to this problem. A number of scalable parallel applications will be studied, and machine learning techniques could be applied to detect and define the areas in the data fields where errors have been introduced by soft faults. By taking advantage of redundancy, the areas can then be `repaired'. Further details on the approach can be made available upon request. This is a joint research project between the Research School of Computer Science and the Mathematical Sciences Institute at ANU. Collaborators at the MSI include Professors Markus Hegland and Stephen Roberts.

Goals

to study and quantify the effect of soft faults across a range of applications

to determine methods of detecting and defining areas in the data which have been corrupted by soft faults, and to determine to what extent these are application-specific.

to develop highly scalable parallel algorithms for the detection and recovery form soft faults.

Requirements

Understanding of parallel computing concepts and programming and some experience in applied mathematics methods.