Fault tolerance will be a fundamental imperative in the next decade
as machines containing hundreds of thousands of cores will be
installed at various locations. In this context, the traditional
checkpoint/restart model does not seem to be a suitable option,
since it makes all the processors roll back to their latest
checkpoint in case of a single failure in one of the processors.
In-memory message logging is an alternative that avoids this global
restoration process and instead replays the messages to the failed
processor. However, there is a large memory overhead associated
with message logging because each message must be logged so it can
be played back if a failure occurs. In this paper, we introduce a
technique to alleviate the demand of memory in message logging by
grouping processors into teams. These teams act as a failure
unit: if one team member fails, all the other members in that team
roll back to their latest checkpoint and start the recovery
process. This eliminates the need to log message contents within
teams. The savings in memory produced by this approach depend on
the characteristics of the application, the number of messages sent
per computation unit and size of those messages. We present
promising results for multiple benchmarks. As an example, the
NPB-CG code running class D on 512 cores manages to reduce the
memory overhead of message logging by 62%.