The Design and Implementation of a Multi-level Content-Addressable Checkpoint File System

Hits: 3599

Year:

2012

Type of Publication:

Article

Authors:

Kulkarni, Abhishek

Manzanares, Adam

Ionkov, Latchesar

Lang, Michael

Lumsdaine, Andrew

Note:

IEEE International Conference on High Performance Computing (HiPC) Pune, India from December 18 - December 21, 2012

Abstract:

Long-running HPC applications guard against node failures by writing checkpoints to parallel ﬁle systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale computing will exacerbate this problem. To meet checkpointing demands and sustain application-perceived throughput at exascale, multi-tiered hierarchical storage architectures involving solid-state burst buffers are being considered. In this paper, we describe the design and implementation of cento, a multilevel, content-addressable checkpoint ﬁle system for large-scale HPC systems. cento achieves in-ﬂight checkpoint data reduction across all compute nodes through compression and elimination of duplicate blocks over a series of checkpoints. Through a detailed analysis of checkpoint dumps, we assess the beneﬁts of data reduction for scientiﬁc applications that are representative of production workloads. We observe upto 40% data reduction within a limited sample of representative workloads. Finally, experiments on existing systems show a decrease in checkpoint commit latencies by 5 to 20% reducing the load on the parallel ﬁle system.