Memory Errors in Modern Systems: The Good, The Bad, and the Ugly

Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems

BibTex:

Abstract:

Several recent publications have shown that hardware faults
in the memory subsystem are commonplace. These faults are
predicted to become more frequent in future systems that
contain orders of magnitude more DRAM and SRAM than
found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems.
In this paper, we present a study of DRAM and SRAM
faults and errors from the field. We use data from two
leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes
that are deployed in current systems. Our study has several
key findings about the efficacy of many currently-deployed reliability techniques such as DRAM ECC, DDR
address/command parity, and SRAM ECC and parity. We
also perform a methodological study, and find that counting
errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect
conclusions about system reliability. Finally, we use our data
to project the needs of future large-scale systems. We find
that SRAM faults are unlikely to pose a significantly larger
reliability threat in the future, while DRAM faults will be
a major concern and stronger DRAM resilience schemes
will be needed to maintain acceptable failure rates similar to
those found on today’s systems.