Background:

For Developers:

Bad page offlining

A common class of memory errors is a single "stuck bit" in a DIMM.
The bit stays stuck in a specific state and cannot be rewritten anymore.
Other bits in the same DIMM or on the same channel are not affected.

With ECC DIMMs this error can be corrected: it is not immediately an fatal problem.
But when another nearby bit gets corrupted for some reason this could develop into
an uncorrected 2bit error. In addition the stuck bit will generate regular
continuous corrected error reports when the memory scrubber scrubs it again. Handling these
reports takes some time and may drown error thresholds for other purposes.
It does not actually tell anything new.

The best strategy is to simply stop using the bit. The only entity which
has reasonable fine control over that is the operating system. It manages
memory by pages (typically 4K of size) and it's possible to offline the page
containing the stuck bit.

When running in daemon mode mcelog keeps track of corrected memory errors per 4K pages
and maintains error counters for each page.
This is controlled using the [page]
section in mcelog.conf
mcelog defaults to page tracking enabled by default (if the CPU supports it)
with offlining of a specific page when a threshold of 10 errors per 24 hours is crossed.

Linux starting with 2.6.33 (and in some 2.6.32 kernels with backports) have
a page soft-offlining capability. That is the contents of the page are copied
somewhere else (or dropped if not needed) and the original page is removed
from the normal operating system memory management and not used anymore.

The capability is called soft-offlining because it never
kills or otherwise affects any application, in contrast to the "hard-offlining"
that is done when an uncorrected recoverable data error happens.

One caveat is that offlining doesn't work for all pages, only pages
in specific states.
However in common workloads the majority of memory can be soft-offlined.

Hardware

Bad page offlining works on CPUs that provide an physical address on corrected
memory machine check errors. This is generally CPUs with integrated memory
controller and ECC memory support. On Intel Xeon 75xx, 65xx, E7 (Westmere) series CPUs a special
driver has to
be loaded for this and the BIOS has to enable the "firmware first" functionality.