This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

==Introduction==

==Introduction==

−

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs, or from other miscellaneous faults, including faulty software triggering hardware errors.

+

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

==Installing mcelog==

==Installing mcelog==

−

The [http://www.mcelog.org/ mcelog] daemon written by Andi Kleen is one of the methods in which one can handle MCEs. The {{Package Official|mcelog}} daemon can be found in the {{Codeline|[community]}} repository and can be installed with [[pacman]].

+

The [http://www.mcelog.org/ mcelog] daemon written by Andi Kleen is one of the tools one can use to gather MCE information.

−

pacman -S mcelog

+

+

[[pacman|Install]] the {{Pkg|mcelog}} package from the [[Official Repositories|official repositories]].

==Configuring mcelog==

==Configuring mcelog==

−

mcelog's configuration file is supposed to be located at {{Filename|/etc/mcelog.conf}}, but as of 2011-09-29, that file is not created after running {{Codeline|pacman -S mcelog}}.

+

mcelog's configuration file is located at {{ic|/etc/mcelog/mcelog.conf}}.

Finally, use the {{ic|/etc/rc.d/mcelog}} script to start mcelog at boot via {{ic|/etc/[[rc.conf]]}}.

−

* Make sure the files are owned by root:root.

+

{{Note|If running mcelog via the {{ic|rc.d}} command or via {{ic|/etc/[[rc.conf]]}}, it is unnecessary to set {{ic|daemon <nowiki>=</nowiki> yes}} in {{ic|/etc/mcelog/mcelog.conf}} because {{ic|/etc/rc.d/mcelog}} starts mcelog in daemon mode by default.}}

It is recommended by upstream to always run mcelog as a daemon, so edit {{Filename|/etc/mcelog.conf}} and set {{Codeline|daemon <nowiki>=</nowiki> yes}}.

−

−

Finally, {{Codeline|mcelog}} needs to be added to the {{Codeline|DAEMONS}} array in {{Filename|/etc/rc.conf}}.

===Additional configuration options===

===Additional configuration options===

−

The following options are probably recommended:

+

The following option is probably recommended:

syslog = yes

syslog = yes

−

syslog-error = yes

−

socket-path = /var/run/mcelog-client

−

−

==Example /etc/mcelog.conf==

−

As of 2011-09-29, the {{Package Official|mcelog}} package from {{Codeline|[community]}} does not generate a default/example configuration file at {{Filename|/etc/mcelog.conf}}. The example configuration file from upstream (as of 2011-09-29) can be found below for reference:

# Append log output to logfile instead of stdout. Only when no syslog logging is active

−

#logfile = filename

−

−

# Use SMBIOS information to decode DIMMs (needs root)

−

# This function is not recommended to use right now and generally not needed

−

# The exception is memdb prepopulation, which is configured separately below.

−

#dmi = no

−

−

# when in daemon mode run as this user after set up

−

# note that the triggers will run as this user too

−

# setting this to non root will mean that triggers cannot take some corrective

−

# action, like offlining objects

−

#run-credentials-user = root

−

# group to run as daemon with

−

# default to the group of the run-credentials-user

−

#run-credentials-group = nobody

−

−

[server]

−

# user allowed to access client socket.

−

# when set to * match any

−

# root is always allowed to access

−

# default: root only

−

client-user = root

−

# group allowed to access mcelog

−

# when no group is configured any group matches (but still user checking)

−

# when set to * match any

−

#client-group = root

−

# path to the unix socket for client<->server communication

−

# when no socket-path is configured the server will not start

−

#socket-path = /var/run/mcelog-client

−

# when mcelog starts it checks if a server is already running. timeout

−

# for this check.

−

#initial-ping-timeout = 2

−

#

−

[dimm]

−

# Is the in memory DIMM error tracking enabled?

−

# Only works on systems with integrated memory controller and

−

# which are supported

−

# Only takes effect in daemon mode

−

dimm-tracking-enabled = yes

−

# Use DMI information from the BIOS to prepopulate DIMM database

−

# Note this might not work with all BIOS and requires mcelog to run as root.

−

# Alternative is to let mcelog create DIMM objects on demand.

−

dmi-prepopulate = yes

−

#

−

# execute these triggers when the rate of corrected or uncorrected

−

# errors per DIMM exceeds the threshold

−

# Note when the hardware does not report DIMMs this might also

−

# be per channel

−

# The default of 10/24h is reasonable for server quality

−

# DDR3 DIMMs as of 2009/10

−

#uc-error-trigger = dimm-error-trigger

−

uc-error-threshold = 1 / 24h

−

#ce-error-trigger = dimm-error-trigger

−

ce-error-threshold = 10 / 24h

−

−

[socket]

−

# Memory error accounting per socket

−

socket-tracking-enabled = yes

−

# Threshold and trigger for uncorrected memory errors on a socket

−

# mem-uc-error-trigger = socket-memory-error-trigger

−

mem-uc-error-threshold = 100 / 24h

−

# Threshold and trigger for corrected memory errors on a socket

−

mem-ce-error-trigger = socket-memory-error-trigger

−

mem-ce-error-threshold = 100 / 24h

−

# Log socket error threshold explicitely?

−

mem-ce-error-log = yes

−

−

−

[cache]

−

# Processing of cache error thresholds reported by Intel CPUs

−

cache-threshold-trigger = cache-error-trigger

−

# Should cache threshold events be logged explicitely?

−

cache-threshold-log = yes

−

−

[page]

−

# Memory error accouting per 4K memory page

−

# Threshold for the correct memory errors trigger script

−

memory-ce-threshold = 10 / 24h

−

# Trigger script for corrected errors

−

# memory-ce-trigger = page-error-trigger

−

# Should page threshold events be logged explicitely?

−

memory-ce-log = yes

−

# specify the internal action in mcelog to exceeding a page error threshold

−

# this is done in addition to executing the trigger script if available

−

# off no action

−

# account only account errors

−

# soft try to soft-offline page without killing any processes

−

# This requires an uptodate kernel. Might not be successfull.

−

# hard try to hard-offline page by killing processes

−

# Requires an uptodate kernel. Might not be successfull.

−

# soft-then-hard First try to soft offline, then try hard offlining

−

#memory-ce-action = off|account|soft|hard|soft-then-hard

−

memory-ce-action = soft

−

−

[trigger]

−

# Maximum number of running triggers

−

children-max = 2

−

# execute triggers in this directory

−

directory = /etc/mcelog

−

</nowiki>}}

−

−

==Example /etc/logrotate.d/mcelog==

−

As of 2011-09-29, the {{Package Official|mcelog}} package from {{Codeline|[community]}} does not generate a default logrotate file at {{Filename|/etc/logrotate.d/mcelog}} or at {{Filename|/etc/logrotate.d/mcelog.logrotate}}. The example logrotate file from upstream (as of 2011-09-29) can be found below for reference:

Revision as of 15:20, 7 August 2013

This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

Contents

Introduction

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

Installing mcelog

The mcelog daemon written by Andi Kleen is one of the tools one can use to gather MCE information.