Machine-check exception

This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

Installation

Install the rasdaemonAUR package. rasdaemon written by Mauro Carvalho Chehab is one of the tools to gather MCE information.

Previously, the task was performed by the mcelog package. However, it has been deprecated, and Arch kernels are not even compiled with the necessary configuration option CONFIG_X86_MCELOG_LEGACY (FS#55657) now.

Configuration

There are two systemd services that need to be started and enabled. ras-mc-ctl.service registers DIMM labels (from /etc/ras/dimm_labels.d/) with EDAC drivers. On consumer-grade motherboards it usually logs a No dimm labels for <motherboard model> error and does nothing. rasdaemon.service runs as a daemon and logs RAS events to systemd journal.

See ras-mc-ctl(8) and rasdaemon(1) for more information.

Usage

You can use ras-mc-ctl --error-count and to quickly glance at the recorded errors. Errors are logged to the journal as well as the sqlite database at .

gollark: It scans enderchests. I'll show you.
gollark: > switchcraft question: what's the list of the public ender chest color codes?Use my enderchest scanner.
gollark: Anprim is the most acidic and least bas(ed|ic) of ideologies.
gollark: Your compass is wrong and apioformic.
gollark: no.

See also

Hardware documentation

This article is issued from Archlinux. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.