Fact-checked by Grok 2 weeks ago

Machine-check exception

A machine-check exception (MCE), also known as #MC in Intel architecture, is an abort-class interrupt (vector 18) generated by the CPU when it detects an internal hardware error, such as faults in memory, caches, buses, or TLBs, that hardware cannot fully correct on its own.^[1] This mechanism, part of the broader Machine-Check Architecture (MCA), enables operating systems to log, diagnose, and respond to these errors, ranging from recoverable issues to those requiring system shutdown.^[1] Introduced in the Pentium processor and enhanced in subsequent x86 families like P6, Pentium 4, Xeon, and Atom, MCE relies on Model-Specific Registers (MSRs) to report detailed error information.^[1] The MCA organizes error reporting into "banks," each corresponding to a hardware subsystem (e.g., one for the data cache, another for the bus unit), with a variable number of banks (indicated by the Count field in IA32_MCG_CAP MSR, supporting up to 255 theoretically but fewer in practice depending on the model).^[1] Global MSRs like IA32_MCG_CAP indicate the number of banks and other capabilities, while per-bank MSRs such as IA32_MCi_STATUS capture specifics like error validity, whether the error is uncorrected (UC bit set), and if it affects program continuity (PCC bit set).^[1] Enabling MCE requires setting the MCE bit in CR4; once active, uncorrectable errors trigger the exception asynchronously during instruction execution, potentially causing a triple fault and processor shutdown if the handler cannot complete.^[1] In virtualized environments, such as those using Intel VT-x, MCEs can cause VM exits to allow hypervisor intervention.^[1] Errors fall into two primary categories: corrected errors, which hardware fixes automatically (e.g., single-bit ECC memory corrections) and are often reported via a Corrected Machine-Check Interrupt (CMCI) for logging without halting execution; and uncorrected errors, which trigger #MC and are further classified as recoverable (software may continue after mitigation) or fatal (demanding immediate shutdown).^[1] In Linux, the kernel's MCE handler, rewritten for x86-64 in version 2.5, decodes these via tools like mcelog, polling for corrected errors every 5 minutes by default and configurable via sysfs for tolerance levels (e.g., panic on uncorrected or signal to user processes). Uncorrected errors in kernel mode typically panic the system, while those in user space may kill the affected process with SIGBUS, enhancing reliability in server and high-availability environments.^[3] Overall, MCE provides essential diagnostics for hardware faults, supporting proactive maintenance in data centers and embedded systems.^[3]

Fundamentals

Definition and Purpose

A machine-check exception (MCE) is an interrupt or exception generated by the CPU in response to detecting an uncorrectable hardware error, such as a memory fault, bus error, or cache inconsistency, alerting the operating system to the issue.^[4] These exceptions are part of the broader Machine Check Architecture (MCA) in x86 processors, which provides a standardized framework using model-specific registers (MSRs), such as IA32_MCG_CAP and IA32_MCi_STATUS, to detect, log, and report both correctable and uncorrectable errors.^[4] The primary purpose of MCEs is to enable software to respond appropriately to severe hardware faults, facilitating actions like error logging, graceful system degradation, recovery attempts, or controlled shutdown to avert data corruption and maintain overall system integrity.^[4] This contrasts with correctable errors, which hardware resolves transparently without software intervention, ensuring minimal disruption for transient issues while escalating uncorrectable ones for explicit handling.^[4] MCEs can be synchronous, directly tied to the execution of a specific instruction that provokes the error, or asynchronous, arising from external or independent hardware failures unrelated to the current program flow.^[4] In Intel x86 processors, MCA serves as the established standard for detailed error reporting, with MCEs signaled through interrupt vector 18 via the local Advanced Programmable Interrupt Controller (APIC).^[4] For instance, the Machine Check In Progress (MCIP) flag in the IA32_MCG_STATUS MSR (bit 2) is set to indicate an active MCE, preventing recursive exceptions until cleared by software upon resolution.^[4]

Historical Development

The concept of machine-check exceptions originated in mainframe computing with the IBM System/360 architecture, announced in 1964 and detailed in its principles of operation by 1967. This system introduced machine check interruptions as a high-priority mechanism to detect hardware malfunctions, such as data parity errors in main storage or channel control issues, enabling diagnostic scans and recovery actions while preserving system state for fault analysis.^[5] In the x86 architecture, machine-check exceptions were first formalized with the introduction of Machine Check Architecture (MCA) in the Intel Pentium processor in 1993, which provided basic reporting for internal errors like multi-bit ECC failures in the on-chip cache.^[6] Enhancements in the Pentium era focused on improved error logging for uncorrectable multi-bit errors, laying the groundwork for more robust hardware diagnostics. AMD adopted MCA support in its K7 (Athlon) processor family starting in 1999, aligning with x86 standards through CPUID feature bit 14 to enable machine check reporting in compatible systems.^[7] The x86 MCA specification evolved from 1996 onward through Intel's architecture documentation, standardizing error registers and exception handling across processors. A significant milestone came in 2008 with Intel's Nehalem microarchitecture, which introduced scalable MCA banks—multiple sets of registers for detailed error logging per CPU core—to handle the growing volume of potential faults in multi-core designs. By the 2020s, extensions appeared in other architectures: ARMv8-A incorporated machine check exceptions as part of its Reliability, Availability, and Serviceability (RAS) features for uncorrectable hardware errors, while RISC-V developed error reporting mechanisms, including machine check-like synchronous exceptions for hardware faults, through ongoing ISA extensions.^[8] These developments were driven by the escalating complexity of multi-core processors and the reliability demands of data centers, where undetected hardware errors could cascade into system-wide failures.^[3]

Hardware Detection

Mechanisms in Modern CPUs

In modern x86 CPUs, hardware errors are detected through integrated monitoring mechanisms within components such as error-correcting code (ECC) memory controllers, cache parity checkers, translation lookaside buffers (TLBs), and interconnect buses. These monitors continuously validate data integrity during operations like reads, writes, and transmissions, flagging anomalies such as single-bit flips or multi-bit corruptions via dedicated signals or internal status bits. Upon detection, the CPU core captures the error details and, if uncorrectable, vectors to a machine-check exception (#MC) handler at interrupt vector 18, provided the CR4.MCE bit is enabled in the control register. This process ensures that severe errors interrupt normal execution promptly, while preserving the processor state for analysis.^[1] Error information is logged in Machine Check Architecture (MCA) banks, which are sets of model-specific registers (MSRs) dedicated to storing comprehensive error records. Each bank includes registers like IA32_MCi_STATUS (for error validity, type, and codes), IA32_MCi_ADDR (for the physical address involved), and IA32_MCi_MISC (for additional syndrome or request data), enabling precise identification of the affected hardware unit. Modern Intel CPUs support up to 28 MCA banks per logical processor, with the exact count reported by the IA32_MCG_CAP MSR; banks are allocated to specific subsystems, such as one for the L3 cache or another for the memory controller, and only valid, uncaptured errors overwrite an available bank. This register-based logging occurs atomically to prevent data loss during concurrent errors.^[1] Distinctions in error handling depend on whether the error is asynchronous (occurring independently of the current instruction, e.g., via external bus events) or corrected (automatically repaired, e.g., by ECC single-error correction). The CR4.MCE bit (bit 6) enables #MC exceptions globally; when set, uncorrectable asynchronous errors trigger an immediate #MC, potentially leading to an asynchronous enclave exit (AEX) in Intel SGX environments, while corrected errors are logged in MCA banks without exception unless a threshold is exceeded, in which case a corrected machine-check interrupt (CMCI) may be signaled via vector 122. If CR4.MCE is cleared, uncorrectable errors may instead cause a processor shutdown via signals like BINIT# or MCERR#, bypassing the exception pathway. Corrected errors, even when CR4.MCE is enabled, prioritize non-disruptive logging to maintain system performance.^[1] The detection-to-handler flow proceeds as follows: hardware monitors identify an error and assert internal flags; the error is captured in the next available MCA bank, updating status bits to mark validity and recovery status; if the error warrants interruption (uncorrectable and CR4.MCE enabled), the CPU delivers the #MC exception, pushing the current state onto the stack and transferring control to the vector 18 handler for OS trapping. This sequence minimizes latency while ensuring error context—such as model-specific error codes (MSCOD) and corrected/uncorrected indicators—is preserved for subsequent decoding, supporting categories like cache or memory errors without delving into classification details.^[1]

Supported Architectures

Machine-check exceptions are supported across several major processor architectures, each implementing mechanisms tailored to their design principles for detecting and reporting hardware errors such as cache failures, bus errors, and memory faults. In the x86 family, Intel introduced the Machine Check Architecture (MCA) with the Pentium Pro processor in 1995, providing a framework for error reporting through dedicated model-specific registers (MSRs).^[3] Subsequent Intel processors, starting from the Pentium II and extending to modern Core and Xeon series, expanded MCA with up to 28 error-reporting banks per core, each containing status, address, and miscellaneous registers to capture error details like syndrome and location. AMD implemented compatible MCA support beginning with the K6 processor in 1997, aligning with x86 standards but introducing variations such as the additional IPID (IP Block Identifier) register in scalable MCA systems for enhanced error source identification in modern Ryzen and EPYC processors. These bank structures enable detailed logging of uncorrectable and correctable errors, with Intel emphasizing status/address/misc triads and AMD adding IPID for modular IP block tracing. ARM architecture incorporated Reliability, Availability, and Serviceability (RAS) extensions starting with ARMv8.2 in 2016, formalized as a mandatory feature by 2018 implementations, which include error record registers (ERRSELR, ERXSTATUS, ERXMISC, and ERXADDR) analogous to x86 MCA banks for capturing asynchronous errors. These registers support an implementation-defined number of error records per processor, facilitating firmware or OS-level error analysis similar to MCA, with features for corrected error counting and injection testing.^[9] IBM's Power architecture has long supported machine-check interrupts through hardware error facilities, enabling detection of processor, memory, and I/O faults via dedicated interrupt vectors and recovery units, with modern POWER9 and POWER10 processors supporting machine-check interrupts with dynamic reconfiguration. Similarly, z/Architecture in IBM Z mainframes provides machine-check interrupts for comprehensive error reporting, with the z15 processor (introduced in 2019) enhancing firmware-first handling through the Integrated Firmware Processor (IFP) and Licensed Internal Code (LIC), allowing initial error interception and recovery at the firmware level before escalating to the OS, reducing outage risks.^[10] RISC-V introduced RAS extensions in its ratified privileged architecture specification (version 1.12) in 2021, defining standard mechanisms for error injection, reporting, and interrupt handling via memory-mapped registers and the RAS Event interrupt, supporting scalable error records for cores and accelerators without fixed bank counts, emphasizing modularity for custom implementations.

Architecture	Introduction Year	Key Features
x86 (Intel)	1995 (Pentium Pro)	Up to 28 MCA banks; status/address/misc registers; scalable error logging.
x86 (AMD)	1997 (K6)	MCA banks with IPID for IP block identification; compatible with Intel but enhanced for modular dies.
ARM	2016 (ARMv8.2)	Implementation-defined number of error record registers (ERRSELR et al.); asynchronous error reporting; corrected error support.
IBM Power	1991 (PowerPC)	Machine-check interrupts; dynamic recovery units.
IBM z/Architecture	2000 (z900)	Machine-check interrupts; firmware-first via IFP/LIC (enhanced in z15, 2019).^[10]
RISC-V	2021 (Priv. v1.12)	Memory-mapped RAS records; error injection; RAS Event interrupt; modular banks.

Error Classification

Uncorrectable vs. Correctable Errors

Machine-check exceptions (MCEs) in modern processors distinguish between uncorrectable and correctable errors to enable appropriate hardware responses, with uncorrectable errors generally triggering an immediate MCE while correctable ones are handled transparently unless thresholds are exceeded.^[1]^[11] Uncorrectable errors, indicated by the UC bit set to 1 in the MCi_STATUS register, cannot be automatically fixed by hardware and thus invoke a machine-check exception (#MC, vector 18) when the MCE feature is enabled via CR4.MCE=1.^[1]^[11] These errors often lead to system panic or require recovery mechanisms if the error is deemed recoverable, such as in cases of multi-bit ECC failures in memory or TLB parity errors that corrupt processor state.^[1] For instance, in Intel's architecture, the PCC bit in MCG_STATUS (bit 0) further classifies uncorrected errors as fatal (PCC=1, indicating processor context corruption and necessitating a reset) or recoverable (PCC=0), while AMD similarly uses the UC bit to flag errors like uncorrectable ECC or severe TLB faults that may still allow limited recovery.^[1]^[11] Uncorrected errors are subdivided into fatal (context corrupted) and uncorrected recoverable (UCR) types, such as signaled restartable aborts (SRAR) or signaled non-restartable aborts (SRAO), enabling potential software mitigation if the processor state permits.^[1] In contrast, correctable errors, marked by UC=0 in MCi_STATUS, are detected and rectified by hardware without interrupting normal execution, such as through single-bit error correction using SECDED (Single Error Correction, Double Error Detection) in memory modules.^[1]^[11] These errors are logged in the relevant MCi_STATUS register for later analysis but do not trigger an MCE unless a configurable threshold is surpassed, at which point they may escalate via a Corrected Machine Check Interrupt (CMCI).^[1] In both Intel and AMD architectures, corrected error occurrences are monitored using threshold registers such as MCi_CTL2 or MCi_THRESHOLD to track frequency and prevent silent degradation.^[1]^[11] The impact of these errors varies by scope, with processor-internal issues (e.g., cache or TLB errors) often confined to a single core or thread, allowing potential isolation, whereas system-wide errors (e.g., interconnect or multi-channel memory failures) can affect broader operations and reduce recovery possibilities.^[1]^[11] Recovery for uncorrectable errors depends on their classification and hardware support, such as restarting instruction execution if the RIPV bit in MCG_STATUS indicates a valid restart IP, though fatal cases typically mandate full system intervention.^[1] Correctable errors, by design, support seamless recovery through hardware correction, maintaining system availability unless patterns suggest impending failure.^[11]

Specific Error Categories

Machine-check exceptions encompass a range of hardware-detected faults, categorized by the subsystem affected, such as memory, processor internals, interconnects, and I/O peripherals. These categories are defined through architectural error reporting mechanisms in processors from Intel and AMD, enabling identification of the error's origin without delving into root triggers.^[1]^[11] Memory-related errors involve faults in the memory subsystem, including ECC failures where error-correcting codes detect and sometimes correct bit flips in data storage or transmission. DRAM scrub errors occur during periodic memory scans that identify and log correctable issues, such as single-bit errors in dynamic random-access memory modules. Unbuffered DIMM faults manifest as access anomalies in non-registered memory configurations, often reported when data integrity checks fail during read or write operations. For instance, Intel's MCACOD value of 0x0004 indicates a generic memory error (as of December 2024), while 0x0008 signifies a corrected patrol scrub error; in AMD systems, syndrome values like DramEccErr (ErrorCodeExt 0x0) denote DRAM ECC discrepancies.^[1]^[11] Processor-internal errors arise within the CPU core, encompassing microcode errors from parity issues in firmware storage, execution unit parity failures in arithmetic or logic pipelines, and floating-point unit exceptions due to internal data corruption. These are typically logged when hardware detects inconsistencies in core operations, such as invalid instruction decoding or register file errors. Representative Intel MCACOD examples include 0x0001 for basic internal processor faults (as of December 2024); AMD equivalents feature ErrorCodeExt values like 0x7 for IntErrTyp1 in execution units or hardware assertions in the core.^[1]^[11] Interconnect and bus errors pertain to communication failures across processor links, including PCIe link failures where transaction layer protocol violations or physical layer discrepancies disrupt data flow between the CPU and expansion devices. In multi-socket systems, QPI or UPI protocol errors signal issues like invalid packets or timeout conditions on inter-core fabrics. Intel MCACOD 0x0004 represents bus or interconnect faults (as of December 2024), with values like 0x0200 for protocol errors; AMD reports these via MCA_STATUS_CS with ErrorCodeExt 0x2 for GMI/xGMI link errors or timeout syndromes in the format 0000 1XXT RRRR XXLL.^[1]^[11] I/O and peripheral errors cover anomalies in external device interactions, such as storage controller timeouts during command execution on SATA or NVMe interfaces, and GPU coherency issues where cache line inconsistencies arise in integrated or discrete graphics processing. These are flagged when the processor's I/O hub detects unrecoverable responses or data mismatches. Examples include Intel MCACOD 0x0400 for internal timer or I/O-related errors (as of December 2024); in AMD architectures, MCA_STATUS_PB uses ErrorCodeExt for parameter block faults, while FTI_DAT_STAT (0x3) indicates I/O status discrepancies.^[1]^[11] Error enumeration relies on standardized codes like Intel's MCACOD, where values such as 0x00A0 denote uncorrectable memory read errors (as of December 2024), providing a compact identifier for logging and analysis across supported architectures. These categories span both correctable and uncorrectable errors, with reporting mechanisms ensuring precise subsystem attribution.^[1]^[11]

Software Response

Operating System Reporting

When a machine-check exception occurs, the operating system's kernel traps the interrupt and invokes a dedicated handler to process the error. In x86 architectures, for instance, this typically involves the #MC exception routing to the machine-check (MCE) handler in the kernel, which captures hardware-reported details from model-specific registers (MSRs).^[12] The handler parses data from Machine Check Architecture (MCA) banks—specialized hardware registers that store error information for specific components like caches or memory controllers—identifying the error source, type, and affected resources.^[13] The kernel then logs the parsed error data to an internal ring buffer or equivalent structure for efficient, low-overhead storage, enabling subsequent analysis without immediate system disruption.^[12] Reporting occurs through standardized formats, such as kernel console messages (e.g., "Machine check events logged") and syslog entries that include timestamps, severity levels, and decoded error summaries.^[13] In Windows, the Hardware Error Architecture (WHEA) employs similar logging via Event Tracing for Windows (ETW), recording errors in the system event log with structured packets detailing the hardware fault.^[14] Users gain visibility into these reports through accessible tools: in Linux, commands like dmesg display kernel ring buffer contents, while in Windows, the Event Viewer provides a graphical interface to ETW logs.^[12]^[14] For fatal, uncorrectable errors that threaten system integrity, the kernel may trigger a panic—resulting in a crash dump or Blue Screen of Death (BSOD) in Windows—to halt execution and preserve diagnostic data.^[13]^[14] Across operating systems, reporting prioritizes non-blocking mechanisms for correctable or recoverable errors, allowing the system to continue operation while queuing logs for later review, thus balancing reliability with availability.^[12]^[15]

Platform-Specific Implementations

In Linux, machine-check exceptions are managed via the Machine Check Architecture (MCA) framework in the kernel, which detects and logs hardware errors such as correctable and uncorrectable events from CPU registers. The mcelog daemon parses these logs for detailed analysis and preventive maintenance, though it has been largely superseded by rasdaemon in modern distributions for collecting and decoding events across platforms. The EDAC (Error Detection and Correction) kernel module specifically handles memory controller errors, exposing statistics through the sysfs interface at /sys/devices/system/edac/ for monitoring ECC failures. Kernel boot parameters like mce=off disable MCA decoding entirely, preventing error reporting but risking undetected hardware faults.^[16]^[16]^[17] Microsoft Windows implements machine-check exception reporting through the Windows Hardware Error Architecture (WHEA), a unified framework that captures processor-detected errors and routes them to the operating system for processing. WHEA logs these events, including details like error source and type, directly to the Windows Event Viewer under the System log, enabling administrators to diagnose issues without external tools. For uncorrectable machine-check exceptions, WHEA triggers a bug check with stop code 0x124 (WHEA_UNCORRECTABLE_ERROR), resulting in a blue screen of death to halt the system and prevent further corruption.^[18]^[19] IBM z/OS employs a dedicated machine-check handler (MCH) within the Licensed Internal Code (LIC) to intercept and process hardware interrupts from the central processor, generating records for failures in storage, keys, timers, or CPU components. These MCH records are stored in the system logrec data set for immediate analysis, with severe uncorrectable errors often leading to an IPL (Initial Program Load) abort to isolate the fault and reinitialize the system. Auditing of machine-check events is facilitated through System Management Facility (SMF) type 15 records, which capture statistical data on error occurrences for long-term reliability tracking and reporting.^[20]^[21]^[22] FreeBSD handles machine-check exceptions through its kernel's exception and interrupt handling mechanisms on supported x86 platforms, logging error details to the kernel message buffer for diagnostics. The mcelog utility from ports can decode these logs into human-readable format. Configuration options, such as enabling detailed reporting or adjusting thresholds, are available via sysctl variables under the hw.mca namespace, allowing runtime tuning without recompilation.^[23]^[24]

Operating System	Key Components	Default Actions
Linux	MCA framework, mcelog/rasdaemon, EDAC module	Log errors and continue operation for correctable events; panic or halt for uncorrectable unless configured otherwise^[16]
Windows	WHEA framework, Event Viewer integration	Log to Event Viewer; blue screen (BSOD) and halt for uncorrectable errors^[19]
IBM z/OS	MCH in LIC, logrec data set, SMF type 15 records	Log to logrec; IPL abort for severe errors; auditing via SMF^[20]
FreeBSD	Kernel exception handling, mcelog utility, sysctl hw.mca	Log to kernel messages; continue for correctable, panic for uncorrectable^[23]

Analysis Methods

Decoding Machine-Check Data

Machine-check exceptions generate raw data stored in hardware registers, which must be decoded to identify the error's nature, location, and severity. This process is essential for diagnosing hardware faults in x86 architectures supporting the Machine-Check Architecture (MCA). Decoding involves reading and interpreting specific model-specific registers (MSRs) that capture error details upon detection of uncorrectable or correctable errors.^[1] The primary data sources for machine-check information are the MCA bank registers, typically including the status, address, and miscellaneous registers for each bank associated with hardware units such as caches, memory controllers, or interconnects. The IA32_MCi_STATUS register logs core error status, containing the MCA error code (MCACOD) in bits 15:0, model-specific error codes in bits 31:16, and flags like the valid bit (VAL, bit 63) to indicate populated data. The IA32_MCi_ADDR register holds the physical, linear, or segment-offset address of the fault if the address valid (ADDRV) flag (bit 58 in IA32_MCi_STATUS) is set. The IA32_MCi_MISC register provides supplementary details, such as error correction code (ECC) syndromes for memory errors or PCI Express requestor IDs for input/output MCA (IOMCA) errors, validated by the miscellaneous valid (MISCV) flag (bit 59 in IA32_MCi_STATUS). Error syndrome bits within these registers, including uncorrected (UC, bit 61), overflow (OVER, bit 62), and processor context corrupt (PCC, bit 57) in IA32_MCi_STATUS, help pinpoint the fault's characteristics and impact.^[1] Interpreting this data follows a structured sequence to ensure accuracy. First, verify the valid bit (e.g., MCi_STATUS.VAL=1) to confirm the bank contains relevant error information; invalid banks are skipped. Next, decode the MCACOD field to classify the error type, using architecture-defined simple codes (e.g., 0001H for unclassified errors) or compound codes (e.g., formats like 1MMMCCCC for memory errors or 1PPTRRRRIILL for bus/QuickPath Interconnect errors) as specified in MCA tables. If ADDRV is set, map the physical address from IA32_MCi_ADDR to a logical representation, considering the address mode indicated in IA32_MCi_MISC (bits 8:6) for contexts like uncorrected recoverable errors. Finally, examine syndrome bits and miscellaneous data for precise fault localization, such as ECC syndromes (bits 63:32 in IA32_MCi_MISC) that identify affected memory bits.^[1] Decoding faces challenges, particularly in high-error environments where bank overflow occurs, signaled by the OVER flag when a new error overwrites prior data, potentially losing information on multiple faults. In scenarios with frequent corrected errors, reliance on the corrected error count (bits 52:38 in IA32_MCi_STATUS) mitigates this if the MCA capabilities register (IA32_MCG_CAP) has MCG_CMCI_P enabled. Additionally, decoding responsibilities may split between firmware and the operating system; firmware can initially log errors via corrected machine-check interrupts (CMCI) and modify status registers if the MCA error management capabilities (IA32_MCG_EMCA) flag (bit 37) permits, while the OS performs final interpretation and clearing, requiring careful coordination to avoid conflicts. For instance, in System Management Mode (SMM), firmware handlers may intercept and process machine checks before OS involvement.^[1] The output of decoding typically produces human-readable summaries that aggregate the interpreted data, detailing the error location (e.g., specific cache level or memory channel from MCACOD and addresses), type (e.g., "corrected single-bit ECC error" from syndrome bits), and recovery status (e.g., based on UC or PCC flags indicating if the error was contained or led to context corruption). These summaries often include hexadecimal register values alongside descriptive text for troubleshooting, such as "MCACOD=0x0008: Corrected patrol scrub error at physical address 0xABCDEF" with associated miscellaneous context like DIMM identifiers.^[1]

Diagnostic Tools

Diagnostic tools for machine-check exceptions primarily consist of software utilities that parse kernel-generated logs, decode error data, and generate human-readable reports to aid in identifying hardware faults. In Linux environments, open-source tools such as rasdaemon serve as the primary daemon for monitoring and logging Reliability, Availability, and Serviceability (RAS) events, including machine-check exceptions from CPU banks and memory controllers.^[25] Rasdaemon captures events via kernel tracepoints, stores them in a database, and supports querying through utilities like ras-mc-ctl, which can display summaries of uncorrectable and correctable errors across components such as memory and PCIe devices.^[26] It has largely replaced the older mcelog tool, which previously handled similar parsing of /dev/mcelog data but lacks support for modern kernel features and architectures like recent AMD processors.^[27] Another complementary open-source utility is EDAC-utils, which focuses on memory diagnostics by interfacing with the kernel's Error Detection and Correction (EDAC) subsystem to report ECC errors, including correctable single-bit flips and uncorrectable multi-bit failures in DIMMs.^[28] Vendor-specific tools extend these capabilities for proprietary architectures. Intel's Processor Diagnostic Tool performs stress tests and feature verification on Intel CPUs, helping to isolate processor-related issues that trigger machine-check exceptions, such as cache or execution unit failures, by simulating loads and checking for anomalies.^[29] For AMD systems, rasdaemon provides native support for logging processor errors via AMD's machine-check architecture, though additional profiling can be done with AMD uProf, which uses Instruction-Based Sampling (IBS) to analyze performance hotspots that may correlate with error-prone hardware states.^[30] Practical usage of these tools often involves command-line interfaces for real-time monitoring and reporting. For instance, invoking ras-mc-ctl --summary on a Linux system outputs a concise overview of error counts by category, such as memory correctables or PCIe AER events, allowing administrators to quickly assess system health without delving into raw logs.^[31] Similarly, edac-util -v displays verbose status of loaded EDAC drivers and detected memory controllers, highlighting any accumulated errors. These tools can integrate with broader monitoring frameworks, such as Nagios, by scripting periodic executions of their output commands to trigger alerts on error thresholds, enabling proactive hardware surveillance in enterprise setups.^[32] Despite their utility, these diagnostic tools have inherent limitations tied to underlying system support. Rasdaemon and its predecessors like mcelog require specific kernel configurations, such as enabled MCE recovery and EDAC drivers, which may not be active by default on all distributions or hardware platforms, potentially missing events if not properly set up.^[27] Additionally, coverage for corrected errors can be incomplete in certain scenarios, as transient single-bit flips might not always trigger full logging or decoding due to hardware filtering or kernel optimizations prioritizing uncorrectable faults, leading to underreporting of subtle degradation patterns.^[33] These tools provide practical interfaces to the decoding process outlined in machine-check data analysis but rely on accurate kernel-level capture for reliability.

Mitigation and Prevention

Common Causes

Machine-check exceptions are frequently triggered by hardware failures within the processor or associated components. Faulty RAM, particularly due to cosmic ray-induced soft errors, represents one of the most prevalent causes, where high-energy particles from space collide with memory cells, flipping bits and leading to uncorrectable data corruption detected by the memory controller.^[34] Overheating CPUs can also induce internal processor errors, such as execution unit failures or cache parity issues, as elevated temperatures degrade silicon integrity and cause transient faults during computation.^[34] Power supply instability, including voltage fluctuations or insufficient current delivery, often results in bus errors or external signaling via pins like MCERR#, disrupting data transfers and prompting the processor to assert a machine-check exception.^[3] Environmental factors exacerbate the risk of these exceptions, particularly in data center settings. Radiation from cosmic rays is more intense at high altitudes, where thinner atmospheric shielding allows more secondary neutrons to reach ground level, increasing soft error rates in memory subsystems by factors of 2-10 compared to sea level. In dense server racks, electromagnetic interference (EMI) from adjacent power supplies, cabling, or cooling fans can induce noise on buses or interconnects, leading to parity errors or protocol violations that manifest as external hardware faults.^[35] Design-related issues in systems pushed beyond specifications contribute significantly to machine-check triggers. Marginal silicon quality in overclocked processors may cause microarchitectural faults or illegal vector errors under elevated clock speeds, as the hardware operates outside validated thermal and voltage envelopes.^[34] In aging multi-socket boards, wear on interconnects like Intel's QuickPath Interconnect (QPI) or Ultra Path Interconnect (UPI) can lead to transaction failures or checksum mismatches, generating bus or cache hierarchy errors over time due to signal degradation.^[34] Statistics underscore the dominance of memory-related causes, with RAM errors accounting for the majority of machine-check events in production systems, often stemming from soft errors rather than permanent defects. Typical cosmic ray-induced soft error rates in DDR4 memory range from 1,000–5,000 FIT (failures in time, or errors per billion device-hours) per Mbit, though real-world deployments report higher aggregate rates up to 25,000 FIT per Mbit across all error types due to scaling and environmental variability.^[36] These rates highlight why memory subsystems are a primary focus for error logging in machine-check architecture.^[3]

Strategies for Error Handling

Recovery techniques for machine-check exceptions (MCEs) focus on isolating faulty components to maintain system operation where possible. For memory errors detected via MCE, Linux implements page retirement, also known as bad page offlining, where affected memory pages are marked as unusable and removed from the active memory pool to prevent further data corruption. This is handled by user-space tools like rasdaemon, which processes MCE reports and triggers offlining scripts upon error detection. For processor faults, Linux supports CPU hot-unplug, allowing the kernel to offline a faulty core or socket dynamically without full system shutdown, using the CPU hotplug mechanism to migrate workloads and isolate the error. This approach is triggered by rasdaemon scripts that invoke CPU offline operations in response to uncorrectable MCEs on specific cores.^[37]^[38]^[38]^[39] Prevention strategies emphasize hardware and firmware measures to reduce the likelihood of MCEs. Using Error-Correcting Code (ECC) memory detects and corrects single-bit errors in real-time, significantly lowering the rate of uncorrectable errors that trigger MCEs, as ECC adds parity bits for error detection and correction. Regular memory scrubbing, performed by hardware controllers, periodically reads and rewrites memory contents to correct latent single-bit errors before they accumulate into uncorrectable multi-bit faults. Redundancy techniques such as RAID arrays for storage and clustering for compute nodes provide failover capabilities, ensuring system availability even if an MCE causes a node failure. Firmware updates address CPU errata that may spuriously generate MCEs, with vendors like Intel and AMD releasing microcode patches to mitigate known hardware defects.^[40]^[41]^[42] Monitoring approaches enable proactive intervention by tracking MCE patterns. Threshold-based alerts, configured in tools like rasdaemon, notify administrators when error rates exceed predefined limits, such as multiple corrected errors on the same memory bank, triggering actions like logging or component isolation. In cloud environments, machine learning models analyze historical sensor data to predict hardware failures, enabling preemptive maintenance in large-scale RAS (Reliability, Availability, and Serviceability) setups.^[25] Best practices include configuring BIOS/UEFI settings to enable MCE reporting and logging, ensuring the kernel receives full hardware error details for proper handling, as seen in options like processor error reporting in Dell and HPE systems. Testing error handling with tools like mce-inject simulates MCEs in a controlled manner to validate recovery mechanisms without risking production data.^[43]^[44]

References

[1]
[PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
... Machine-Check Exception (#MC) ... Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-6. 18.2.2. Intel® Xeon® Processor 7400 Model ...
[2]
machinecheck
When a machine check exception occurs for a non corrected machine check the kernel can take different actions. Since machine check exceptions can happen any ...Missing: definition | Show results with:definition
[3]
[PDF] Machine check handling on Linux - Andi Kleen
This architecture is implemented by modern x86 CPUs from. Intel and AMD. It consists of a standard exception (interrupt 18) for machine checks and some ...Missing: documentation | Show results with:documentation
[4]
[PDF] Volume 3 (3A, 3B, 3C & 3D): System Programming Guide - Intel
NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of four volumes: Basic Architecture, Order Number 253665; Instruction Set ...
[5]
[PDF] Systems Reference Library IBM System/360 Principles of Operation
This manual describes the IBM System/360's structure, arithmetic, logical, branching, status switching, input/output, and interruption operations.
[6]
[PDF] Intel Architecture Software Developer's Manual
Machine-Check Architecture. Describes the machine-check architecture, which was introduced into the IA with the Pentium® processor. Chapter 14 ...
[7]
[PDF] AMD Processor Recognition - kib.kiev.ua
Execute the CPUID instruction and check whether an illegal instruction exception occurs. If an exception occurs, the processor does not have CPUID support. □.
[8]
HW-SW Interface Design and Implementation for Error Logging and ...
Sep 27, 2023 · HW-SW Interface Design and Implementation for Error Logging and Reporting for RAS in RISC-V Architectures ... “Machine Check Handling on Linux.” ...
[9]
https://documentation-service.arm.com/static/5f8db40c4966cd7c95fd58e3
[10]
[PDF] IBM z15 (8562) Technical Guide
For the most up-to-date information regarding this product, consult the product documentation or subsequent updates of this book. Page 5. © Copyright IBM Corp.
[11]
[PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
... machine-check exception mechanism found in the. Pentium 4, Intel Xeon, Intel Atom, and P6 family processors. Seemachine check capability is also given ...
[12]
[PDF] AMD64 Architecture Programmer's Manual, Volume 2
... 9.2. Determining Machine-Check Architecture Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301. 9.3. Machine Check Architecture MSRs ...
[13]
[PDF] Open-Source Register Reference For AMD Family 17h Processors ...
Jul 3, 2018 · ... MCA bank is visible to exactly one thread in a system, and that error notifications are directed to that thread. Hardware also makes MCA ...
[14]
Understanding Hardware Error Handling in Linux: MCA Explained
May 30, 2025 · ... machine-check exception (#MC). Understanding Uncorrected errors. The #MC is an abort-class exception, meaning once it's encountered there's ...Missing: definition | Show results with:definition
[15]
Windows Hardware Error Architecture Definitions - Microsoft Learn
Mar 26, 2025 · WHEA uses ETW to notify subscribers about the hardware error events and to record hardware error events in the system event log.
[16]
Whea overview - Windows drivers | Microsoft Learn
Jan 23, 2023 · The WHEA_ERROR_PACKET_V1 structure describes the hardware error data that is passed to the operating system by a low-level hardware error ...
[17]
Reliability, Availability and Serviceability — The Linux Kernel documentation
### Summary of Machine Check Exception Handling in Linux
[18]
https://learn.microsoft.com/en-us/windows-hardware/drivers/whea/introduction-to-the-windows-hardware-error-architecture
[19]
Introduction to the Windows Hardware Error Architecture (WHEA)
Mar 26, 2025 · Provides mechanisms to help recover from hardware errors to avoid causing a bug check when a hardware error is nonfatal. Supports user-mode ...Missing: machine exception
[20]
Bug Check 0x124 WHEA_UNCORRECTABLE_ERROR
Jan 3, 2023 · The WHEA_UNCORRECTABLE_ERROR bug check has a value of 0x00000124 and indicates that a fatal hardware error has occurred. This bug check uses ...
[21]
https://www.ibm.com/docs/en/zos/2.5.0?topic=wsc-02e
[22]
02E - IBM
Analyze the messages provided by the operator to determine the cause of the error. Look at any I/O, machine check, missing interrupt handler, or disabled ...
[23]
https://www.freshports.org/sysutils/mcelog/
[24]
https://forums.freebsd.org/threads/hw-mca-enabled-0.50365/
[25]
https://github.com/mchehab/rasdaemon
[26]
mchehab/rasdaemon - GitHub
EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures.Missing: mcelog | Show results with:mcelog
[27]
ras-mc-ctl(8) — rasdaemon — Debian testing
Apr 25, 2025 · --summary: Presents a summary of the logged errors. --errors: Shows the errors stored at the error database. --error-count: Shows the corrected ...
[28]
Which of mcelog and rasdaemon should I use for monitoring of ...
Sep 8, 2025 · Issue. What of mcelog and rasdaemon should I use for monitoring of hardware, for which usecases? How to verify that rasdaemon is working?
[29]
edac-util(1): EDAC error reporting utility - Linux man page - Die.net
Displays the current status of EDAC drivers. edac-util will report whether it detects that EDAC drivers are loaded, and the number of memory controllers (MCs) ...Missing: diagnostics | Show results with:diagnostics
[30]
Intel® Processor Diagnostic Tool
The diagnostic tool checks for brand identification, verifies the processor operating frequency, tests specific processor features, and performs a stress test ...Missing: MCE | Show results with:MCE
[31]
[PDF] uProf User Guide | AMD
In this profile, AMD uProf uses the IBS supported by the AMD x64-based processor to diagnose the performance issues in hot spots. It collects data on how ...
[32]
https://serverfault.com/questions/643542/how-do-i-get-notified-of-ecc-errors-in-linux
[33]
Chapter 32. Using Advanced Error Reporting | 8
The command displays a summary of the logged errors (the --summary option) or displays the errors stored in the error database (the --errors option). Additional ...<|separator|>
[34]
How do I get notified of ECC errors in Linux? - Server Fault
Nov 11, 2014 · Running mcelog is generally a good idea. Depends on the system, but uncorrectable/correctable ECC errors are likely reported as machine check exception (MCE), ...MCE Errors but no edac-util errors?Machine check events logged - linuxMore results from serverfault.com
[35]
How Twitter uses rasdaemon for hardware reliability - Blog
Jan 6, 2023 · MCE (Machine Check Exception) events across a variety of platform types. This replaces mcelog for collecting/exposing hardware failures ...
[36]
[PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
This is Volume 3A, Part 1 of the Intel 64 and IA-32 manual, which is a system programming guide. The manual has nine volumes.
[37]
Data Center Electromagnetic Interference and Tier Standards
Low-frequency EMI in data centers is usually caused by power supplies. This type of EMI damages associated hardware. It can corrupt the data in servers and can ...
[38]
[PDF] DRAM Errors in the Wild: A Large-Scale Field Study
Jun 19, 2009 · For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion.
[39]
mcelog -- further reading
A classic study on the benefits of automatic bad page offlining: "Assessment of the Effect of Memory Page Retirement on Systems RAS against Hardware Faults", ...
[40]
[PDF] mcelog: memory error handling in user space - Andi Kleen
The paper describes memory error handling in user space on Linux systems using the mcelog daemon. It describes features like bad page offlining or cache error ...
[41]
ECC Technical Details - MemTest86
The purpose of On-die ECC is to protect the integrity of data stored in ... Machine Check Architecture (MCA) is an x86-specific mechanism for CPUs to ...<|separator|>
[42]
Memory scrubbing - Wikipedia
Memory scrubbing consists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code (ECC), and writing the ...<|control11|><|separator|>
[43]
PowerEdge: CPU Machine Check Errors | Dell US
Jul 25, 2025 · This article provides information about CPU Machine Check errors and common causes and proper handling when errors are seen.
[44]
https://github.com/andikleen/mce-inject
[45]
For Developers - mcelog
X86 CPUs report errors detected by the CPU as machine check events (MCEs). These can be data corruption detected in the CPU caches, in main memory by an ...
[46]
AMD Milan Processors reboot with Machine Check Exceptions - Dell
May 9, 2025 · Enter the F2 BIOS. · Under System BIOS Settings > Processor Settings, enable the setting "AMD IC Config Disable IT Bypass" · Save the changes and ...
[47]
andikleen/mce-inject: Linux machine check injection tool - GitHub
Linux machine check injection tool. Contribute to andikleen/mce-inject development by creating an account on GitHub.