Machine-check exception
A machine-check exception (MCE), also known as #MC in Intel architecture, is an abort-class interrupt (vector 18) generated by the CPU when it detects an internal hardware error, such as faults in memory, caches, buses, or TLBs, that hardware cannot fully correct on its own.[1] This mechanism, part of the broader Machine-Check Architecture (MCA), enables operating systems to log, diagnose, and respond to these errors, ranging from recoverable issues to those requiring system shutdown.[1] Introduced in the Pentium processor and enhanced in subsequent x86 families like P6, Pentium 4, Xeon, and Atom, MCE relies on Model-Specific Registers (MSRs) to report detailed error information.[1] The MCA organizes error reporting into "banks," each corresponding to a hardware subsystem (e.g., one for the data cache, another for the bus unit), with a variable number of banks (indicated by the Count field in IA32_MCG_CAP MSR, supporting up to 255 theoretically but fewer in practice depending on the model).[1] Global MSRs like IA32_MCG_CAP indicate the number of banks and other capabilities, while per-bank MSRs such as IA32_MCi_STATUS capture specifics like error validity, whether the error is uncorrected (UC bit set), and if it affects program continuity (PCC bit set).[1] Enabling MCE requires setting the MCE bit in CR4; once active, uncorrectable errors trigger the exception asynchronously during instruction execution, potentially causing a triple fault and processor shutdown if the handler cannot complete.[1] In virtualized environments, such as those using Intel VT-x, MCEs can cause VM exits to allow hypervisor intervention.[1] Errors fall into two primary categories: corrected errors, which hardware fixes automatically (e.g., single-bit ECC memory corrections) and are often reported via a Corrected Machine-Check Interrupt (CMCI) for logging without halting execution; and uncorrected errors, which trigger #MC and are further classified as recoverable (software may continue after mitigation) or fatal (demanding immediate shutdown).[1] In Linux, the kernel's MCE handler, rewritten for x86-64 in version 2.5, decodes these via tools like mcelog, polling for corrected errors every 5 minutes by default and configurable via sysfs for tolerance levels (e.g., panic on uncorrected or signal to user processes). Uncorrected errors in kernel mode typically panic the system, while those in user space may kill the affected process with SIGBUS, enhancing reliability in server and high-availability environments.[3] Overall, MCE provides essential diagnostics for hardware faults, supporting proactive maintenance in data centers and embedded systems.[3]Fundamentals
Definition and Purpose
A machine-check exception (MCE) is an interrupt or exception generated by the CPU in response to detecting an uncorrectable hardware error, such as a memory fault, bus error, or cache inconsistency, alerting the operating system to the issue.[4] These exceptions are part of the broader Machine Check Architecture (MCA) in x86 processors, which provides a standardized framework using model-specific registers (MSRs), such as IA32_MCG_CAP and IA32_MCi_STATUS, to detect, log, and report both correctable and uncorrectable errors.[4] The primary purpose of MCEs is to enable software to respond appropriately to severe hardware faults, facilitating actions like error logging, graceful system degradation, recovery attempts, or controlled shutdown to avert data corruption and maintain overall system integrity.[4] This contrasts with correctable errors, which hardware resolves transparently without software intervention, ensuring minimal disruption for transient issues while escalating uncorrectable ones for explicit handling.[4] MCEs can be synchronous, directly tied to the execution of a specific instruction that provokes the error, or asynchronous, arising from external or independent hardware failures unrelated to the current program flow.[4] In Intel x86 processors, MCA serves as the established standard for detailed error reporting, with MCEs signaled through interrupt vector 18 via the local Advanced Programmable Interrupt Controller (APIC).[4] For instance, the Machine Check In Progress (MCIP) flag in the IA32_MCG_STATUS MSR (bit 2) is set to indicate an active MCE, preventing recursive exceptions until cleared by software upon resolution.[4]Historical Development
The concept of machine-check exceptions originated in mainframe computing with the IBM System/360 architecture, announced in 1964 and detailed in its principles of operation by 1967. This system introduced machine check interruptions as a high-priority mechanism to detect hardware malfunctions, such as data parity errors in main storage or channel control issues, enabling diagnostic scans and recovery actions while preserving system state for fault analysis.[5] In the x86 architecture, machine-check exceptions were first formalized with the introduction of Machine Check Architecture (MCA) in the Intel Pentium processor in 1993, which provided basic reporting for internal errors like multi-bit ECC failures in the on-chip cache.[6] Enhancements in the Pentium era focused on improved error logging for uncorrectable multi-bit errors, laying the groundwork for more robust hardware diagnostics. AMD adopted MCA support in its K7 (Athlon) processor family starting in 1999, aligning with x86 standards through CPUID feature bit 14 to enable machine check reporting in compatible systems.[7] The x86 MCA specification evolved from 1996 onward through Intel's architecture documentation, standardizing error registers and exception handling across processors. A significant milestone came in 2008 with Intel's Nehalem microarchitecture, which introduced scalable MCA banks—multiple sets of registers for detailed error logging per CPU core—to handle the growing volume of potential faults in multi-core designs. By the 2020s, extensions appeared in other architectures: ARMv8-A incorporated machine check exceptions as part of its Reliability, Availability, and Serviceability (RAS) features for uncorrectable hardware errors, while RISC-V developed error reporting mechanisms, including machine check-like synchronous exceptions for hardware faults, through ongoing ISA extensions.[8] These developments were driven by the escalating complexity of multi-core processors and the reliability demands of data centers, where undetected hardware errors could cascade into system-wide failures.[3]Hardware Detection
Mechanisms in Modern CPUs
In modern x86 CPUs, hardware errors are detected through integrated monitoring mechanisms within components such as error-correcting code (ECC) memory controllers, cache parity checkers, translation lookaside buffers (TLBs), and interconnect buses. These monitors continuously validate data integrity during operations like reads, writes, and transmissions, flagging anomalies such as single-bit flips or multi-bit corruptions via dedicated signals or internal status bits. Upon detection, the CPU core captures the error details and, if uncorrectable, vectors to a machine-check exception (#MC) handler at interrupt vector 18, provided the CR4.MCE bit is enabled in the control register. This process ensures that severe errors interrupt normal execution promptly, while preserving the processor state for analysis.[1] Error information is logged in Machine Check Architecture (MCA) banks, which are sets of model-specific registers (MSRs) dedicated to storing comprehensive error records. Each bank includes registers like IA32_MCi_STATUS (for error validity, type, and codes), IA32_MCi_ADDR (for the physical address involved), and IA32_MCi_MISC (for additional syndrome or request data), enabling precise identification of the affected hardware unit. Modern Intel CPUs support up to 28 MCA banks per logical processor, with the exact count reported by the IA32_MCG_CAP MSR; banks are allocated to specific subsystems, such as one for the L3 cache or another for the memory controller, and only valid, uncaptured errors overwrite an available bank. This register-based logging occurs atomically to prevent data loss during concurrent errors.[1] Distinctions in error handling depend on whether the error is asynchronous (occurring independently of the current instruction, e.g., via external bus events) or corrected (automatically repaired, e.g., by ECC single-error correction). The CR4.MCE bit (bit 6) enables #MC exceptions globally; when set, uncorrectable asynchronous errors trigger an immediate #MC, potentially leading to an asynchronous enclave exit (AEX) in Intel SGX environments, while corrected errors are logged in MCA banks without exception unless a threshold is exceeded, in which case a corrected machine-check interrupt (CMCI) may be signaled via vector 122. If CR4.MCE is cleared, uncorrectable errors may instead cause a processor shutdown via signals like BINIT# or MCERR#, bypassing the exception pathway. Corrected errors, even when CR4.MCE is enabled, prioritize non-disruptive logging to maintain system performance.[1] The detection-to-handler flow proceeds as follows: hardware monitors identify an error and assert internal flags; the error is captured in the next available MCA bank, updating status bits to mark validity and recovery status; if the error warrants interruption (uncorrectable and CR4.MCE enabled), the CPU delivers the #MC exception, pushing the current state onto the stack and transferring control to the vector 18 handler for OS trapping. This sequence minimizes latency while ensuring error context—such as model-specific error codes (MSCOD) and corrected/uncorrected indicators—is preserved for subsequent decoding, supporting categories like cache or memory errors without delving into classification details.[1]Supported Architectures
Machine-check exceptions are supported across several major processor architectures, each implementing mechanisms tailored to their design principles for detecting and reporting hardware errors such as cache failures, bus errors, and memory faults. In the x86 family, Intel introduced the Machine Check Architecture (MCA) with the Pentium Pro processor in 1995, providing a framework for error reporting through dedicated model-specific registers (MSRs).[3] Subsequent Intel processors, starting from the Pentium II and extending to modern Core and Xeon series, expanded MCA with up to 28 error-reporting banks per core, each containing status, address, and miscellaneous registers to capture error details like syndrome and location. AMD implemented compatible MCA support beginning with the K6 processor in 1997, aligning with x86 standards but introducing variations such as the additional IPID (IP Block Identifier) register in scalable MCA systems for enhanced error source identification in modern Ryzen and EPYC processors. These bank structures enable detailed logging of uncorrectable and correctable errors, with Intel emphasizing status/address/misc triads and AMD adding IPID for modular IP block tracing. ARM architecture incorporated Reliability, Availability, and Serviceability (RAS) extensions starting with ARMv8.2 in 2016, formalized as a mandatory feature by 2018 implementations, which include error record registers (ERRSELR, ERXSTATUS, ERXMISC, and ERXADDR) analogous to x86 MCA banks for capturing asynchronous errors. These registers support an implementation-defined number of error records per processor, facilitating firmware or OS-level error analysis similar to MCA, with features for corrected error counting and injection testing.[9] IBM's Power architecture has long supported machine-check interrupts through hardware error facilities, enabling detection of processor, memory, and I/O faults via dedicated interrupt vectors and recovery units, with modern POWER9 and POWER10 processors supporting machine-check interrupts with dynamic reconfiguration. Similarly, z/Architecture in IBM Z mainframes provides machine-check interrupts for comprehensive error reporting, with the z15 processor (introduced in 2019) enhancing firmware-first handling through the Integrated Firmware Processor (IFP) and Licensed Internal Code (LIC), allowing initial error interception and recovery at the firmware level before escalating to the OS, reducing outage risks.[10] RISC-V introduced RAS extensions in its ratified privileged architecture specification (version 1.12) in 2021, defining standard mechanisms for error injection, reporting, and interrupt handling via memory-mapped registers and the RAS Event interrupt, supporting scalable error records for cores and accelerators without fixed bank counts, emphasizing modularity for custom implementations.| Architecture | Introduction Year | Key Features |
|---|---|---|
| x86 (Intel) | 1995 (Pentium Pro) | Up to 28 MCA banks; status/address/misc registers; scalable error logging. |
| x86 (AMD) | 1997 (K6) | MCA banks with IPID for IP block identification; compatible with Intel but enhanced for modular dies. |
| ARM | 2016 (ARMv8.2) | Implementation-defined number of error record registers (ERRSELR et al.); asynchronous error reporting; corrected error support. |
| IBM Power | 1991 (PowerPC) | Machine-check interrupts; dynamic recovery units. |
| IBM z/Architecture | 2000 (z900) | Machine-check interrupts; firmware-first via IFP/LIC (enhanced in z15, 2019).[10] |
| RISC-V | 2021 (Priv. v1.12) | Memory-mapped RAS records; error injection; RAS Event interrupt; modular banks. |
Error Classification
Uncorrectable vs. Correctable Errors
Machine-check exceptions (MCEs) in modern processors distinguish between uncorrectable and correctable errors to enable appropriate hardware responses, with uncorrectable errors generally triggering an immediate MCE while correctable ones are handled transparently unless thresholds are exceeded.[1][11] Uncorrectable errors, indicated by the UC bit set to 1 in the MCi_STATUS register, cannot be automatically fixed by hardware and thus invoke a machine-check exception (#MC, vector 18) when the MCE feature is enabled via CR4.MCE=1.[1][11] These errors often lead to system panic or require recovery mechanisms if the error is deemed recoverable, such as in cases of multi-bit ECC failures in memory or TLB parity errors that corrupt processor state.[1] For instance, in Intel's architecture, the PCC bit in MCG_STATUS (bit 0) further classifies uncorrected errors as fatal (PCC=1, indicating processor context corruption and necessitating a reset) or recoverable (PCC=0), while AMD similarly uses the UC bit to flag errors like uncorrectable ECC or severe TLB faults that may still allow limited recovery.[1][11] Uncorrected errors are subdivided into fatal (context corrupted) and uncorrected recoverable (UCR) types, such as signaled restartable aborts (SRAR) or signaled non-restartable aborts (SRAO), enabling potential software mitigation if the processor state permits.[1] In contrast, correctable errors, marked by UC=0 in MCi_STATUS, are detected and rectified by hardware without interrupting normal execution, such as through single-bit error correction using SECDED (Single Error Correction, Double Error Detection) in memory modules.[1][11] These errors are logged in the relevant MCi_STATUS register for later analysis but do not trigger an MCE unless a configurable threshold is surpassed, at which point they may escalate via a Corrected Machine Check Interrupt (CMCI).[1] In both Intel and AMD architectures, corrected error occurrences are monitored using threshold registers such as MCi_CTL2 or MCi_THRESHOLD to track frequency and prevent silent degradation.[1][11] The impact of these errors varies by scope, with processor-internal issues (e.g., cache or TLB errors) often confined to a single core or thread, allowing potential isolation, whereas system-wide errors (e.g., interconnect or multi-channel memory failures) can affect broader operations and reduce recovery possibilities.[1][11] Recovery for uncorrectable errors depends on their classification and hardware support, such as restarting instruction execution if the RIPV bit in MCG_STATUS indicates a valid restart IP, though fatal cases typically mandate full system intervention.[1] Correctable errors, by design, support seamless recovery through hardware correction, maintaining system availability unless patterns suggest impending failure.[11]Specific Error Categories
Machine-check exceptions encompass a range of hardware-detected faults, categorized by the subsystem affected, such as memory, processor internals, interconnects, and I/O peripherals. These categories are defined through architectural error reporting mechanisms in processors from Intel and AMD, enabling identification of the error's origin without delving into root triggers.[1][11] Memory-related errors involve faults in the memory subsystem, including ECC failures where error-correcting codes detect and sometimes correct bit flips in data storage or transmission. DRAM scrub errors occur during periodic memory scans that identify and log correctable issues, such as single-bit errors in dynamic random-access memory modules. Unbuffered DIMM faults manifest as access anomalies in non-registered memory configurations, often reported when data integrity checks fail during read or write operations. For instance, Intel's MCACOD value of 0x0004 indicates a generic memory error (as of December 2024), while 0x0008 signifies a corrected patrol scrub error; in AMD systems, syndrome values like DramEccErr (ErrorCodeExt 0x0) denote DRAM ECC discrepancies.[1][11] Processor-internal errors arise within the CPU core, encompassing microcode errors from parity issues in firmware storage, execution unit parity failures in arithmetic or logic pipelines, and floating-point unit exceptions due to internal data corruption. These are typically logged when hardware detects inconsistencies in core operations, such as invalid instruction decoding or register file errors. Representative Intel MCACOD examples include 0x0001 for basic internal processor faults (as of December 2024); AMD equivalents feature ErrorCodeExt values like 0x7 for IntErrTyp1 in execution units or hardware assertions in the core.[1][11] Interconnect and bus errors pertain to communication failures across processor links, including PCIe link failures where transaction layer protocol violations or physical layer discrepancies disrupt data flow between the CPU and expansion devices. In multi-socket systems, QPI or UPI protocol errors signal issues like invalid packets or timeout conditions on inter-core fabrics. Intel MCACOD 0x0004 represents bus or interconnect faults (as of December 2024), with values like 0x0200 for protocol errors; AMD reports these via MCA_STATUS_CS with ErrorCodeExt 0x2 for GMI/xGMI link errors or timeout syndromes in the format 0000 1XXT RRRR XXLL.[1][11] I/O and peripheral errors cover anomalies in external device interactions, such as storage controller timeouts during command execution on SATA or NVMe interfaces, and GPU coherency issues where cache line inconsistencies arise in integrated or discrete graphics processing. These are flagged when the processor's I/O hub detects unrecoverable responses or data mismatches. Examples include Intel MCACOD 0x0400 for internal timer or I/O-related errors (as of December 2024); in AMD architectures, MCA_STATUS_PB uses ErrorCodeExt for parameter block faults, while FTI_DAT_STAT (0x3) indicates I/O status discrepancies.[1][11] Error enumeration relies on standardized codes like Intel's MCACOD, where values such as 0x00A0 denote uncorrectable memory read errors (as of December 2024), providing a compact identifier for logging and analysis across supported architectures. These categories span both correctable and uncorrectable errors, with reporting mechanisms ensuring precise subsystem attribution.[1][11]Software Response
Operating System Reporting
When a machine-check exception occurs, the operating system's kernel traps the interrupt and invokes a dedicated handler to process the error. In x86 architectures, for instance, this typically involves the #MC exception routing to the machine-check (MCE) handler in the kernel, which captures hardware-reported details from model-specific registers (MSRs).[12] The handler parses data from Machine Check Architecture (MCA) banks—specialized hardware registers that store error information for specific components like caches or memory controllers—identifying the error source, type, and affected resources.[13] The kernel then logs the parsed error data to an internal ring buffer or equivalent structure for efficient, low-overhead storage, enabling subsequent analysis without immediate system disruption.[12] Reporting occurs through standardized formats, such as kernel console messages (e.g., "Machine check events logged") and syslog entries that include timestamps, severity levels, and decoded error summaries.[13] In Windows, the Hardware Error Architecture (WHEA) employs similar logging via Event Tracing for Windows (ETW), recording errors in the system event log with structured packets detailing the hardware fault.[14] Users gain visibility into these reports through accessible tools: in Linux, commands likedmesg display kernel ring buffer contents, while in Windows, the Event Viewer provides a graphical interface to ETW logs.[12][14] For fatal, uncorrectable errors that threaten system integrity, the kernel may trigger a panic—resulting in a crash dump or Blue Screen of Death (BSOD) in Windows—to halt execution and preserve diagnostic data.[13][14]
Across operating systems, reporting prioritizes non-blocking mechanisms for correctable or recoverable errors, allowing the system to continue operation while queuing logs for later review, thus balancing reliability with availability.[12][15]
Platform-Specific Implementations
In Linux, machine-check exceptions are managed via the Machine Check Architecture (MCA) framework in the kernel, which detects and logs hardware errors such as correctable and uncorrectable events from CPU registers. The mcelog daemon parses these logs for detailed analysis and preventive maintenance, though it has been largely superseded by rasdaemon in modern distributions for collecting and decoding events across platforms. The EDAC (Error Detection and Correction) kernel module specifically handles memory controller errors, exposing statistics through the sysfs interface at/sys/devices/system/edac/ for monitoring ECC failures. Kernel boot parameters like mce=off disable MCA decoding entirely, preventing error reporting but risking undetected hardware faults.[16][16][17]
Microsoft Windows implements machine-check exception reporting through the Windows Hardware Error Architecture (WHEA), a unified framework that captures processor-detected errors and routes them to the operating system for processing. WHEA logs these events, including details like error source and type, directly to the Windows Event Viewer under the System log, enabling administrators to diagnose issues without external tools. For uncorrectable machine-check exceptions, WHEA triggers a bug check with stop code 0x124 (WHEA_UNCORRECTABLE_ERROR), resulting in a blue screen of death to halt the system and prevent further corruption.[18][19]
IBM z/OS employs a dedicated machine-check handler (MCH) within the Licensed Internal Code (LIC) to intercept and process hardware interrupts from the central processor, generating records for failures in storage, keys, timers, or CPU components. These MCH records are stored in the system logrec data set for immediate analysis, with severe uncorrectable errors often leading to an IPL (Initial Program Load) abort to isolate the fault and reinitialize the system. Auditing of machine-check events is facilitated through System Management Facility (SMF) type 15 records, which capture statistical data on error occurrences for long-term reliability tracking and reporting.[20][21][22]
FreeBSD handles machine-check exceptions through its kernel's exception and interrupt handling mechanisms on supported x86 platforms, logging error details to the kernel message buffer for diagnostics. The mcelog utility from ports can decode these logs into human-readable format. Configuration options, such as enabling detailed reporting or adjusting thresholds, are available via sysctl variables under the hw.mca namespace, allowing runtime tuning without recompilation.[23][24]
| Operating System | Key Components | Default Actions |
|---|---|---|
| Linux | MCA framework, mcelog/rasdaemon, EDAC module | Log errors and continue operation for correctable events; panic or halt for uncorrectable unless configured otherwise[16] |
| Windows | WHEA framework, Event Viewer integration | Log to Event Viewer; blue screen (BSOD) and halt for uncorrectable errors[19] |
| IBM z/OS | MCH in LIC, logrec data set, SMF type 15 records | Log to logrec; IPL abort for severe errors; auditing via SMF[20] |
| FreeBSD | Kernel exception handling, mcelog utility, sysctl hw.mca | Log to kernel messages; continue for correctable, panic for uncorrectable[23] |