Fact-checked by Grok 2 weeks ago

Machine-check exception

A machine-check exception (MCE), also known as #MC in Intel architecture, is an abort-class interrupt (vector 18) generated by the CPU when it detects an internal hardware error, such as faults in memory, caches, buses, or TLBs, that hardware cannot fully correct on its own. This mechanism, part of the broader Machine-Check Architecture (MCA), enables operating systems to log, diagnose, and respond to these errors, ranging from recoverable issues to those requiring system shutdown. Introduced in the Pentium processor and enhanced in subsequent x86 families like P6, Pentium 4, Xeon, and Atom, MCE relies on Model-Specific Registers (MSRs) to report detailed error information. The MCA organizes error reporting into "banks," each corresponding to a hardware subsystem (e.g., one for the data cache, another for the bus unit), with a variable number of banks (indicated by the Count field in IA32_MCG_CAP MSR, supporting up to 255 theoretically but fewer in practice depending on the model). Global MSRs like IA32_MCG_CAP indicate the number of banks and other capabilities, while per-bank MSRs such as IA32_MCi_STATUS capture specifics like error validity, whether the error is uncorrected (UC bit set), and if it affects program continuity (PCC bit set). Enabling MCE requires setting the MCE bit in CR4; once active, uncorrectable errors trigger the exception asynchronously during instruction execution, potentially causing a triple fault and processor shutdown if the handler cannot complete. In virtualized environments, such as those using Intel VT-x, MCEs can cause VM exits to allow hypervisor intervention. Errors fall into two primary categories: corrected errors, which hardware fixes automatically (e.g., single-bit corrections) and are often reported via a Corrected Machine-Check (CMCI) for logging without halting execution; and uncorrected errors, which trigger #MC and are further classified as recoverable (software may continue after ) or fatal (demanding immediate shutdown). In , the 's MCE handler, rewritten for in version 2.5, decodes these via tools like mcelog, polling for corrected errors every 5 minutes by default and configurable via for tolerance levels (e.g., on uncorrected or signal to processes). Uncorrected errors in mode typically the , while those in space may kill the affected with SIGBUS, enhancing reliability in and high-availability environments. Overall, MCE provides essential diagnostics for hardware faults, supporting proactive maintenance in data centers and embedded s.

Fundamentals

Definition and Purpose

A machine-check exception (MCE) is an or exception generated by the CPU in response to detecting an uncorrectable hardware error, such as a fault, , or inconsistency, alerting the to the issue. These exceptions are part of the broader Machine Check Architecture (MCA) in x86 processors, which provides a standardized framework using model-specific registers (MSRs), such as IA32_MCG_CAP and IA32_MCi_STATUS, to detect, log, and report both correctable and uncorrectable errors. The primary purpose of MCEs is to enable software to respond appropriately to severe hardware faults, facilitating actions like error logging, graceful system degradation, recovery attempts, or controlled shutdown to avert and maintain overall system integrity. This contrasts with correctable s, which hardware resolves transparently without software intervention, ensuring minimal disruption for transient issues while escalating uncorrectable ones for explicit handling. MCEs can be synchronous, directly tied to the execution of a specific that provokes the , or asynchronous, arising from external or independent hardware failures unrelated to the current program flow. In x86 processors, serves as the established standard for detailed error reporting, with MCEs signaled through interrupt vector 18 via the local (APIC). For instance, the Machine Check In Progress (MCIP) flag in the IA32_MCG_STATUS MSR (bit 2) is set to indicate an active MCE, preventing recursive exceptions until cleared by software upon resolution.

Historical Development

The concept of machine-check exceptions originated in mainframe computing with the IBM System/360 architecture, announced in 1964 and detailed in its principles of operation by 1967. This system introduced machine check interruptions as a high-priority mechanism to detect hardware malfunctions, such as data parity errors in main storage or channel control issues, enabling diagnostic scans and recovery actions while preserving system state for fault analysis. In the x86 architecture, machine-check exceptions were first formalized with the introduction of Machine Check Architecture () in the processor in 1993, which provided basic reporting for internal errors like multi-bit failures in the on-chip cache. Enhancements in the era focused on improved error logging for uncorrectable multi-bit errors, laying the groundwork for more robust hardware diagnostics. adopted support in its K7 () processor family starting in 1999, aligning with x86 standards through feature bit 14 to enable machine check reporting in compatible systems. The x86 MCA specification evolved from 1996 onward through Intel's architecture documentation, standardizing error registers and exception handling across processors. A significant milestone came in 2008 with Intel's Nehalem microarchitecture, which introduced scalable MCA banks—multiple sets of registers for detailed error logging per CPU core—to handle the growing volume of potential faults in multi-core designs. By the 2020s, extensions appeared in other architectures: ARMv8-A incorporated machine check exceptions as part of its Reliability, Availability, and Serviceability (RAS) features for uncorrectable hardware errors, while RISC-V developed error reporting mechanisms, including machine check-like synchronous exceptions for hardware faults, through ongoing ISA extensions. These developments were driven by the escalating complexity of multi-core processors and the reliability demands of data centers, where undetected hardware errors could cascade into system-wide failures.

Hardware Detection

Mechanisms in Modern CPUs

In modern x86 CPUs, hardware errors are detected through integrated monitoring mechanisms within components such as controllers, parity checkers, translation lookaside buffers (TLBs), and interconnect buses. These monitors continuously validate during operations like reads, writes, and transmissions, flagging anomalies such as single-bit flips or multi-bit corruptions via dedicated signals or internal status bits. Upon detection, the CPU core captures the error details and, if uncorrectable, vectors to a machine-check exception (#MC) handler at interrupt vector 18, provided the CR4.MCE bit is enabled in the . This process ensures that severe errors normal execution promptly, while preserving the processor state for analysis. Error information is logged in Machine Check Architecture () banks, which are sets of model-specific registers (MSRs) dedicated to storing comprehensive error records. Each bank includes registers like IA32_MCi_STATUS (for error validity, type, and codes), IA32_MCi_ADDR (for the involved), and IA32_MCi_MISC (for additional or request data), enabling precise identification of the affected hardware unit. Modern CPUs support up to 28 MCA banks per logical processor, with the exact count reported by the IA32_MCG_CAP MSR; banks are allocated to specific subsystems, such as one for the L3 or another for the , and only valid, uncaptured errors overwrite an available bank. This register-based logging occurs atomically to prevent data loss during concurrent errors. Distinctions in error handling depend on whether the error is asynchronous (occurring independently of the current , e.g., via external bus events) or corrected (automatically repaired, e.g., by single-error correction). The CR4.MCE bit (bit 6) enables #MC exceptions globally; when set, uncorrectable asynchronous errors trigger an immediate #MC, potentially leading to an asynchronous enclave exit (AEX) in SGX environments, while corrected errors are logged in banks without exception unless a is exceeded, in which case a corrected machine-check (CMCI) may be signaled via vector 122. If CR4.MCE is cleared, uncorrectable errors may instead cause a shutdown via signals like BINIT# or MCERR#, bypassing the exception pathway. Corrected errors, even when CR4.MCE is enabled, prioritize non-disruptive logging to maintain system performance. The detection-to-handler flow proceeds as follows: hardware monitors identify an and assert internal flags; the error is captured in the next available bank, updating status bits to mark validity and recovery status; if the error warrants interruption (uncorrectable and CR4.MCE enabled), the CPU delivers the #MC exception, pushing the current onto the and transferring control to the 18 handler for OS trapping. This sequence minimizes latency while ensuring error context—such as model-specific error codes (MSCOD) and corrected/uncorrected indicators—is preserved for subsequent decoding, supporting categories like or errors without delving into classification details.

Supported Architectures

Machine-check exceptions are supported across several major processor architectures, each implementing mechanisms tailored to their design principles for detecting and reporting hardware errors such as cache failures, bus errors, and memory faults. In the x86 family, introduced the Machine Check Architecture () with the processor in 1995, providing a for error reporting through dedicated model-specific registers (MSRs). Subsequent processors, starting from the and extending to modern and series, expanded MCA with up to 28 error-reporting banks per core, each containing status, address, and miscellaneous registers to capture error details like and location. implemented compatible MCA support beginning with the K6 processor in 1997, aligning with x86 standards but introducing variations such as the additional IPID (IP Block Identifier) register in scalable MCA systems for enhanced error source identification in modern and processors. These bank structures enable detailed logging of uncorrectable and correctable errors, with emphasizing status/address/misc triads and adding IPID for modular IP block tracing. ARM architecture incorporated (RAS) extensions starting with ARMv8.2 in 2016, formalized as a mandatory feature by 2018 implementations, which include error record registers (ERRSELR, ERXSTATUS, ERXMISC, and ERXADDR) analogous to x86 banks for capturing asynchronous errors. These registers support an implementation-defined number of error records per processor, facilitating firmware or OS-level error analysis similar to , with features for corrected error counting and injection testing. IBM's Power architecture has long supported machine-check interrupts through hardware error facilities, enabling detection of processor, memory, and I/O faults via dedicated interrupt vectors and recovery units, with modern and processors supporting machine-check interrupts with dynamic reconfiguration. Similarly, in mainframes provides machine-check interrupts for comprehensive error reporting, with the z15 processor (introduced in 2019) enhancing firmware-first handling through the Integrated Firmware Processor (IFP) and Licensed Internal Code (LIC), allowing initial error interception and recovery at the firmware level before escalating to the OS, reducing outage risks. RISC-V introduced RAS extensions in its ratified privileged architecture specification (version 1.12) in 2021, defining standard mechanisms for injection, reporting, and handling via memory-mapped registers and the RAS Event , supporting scalable records for cores and accelerators without fixed bank counts, emphasizing modularity for custom implementations.
ArchitectureIntroduction YearKey Features
x86 ()1995 ()Up to 28 MCA banks; status/address/misc registers; scalable logging.
x86 ()1997 (K6)MCA banks with IPID for IP block identification; compatible with but enhanced for modular dies.
2016 (ARMv8.2)Implementation-defined number of error record registers (ERRSELR et al.); asynchronous reporting; corrected support.
IBM 1991 (PowerPC)Machine-check ; dynamic recovery units.
IBM z/Architecture2000 (z900)Machine-check ; firmware-first via IFP/LIC (enhanced in z15, 2019).
2021 (Priv. v1.12)Memory-mapped RAS records; injection; RAS Event ; modular banks.

Error Classification

Uncorrectable vs. Correctable Errors

Machine-check exceptions (MCEs) in modern processors distinguish between uncorrectable and correctable errors to enable appropriate hardware responses, with uncorrectable errors generally triggering an immediate MCE while correctable ones are handled transparently unless thresholds are exceeded. Uncorrectable errors, indicated by the UC bit set to 1 in the MCi_STATUS , cannot be automatically fixed by hardware and thus invoke a machine-check exception (#MC, vector 18) when the MCE feature is enabled via CR4.MCE=1. These errors often lead to or require mechanisms if the error is deemed recoverable, such as in cases of multi-bit failures in or TLB errors that corrupt . For instance, in 's architecture, the PCC bit in MCG_STATUS (bit 0) further classifies uncorrected errors as fatal (PCC=1, indicating corruption and necessitating a ) or recoverable (PCC=0), while similarly uses the UC bit to flag errors like uncorrectable or severe TLB faults that may still allow limited . Uncorrected errors are subdivided into fatal (context corrupted) and uncorrected recoverable (UCR) types, such as signaled restartable aborts (SRAR) or signaled non-restartable aborts (SRAO), enabling potential software mitigation if the permits. In contrast, correctable errors, marked by UC=0 in MCi_STATUS, are detected and rectified by hardware without interrupting normal execution, such as through single-bit error correction using SECDED (Single Error Correction, Double Error Detection) in memory modules. These errors are logged in the relevant MCi_STATUS for later analysis but do not trigger an MCE unless a configurable is surpassed, at which point they may escalate via a Corrected Machine Check Interrupt (CMCI). In both and architectures, corrected error occurrences are monitored using threshold registers such as MCi_CTL2 or MCi_THRESHOLD to track frequency and prevent silent degradation. The impact of these errors varies by scope, with processor-internal issues (e.g., or TLB errors) often confined to a single core or thread, allowing potential isolation, whereas system-wide errors (e.g., interconnect or multi-channel memory failures) can affect broader operations and reduce recovery possibilities. Recovery for uncorrectable errors depends on their and , such as restarting execution if the RIPV bit in MCG_STATUS indicates a valid restart IP, though fatal cases typically mandate full system intervention. Correctable errors, by design, seamless recovery through correction, maintaining system availability unless patterns suggest impending failure.

Specific Error Categories

Machine-check exceptions encompass a range of hardware-detected faults, categorized by the subsystem affected, such as memory, processor internals, interconnects, and I/O peripherals. These categories are defined through architectural error reporting mechanisms in processors from and , enabling identification of the error's origin without delving into root triggers. Memory-related errors involve faults in the memory subsystem, including ECC failures where error-correcting codes detect and sometimes correct bit flips in or transmission. DRAM scrub errors occur during periodic memory scans that identify and log correctable issues, such as single-bit errors in modules. Unbuffered DIMM faults manifest as access anomalies in non-registered configurations, often reported when checks fail during read or write operations. For instance, Intel's MCACOD value of 0x0004 indicates a generic (as of December 2024), while 0x0008 signifies a corrected patrol scrub error; in systems, syndrome values like DramEccErr (ErrorCodeExt 0x0) denote discrepancies. Processor-internal errors arise within the CPU core, encompassing errors from issues in storage, execution unit failures in arithmetic or logic pipelines, and exceptions due to internal . These are typically logged when hardware detects inconsistencies in core operations, such as invalid instruction decoding or errors. Representative MCACOD examples include 0x0001 for basic internal processor faults (as of December 2024); AMD equivalents feature ErrorCodeExt values like 0x7 for IntErrTyp1 in execution units or hardware assertions in the core. Interconnect and bus errors pertain to communication failures across links, including PCIe link failures where transaction layer protocol violations or discrepancies disrupt data flow between the CPU and expansion devices. In multi-socket systems, QPI or UPI errors signal issues like invalid packets or timeout conditions on inter-core fabrics. MCACOD 0x0004 represents bus or interconnect faults (as of December 2024), with values like 0x0200 for errors; reports these via MCA_STATUS_CS with ErrorCodeExt 0x2 for GMI/xGMI link errors or timeout syndromes in the format 0000 1XXT RRRR XXLL. I/O and peripheral errors cover anomalies in external device interactions, such as storage controller timeouts during command execution on or NVMe interfaces, and GPU coherency issues where cache line inconsistencies arise in integrated or graphics . These are flagged when the processor's I/O hub detects unrecoverable responses or data mismatches. Examples include MCACOD 0x0400 for internal timer or I/O-related errors (as of December 2024); in AMD architectures, MCA_STATUS_PB uses ErrorCodeExt for parameter block faults, while FTI_DAT_STAT (0x3) indicates I/O status discrepancies. Error enumeration relies on standardized codes like Intel's MCACOD, where values such as 0x00A0 denote uncorrectable read errors (as of December 2024), providing a compact identifier for logging and analysis across supported architectures. These categories span both correctable and uncorrectable errors, with reporting mechanisms ensuring precise subsystem attribution.

Software Response

Operating System Reporting

When a machine-check exception occurs, the operating system's traps the and invokes a dedicated handler to process the error. In x86 architectures, for instance, this typically involves the #MC exception routing to the machine-check (MCE) handler in the , which captures hardware-reported details from model-specific registers (MSRs). The handler parses data from Machine Check Architecture (MCA) banks—specialized hardware registers that store error information for specific components like caches or memory controllers—identifying the error source, type, and affected resources. The kernel then logs the parsed error data to an internal ring buffer or equivalent structure for efficient, low-overhead storage, enabling subsequent analysis without immediate system disruption. Reporting occurs through standardized formats, such as kernel console messages (e.g., "Machine check events logged") and syslog entries that include timestamps, severity levels, and decoded error summaries. In Windows, the Hardware Error Architecture (WHEA) employs similar logging via Event Tracing for Windows (ETW), recording errors in the system event log with structured packets detailing the hardware fault. Users gain visibility into these reports through accessible tools: in Linux, commands like dmesg display kernel ring buffer contents, while in Windows, the Event Viewer provides a graphical interface to ETW logs. For fatal, uncorrectable errors that threaten system integrity, the kernel may trigger a panic—resulting in a crash dump or Blue Screen of Death (BSOD) in Windows—to halt execution and preserve diagnostic data. Across operating systems, reporting prioritizes non-blocking mechanisms for correctable or recoverable errors, allowing the system to continue operation while queuing logs for later review, thus balancing reliability with availability.

Platform-Specific Implementations

In , machine-check exceptions are managed via the () framework in the , which detects and logs errors such as correctable and uncorrectable events from CPU registers. The mcelog daemon parses these logs for detailed and preventive , though it has been largely superseded by rasdaemon in modern distributions for collecting and decoding events across platforms. The () kernel module specifically handles errors, exposing statistics through the interface at /sys/devices/system/edac/ for monitoring failures. boot parameters like mce=off disable decoding entirely, preventing error reporting but risking undetected faults. Microsoft Windows implements machine-check exception reporting through the Windows Hardware Error Architecture (WHEA), a unified framework that captures processor-detected errors and routes them to the operating system for processing. WHEA logs these events, including details like error source and type, directly to the Windows under the System log, enabling administrators to diagnose issues without external tools. For uncorrectable machine-check exceptions, WHEA triggers a bug check with stop code 0x124 (WHEA_UNCORRECTABLE_ERROR), resulting in a to halt the system and prevent further corruption. IBM z/OS employs a dedicated machine-check handler (MCH) within the Licensed Internal Code (LIC) to intercept and process hardware interrupts from the central , generating records for failures in , keys, timers, or CPU components. These MCH records are stored in the logrec for immediate analysis, with severe uncorrectable errors often leading to an IPL (Initial Program Load) abort to isolate the fault and reinitialize the . Auditing of machine-check events is facilitated through System Management Facility (SMF) type 15 records, which capture statistical data on error occurrences for long-term reliability tracking and reporting. FreeBSD handles machine-check exceptions through its 's exception and interrupt handling mechanisms on supported x86 platforms, logging error details to the kernel message buffer for diagnostics. The utility from ports can decode these logs into human-readable format. Configuration options, such as enabling detailed reporting or adjusting thresholds, are available via variables under the hw.mca namespace, allowing runtime tuning without recompilation.
Operating SystemKey ComponentsDefault Actions
LinuxMCA framework, mcelog/rasdaemon, EDAC moduleLog errors and continue operation for correctable events; panic or halt for uncorrectable unless configured otherwise
WindowsWHEA framework, Event Viewer integrationLog to Event Viewer; blue screen (BSOD) and halt for uncorrectable errors
IBM z/OSMCH in LIC, logrec data set, SMF type 15 recordsLog to logrec; IPL abort for severe errors; auditing via SMF
FreeBSDKernel exception handling, mcelog utility, sysctl hw.mcaLog to kernel messages; continue for correctable, panic for uncorrectable

Analysis Methods

Decoding Machine-Check Data

Machine-check exceptions generate raw data stored in hardware registers, which must be decoded to identify the error's nature, location, and severity. This process is essential for diagnosing hardware faults in x86 architectures supporting the Machine-Check Architecture (MCA). Decoding involves reading and interpreting specific model-specific registers (MSRs) that capture error details upon detection of uncorrectable or correctable errors. The primary data sources for machine-check information are the MCA bank registers, typically including the status, address, and miscellaneous registers for each bank associated with hardware units such as caches, memory controllers, or interconnects. The IA32_MCi_STATUS register logs core error status, containing the MCA error code (MCACOD) in bits 15:0, model-specific error codes in bits 31:16, and flags like the valid bit (VAL, bit 63) to indicate populated data. The IA32_MCi_ADDR register holds the physical, linear, or segment-offset address of the fault if the address valid (ADDRV) flag (bit 58 in IA32_MCi_STATUS) is set. The IA32_MCi_MISC register provides supplementary details, such as error correction code (ECC) syndromes for memory errors or PCI Express requestor IDs for input/output MCA (IOMCA) errors, validated by the miscellaneous valid (MISCV) flag (bit 59 in IA32_MCi_STATUS). Error syndrome bits within these registers, including uncorrected (UC, bit 61), overflow (OVER, bit 62), and processor context corrupt (PCC, bit 57) in IA32_MCi_STATUS, help pinpoint the fault's characteristics and impact. Interpreting this data follows a structured sequence to ensure accuracy. First, verify the valid bit (e.g., MCi_STATUS.VAL=1) to confirm the contains relevant error information; invalid banks are skipped. Next, decode the MCACOD field to classify the error type, using architecture-defined simple codes (e.g., 0001H for unclassified errors) or compound codes (e.g., formats like 1MMMCCCC for errors or 1PPTRRRRIILL for bus/QuickPath Interconnect errors) as specified in MCA tables. If ADDRV is set, map the from IA32_MCi_ADDR to a logical representation, considering the mode indicated in IA32_MCi_MISC (bits 8:6) for contexts like uncorrected recoverable errors. Finally, examine bits and miscellaneous data for precise fault localization, such as ECC syndromes (bits 63:32 in IA32_MCi_MISC) that identify affected bits. Decoding faces challenges, particularly in high-error environments where bank overflow occurs, signaled by the OVER when a new error overwrites prior data, potentially losing information on multiple faults. In scenarios with frequent corrected errors, reliance on the corrected error count (bits 52:38 in IA32_MCi_STATUS) mitigates this if the capabilities register (IA32_MCG_CAP) has MCG_CMCI_P enabled. Additionally, decoding responsibilities may split between and the operating system; can initially log errors via corrected machine-check interrupts (CMCI) and modify status registers if the error management capabilities (IA32_MCG_EMCA) (bit 37) permits, while the OS performs final interpretation and clearing, requiring careful coordination to avoid conflicts. For instance, in (SMM), handlers may intercept and process machine checks before OS involvement. The output of decoding typically produces human-readable summaries that aggregate the interpreted data, detailing the error location (e.g., specific level or channel from MCACOD and addresses), type (e.g., "corrected single-bit error" from bits), and recovery status (e.g., based on or flags indicating if the error was contained or led to context corruption). These summaries often include register values alongside descriptive text for , such as "MCACOD=0x0008: Corrected patrol scrub error at 0xABCDEF" with associated miscellaneous context like identifiers.

Diagnostic Tools

Diagnostic tools for machine-check exceptions primarily consist of software utilities that parse kernel-generated logs, decode error data, and generate human-readable reports to aid in identifying faults. In environments, open-source tools such as rasdaemon serve as the primary daemon for monitoring and logging (RAS) events, including machine-check exceptions from CPU banks and controllers. Rasdaemon captures events via kernel tracepoints, stores them in a database, and supports querying through utilities like ras-mc-ctl, which can display summaries of uncorrectable and correctable s across components such as and PCIe devices. It has largely replaced the older mcelog tool, which previously handled similar parsing of /dev/mcelog but lacks support for modern kernel features and architectures like recent processors. Another complementary open-source utility is EDAC-utils, which focuses on diagnostics by interfacing with the kernel's (EDAC) subsystem to report s, including correctable single-bit flips and uncorrectable multi-bit failures in DIMMs. Vendor-specific tools extend these capabilities for proprietary architectures. Intel's Processor Diagnostic Tool performs stress tests and feature verification on CPUs, helping to isolate processor-related issues that trigger machine-check exceptions, such as or failures, by simulating loads and checking for anomalies. For systems, rasdaemon provides native support for logging processor errors via AMD's machine-check architecture, though additional profiling can be done with AMD uProf, which uses Instruction-Based Sampling (IBS) to analyze performance hotspots that may correlate with error-prone hardware states. Practical usage of these tools often involves command-line interfaces for monitoring and . For instance, invoking ras-mc-ctl --summary on a system outputs a concise overview of counts by category, such as correctables or PCIe AER events, allowing administrators to quickly assess system health without delving into raw logs. Similarly, edac-util -v displays verbose status of loaded EDAC drivers and detected memory controllers, highlighting any accumulated . These tools can integrate with broader frameworks, such as , by scripting periodic executions of their output commands to trigger alerts on error thresholds, enabling proactive surveillance in setups. Despite their utility, these diagnostic tools have inherent limitations tied to underlying system support. Rasdaemon and its predecessors like mcelog require specific configurations, such as enabled MCE recovery and EDAC drivers, which may not be active by default on all distributions or platforms, potentially missing events if not properly set up. Additionally, coverage for corrected errors can be incomplete in certain scenarios, as transient single-bit flips might not always trigger full logging or decoding due to filtering or optimizations prioritizing uncorrectable faults, leading to underreporting of subtle degradation patterns. These tools provide practical interfaces to the decoding process outlined in machine-check data analysis but rely on accurate kernel-level capture for reliability.

Mitigation and Prevention

Common Causes

Machine-check exceptions are frequently triggered by hardware failures within the or associated components. Faulty , particularly due to cosmic ray-induced soft errors, represents one of the most prevalent causes, where high-energy particles from collide with memory cells, flipping bits and leading to uncorrectable detected by the . Overheating CPUs can also induce internal errors, such as failures or parity issues, as elevated temperatures degrade integrity and cause transient faults during computation. Power supply instability, including voltage fluctuations or insufficient current delivery, often results in bus errors or external signaling via pins like MCERR#, disrupting data transfers and prompting the to assert a machine-check exception. Environmental factors exacerbate the risk of these exceptions, particularly in settings. Radiation from cosmic rays is more intense at high altitudes, where thinner atmospheric shielding allows more secondary neutrons to reach ground level, increasing rates in memory subsystems by factors of 2-10 compared to . In dense server racks, (EMI) from adjacent power supplies, cabling, or cooling fans can induce noise on buses or interconnects, leading to errors or violations that manifest as external faults. Design-related issues in systems pushed beyond specifications contribute significantly to machine-check triggers. Marginal quality in overclocked processors may cause microarchitectural faults or illegal vector errors under elevated clock speeds, as the hardware operates outside validated thermal and voltage envelopes. In aging multi-socket boards, wear on interconnects like Intel's QuickPath Interconnect (QPI) or Ultra Path Interconnect (UPI) can lead to transaction failures or mismatches, generating bus or errors over time due to signal degradation. Statistics underscore the dominance of memory-related causes, with errors for the majority of machine-check events in production systems, often stemming from s rather than permanent defects. Typical cosmic ray-induced rates in DDR4 range from 1,000–5,000 FIT (failures , or errors per billion device-hours) per Mbit, though real-world deployments report higher aggregate rates up to 25,000 FIT per Mbit across all types due to scaling and environmental variability. These rates highlight why subsystems are a primary focus for logging in machine-check .

Strategies for Error Handling

Recovery techniques for machine-check exceptions (MCEs) focus on isolating faulty components to maintain system operation where possible. For memory errors detected via MCE, implements page retirement, also known as bad page offlining, where affected memory pages are marked as unusable and removed from the active to prevent further . This is handled by user-space tools like rasdaemon, which processes MCE reports and triggers offlining scripts upon error detection. For processor faults, supports CPU hot-unplug, allowing the to offline a faulty core or socket dynamically without full system shutdown, using the CPU hotplug mechanism to migrate workloads and isolate the error. This approach is triggered by rasdaemon scripts that invoke CPU offline operations in response to uncorrectable MCEs on specific cores. Prevention strategies emphasize hardware and firmware measures to reduce the likelihood of MCEs. Using detects and corrects single-bit errors in , significantly lowering the rate of uncorrectable errors that trigger MCEs, as ECC adds parity bits for error detection and correction. Regular , performed by hardware controllers, periodically reads and rewrites memory contents to correct latent single-bit errors before they accumulate into uncorrectable multi-bit faults. Redundancy techniques such as arrays for storage and clustering for compute nodes provide capabilities, ensuring system availability even if an MCE causes a node failure. updates address CPU errata that may spuriously generate MCEs, with vendors like and releasing patches to mitigate known hardware defects. Monitoring approaches enable proactive intervention by tracking MCE patterns. Threshold-based alerts, configured in tools like rasdaemon, notify administrators when error rates exceed predefined limits, such as multiple corrected errors on the same memory bank, triggering actions like logging or component isolation. In cloud environments, machine learning models analyze historical sensor data to predict hardware failures, enabling preemptive maintenance in large-scale RAS (Reliability, Availability, and Serviceability) setups. Best practices include configuring / settings to enable MCE reporting and logging, ensuring the receives full hardware error details for proper handling, as seen in options like processor error reporting in and HPE systems. Testing error handling with tools like mce-inject simulates MCEs in a controlled manner to validate recovery mechanisms without risking production data.

References

  1. [1]
    [PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
    ... Machine-Check Exception (#MC) ... Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-6. 18.2.2. Intel® Xeon® Processor 7400 Model ...
  2. [2]
    machinecheck
    When a machine check exception occurs for a non corrected machine check the kernel can take different actions. Since machine check exceptions can happen any ...Missing: definition | Show results with:definition
  3. [3]
    [PDF] Machine check handling on Linux - Andi Kleen
    This architecture is implemented by modern x86 CPUs from. Intel and AMD. It consists of a standard exception (interrupt 18) for machine checks and some ...Missing: documentation | Show results with:documentation
  4. [4]
    [PDF] Volume 3 (3A, 3B, 3C & 3D): System Programming Guide - Intel
    NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of four volumes: Basic Architecture, Order Number 253665; Instruction Set ...
  5. [5]
    [PDF] Systems Reference Library IBM System/360 Principles of Operation
    This manual describes the IBM System/360's structure, arithmetic, logical, branching, status switching, input/output, and interruption operations.
  6. [6]
    [PDF] Intel Architecture Software Developer's Manual
    Machine-Check Architecture. Describes the machine-check architecture, which was introduced into the IA with the Pentium® processor. Chapter 14 ...
  7. [7]
    [PDF] AMD Processor Recognition - kib.kiev.ua
    Execute the CPUID instruction and check whether an illegal instruction exception occurs. If an exception occurs, the processor does not have CPUID support. □.
  8. [8]
    HW-SW Interface Design and Implementation for Error Logging and ...
    Sep 27, 2023 · HW-SW Interface Design and Implementation for Error Logging and Reporting for RAS in RISC-V Architectures ... “Machine Check Handling on Linux.” ...
  9. [9]
  10. [10]
    [PDF] IBM z15 (8562) Technical Guide
    For the most up-to-date information regarding this product, consult the product documentation or subsequent updates of this book. Page 5. © Copyright IBM Corp.
  11. [11]
    [PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
    ... machine-check exception mechanism found in the. Pentium 4, Intel Xeon, Intel Atom, and P6 family processors. Seemachine check capability is also given ...
  12. [12]
    [PDF] AMD64 Architecture Programmer's Manual, Volume 2
    ... 9.2. Determining Machine-Check Architecture Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301. 9.3. Machine Check Architecture MSRs ...
  13. [13]
    [PDF] Open-Source Register Reference For AMD Family 17h Processors ...
    Jul 3, 2018 · ... MCA bank is visible to exactly one thread in a system, and that error notifications are directed to that thread. Hardware also makes MCA ...
  14. [14]
    Understanding Hardware Error Handling in Linux: MCA Explained
    May 30, 2025 · ... machine-check exception (#MC). Understanding Uncorrected errors. The #MC is an abort-class exception, meaning once it's encountered there's ...Missing: definition | Show results with:definition
  15. [15]
    Windows Hardware Error Architecture Definitions - Microsoft Learn
    Mar 26, 2025 · WHEA uses ETW to notify subscribers about the hardware error events and to record hardware error events in the system event log.
  16. [16]
    Whea overview - Windows drivers | Microsoft Learn
    Jan 23, 2023 · The WHEA_ERROR_PACKET_V1 structure describes the hardware error data that is passed to the operating system by a low-level hardware error ...
  17. [17]
    Reliability, Availability and Serviceability — The Linux Kernel documentation
    ### Summary of Machine Check Exception Handling in Linux
  18. [18]
  19. [19]
    Introduction to the Windows Hardware Error Architecture (WHEA)
    Mar 26, 2025 · Provides mechanisms to help recover from hardware errors to avoid causing a bug check when a hardware error is nonfatal. Supports user-mode ...Missing: machine exception
  20. [20]
    Bug Check 0x124 WHEA_UNCORRECTABLE_ERROR
    Jan 3, 2023 · The WHEA_UNCORRECTABLE_ERROR bug check has a value of 0x00000124 and indicates that a fatal hardware error has occurred. This bug check uses ...
  21. [21]
  22. [22]
    02E - IBM
    Analyze the messages provided by the operator to determine the cause of the error. Look at any I/O, machine check, missing interrupt handler, or disabled ...
  23. [23]
  24. [24]
  25. [25]
  26. [26]
    mchehab/rasdaemon - GitHub
    EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures.Missing: mcelog | Show results with:mcelog
  27. [27]
    ras-mc-ctl(8) — rasdaemon — Debian testing
    Apr 25, 2025 · --summary: Presents a summary of the logged errors. --errors: Shows the errors stored at the error database. --error-count: Shows the corrected ...
  28. [28]
    Which of mcelog and rasdaemon should I use for monitoring of ...
    Sep 8, 2025 · Issue. What of mcelog and rasdaemon should I use for monitoring of hardware, for which usecases? How to verify that rasdaemon is working?
  29. [29]
    edac-util(1): EDAC error reporting utility - Linux man page - Die.net
    Displays the current status of EDAC drivers. edac-util will report whether it detects that EDAC drivers are loaded, and the number of memory controllers (MCs) ...Missing: diagnostics | Show results with:diagnostics
  30. [30]
    Intel® Processor Diagnostic Tool
    The diagnostic tool checks for brand identification, verifies the processor operating frequency, tests specific processor features, and performs a stress test ...Missing: MCE | Show results with:MCE
  31. [31]
    [PDF] uProf User Guide | AMD
    In this profile, AMD uProf uses the IBS supported by the AMD x64-based processor to diagnose the performance issues in hot spots. It collects data on how ...
  32. [32]
  33. [33]
    Chapter 32. Using Advanced Error Reporting | 8
    The command displays a summary of the logged errors (the --summary option) or displays the errors stored in the error database (the --errors option). Additional ...<|separator|>
  34. [34]
    How do I get notified of ECC errors in Linux? - Server Fault
    Nov 11, 2014 · Running mcelog is generally a good idea. Depends on the system, but uncorrectable/correctable ECC errors are likely reported as machine check exception (MCE), ...MCE Errors but no edac-util errors?Machine check events logged - linuxMore results from serverfault.com
  35. [35]
    How Twitter uses rasdaemon for hardware reliability - Blog
    Jan 6, 2023 · MCE (Machine Check Exception) events across a variety of platform types. This replaces mcelog for collecting/exposing hardware failures ...
  36. [36]
    [PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
    This is Volume 3A, Part 1 of the Intel 64 and IA-32 manual, which is a system programming guide. The manual has nine volumes.
  37. [37]
    Data Center Electromagnetic Interference and Tier Standards
    Low-frequency EMI in data centers is usually caused by power supplies. This type of EMI damages associated hardware. It can corrupt the data in servers and can ...
  38. [38]
    [PDF] DRAM Errors in the Wild: A Large-Scale Field Study
    Jun 19, 2009 · For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion.
  39. [39]
    mcelog -- further reading
    A classic study on the benefits of automatic bad page offlining: "Assessment of the Effect of Memory Page Retirement on Systems RAS against Hardware Faults", ...
  40. [40]
    [PDF] mcelog: memory error handling in user space - Andi Kleen
    The paper describes memory error handling in user space on Linux systems using the mcelog daemon. It describes features like bad page offlining or cache error ...
  41. [41]
    ECC Technical Details - MemTest86
    The purpose of On-die ECC is to protect the integrity of data stored in ... Machine Check Architecture (MCA) is an x86-specific mechanism for CPUs to ...<|separator|>
  42. [42]
    Memory scrubbing - Wikipedia
    Memory scrubbing consists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code (ECC), and writing the ...<|control11|><|separator|>
  43. [43]
    PowerEdge: CPU Machine Check Errors | Dell US
    Jul 25, 2025 · This article provides information about CPU Machine Check errors and common causes and proper handling when errors are seen.
  44. [44]
  45. [45]
    For Developers - mcelog
    X86 CPUs report errors detected by the CPU as machine check events (MCEs). These can be data corruption detected in the CPU caches, in main memory by an ...
  46. [46]
    AMD Milan Processors reboot with Machine Check Exceptions - Dell
    May 9, 2025 · Enter the F2 BIOS. · Under System BIOS Settings > Processor Settings, enable the setting "AMD IC Config Disable IT Bypass" · Save the changes and ...
  47. [47]
    andikleen/mce-inject: Linux machine check injection tool - GitHub
    Linux machine check injection tool. Contribute to andikleen/mce-inject development by creating an account on GitHub.