Fact-checked by Grok 2 weeks ago

Soft error

A soft error is a transient fault in electronic circuits, particularly in integrated circuits and memory devices, caused by ionizing radiation that temporarily alters the state of a logic gate or data bit without inflicting permanent hardware damage.^[1]^[2] Unlike hard errors, which result from manufacturing defects or wear-out, soft errors are random and non-destructive, manifesting as single event upsets (SEUs) where a particle strike generates electron-hole pairs that flip bit values until overwritten by new data.^[1]^[2] These errors have become increasingly prevalent in modern computing due to shrinking transistor sizes, which reduce the critical charge required to upset a node, elevating the overall soft error rate (SER) in systems.^[1] The main causes of soft errors stem from terrestrial radiation sources, including alpha particles emitted by trace uranium and thorium impurities in chip packaging materials, as well as high-energy neutrons from cosmic rays that collide with silicon nuclei to produce ionizing secondary particles.^[1]^[2] Low-energy neutrons can also react with boron-10 in certain dielectric layers, such as borophosphosilicate glass (BPSG), generating additional charge.^[1] In space or high-altitude environments, direct exposure to galactic cosmic rays exacerbates the risk, but even ground-level systems face failure rates measured in failures in time (FIT), with advanced chips potentially exceeding 50,000 FITs per device.^[1]^[2] Soft errors can propagate through combinational logic or storage elements, leading to silent data corruption (SDC) where erroneous results go undetected, or detected unrecoverable errors (DUE) that trigger system halts.^[3] This vulnerability is particularly acute in memory-intensive applications like servers and embedded systems, where scaling trends have significantly increased the system-level SER, typically by a factor of approximately 2-4 per technology generation due to higher density and reduced critical charge.^[1] In field-programmable gate arrays (FPGAs), configuration memory is especially susceptible, as bit flips can alter entire circuit behaviors.^[4] To counter these issues, mitigation techniques span process, design, and system levels, including the purification of materials to minimize alpha emissions and the adoption of silicon-on-insulator (SOI) substrates to limit charge collection.^[1]^[2] Error-correcting codes (ECC) in dynamic random-access memory (DRAM) and static RAM (SRAM) can reduce effective SER by over 10,000 times, while architectural redundancies like triple modular redundancy (TMR) provide fault tolerance in critical logic paths.^[1]^[2] More recent innovations, such as brain-inspired hybrid-grained scrubbing in SRAM-based FPGAs, combine fine- and coarse-grained repairs to achieve 100% correction of single- and double-bit upsets with reduced recovery time.^[4] These approaches balance reliability against performance and cost overheads, essential for applications in aerospace, data centers, and high-reliability computing.^[1]

Fundamentals

Definition and Characteristics

A soft error is a transient malfunction in digital circuitry caused by external factors, resulting in incorrect data or signals without any physical damage to the hardware. These errors affect the state of memories, sequential elements, or logic circuits, manifesting as temporary disruptions that do not stem from design mistakes, construction defects, or permanent failures. In contrast to hard errors, which cause irreversible damage through mechanisms like manufacturing flaws or wear-out, soft errors are non-destructive and typically resolve upon data rewrite or system reset.^[5]^[6] Key characteristics of soft errors include their probabilistic nature, arising randomly from environmental influences, and their potential to propagate as single-event upsets (SEUs)—where a single bit in storage or logic flips—or, less commonly, multiple-bit errors across adjacent elements. Unlike firm errors, which involve recoverable but repeatable faults often requiring reconfiguration (such as in programmable logic), soft errors do not persist after power cycling and pose risks primarily through undetected propagation in unmitigated systems. The induction of a soft error occurs when an external event deposits sufficient charge to exceed the critical charge threshold of the affected node, though detailed thresholds are analyzed separately.^[7]^[8]^[9] Soft errors were first systematically observed in the 1970s during testing of dynamic random-access memory (DRAM) chips, where random data corruptions appeared without evident hardware degradation. A pivotal report came in 1978 from Intel researchers, who documented these phenomena in DRAM devices and identified radiation as the underlying trigger in a foundational study presented at the International Reliability Physics Symposium. Examples include bit flips in memory cells, altering a stored binary value, or transient state changes in combinational logic gates, which can yield erroneous outputs until the signal corrects naturally.^[10]^[11]^[5]

Critical Charge

The critical charge, denoted as Q_{\text{crit}}, is the minimum amount of charge that must be deposited at a sensitive node within a semiconductor circuit to produce a voltage perturbation sufficient to alter the logic state, thereby inducing a soft error.^[12]^[13] This threshold quantifies the device's vulnerability to transient disturbances from charge deposition, serving as a key parameter in assessing susceptibility to single-event effects.^[14] The value of Q_{\text{crit}} is fundamentally determined by the relation Q_{\text{crit}} = C \times V_{\text{DD}}, where C represents the capacitance of the affected node and V_{\text{DD}} is the supply voltage.^[15] In dynamic nodes, such as those found in memory elements, this basic formula is adjusted to account for leakage currents and charge recovery dynamics, often requiring circuit-level simulations to capture the effective threshold more accurately.^[16] Several factors influence Q_{\text{crit}}. Technology node scaling in sub-micron processes reduces Q_{\text{crit}} primarily through diminished node capacitance and lowered supply voltages, exacerbating soft error risks as feature sizes shrink.^[16]^[17] Circuit design also plays a role, with static storage in SRAM cells generally yielding a higher Q_{\text{crit}} than the dynamic storage in DRAM due to differences in charge retention mechanisms.^[1] Additionally, temperature exhibits dependence on Q_{\text{crit}}, where elevated temperatures can decrease it in certain devices by increasing leakage, though the effect varies by technology and may improve immunity in others through reduced collection efficiency.^[18]^[19] Empirical measurement of Q_{\text{crit}} typically involves heavy-ion testing, where devices are irradiated with controlled particle beams at accelerators to induce upsets and quantify the charge threshold for state flips.^[12]^[20] Complementary simulation-based approaches, such as SPICE modeling of charge injection pulses, further refine these values by incorporating process-specific parameters.^[21]

Causes

Radioactive Impurities

Soft errors arising from radioactive impurities in semiconductor devices are predominantly caused by alpha particles emitted from the natural decay of trace amounts of uranium (primarily ^{238}U) and thorium (primarily ^{232}Th) present in packaging materials, including ceramic lids, lead frames, and solder bumps. These impurities, often at concentrations of parts per trillion to parts per million, release alpha particles with typical energies of 4–9 MeV during alpha decay chains.^[22]^[23] Upon emission, these positively charged helium nuclei traverse the packaging and enter the silicon die, where they lose energy through ionization of silicon atoms along a short, straight track (typically 10–50 μm in length). This process generates thousands of electron-hole pairs per micrometer, depositing a localized charge that can perturb sensitive nodes if it exceeds the circuit's critical charge threshold. The density of this charge deposition is quantified by the particle's linear energy transfer (LET), which for alpha particles in silicon ranges from about 0.5 to 2 MeV·cm²/mg, resulting in high track densities compared to lighter particles.^[10]^[23] The phenomenon was first systematically identified in 1978 by Intel researchers Timothy C. May and M.H. Woods, who linked unexpected single-bit errors in 16-kb DRAMs (such as the Intel 2107) to alpha particles from packaging impurities, rather than external radiation. This discovery prompted immediate industry action, leading to the development and adoption of low-alpha materials in the early 1980s, including specially purified ceramics and lead-tin solders refined to minimize uranium and thorium content. These mitigation strategies reduced alpha emission rates from typical levels exceeding 100 alphas/cm²/hour to below 0.002 alphas/cm²/hour in qualified materials, lowering soft error rates attributable to packaging by 2–4 orders of magnitude and shifting the dominant terrestrial error source to cosmic neutrons by the mid-1990s.^[24]^[13] In modern contexts, while low-alpha and ultra-low-alpha (ULA) packaging has become standard, residual impurities continue to pose risks in advanced nanoscale nodes (e.g., below 65 nm), where reduced critical charges amplify susceptibility to even low-flux alpha events. For example, during the 1990s transition to sub-micron processes, Intel reported elevated soft error rates in microprocessor SRAM caches linked to trace alpha emissions from solder and lid materials, necessitating further purification protocols. Ongoing challenges include ensuring ULA compliance in lead-free solders and flip-chip assemblies, where inadvertent contamination can still elevate error rates in high-density memory and logic.^[25]^[13]

Cosmic Ray Interactions

Galactic cosmic rays, primarily consisting of high-energy protons and heavy ions originating from outside the solar system, interact with the Earth's atmosphere to produce cascades of secondary particles, including neutrons and protons. These secondary particles, generated through processes such as nuclear spallation and fragmentation in the air shower, can penetrate to ground level and induce soft errors in semiconductor devices.^[26]^[10] The primary mechanism by which these cosmic ray secondaries cause soft errors involves nuclear reactions within silicon-based materials of integrated circuits. High-energy neutrons, typically in the range of 10-100 MeV, collide with silicon nuclei, triggering spallation events that eject charged particles such as alpha particles, protons, and heavy recoiling ions. These charged fragments create dense ionization tracks, depositing sufficient charge to alter the state of memory cells or logic elements if the energy exceeds the critical charge threshold.^[27]^[28] Soft error rates induced by cosmic rays vary significantly with altitude and geographic location due to atmospheric attenuation and geomagnetic field effects. At higher altitudes, such as 10 km typical for commercial aircraft, the neutron flux increases dramatically—approximately 10 times higher than at sea level—leading to elevated error rates in avionics systems. Similarly, rates are higher in polar regions, where reduced geomagnetic shielding allows more cosmic rays to reach the atmosphere, compared to equatorial areas.^[29]^[30] Notable examples of cosmic ray-induced soft errors include upsets observed in satellite electronics, such as the 1980s anomaly in NASA's Tracking and Data Relay Satellite (TDRS-1), where high-energy particles caused multiple single-event upsets in memory. In avionics, a 2008 incident involving Qantas Flight 72 on an Airbus A330 experienced an uncommanded descent, suspected to be due to cosmic ray-induced errors in the air data inertial reference unit. Ground-based observations trace back to 1970s IBM studies, which first linked atmospheric neutrons from cosmic ray cascades to soft errors in dynamic random-access memory (DRAM) chips, demonstrating error rates correlating with neutron flux measurements.^[31]^[32]^[10]

Other Environmental Factors

Thermal neutrons contribute to soft errors through their capture by boron-10 isotopes, which are commonly used in p-type doping for semiconductor devices. This nuclear reaction, ^{10}\text{B} + n \rightarrow ^{7}\text{Li} + \alpha, releases high-energy alpha particles and lithium-7 ions that deposit charge sufficient to flip bits in memory cells, particularly in SRAMs at deep submicron scales.^[33] The high thermal neutron capture cross-section of boron-10 (approximately 3840 barns) makes this a notable concern in environments with even low neutron fluxes, such as sea-level terrestrial settings.^[34] To address this, the semiconductor industry adopted boron isotope purification starting in the early 2000s, enriching sources with boron-11 to reduce the boron-10 content and thereby lower soft error rates from this mechanism.^[1] Electromagnetic interference (EMI) represents a rare, non-ionizing source of soft errors, primarily through induced noise rather than direct charge deposition. Power line transients or radiofrequency signals can couple into integrated circuits, causing voltage glitches that mimic single-event transients and lead to temporary bit flips, especially in sensitive analog or mixed-signal components.^[35] These effects are typically mitigated by shielding and filtering, but in unshielded systems exposed to high-EMI environments like near high-voltage lines, they can occasionally contribute to system-level faults without permanent damage.^[36] Manufacturing defects arising from process variations create latent weak spots that amplify soft error susceptibility in advanced nodes. In FinFET technologies below 22 nm, fluctuations in fin dimensions, gate work functions, or doping profiles can lower the critical charge threshold, making cells more prone to upset from even low-energy particles.^[37] For instance, weak resistive defects in SRAM arrays, introduced during fabrication, increase the likelihood of read/write failures under radiation stress, effectively turning process-induced variability into a reliability vulnerability.^[38] Emerging research highlights soft error risks in neuromorphic hardware, where radiation interactions can disrupt spiking neural network dynamics, though photon-specific effects remain underexplored. Post-2020 studies indicate that space-grade neuromorphic processors are susceptible to single-event upsets from cosmic rays and neutrons, potentially causing erroneous synaptic weights or neuron firings that propagate through the network.^[39] In photonic neuromorphic systems, unintended photon scattering or crosstalk in optical interconnects may induce analogous transient errors, underscoring the need for radiation-hardened designs in these brain-inspired architectures.^[40]

Mechanisms and Effects

In Storage Elements

Soft errors in storage elements, such as those found in static random-access memory (SRAM) and dynamic random-access memory (DRAM), primarily manifest as single-event upsets (SEUs), where ionizing particles from cosmic rays deposit charge that flips the logical state of individual bits in memory cells. These upsets are transient and non-destructive, altering stored data until it is overwritten or refreshed, and they represent the dominant form of soft error in such components.^[41] In SRAM cells, which rely on cross-coupled inverters to maintain state, SEUs disrupt the voltage balance at sensitive nodes, leading to bit inversion.^[25] A single particle strike can also induce multiple-bit upsets (MBUs), corrupting several adjacent bits within the same memory array due to charge sharing or track propagation in the silicon substrate. This phenomenon has been observed in high-density DRAMs, where experimental analysis revealed novel MBU patterns in 16 Mbit and 64 Mbit devices, with error clusters spanning multiple cells from a single ion impact.^[42] MBUs complicate error handling because they exceed the capacity of simple single-bit correction mechanisms, though their occurrence rate is typically hundreds to thousands of times lower than single-bit upsets.^[41] In sequential storage elements like latches and flip-flops, which form the basis of registers and pipeline stages in processors, SEUs alter the held state and propagate the erroneous value to downstream logic on the next clock cycle. Such state changes can manifest as timing anomalies, including setup or hold time violations at receiving elements if the upset coincides with critical clock edges, potentially triggering pipeline stalls to maintain system integrity. Propagation of these errors through the pipeline can lead to corrupted instructions or data, amplifying the impact in compute-intensive applications. SRAM storage elements exhibit higher vulnerability to soft errors compared to DRAM, primarily due to their lower critical charge (Qcrit), the minimum charge required to induce an upset, which decreases with technology scaling and reduced supply voltages. In contrast, DRAM cells maintain higher Qcrit through dedicated storage capacitors, often enhanced by 3D structures, making them relatively more resilient despite their charge-based retention mechanism.^[1] Large on-chip caches, composed of vast SRAM arrays, are particularly prone to error bursts, as the sheer volume of bits—often tens of megabits—increases the probability of multiple concurrent upsets during high-radiation events.^[41] Historical measurements of unmitigated DRAM at sea level in the pre-2000s era reported soft error rates on the order of 1000 failures in time (FIT) per megabit, equivalent to roughly one upset per 10^9 bit-hours under terrestrial cosmic ray exposure.^[43] These rates highlighted the growing reliability challenges as memory densities increased, with system-level impacts becoming evident in multi-gigabit configurations like those in supercomputers.^[1]

In Combinational Logic

In combinational logic, soft errors primarily arise from single-event transients (SETs), where ionizing particles deposit charge in sensitive nodes of transistors, generating temporary voltage pulses that disrupt normal signal propagation. These pulses occur when the deposited charge exceeds the critical charge threshold of the node, temporarily altering the logic state until the circuit recovers. Unlike permanent damage, SETs are non-destructive and last only nanoseconds, but they can lead to logical errors if the transient propagates to downstream gates without being masked.^[44] The likelihood of an SET causing a detectable error depends on several factors, including the pulse width relative to the clock period and the circuit's topology. If the pulse duration aligns with the latching window of a downstream storage element, it may be captured as a soft error; shorter pulses are often filtered out by electrical or logical masking. Designs with high fan-out, where a single node drives multiple subsequent gates, increase susceptibility because the transient can fan out and amplify propagation risks across broader circuit sections.^[45]^[44] For example, an SET striking a gate in an adder circuit can invert a bit in the sum or carry output, resulting in erroneous arithmetic computations that propagate through the processor pipeline. Studies on modern processors indicate that soft errors in combinational logic account for 20-50% of total soft error events, comparable to those in unprotected memory elements at advanced technology nodes like 50 nm and beyond. In contrast to storage elements, where errors persist as bit flips until corrected or overwritten, SETs in combinational logic are inherently transient and vanish without latching, though in deep pipelines they can cascade, affecting sequential instructions and amplifying overall system vulnerability.^[45]^[46]^[44]

Mitigation and Design

Detection Approaches

Detection approaches for soft errors focus on identifying transient faults, such as bit flips in memory or logic, during system operation to enable logging, diagnosis, or subsequent mitigation without immediate correction. These methods typically incur lower overhead than full error correction schemes and are essential in reliability-critical environments where undetected errors could propagate silently. Common strategies leverage redundancy in data representation or monitoring mechanisms to flag anomalies promptly. Parity checks provide a lightweight technique for quick anomaly detection by appending a parity bit to data words, allowing identification of odd-number of bit flips indicative of soft errors. For instance, cross-parity checks in storage elements, such as processor registers, enable on-line detection of multiple-bit errors by computing parity across rows and columns of bit arrays, signaling faults when inconsistencies arise.^[47] More advanced implementations combine parity with re-execution; the P-DETECTOR approach integrates parity checking into a low-level re-execution mechanism, detecting up to 93.76% of control flow faults and 87.89% of data flow errors in benchmark evaluations while minimizing performance impact.^[48] Similarly, parity-based product codes extend this to multi-bit detection in memory systems, where parity computations across codewords flag errors without decoding the full syndrome.^[49] In error-correcting codes (ECC), syndrome decoding can be used solely for detection by computing the syndrome bits from received data and check bits; a non-zero syndrome indicates an error presence, even if the exact error location is not resolved for correction. This partial use of ECC hardware, such as Hamming or BCH codes, detects single- and some multi-bit soft errors in SRAM or registers with minimal additional logic beyond the standard ECC generator. Hardware-based detection often employs built-in self-test (BIST) circuits integrated into processors or memory blocks to periodically scan for soft errors. Memory BIST, for example, applies march algorithms to SRAM arrays, detecting stuck-at or transient faults caused by radiation; Texas Instruments' C2000 CPU implementation uses this to identify soft errors from alpha particles or voltage glitches during runtime self-tests. Processor-level error counters, part of Intel's Machine Check Architecture (MCA) introduced in the Pentium era and enhanced in Xeon processors, log corrected errors like cache or bus parity mismatches from soft events, enabling system administrators to monitor error rates via model-specific registers (MSRs).^[50] MCA banks record details such as error type (e.g., uncorrectable ECC) and location, facilitating proactive maintenance without halting operation for single-event upsets. Software techniques complement hardware by performing runtime monitoring through checksums or kernel-level checks. Algorithm-based fault tolerance (ABFT) uses checksums to verify computations; for embedded systems, software-implemented checksums on critical variables detect bit flips in registers or memory post-operation, with detection coverage exceeding 99% for single errors in matrix multiplications. In operating systems, machine check handlers in Linux kernels process MCA interrupts to detect and log soft errors in CPU components, using checksums on kernel data structures for anomaly flagging.^[51] These methods, often inserted via compiler directives, monitor control flow and data integrity during execution, identifying deviations from expected checksum values. Trade-offs in detection approaches balance detection coverage against area, power, and performance overheads. Parity and syndrome methods add 5-10% area overhead in VLSI designs but achieve high single-error detection rates (>95%) with negligible latency, though multi-bit errors may evade them.^[48] BIST and MCA logging introduce periodic test cycles (e.g., 1-5% power increase) but provide comprehensive coverage in processors, with error counters enabling trend analysis over time. Software checksums impose 10-20% runtime overhead in kernels but offer flexibility for non-hardware-protected regions. In space systems, radiation sensors like embedded particle detectors enhance detection by directly monitoring ionizing particles; CERN's SpaceRadMon uses floating-gate transistors as radiation-sensitive elements to correlate particle hits with soft error occurrences in FPGAs, achieving sub-millisecond event localization with low power (microWatts).^[52] These sensors, integrated into self-adaptive systems, detect cosmic ray-induced upsets in real-time, trading minimal mass overhead for improved fault attribution in orbital environments.^[52]

Correction Techniques

Correction techniques for soft errors primarily involve error-correcting codes (ECC) that enable automatic recovery from detected errors by reconstructing the original data from redundant information. These methods build on detection mechanisms to not only identify but also repair bit flips, ensuring data integrity in memory and logic circuits.^[53] Hamming codes form the foundation of single-error correction (SEC), achieving this capability through a minimum Hamming distance of 3, which allows the decoder to distinguish and correct any single-bit error while detecting double-bit errors in extended variants.^[53] For a 64-bit data word, a Hamming-based SEC code requires 7 parity bits, but the widely used single-error-correcting double-error-detecting (SECDED) variant adds an overall parity bit, totaling 8 check bits to form a 72-bit codeword.^[54] The parity bits are calculated such that each covers a unique combination of data and parity positions, enabling syndrome computation to pinpoint the erroneous bit during decoding.^[55] For multi-bit errors common in denser memories, Bose-Chaudhuri-Hocquenghem (BCH) codes extend correction capabilities to multiple bits per codeword, using cyclic polynomials over finite fields to generate parity checks that correct up to t errors where the code distance is at least 2t+1.^[56] BCH codes are particularly effective in NAND flash and DRAM, where clustered soft errors from cosmic rays may affect several adjacent bits.^[56] Implementation of these codes often includes scrubbing in dynamic random-access memory (DRAM), a process where the system periodically reads data, applies ECC decoding to detect and correct errors, and writes back the corrected values to prevent error accumulation.^[43] In processors, rollback recovery uses checkpoints to restore pre-error state upon detection, re-executing instructions with corrected data to maintain computational integrity.^[57] ECC has been standard in server RAM since the early 1990s, transitioning from parity checks to full correction to handle increasing soft error rates in larger memory capacities.^[58] This adoption significantly reduces the effective uncorrectable error rate by correcting all single-bit soft errors, which dominate in typical environments.^[43] More recently, low-density parity-check (LDPC) codes have emerged for high-reliability applications like solid-state drives and radiation-tolerant systems, offering superior multi-bit correction with iterative decoding that approaches theoretical limits while minimizing overhead.^[59]^[60]

Architectural Strategies

Architectural strategies for mitigating soft errors involve higher-level design decisions that enhance system resilience through redundancy, process modifications, and specialized hardening techniques. These approaches aim to reduce susceptibility across the entire system without relying solely on low-level error correction mechanisms. Redundancy techniques replicate computation or hardware to mask errors. Triple modular redundancy (TMR) employs three identical modules performing the same operation, with a majority voting circuit selecting the output that appears at least twice, thereby tolerating a single faulty module due to a soft error.^[61] Time redundancy in pipelines duplicates computations temporally, re-executing instructions to detect discrepancies, while space redundancy uses parallel hardware paths for comparison, offering improved tolerance in asynchronous self-timed pipelines compared to synchronous ones.^[62] Process-level tweaks adjust fabrication to increase the critical charge (Qcrit), the minimum charge needed to flip a node state. Larger feature sizes maintain higher Qcrit by preserving node capacitance, countering the exponential SER increase in scaled logic circuits.^[16] Silicon-on-insulator (SOI) technology reduces charge collection from particle strikes by isolating the active silicon layer, achieving up to 5x SER reduction in SRAM devices.^[3] Low-power modes, such as high-resilience execution configurations, further decrease vulnerability by optimizing pipeline behavior, reducing soft error rates by about 11% in processors like the ARM Cortex-R5 without area or performance penalties.^[63] At the system level, radiation-hardened (rad-hard) designs incorporate hardened components and architectures for extreme environments. NASA has utilized rad-hard electronics since the 1980s to combat soft errors in spacecraft, employing strategies like TMR integration and shielding to ensure reliability during long-duration missions exposed to cosmic rays.^[64] These strategies involve significant trade-offs in area, power, and performance. TMR triples hardware resources, increasing costs, while SOI and larger features may limit density and speed.^[61]^[16] In automotive electronic control units (ECUs), mitigation like dynamic voltage scaling balances reliability against power overheads, as unoptimized schemes can inflate area and delay, but analytical models enable efficient design for ISO 26262 compliance.^[65]

Quantification

Soft Error Rate Metrics

The soft error rate (SER) is the primary metric for quantifying the frequency of soft errors in semiconductor devices, defined as the number of failures per unit time under specified conditions. It is commonly expressed in failures in time (FIT), where 1 FIT equals one failure in 10^9 device-hours (or approximately 114 years of continuous operation). This unit allows comparison across devices and technologies, with SER often normalized per bit or per device to assess vulnerability.^[5]^[66] The SER can be calculated using the formula

\text{SER} = \Phi \times \sigma \times N,

where \Phi is the particle flux (particles per unit area per unit time), \sigma is the cross-section representing the effective sensitive area of the device to particle strikes, and N is the number of susceptible nodes or bits. This model captures the probabilistic nature of radiation-induced errors, with flux typically referencing sea-level atmospheric neutrons (around 13 neutrons/cm²/hour for high-energy particles >10 MeV). Another related metric is the mean time between failures (MTBF), calculated as \text{MTBF} = 10^9 / \text{SER} hours, providing a practical estimate of system reliability.^[16]^[67] Benchmarks for SER vary by technology and particle type, but sea-level neutron-induced SER for 90 nm SRAM cells is approximately 500 FIT per Mbit, reflecting typical measurements from accelerated testing scaled to terrestrial conditions.^[68] For context, uncorrected embedded SRAM in advanced processors can exhibit SER values leading to chip-level rates of thousands of FIT, necessitating mitigation for high-reliability applications.^[27]^[1] With technology scaling, SER has shown a marked increase due to reduced critical charge and node capacitance, exacerbating susceptibility to strikes. Projections indicate an approximately 7-fold rise in per-device SER from 130 nm to 22 nm nodes, driven by denser integration and lower voltages, though architectural derating and process optimizations can partially offset this trend. Scaling has continued beyond 22 nm, with per-bit SER roughly doubling per generation into the 3 nm nodes as of 2025, leading to even higher device-level rates without mitigations.^[69]^[70]^[71]

Influencing Factors

Soft error rates (SER) are significantly influenced by environmental conditions, with altitude playing a key role due to reduced atmospheric shielding against cosmic ray-induced neutrons. At sea level, the neutron flux is baseline, but it increases exponentially with elevation; for instance, at 1.5 km altitude (e.g., Denver), the flux is 3 to 5 times higher than at sea level, leading to correspondingly elevated SER in semiconductor devices.^[72] In aviation environments at approximately 10-12 km, the neutron flux can be roughly 300 to 500 times higher than at ground level, dramatically amplifying SER for avionics and onboard electronics.^[73] Latitude further modulates this effect, as higher geomagnetic latitudes experience greater cosmic ray penetration owing to weaker magnetic field deflection; SER can be up to 2-3 times higher at polar regions compared to equatorial areas under nominal conditions.^[74] Solar activity introduces temporal variability, particularly through events like solar flares and coronal mass ejections (CMEs) that produce ground-level enhancements (GLEs) in particle flux. During extreme solar flares, SER in susceptible devices can increase by 2-3 times the baseline rate, with dose rates rising up to 1250% in high-latitude regions during GLEs, posing risks to flight-critical systems.^[75] These spikes are short-lived but critical for applications in aviation and high-altitude operations, where particle showers from such events directly contribute to single event upsets (SEUs).^[76] Technological parameters exacerbate SER as devices scale. With each successive CMOS node shrink, per-bit SER roughly doubles due to reduced critical charge and smaller feature sizes, resulting in an approximate 100-fold increase per decade as multiple generations occur.^[77] Voltage reduction, often employed for power efficiency, further heightens vulnerability; lowering supply voltage by 10-20% can elevate SRAM SER by up to 40%, as the margin for error correction diminishes.^[78] Clock speed exhibits an inverse relationship, where higher frequencies tend to reduce effective SER in combinational logic by shortening the window for transient errors to propagate to stable states, though this benefit plateaus in advanced nodes.^[79] In application contexts, data centers and cloud computing environments face elevated SER compared to consumer devices, driven by higher device density, continuous operation, and larger-scale deployments. Studies from the 2010s indicate that 12-45% of servers in large-scale data centers (e.g., Google fleets) encounter at least one DRAM soft error annually, orders of magnitude higher than typical consumer hardware due to aggregated exposure and lack of frequent resets.^[80] Virtualization in cloud infrastructures amplifies vulnerabilities, as hypervisor layers propagate errors across virtual machines, with post-2010 analyses highlighting increased soft error susceptibility in multi-tenant setups.^[81] Emerging AI accelerators, such as GPUs and TPUs in the 2020s, show heightened SER in memory-intensive components like patch embeddings, where soft errors can degrade model accuracy by up to 10% in vision transformers under radiation exposure.^[82] Modeling tools like CREME (Cosmic Ray Effects on Micro-Electronics) enable prediction of these factors by simulating cosmic ray fluxes and their modulation by altitude, latitude, and solar conditions. CREME96 and its updates provide SER estimates for terrestrial and space environments, incorporating geomagnetic and solar cycle variations to forecast error rates in scaled technologies.^[83] These tools reveal gaps in AI accelerator modeling, where 2020s GPU studies indicate underestimation of SER in high-density compute clusters without altitude-specific adjustments.^[84]

References

[1]
[PDF] Soft Errors in Advanced Computer Systems - Columbia CS
Soft errors occur when a single radiation event corrupts a data bit in a device, causing a data bit to be corrupted until new data is written.
[2]
[PDF] Single Event Upset: An Embedded Tutorial - Auburn University
We present a tutorial study of the radiation-induced single event upset phenomenon caused by external radiation, which is a major source of soft errors. We.
[3]
[PDF] The Soft Error Problem: An Architectural Perspective
Radiation-induced soft errors have emerged as a key challenge in computer system design. Exponentially increasing transistor counts will drive per-chip fault ...Missing: mitigation | Show results with:mitigation
[4]
Soft error mitigation and recovery of SRAM-based FPGAs ... - NIH
Sep 12, 2023 · This study proposes a brain-inspired hybrid-grained scrubbing mechanism consisting of fine-grained and coarse-grained scrubbing to mitigate and repair the ...
[5]
Soft error rate FAQs | Quality & reliability | TI.com - Texas Instruments
Soft errors affect the data state of memories and sequential elements and are caused by random radiation events that occur naturally in the terrestrial ...
[6]
(PDF) Soft Errors in Advanced Computer Systems - ResearchGate
Aug 6, 2025 · In terrestrial applications, the predominant radiation issue is the soft error, whereby a single radiation event causes a data bit stored in a ...
[7]
1. Mitigating Single Event Upset - Intel
Single event upsets (SEUs) are rare, unintended changes in the state of an FPGA's internal memory elements caused by cosmic radiation effects.
[8]
Single Event Upset - an overview | ScienceDirect Topics
Soft errors caused by charged particles are called single-event upsets (SEUs). As integrated circuit technology advances, the size of transistors continues to ...
[9]
[PDF] Extended Temperature Fusion Family of Mixed Signal FPGAs
Jan 1, 2013 · ... firm errors is alpha particles. For an alpha to cause a soft or firm error, its source must be in very close proximity to the affected circuit ...
[10]
Radiation Induced Soft Errors - IEEE Electron Devices Society
History. Late 70s: Random fails appeared in Intel 4K 2107 DRAM introduced in 1974. 1978: May and Woods of Intel showed the fails to be radiation.
[11]
Alpha-particle-induced soft errors in dynamic memories - IEEE Xplore
A new physical soft error mechanism in dynamic RAM's and CCD's is the upset of stored data by the passage of alpha particles through the memory array area.
[12]
Critical Charge - an overview | ScienceDirect Topics
Qcrit is used to map the circuit's susceptibility to soft errors, which are disruptions caused by particle strikes, to a failure rate expressed in FIT (Failures ...
[13]
[PDF] SOFT ERROR ISSUE AND IMPORTANCE OF LOW ALPHA ...
pends on the magnitude of charge collected (Qcoll) and critical charge (Qcrit). The critical charge (Qcrit) is the amount of charge required to trigger a change.
[14]
[PDF] Efficient Implementation of Error Detection Functions - cs.wisc.edu
This minimum amount of charge required to change the logic state is called the critical charge, Qcrit, and is a function of the physical characteristics of ...
[15]
[PDF] Introduction - Elsevier
The first report on soft errors due to alpha particle contamination in computer chips was from Intel Corporation in 1978. Intel was unable to deliver its chips ...
[16]
[PDF] Modeling the Effect of Technology Trends on the Soft Error Rate of ...
This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in. CMOS memory and logic circuits.Missing: mitigation | Show results with:mitigation<|control11|><|separator|>
[17]
[PDF] Impact of Technology and Voltage Scaling on the Soft Error ...
Decreasing the supply voltage impacts soft error susceptibility as the charge needed to upset a node is a function of the voltage level.
[18]
Impact of negative bias temperature instability on single event ...
Jan 17, 2021 · The results show that the critical charge decreases with the temperature and strongly depends on the input states. Next, we validate the results ...
[19]
Temperature effects on BTI and soft errors in modern logic circuits
The results reveal that soft error immunity in all experimental circuits improves significantly with increasing supply voltage and temperature, mainly due to ...
[20]
[PDF] Radiation Testing Electronics with Heavy Ions-The Best Way to Hit a ...
Apr 17, 2018 · • SEE occurs when charge reaches critical charge Qc and probability ... • For very soft devices (upset caused by collected charge < 6 fC),.
[21]
Critical Charge Characterization for Soft Error Rate Modeling in ...
Critical charge (Qcrit) is the minimal charge that can upset a memory circuit's logic state. This paper investigates Qcrit for a 90nm SRAM cell.Missing: DRAM | Show results with:DRAM
[22]
https://ieeexplore.ieee.org/document/4208258
[23]
https://www.tandfonline.com/doi/full/10.1080/00223131.2025.2462190
[24]
A New Physical Mechanism for Soft Errors in Dynamic Memories
A New Physical Mechanism for Soft Errors in Dynamic Memories · T. May, M. H. Woods · Published in IEEE International… 18 April 1978 · Physics, Engineering.
[25]
[PDF] Scaling and Technology Issues for Soft Error Rates - NASA NEPP
Soft-errors from alpha particles were first reported by. May and Woods [1], and considerable effort was spent by the semiconductor device community during the ...
[26]
[PDF] Terrestrial Cosmic Ray Induced Soft Errors and Large-Scale FPGA ...
The study presented in this paper examines the impact of terrestrial cosmic ray induced soft errors on large-scale FPGA systems within cloud computing. Today, ...
[27]
[PDF] Characterization of Soft Errors Caused by Single Event Upsets in ...
PKA has a very high linear energy transfer (LET). It causes ionization and displacement damage in semiconductor devices, there- by causing SEUs and the ...
[28]
Incidence of multi-particle events on soft error rates caused by n-Si ...
Neutron reactions with silicon nuclei can be responsible for much of the soft errors rate (SER) observed, for instance, in high density memories.
[29]
Cosmic-ray soft error rate characterization of a standard 0.6-μm ...
Aug 9, 2025 · Cosmic-ray soft errors from ground level to aircraft flight altitudes are caused mainly by neutrons. We derived an empirical model for ...
[30]
An estimate of error rates in integrated circuits at aircraft altitudes ...
The calculation indicates that typical error rates at 10 km would be about 10 errors per megabit per year, only slightly less than the estimated rate for ...
[31]
Radiation effects on spacecraft & aircraft - NASA ADS
3,3 Single Event Effects A classic example of cosmic-ray induced upsets was experienced by the NASA/DOD Tracking and Data Relay Satellite (TDRS-1) which ...<|control11|><|separator|>
[32]
[PDF] The Effect of Cosmic Rays on the Soft - Regulations.gov
It was also discovered that alpha particles emitted by the radioactive decay of impurities in chip packaging materials can cause soft errors, also called single ...Missing: Intel Pentium
[33]
(PDF) Neutron-induced boron fission as a major source of soft errors ...
Oct 11, 2016 · Comparison of thermal neutron capture cross-sections of several key device materials. Besides having the largest crosssection , 10 B is the only ...
[34]
Thermal Neutron-Induced Single-Event Upsets in Microcontrollers ...
Nov 2, 2019 · The stable isotope 10 B (20% of total boron) with a high thermal neutron absorption cross-section of 3840 barns is second only to cadmium ...
[35]
A tutorial in radiation-induced single event upsets
Oct 4, 2017 · Soft errors may be caused by electronic noise sources such as a noisy power supply, lighting, and electrostatic discharge (ESD). These are the ...
[36]
Combined ionizing radiation & electromagnetic interference test ...
In this scenario, EMI produces Power Supply Disturbances (PSD), which in turn, if large enough, may produce transient faults inside the chip. In other words, ...
[37]
[PDF] Process Variability Impact on the SET Response of FinFET Multi ...
The process variability, one of the main challenges in sub-22nm technologies, can modify the LETth to induce a soft error.
[38]
[PDF] Evaluation of Single Event Upset Susceptibility of FinFET-based ...
Jun 1, 2021 · Variation during the manufacturing process has introduced different types of defects that directly affect the SRAM's reliability, such as weak ...
[39]
[PDF] A Survey Examining Neuromorphic Architecture in Space ... - arXiv
Nov 25, 2023 · Neuromorphic computing systems are susceptible to radiation-induced failures, including SEUs, latch-ups, and total ionizing dose (TID) effects.
[40]
[PDF] Radiation Tolerance and Mitigation for Neuromorphic Processors
Any space system, operating be- yond LEO requires computing hardware that resilient against radiation effects. However, neuromorphic processors have not yet ...
[41]
[PDF] Towards Soft Errors
SER of DRAM is smaller than that of SRAM, i.e., DRAMs are much more immune to soft error than SRAMs in current technology.
[42]
(PDF) Analysis of single-ion multiple-bit upset in high-density DRAMs
Aug 6, 2025 · New types of multiple-bit upset (MBU) modes have been identified in high density DRAMs (16 Mbit and 64 Mbit). The identification of the ...
[43]
[PDF] Soft Errors in Electronic Memory – A White Paper
Jan 5, 2004 · [9] • “Radiation hardening” can decrease error rates by several orders of magnitude [13][24][27], but these techniques cost more, reduce ...
[44]
https://ieeexplore.ieee.org/document/6530775
[45]
Modeling the effect of technology trends on the soft error rate of ...
This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits.
[46]
[PDF] Modeling the Effect of Technology Trends on Soft Error Rate of ...
This paper examines the effect of technology scaling and mi- croarchitectural trends on the rate of soft errors in CMOS memory and logic circuits.Missing: CPUs | Show results with:CPUs
[47]
On-line error detection and correction in storage elements with cross ...
This paper proposes the cross-parity check as a method for an on-line detection of multiple bit-errors in storage elements of microprocessors like registers ...Missing: techniques | Show results with:techniques
[48]
Utilizing Parity Checking to Optimize Soft Error Detection Through ...
Jul 31, 2023 · P-DETECTOR combines parity checking with DETECTOR's re-execution, reducing faults by 93.76% for control flow and 87.89% for data flow errors.
[49]
Parity-Based ECC and Mechanism for Detecting and Correcting Soft ...
Sep 12, 2018 · In this work, we use Parity Product Code (PPC) and propose several supporting mechanisms to detect and correct soft errors. First, PPC can work ...Missing: techniques | Show results with:techniques
[50]
[PDF] Platform-Level Error Handling Strategies for Intel Systems
This paper overviews error detection in Intel systems, and strategies like centralized IO, board management, and interrupt-based error handling.
[51]
Understanding Hardware Error Handling in Linux: MCA Explained
May 30, 2025 · It provides a mechanism for detecting and reporting hardware (machine) errors, such as: system bus errors, ECC errors, parity errors, cache errors, and TLB ...
[52]
https://ieeexplore.ieee.org/document/9225138/
[53]
[PDF] 5.2 error correction
The simplest of the error-correcting codes is the Hamming code devised by ... The code just described is known as a single-error-correcting (SEC) code.Missing: BCH | Show results with:BCH
[54]
[PDF] Hamming (72,64) Code - Computer Science (CS)
sequence of 64 data bits, which will require 8 parity bits, for a total of 72 bits. This is called a Hamming (72, 64) code; the convention is that the first ...
[55]
[PDF] pdf - ece.ucsb.edu
Thus, a single-error- correcting/double-error-detecting (SEC/DED) code requires a minimum distance of 4. We next review the types of errors and various ways of ...
[56]
[PDF] Error Correction Codes in NAND Flash Memory
Feb 16, 2016 · Hamming codes can correct single-bit errors and detect double-bit errors (SEC-. DED). A single-error correcting Hamming code (SEC) has an H- ...
[57]
Soft‐error reliable architecture for future microprocessors
Mar 5, 2019 · For error recovery mechanism we follow standard checkpoint and rollback scheme. As transient faults being infrequent in nature, the overhead ...
[58]
[PDF] IBM Chipkill Memory - John
In the early 1990s, most Intel processor-based servers employed parity memory technology. ... The 1GB ECC memory-equipped server received 9 outages per 100 ...
[59]
[PDF] LDPC-in-SSD: Making Advanced Error Correction Codes ... - USENIX
Both LDPC code and BCH code protect each 4KB user data with 512B coding redundan- cy.
[60]
[PDF] Evaluation of Error-Correcting Codes for Radiation-Tolerant Memory
May 15, 2010 · In order to obtain the full performance of soft-decision decoding LDPC codes, we introduce a technique to generate soft symbol information using.
[61]
1.5. Triple Modular Redundancy - Intel
Triple modular redundancy (TMR) is an established SEU mitigation technique for improving hardware fault tolerance. Use TMR if your system cannot suffer downtime ...
[62]
Comparison of Synchronous and Self-Timed Pipeline’s Soft Error Tolerance
**Summary of Time and Space Redundancy in Pipelines for Soft Error Tolerance:**
[63]
A "high resilience" mode to minimize soft error vulnerabilities in ARM ...
Oct 15, 2017 · This paper proposes a "high resilience" execution mode to increase the robustness of CPU pipelines to soft errors when executing critical ...
[64]
[PDF] Radiation Hardened Electronics for Space Environments (RHESE)
NASA spacecraft developers have defined a Radiation. Hardness Assurance (RHA) methodology process*. • In general, the process may be described by the ...
[65]
[PDF] Transient Errors Resiliency Analysis Technique for Automotive ...
There exist several works on soft- error rates (SERs) analysis [9], [10] and soft-errors aware circuit design techniques [11], [12], which use space and time ...
[66]
[PDF] Soft Error Rate Analysis for Sequential Circuits*
The effect of soft errors is measured by the soft error rate (SER) in FITs (failure-in-time), which is defined as one failure in 109 hours.
[67]
[PDF] Analysis of Soft Error Rates for future technologies - UPCommons
Mar 7, 2015 · Therefore, we focused on computing the soft error rates due to neutrons. The neutron is one of the subatomic particles that make up an atom ...
[68]
[PDF] RELIABILITY OF SRAMs AND 3D TSV ICS: DESIGN PROTECTION ...
Figure 2.13: SRAM and DRAM soft error trend per chip vs. design rule ... The reduction in transistor sizing for the 8T cell results in lower overall cell ...
[69]
Comparison of accelerated DRAM soft error rates measured at ...
The following results are obtained: 1) Soft-error rates per device in SRAMs will increase x6-7 from 130 nm to 22 nm process; 2) As SRAM is scaled down to a ...
[70]
A Systematic Methodology to Compute the Architectural ...
For example, at an altitude 1.5km—the altitude of Denver,. Colorado—the neutron flux due to cosmic rays is 3 to 5 times higher than at sea level. ... Soft Error ...
[71]
Probabilistic Model Checking Based DAL Analysis to Optimize a ...
It has been reported that long-haul aircrafts flying at airplane altitudes experience a neutron-flux roughly 500 times higher than that at ground ... for fixing ...
[72]
Terrestrial cosmic rays | IBM Journals & Magazine - IEEE Xplore
... latitude ... The terrestrial flux of nucleons can be attenuated by shielding, making a significant reduction in the electronic system soft-error rate.
[73]
https://dl.acm.org/doi/pdf/10.1109/MEMCOD.2014.6961856
[74]
[PDF] The Variation of Radiation Effective Dose Rates and Single Event ...
Dec 16, 2022 · such as Coronal Mass Ejections (CMEs) or solar flares can cause the Sun to eject a large cloud of ... soft error rate for avionics. IEEE ...
[75]
The impact of new technology on soft error rates - IEEE Xplore
This paper presents the impact of new microprocessor technology on microprocessor soft error rate (SER). The results are based on Oracle's (formerly Sun ...
[76]
https://arxiv.org/pdf/2212.08229
[77]
https://ieeexplore.ieee.org/document/5784522
[78]
DRAM's Damning Defects—and How They Cripple Computers
Nov 23, 2015 · Between 12 percent and 45 percent of machines at Google experience at least one DRAM error per year. This is orders of magnitude more frequent ...
[79]
[PDF] On Soft Error Reliability of Virtualization Infrastructure - IEEE Xplore
Abstract—Hardware errors are no longer exceptions in modern cloud data centers. Although virtualization provides software failure.
[80]
Vision Transformer Reliability Evaluation on the Coral Edge TPU
Apr 18, 2025 · In this article, we study the reliability of transformer models on low-power and low-cost commercial-of-the-shelf (COTS) accelerators, such as ...
[81]
[PDF] 3.4. CRÈME96 and Related Error Rate Prediction Methods
This revision, CRÈME96 [25], was completed and released as a WWW- based tool, one of the first of its kind. The revisions in CRÈME96 included improved.
[82]
Welcome to the CRÈME site — CREME-MC site
The CRÈME site provides a tool for SEE rate prediction, using a new modular, physics-based model, and Monte Carlo modules to simulate particle effects.Getting Started · News · Help · Site MoveMissing: soft | Show results with:soft