Fact-checked by Grok 2 weeks ago

Fault injection

Fault injection is a validation technique employed to assess the dependability and fault tolerance of computer systems by deliberately introducing controlled faults into hardware, software, or their models, and then observing the system's behavior in response to these perturbations.^[1] This method enables the evaluation of how systems handle errors, forecast potential failures, and verify the effectiveness of fault-tolerance mechanisms, making it essential for designing reliable electronic and software systems in safety-critical domains.^[1]^[2] The origins of fault injection trace back to early computing efforts focused on error reduction, with foundational research emerging in the 1970s, such as studies on cosmic rays impacting satellite circuits.^[3] By the 1990s, it had evolved into a systematic approach for validating dependability in fault-tolerant systems, complementing analytical modeling with empirical experimentation.^[1] In the 2000s and beyond, advancements in cloud computing spurred practical tools like Netflix's Chaos Monkey for software resilience testing, while hardware-focused techniques gained traction in embedded and IoT systems.^[2] Today, fault injection remains a cornerstone of dependability assessment, adapting to complex distributed environments.^[4] Fault injection techniques are broadly categorized into hardware-based, software-based, simulation-based, emulation-based, and hybrid approaches. Hardware-based methods involve physical interventions, such as voltage glitching or heavy ion radiation, to induce real faults in circuits.^[1]^[3] Software-based techniques, including code mutation and error emulation, insert faults directly into running programs to simulate hardware malfunctions without physical access.^[2] Simulation and emulation leverage models (e.g., VHDL for hardware) or reconfigurable devices like FPGAs to accelerate testing while preserving timing accuracy.^[1] Hybrid methods combine these for comprehensive analysis, such as pairing software injection with hardware monitoring.^[1] Applications of fault injection span reliability testing, security analysis, and performance evaluation across industries. In dependability engineering, it identifies design weaknesses, measures fault coverage, and studies error propagation in systems like aerospace and automotive controls.^[1] For software systems, it anticipates worst-case scenarios, as seen in robustness testing of the Linux kernel or web services.^[4] In cybersecurity, adversarial fault injection—via lasers, electromagnetic pulses, or clock disruptions—exploits vulnerabilities to bypass protections, extract cryptographic keys, or enable code execution, as demonstrated in attacks on secure boot processes or voting machines.^[3] Overall, these uses underscore fault injection's role in enhancing system resilience against both accidental and intentional disruptions.^[5]

Overview

Definition and Principles

Fault injection is the deliberate introduction of faults into a computer system or component to assess its dependability, robustness, and fault-handling mechanisms under simulated adverse conditions. This technique enables engineers to observe how the system responds to errors, failures, or stresses that might occur in real-world scenarios, thereby identifying weaknesses in design, implementation, or recovery processes.^[6]^[2] At its core, fault injection operates on principles such as defining appropriate fault models—representations of potential errors like crash faults (where a component abruptly stops functioning) or timing faults (where delays or accelerations disrupt synchronization)—and selecting injection points, such as user inputs, code execution paths, memory states, or hardware interfaces. The primary objectives include verifying the effectiveness of fault tolerance mechanisms, detecting vulnerabilities that could lead to system failures, and informing improvements in system architecture to enhance reliability and resilience. These principles ensure that injected faults mimic realistic error conditions while allowing controlled experimentation to measure metrics like error detection rates and recovery times.^[6]^[7] The basic workflow of fault injection typically involves several steps: first, selecting and modeling faults based on the system's expected failure modes; second, injecting the faults at predetermined points during system operation under a representative workload; third, monitoring and observing the system's response, including any propagated errors or recovery actions; and finally, analyzing the results to evaluate performance and refine the design. This structured approach provides empirical data on system behavior that traditional testing may overlook.^[2]^[6] Unlike testing methods that rely on naturally occurring failures or stress the system through overload without specific error simulation, fault injection emphasizes the controlled introduction of artificial faults to proactively uncover and mitigate hidden issues in error handling and recovery. This distinction allows for targeted validation of fault tolerance without waiting for unpredictable real-world incidents.^[8]

Types of Faults Injected

Fault injection techniques categorize faults based on their behavior and impact on system components, enabling targeted testing of dependability in distributed and standalone systems. Common classifications include Byzantine faults, where a component exhibits arbitrary or malicious behavior, potentially sending conflicting messages to others; crash faults, characterized by sudden and permanent cessation of operations without further actions; omission faults, involving the failure to deliver or process messages; and timing faults, which manifest as deviations in expected timing, such as delays or accelerations in responses.^[9]^[10] Byzantine faults are particularly challenging in distributed systems, as they can mimic correct behavior intermittently while undermining consensus, and are injected to evaluate protocols like those in blockchain or replicated databases for resilience against up to one-third faulty nodes.^[10] Crash faults simulate hardware or software hangs, testing recovery mechanisms such as checkpointing, and are used when assessing systems where abrupt stops are the primary concern, like in real-time embedded applications.^[9] Omission faults target communication layers by dropping packets or ignoring inputs, revealing issues in message-passing protocols, as seen in controller area network (CAN) testing where undelivered frames disrupt coordination.^[11] Timing faults, often induced to mimic clock drifts or scheduling anomalies, help validate time-sensitive systems like avionics, where induced delays can expose synchronization failures without altering data integrity.^[9] In software domains, faults such as bit flips in memory—simulating radiation-induced errors—or invalid inputs like malformed arguments are injected to probe error handling in applications, with bit flips commonly altering single variables to trigger cascading exceptions in safety-critical code.^[12] Hardware faults include stuck-at faults, where a circuit line is permanently fixed at logic 0 or 1, emulating manufacturing defects and used to assess digital logic reliability during design validation.^[13] Network faults, exemplified by packet corruption that alters data bits in transit, test protocol robustness against transmission errors, often revealing vulnerabilities in TCP/IP stacks where corrupted payloads lead to retransmissions or session drops.^[11] Fault models provide abstract frameworks for these injections, with the crash-stop model assuming a process halts indefinitely upon failure, ideal for evaluating non-recoverable scenarios in fault-tolerant clusters.^[14] The fail-stop model extends this by incorporating fault detection, where the system announces the failure before stopping, facilitating testing of diagnostic mechanisms in environments like parallel computing where undetected crashes could propagate silently.^[15] These models are selected based on the system's architecture; for instance, crash-stop is prevalent in simulations of large-scale data centers to quantify downtime impacts, while fail-stop suits environments requiring explicit error signaling, such as fault-tolerant operating systems. Examples include injecting a null pointer dereference in software to mimic a crash-stop failure, causing immediate program termination, or simulating a voltage spike in hardware models to induce a stuck-at fault, altering gate outputs in circuit simulations.^[12]^[13]

Historical Development

Early Techniques and Milestones

Fault injection techniques originated in the aerospace and military sectors during the 1970s, driven by the need to ensure reliability in safety-critical systems where natural faults were infrequent but potentially catastrophic, such as in avionics and spacecraft.^[7] NASA's projects for space programs emphasized fault-tolerant computing to address computer errors in harsh environments.^[16] These efforts laid the groundwork for fault injection as a method to emulate failures in hardware and software, particularly for space systems where redundancy and error recovery were essential.^[17] A key milestone in the 1970s was the formalization of fault-tolerant computing principles, with Algirdas Avizienis publishing seminal work on the architecture of such systems, including strategies for fault detection and recovery in computing environments.^[18] This period also saw the initial development of software fault injection techniques, which involved artificially inducing faults to test system robustness, marking a shift from purely hardware-focused methods to software emulation for dependability assessment.^[19] By the late 1970s, hardware fault injection using radiation testing emerged in semiconductor laboratories to simulate cosmic ray-induced errors, providing empirical data on device vulnerabilities under accelerated conditions.^[20] In the 1980s, the introduction of dedicated tools advanced these techniques further; for instance, FIAT (Fault Injection-based Automated Testing), developed for real-time distributed systems, enabled systematic emulation of faults through code and data mutations to evaluate fault tolerance mechanisms.^[21] These early methods, motivated by the rarity of natural faults in controlled environments like military avionics, prioritized conceptual models of fault propagation over exhaustive testing, influencing subsequent reliability engineering practices.^[22]

Evolution in Computing Eras

In the 1990s, fault injection techniques gained standardization through software-implemented methods tailored for embedded systems, exemplified by the Xception tool, which enabled precise fault insertion and monitoring in processor functional units to evaluate dependability without hardware modifications.^[23] This era marked the integration of fault injection into rigorous certification standards, such as DO-178B for aviation software, where it became essential for verifying robustness in safety-critical airborne systems by simulating faults during development and testing phases.^[24] The 2000s saw fault injection adapt to emerging distributed and virtualized environments, including grid computing infrastructures where techniques were applied to assess fault tolerance in large-scale, resource-sharing networks.^[25] A notable advancement occurred in cloud systems with the introduction of Chaos Monkey by Netflix in 2011, a tool that randomly terminates virtual machine instances to inject failures and ensure system resilience in production environments.^[26] Concurrently, virtualization platforms facilitated fault injection by allowing isolated experimentation on emulated hardware, bridging the gap between simulation and real-world deployment.^[27] From the 2010s into the 2020s, fault injection evolved to address AI and machine learning systems, incorporating adversarial perturbations—subtle input modifications that test model robustness against malicious or erroneous data, as pioneered in seminal works on evasion attacks.^[28] Key milestones included 2005 IEEE publications on hybrid fault injection approaches, which combined software and hardware methods to enhance detection accuracy in complex systems.^[29] In the 2020s, focus shifted to cyber-physical systems, such as autonomous vehicles, where tools like AVFI enable targeted fault simulation to validate resilience against sensor and actuator failures in dynamic environments.^[30] Emerging quantum computing paradigms around 2020 introduced specialized fault models to simulate qubit errors and decoherence, laying groundwork for fault-tolerant quantum architectures.^[31]

Implementation Methods

Software-Based Fault Injection

Software-based fault injection involves deliberately introducing faults into software systems at the code or runtime level to evaluate their robustness and fault-handling mechanisms. This approach operates without physical hardware modifications, focusing instead on altering program behavior through programmatic means. It is particularly useful for testing error recovery in applications where hardware access is limited or impractical.^[2] One primary method is code mutation, where faults are inserted by modifying the source code prior to compilation, such as changing operators, variables, or control flow statements to simulate defects like arithmetic overflows or logic errors. This technique, often adapted from mutation testing, allows for precise control over fault types and locations, enabling assessment of how well test cases detect and handle injected errors. For instance, in C++ applications, mutating conditional statements can reveal weaknesses in exception handling. Seminal work on mutation-based injection for dependability evaluation traces back to early tools that integrated mutation operators with fault tolerance testing.^[32]^[33] Runtime injection techniques inject faults during program execution without requiring source code access, using mechanisms like debuggers or interceptors to alter memory, registers, or execution paths in real time. Debuggers, such as those based on ptrace in Unix-like systems or JDB in Java, can flip bits in variables or force exceptions to mimic transient errors. Interceptors, often implemented via dynamic linking libraries like LD_PRELOAD, override system calls to simulate failures such as memory allocation errors. In Java distributed applications, runtime injection via debugger-based tools like FAIL-FCI allows high-level fault scenarios to be scripted and executed across nodes, supporting both random and deterministic injections for scalability testing. Similarly, in C++ environments, tools like GOOFI use object-oriented wrappers to inject faults into running processes, facilitating evaluation of fault propagation. These methods enable dynamic testing of live systems but demand careful synchronization to avoid unintended side effects.^[2]^[34] API hooking represents another runtime approach, where interceptors modify the behavior of application programming interfaces (APIs) to simulate specific errors, such as introducing network delays or return value corruptions. By redirecting calls to custom implementations, this technique targets interactions with libraries or operating systems, making it suitable for black-box testing. For example, in C++ benchmarking frameworks like Hovac, DLL-based hooking injects faults into third-party library calls, allowing configurable error modes without recompiling the target application. This method is effective for isolating component-level vulnerabilities in complex software stacks.^[35] Protocol-specific software fault injection focuses on corrupting communication protocols within network stacks, such as altering TCP/IP packet checksums or HTTP response headers to test protocol robustness. Tools like ORCHESTRA insert faults through a dedicated layer between the protocol implementation and transport mechanism, enabling probing of timing properties and error recovery in distributed systems. Experiments using ORCHESTRA on commercial TCP implementations have revealed specification violations by simulating packet losses or delays, highlighting the technique's value in validating network software dependability. This subtype extends general runtime methods to protocol layers, often combining interceptors with packet filters for targeted injections.^[36] Software-based fault injection offers several advantages, including low implementation cost since it requires only software tools and access to the execution environment, high controllability over fault parameters like location and timing, and ease of repeatability for reproducible experiments. For example, injecting exceptions in Java applications via runtime tools allows rapid iteration on fault scenarios without hardware setup. These benefits make it ideal for early-stage development and continuous integration testing.^[2]^[32] However, challenges include performance overhead from instrumentation, which can alter timing-sensitive behaviors and increase execution time by up to several factors depending on injection density. In code mutation approaches, recompilation is often necessary, complicating workflows for large codebases, while runtime methods may introduce intrusiveness that affects fault representativeness. Additionally, ensuring fault realism requires domain expertise to model errors accurately without over-simplifying complex interactions.^[2]^[34]

Hardware-Based Fault Injection

Hardware-based fault injection involves physically perturbing hardware components to simulate faults, providing a realistic assessment of system resilience under real-world conditions. Unlike software methods, these techniques directly manipulate electrical signals, radiation, or environmental factors on actual devices, enabling the study of hardware-level error propagation. This approach is particularly valuable for validating fault tolerance in critical systems where physical faults, such as those induced by cosmic rays or manufacturing defects, must be emulated accurately.^[37] Key techniques include pin-level injection, which alters signals at specific circuit pins, often through voltage glitches that temporarily drop the power supply below operational thresholds to induce computational errors. For instance, voltage glitches on CPU pins can cause transient faults in control logic, mimicking power surges or undervoltage events. Radiation-based methods, such as heavy ion bombardment, simulate cosmic ray impacts by directing particle beams at chips to flip bits in memory or registers, typically resulting in single or multiple bit errors. Modern laser-based variants, advanced post-2010, use pulsed lasers (e.g., diode or YAG types) to target precise locations like SRAM cells, achieving reproducible single-byte faults with high spatial resolution on nodes down to 28 nm. Clock manipulation techniques disrupt timing by introducing glitches—short interruptions or extensions in the clock signal—to create timing faults, such as skipped instructions or metastable states in sequential logic.^[6]^[37]^[38]^[37] Custom hardware setups facilitate these injections, including fault injection boards built with FPGAs for programmable control over glitches and timing, allowing emulation of stuck-at faults in digital circuits by forcing pins to fixed logic levels (0 or 1). Electromagnetic interference (EMI) devices, such as magnetic probes, generate localized pulses to induce faults without direct contact, offering a non-invasive alternative for testing embedded systems. These tools, often integrated with oscilloscopes for precise triggering, enable targeted experiments on ASICs and memory modules, where faults like stuck-at conditions in combinational logic are injected to evaluate detection coverage in prototypes. For example, FPGA-based platforms like TURTLE emulate single-event upsets (SEUs) in SRAM to test mitigation in radiation-hardened designs.^[37]^[39]^[6]^[40] Applications span testing ASICs for cryptographic security, where laser-induced bit flips reveal vulnerabilities in embedded controllers, and memory modules in aerospace systems to assess error-correcting code efficacy against heavy ion faults. In digital circuits, hardware injection of stuck-at faults—permanently fixing a node to a logic value—helps verify manufacturing test patterns and fault-tolerant architectures in embedded devices. These methods are essential for systems requiring high reliability, such as automotive ECUs or satellite processors, by simulating physical defects that software alone cannot replicate.^[37]^[41]^[6] Evaluation metrics emphasize fault coverage, defined as the percentage of injected faults that propagate to observable errors, which can reach nearly 100% with precise laser techniques but drops to 1-2% for broad voltage glitches due to non-deterministic effects. Physical reproducibility poses challenges, as radiation methods suffer from variability in fault location and timing, while pin-level approaches offer high repeatability but limited scalability for internal chip structures. Post-2010 advancements in laser fault injection have improved controllability, with success rates exceeding 75% for targeted bit flips, though decapsulation and alignment requirements increase setup complexity.^[37]^[42]^[38]

Simulation-Based Fault Injection

Simulation-based fault injection involves introducing faults into virtual models or emulated environments to assess system behavior without risking physical hardware. This method leverages computational simulations to mimic fault effects, enabling early-stage reliability analysis in design phases. It bridges software and hardware testing by operating at abstraction levels from circuit to system-on-chip (SoC), allowing for repeatable experiments under controlled conditions.^[43] Key approaches include model-based simulation, where faults are injected into descriptive models of the system. For instance, SPICE simulators are used for analog circuit fault injection by modifying component parameters to emulate defects like shorts or opens, facilitating mixed-signal design validation.^[44] In software contexts, UML dynamic specifications support fault injection through models that target state machine errors or unconnected ports, as demonstrated in analyses of systems like cardiac pacemakers.^[45] SystemC models extend this to SoC design, enabling bus-level fault injection to perform failure mode and effects analysis (FMEA) during early prototyping of ARM-based systems.^[46] Emulator-based injection utilizes tools like QEMU to simulate virtual machines and inject faults at the instruction level, abstracting hardware faults such as bit flips in registers or memory. This approach supports multiple architectures like x86 and ARM, providing non-intrusive analysis of embedded software dependability.^[47] Hybrid simulations combine these with higher-fidelity models, such as switching between register-transfer level (RTL) and gate-level simulations to accelerate fault campaigns while preserving accuracy; for example, frameworks like Simbah-FI achieve over 10x speedups in reliability testing of VLIW processors.^[48] These methods offer significant benefits, including scalability for large, complex systems where physical testing is impractical, and safety by avoiding destructive faults on real hardware. In SystemC environments, this allows exhaustive exploration of SoC fault scenarios without prototyping costs, enhancing design reliability in deep submicron technologies.^[46] QEMU-based emulation further demonstrates efficiency, with experiments showing effective fault coverage for transient and permanent errors across processor architectures.^[47] Specific techniques emphasize fault propagation modeling to trace error effects through simulated components. In VHDL and Verilog environments, delay faults are modeled using transition and path delay fault approaches, where slow-to-rise/fall transitions or cumulative path delays are injected to simulate timing defects; these are detected via two-pattern tests in benchmark circuits like ISCAS89, achieving high coverage rates such as 99% in s13207.^[49] Such modeling in hardware description languages enables precise propagation analysis, supporting validation of fault-tolerant designs before synthesis.^[50]

Key Characteristics and Evaluation

Core Properties of Fault Injection

Fault injection techniques are defined by key properties that ensure their utility in validating system dependability. Controllability denotes the degree of precision in specifying the location, timing, and type of fault introduced into the system, enabling targeted experimentation to mimic specific failure scenarios. Observability refers to the capability to monitor and capture the system's internal states and outputs in response to injected faults, facilitating detailed analysis of error propagation. Repeatability ensures that repeated injections of the same fault under identical conditions produce consistent results, which is essential for statistical validation and comparison across experiments. Intrusiveness measures the extent to which the fault injection mechanism disrupts the system's normal execution, with lower intrusiveness preserving the authenticity of behavioral observations. These properties vary across techniques; for instance, hardware-based methods often offer high controllability and repeatability but may introduce moderate intrusiveness through physical interfaces, while software methods provide strong observability at the cost of potential timing perturbations. A fundamental taxonomy of fault injection distinguishes approaches based on the tester's access to system internals. Black-box fault injection operates externally, perturbing inputs or environmental conditions without knowledge of the underlying code or architecture, making it suitable for evaluating end-to-end system resilience in opaque environments. In contrast, white-box fault injection requires detailed internal access, allowing direct modification of code, memory, or hardware registers to inject faults at precise points, which enhances controllability but demands comprehensive system documentation. This classification aligns with broader testing paradigms and influences the choice of method depending on the validation goals, such as holistic system assessment versus component-level scrutiny.^[51] The theoretical underpinnings of fault injection draw from dependability theory, particularly the fault-error-failure chain. A fault represents a defect or abnormal condition within the system, such as a hardware transient or software bug; if activated, it may produce an error, defined as a deviation from the system's correct service delivery; an error can then propagate to cause a failure, where the system deviates from its specified behavior. This sequential model, formalized in foundational dependability research, guides fault injection by enabling the simulation of real-world threats to assess tolerance, detection, and recovery mechanisms. By injecting faults at various chain stages, practitioners can trace how errors manifest as failures, informing design improvements for reliable computing systems.^[52] Fault injection differs distinctly from mutation testing in its objectives and mechanisms. While mutation testing generates syntactic variants of source code (mutants) to evaluate the fault-revealing power of test suites primarily during development, fault injection emulates operational faults in a running system to probe runtime dependability and resilience under dynamic conditions. This runtime focus allows fault injection to capture interactions with hardware, environment, and concurrency that static code mutations overlook, prioritizing systemic behavior over code coverage adequacy.^[53]

Metrics for Assessing Effectiveness

Fault injection campaigns are evaluated through a set of quantitative metrics that measure the system's ability to detect, contain, and recover from induced faults, providing essential insights into dependability and resilience. These metrics, rooted in dependability engineering, help quantify the impact of fault injection on system behavior without relying on qualitative assessments alone. Key among them are fault coverage, latency to recovery, propagation rate, and robustness score, each addressing distinct aspects of fault handling effectiveness. Fault coverage represents the proportion of injected faults that are detected and handled by the system's error detection and tolerance mechanisms before they propagate to cause failures. In dependability engineering, this metric derives from probabilistic models of fault tolerance, where coverage C is the probability that a randomly injected fault is identified, often estimated empirically through repeated injections. The standard formula is:

C = \left( \frac{D}{N} \right) \times 100\%

where D is the number of detected faults and N is the total number of injected faults. This approach, introduced in early fault-tolerant system analyses, allows for statistical estimation of detection efficacy across diverse fault models.^[54]^[55] Latency to recovery measures the duration from fault injection to full system restoration, capturing the responsiveness of recovery processes such as error correction or failover. High-resolution timing in hardware or simulation-based injections enables precise measurement, revealing bottlenecks in fault containment. For instance, transient faults may exhibit latencies in milliseconds, while permanent faults could extend to seconds or longer, directly influencing system availability.^[6] Propagation rate quantifies the likelihood and extent to which an injected fault evolves into an error that affects system outputs or downstream components, often expressed as the percentage of faults reaching critical interfaces. This metric highlights vulnerability to error cascades, with rates varying by architecture; for example, simpler pipelined processors may show 5-10% higher propagation due to reduced mitigation layers. It is particularly useful for identifying weak points in fault isolation.^[56] Robustness score, akin to a system survival rate post-injection, evaluates the overall resilience by calculating the percentage of fault scenarios where the system maintains correct operation, either by masking the fault or recovering without failure. This composite metric integrates detection and recovery outcomes, providing a holistic view of dependability.^[6] To derive reliable estimates for these metrics, especially propagation probabilities, statistical techniques like Monte Carlo simulations are applied. These involve injecting faults at random locations and timings over thousands of runs to model variability and compute confidence intervals, ensuring results reflect real-world stochastic behavior in dependability assessments.^[57]

Tools and Frameworks

Research and Open-Source Tools

Research and open-source tools for fault injection have primarily emerged from academic institutions, with significant contributions in the early 2000s from the University of California, Berkeley's Recovery Oriented Computing (ROC) project, which utilized fault injection to evaluate system availability and recovery mechanisms in distributed environments.^[58] This work built on earlier software-based techniques to simulate hardware faults, influencing subsequent tools focused on dependability assessment.^[59] One seminal tool is Xception, a software-implemented fault injection technique developed in the late 1990s for evaluating dependability in embedded systems, particularly those written in C for real-time applications.^[60] Xception supports fault injection at the process level by leveraging advanced processor debugging and performance monitoring features, allowing emulation of memory, timing, and processor faults without hardware modifications, and has been used in experiments to measure error propagation in embedded software.^[61] Similarly, FERRARI, introduced in 1995, is a flexible framework for injecting faults and errors into software to validate system tolerance, emulating hardware faults through dynamic code instrumentation and supporting multiple error models for real-time evaluation.^[62] These tools emphasize mutation operators, such as bit flips and value alterations, to mimic realistic failure scenarios in controlled experiments.^[63] In more recent research, open-source tools have expanded fault injection to cloud and machine learning domains; for instance, 2015 studies employed custom injectors to assess cloud software dependability, revealing that injected network and VM faults propagate in up to 40% of cases in platforms like OpenStack, highlighting gaps in error handling.^[64] Developments include FAIL*, an open-source framework on GitHub (introduced in 2015) for comprehensive fault campaigns in embedded and OS-level systems, supporting configurable injection points and post-analysis for tolerance quantification.^[65] In cloud-native environments, tools like ChaosMesh and LitmusChaos enable fault injection in Kubernetes clusters to test distributed system resilience.^[66]^[67] For ML robustness, TensorFI (2020) and its extension TensorFI+ (2022) provide scalable injection of hardware faults like bit flips into TensorFlow models, enabling evaluation of DNN vulnerability with low overhead (around 7-8x inference slowdown), while MRFI (2023) offers multi-resolution injection for PyTorch networks to test layer-specific resilience.^[68]^[69]^[70] These tools, often hosted on GitHub, facilitate reproducible research by integrating with debuggers and supporting custom mutation operators for targeted fault analysis.^[71]

Commercial and Enterprise Tools

Commercial and enterprise fault injection tools provide proprietary solutions tailored for large-scale, production-ready environments, enabling organizations to simulate faults in software, hardware, and hybrid systems to enhance reliability and compliance. These tools emphasize seamless integration into enterprise workflows, robust support for safety standards, and advanced analytics to quantify resilience, distinguishing them from open-source alternatives by offering dedicated support, scalability for distributed architectures, and compliance certifications.^[72] Gremlin is a leading commercial platform for chaos engineering, specializing in fault injection for cloud-native and microservices environments. It allows teams to inject targeted failures such as latency spikes, resource exhaustion, or network partitions to test system resilience in production-like settings. Key features include the Enterprise Fault Injection Suite for replicating real-world incidents, GameDay Manager for orchestrated experiments, and Service Reliability Scores dashboards that track risk remediation progress, supporting enterprise-scale deployments with 24/7 support. Pricing follows a custom model based on deployment size, requiring contact with sales for quotes, and it integrates with monitoring tools like Datadog for observability, though native CI/CD orchestration may require additional setup. Adoption in industries like finance and e-commerce has demonstrated reduced downtime by up to 50% through proactive fault testing.^[73]^[74]^[75] For hardware verification, Synopsys offers the Verdi Automated Debug System integrated with VC Z01X fault simulation, providing a comprehensive solution for injecting and analyzing faults in complex SoCs. VC Z01X enables high-performance fault injection to model manufacturing defects and safety-critical failures, measuring testbench quality and coverage for verification. Verdi enhances this by offering graphical analysis of fault simulation results, supporting UVM-based testbenches and HW/SW co-debug with synchronized views, while integrating with VCS simulation for efficient workflows. These tools comply with ISO 26262 for automotive functional safety, facilitating fault coverage metrics essential for ASIL-D certification. Pricing is enterprise-customized, often bundled in Synopsys verification suites, and case studies in semiconductor design highlight improved debug efficiency for chip complexity scaling.^[76]^[77]^[78] LDRA Fault Injection, part of the LDRA tool suite, targets safety-critical domains like avionics, injecting faults to verify robustness and compliance with standards such as DO-178C and ISO 26262. It supports dynamic testing for resource constraints and failure modes, including back-to-back model-code validation, to ensure resilience in embedded systems. Features include traceability integration for requirements to tests, automated fault scenarios for unit and integration levels, and reporting for certification artifacts, with scalability for large avionics projects. Pricing is quote-based for enterprise licenses, and automotive case studies show its role in achieving ISO 26262 compliance by proving fault tolerance in ECUs, reducing verification time through automated injection.^[79]^[80]^[81] Since 2015, adoption of these tools has surged alongside DevOps practices, with chaos engineering platforms like Gremlin contributing to growing market adoption in reliability testing, driven by the need for continuous integration in agile environments. By 2025, over 78% of organizations report DevOps implementation, incorporating fault injection for CI/CD pipelines to accelerate feedback loops and meet safety standards like ISO 26262, which requires fault injection for verifying fault coverage in high-assurance systems. In the 2020s, tools have evolved for microservices and edge computing, with Gremlin and similar platforms enabling distributed fault scenarios in containerized and IoT setups, supporting hybrid cloud-edge resilience against intermittent connectivity failures.^[82]^[83]^[84]

Libraries and Integration Frameworks

Libraries and integration frameworks provide modular, programmable interfaces for embedding fault injection into software development pipelines, allowing developers to simulate failures at the code or runtime level without requiring standalone tools. These components typically offer APIs for injecting faults such as exceptions, delays, or mutations, enabling seamless incorporation into testing scripts or continuous integration processes.^[85] In Python, FIT4Python is a prominent library for injecting software faults by applying targeted code mutations to source files, supporting fault models like arithmetic errors and logical operator changes to evaluate error-handling mechanisms.^[86] The library parses Python abstract syntax trees to insert faults, making it suitable for assessing dependability in applications like OpenStack, where it revealed gaps in exception coverage during mutation campaigns.^[86] For process-level injection, ProFIPy offers a programmable fault injection service that dynamically alters program behavior, such as forcing exceptions or altering return values, via a configuration-driven API.^[85] For Java, libfaultinj serves as a cross-language fault injection library that intercepts application functions to introduce errors, including network delays and resource failures, by wrapping calls at runtime.^[87] This enables dependency-level fault simulation, where faulty implementations can replace standard dependencies to test resilience in service-oriented architectures.^[87] LLVM-based libraries like Mull extend fault injection to compiled languages by performing mutation testing on intermediate representations, applying operators such as bit flips or negation swaps to C/C++ code during compilation.^[88] Mull's API allows specification of mutation sets, with execution revealing test suite effectiveness; for instance, it has been used to achieve over 80% mutation scores in open-source projects by integrating with build systems like CMake.^[89] Modern stacks benefit from language-specific libraries, such as fail-rs in Rust, which implements fail points for runtime error injection without recompilation, supporting custom fault behaviors like panics or value corruption via macros.^[90] In Go, the go-fault library provides HTTP middleware for injecting faults like request rejection or latency into services, configurable through standard net/http handlers.^[91] These libraries integrate into broader frameworks for targeted testing; for example, fault injection nodes from the ROS Fault Injection Toolkit can be embedded in Robot Operating System graphs to simulate sensor or communication failures, using ROS topics to propagate injected errors during simulation runs.^[92] Similarly, in web testing workflows, APIs from libraries like ProFIPy can be scripted alongside Selenium to inject browser or network faults, such as timeouts, by wrapping WebDriver calls in fault-prone contexts.^[85] A key advantage of these libraries is their flexibility for custom scripting, allowing developers to define fault scenarios programmatically; for instance, Mull's mutation API can be invoked via command-line or embedded scripts to target specific LLVM IR instructions for bit-flip simulations, as in mull-cxx --mutation <bitflip> target.cpp, which alters operand bits to assess propagation.^[88] This modularity reduces overhead compared to full tools, enabling rapid iteration in CI/CD pipelines while maintaining precise control over fault timing and location.^[89]

Applications and Use Cases

In Software Reliability Testing

Fault injection plays a crucial role in software reliability testing by deliberately introducing errors into applications to evaluate their robustness and error-handling capabilities during the development lifecycle. This technique allows developers to simulate real-world failure scenarios, such as network timeouts or resource unavailability, ensuring that software systems can gracefully degrade or recover without cascading failures. By targeting exception paths that are rarely exercised in normal operation, fault injection enhances the overall dependability of software, particularly in distributed and cloud-based environments.^[2] In integration testing, fault injection is commonly used to verify error-handling mechanisms across interconnected components, such as injecting input/output (I/O) failures in database software to test data consistency and recovery protocols. For instance, tools like CharybdeFS have been employed to randomly induce filesystem errors during database operations, revealing inconsistencies in data flushing that could lead to corruption under stress. Similarly, in regression testing following software updates, faults are injected to confirm that modifications have not introduced vulnerabilities in existing error-handling logic, thereby maintaining reliability across iterations. These practices ensure comprehensive validation of functional properties and test cases specific to software behavior.^[2]^[93] The benefits of fault injection in software reliability testing include improved code coverage for seldom-traveled exception paths, which traditional testing often overlooks, leading to more resilient applications. Experimental studies have shown that injecting representative software faults can increase test coverage in critical modules, directly quantifying improvements in fault detection. Furthermore, this approach aligns with established standards for safety-critical software, such as ISO 26262, which mandates fault injection to verify robustness against erroneous inputs and failures in automotive systems, complementing coding guidelines like MISRA that emphasize defensive programming.^[94] A notable case study is Netflix's adoption of fault injection through its Chaos Monkey tool, initiated in 2011 and publicly released in 2012, to test failover in microservices architectures. By randomly terminating virtual machine instances in production environments, Netflix engineers identified and mitigated single points of failure, enhancing system resilience during peak loads and regional outages. This ongoing practice, part of broader chaos engineering principles, has prevented widespread disruptions by proactively exposing weaknesses in service dependencies. As of 2024, similar capabilities have been expanded in cloud platforms, such as AWS Fault Injection Service (FIS) for testing network resilience in Amazon ECS workloads.^[95]^[96] To quantify reliability gains, metrics derived from fault injection experiments, such as fault detection coverage and mean time to recovery, are applied to measure how effectively injected faults propagate and are handled. For example, coverage ratios from injected faults can indicate the proportion of error scenarios successfully managed, providing empirical evidence of reliability improvements post-testing. These metrics help prioritize refactoring efforts, ensuring that software meets dependability thresholds before deployment.^[2]

In Hardware and System Validation

Fault injection plays a critical role in hardware and system validation by simulating faults to verify the robustness of integrated circuits (ICs) and embedded systems before deployment. In pre-silicon validation, emulation and simulation techniques are employed to inject faults into design models, allowing engineers to assess fault-tolerant mechanisms without physical prototypes. For instance, analog fault injection flows use co-simulation environments to model faults such as stuck-at conditions or voltage drifts in mixed-signal ICs, supporting compliance with safety standards like ISO 26262 for automotive applications.^[97] This approach has been applied to verify safety measures in devices like 77 GHz radar sensors, enabling early detection of vulnerabilities in netlist models.^[97] Post-silicon testing extends validation to fabricated ICs, where physical fault injection methods, such as pin-level perturbations or radiation exposure, evaluate real-world behavior. These techniques identify dependability issues, including error detection coverage and performance degradation under fault conditions, using tools like prototype-based injectors that target hardware logic or electrical faults.^[13] In radiation-hardened designs, single-event upsets (SEUs) are simulated by inducing bit-flips via heavy-ion radiation or controlled platforms, which debug FPGA-based systems by monitoring upset locations and recovery mechanisms.^[98] Such injections reveal transient fault sensitivities, ensuring designs mitigate cosmic radiation effects in space or high-reliability environments.^[13] Compliance with standards like IEC 61508 is achieved through fault injection during dynamic testing phases, where faults are introduced to validate diagnostic coverage and safe-state transitions in industrial electrical/electronic/programmable systems.^[99] A practical example is hardware-in-the-loop (HIL) testing for automotive electronic control units (ECUs), where fault insertion units simulate signal faults—such as open circuits or shorts to ground—between the ECU and simulated vehicle environment.^[100] This real-time framework assesses ECU responses, confirming fault tolerance in safety-critical functions like braking or engine control.^[101] Key outcomes of these validation efforts include pinpointing design flaws, such as timing violations caused by process variations or power noise, which are localized using techniques like instruction fault reproduction and assertion mining on post-silicon traces.^[102] For example, fault injection has uncovered up to 73% undetected faults in CPU architectures, guiding fixes like hardware patches to enhance reliability without full redesigns.^[13] Overall, these methods ensure hardware systems meet stringent reliability targets, reducing field failure risks in sectors like automotive and aerospace.^[102]

In Security and Resilience Analysis

Fault injection plays a critical role in security and resilience analysis by simulating adversarial conditions to evaluate how systems withstand deliberate disruptions, such as those aimed at compromising cryptographic operations or exploiting hardware vulnerabilities. In this context, fault injection differs from benign reliability testing by focusing on malicious scenarios, including physical and software-based attacks that target sensitive data or control flows. Researchers employ these techniques to identify weaknesses in secure systems, ensuring that defenses like error detection codes or redundancy mechanisms can mitigate real-world threats.^[103] One prominent technique is the Rowhammer attack, a software-induced fault injection method that exploits DRAM cell interference to flip bits in adjacent rows without direct access, enabling privilege escalation or data corruption in secure environments. First demonstrated in 2014, Rowhammer has been used to bypass memory isolation in operating systems and hypervisors, highlighting vulnerabilities in modern computing hardware. For instance, attackers can induce faults to leak cryptographic keys from isolated memory regions, as shown in experimental setups on commodity DRAM modules. Post-2015 advancements have extended Rowhammer to remote scenarios, such as JavaScript-based attacks in browsers, underscoring its relevance to web security resilience.^[104] In cryptographic systems, side-channel fault injection combines timing or power analysis with induced errors to perform cryptanalysis, often targeting block ciphers like AES or elliptic curve implementations. Voltage glitching, a common physical method, involves transient power supply disruptions to alter computation paths, such as skipping verification steps in smart card authentication protocols. This technique has been applied to extract keys from embedded devices by inducing single-bit faults during decryption. A systematic review of such attacks emphasizes their low cost and portability, using off-the-shelf equipment like programmable power supplies.^[105] Fault injection also assesses resilience in Internet of Things (IoT) devices against physical tampering, where attackers might use electromagnetic pulses or laser-based injections to compromise firmware integrity. Use cases include testing smart home gateways for fault-tolerant encryption. These evaluations integrate with penetration testing frameworks, such as those demonstrated in DEF CON challenges during the 2010s, where participants used open-source tools like ChipWhisperer to glitch embedded systems and bypass protections. Recent fault-based cryptanalysis since 2015 has focused on symmetric key schemes, proposing differential fault attacks that reduce key recovery complexity from 2^128 to 2^32 operations in vulnerable implementations. As of 2025, fault injection techniques are increasingly applied to evaluate post-quantum cryptography implementations against similar attacks.^[5]^[106]^[107]

Challenges and Future Directions

Limitations and Common Pitfalls

Fault injection techniques, while powerful for assessing system dependability, present significant challenges in implementation and interpretation. One primary issue is the high complexity in setup, particularly for hardware-based methods, which often require costly and time-intensive equipment such as radiation sources, leading to limited controllability and repeatability.^[108] Simulation-based approaches add further complexity by demanding precise input parameters that evolve with design changes, complicating their application in dynamic environments.^[6] Unrealistic fault models frequently result in false positives or skewed results, as software fault injection may not accurately replicate real-world hardware failures, such as permanent faults inaccessible to software tools.^[6] For instance, uniform single-bit flip models in software injection often fail to mirror actual soft-error conditions, underestimating or overestimating system susceptibility.^[108] Scalability poses another critical limitation in large systems, where exhaustive fault-space exploration can demand billions of experiments—for a 1 MiB benchmark running for 1 second, up to 266 million CPU years may be required—rendering full analyses impractical.^[108] Common pitfalls include overlooking environmental factors, such as non-deterministic hardware responses influenced by atmospheric radiation or cosmic rays, which can invalidate injection outcomes by introducing uncontrolled variables.^[109] In security testing, ethical concerns arise from the potential risks of injecting faults into operational systems, including hardware damage from power disturbances or safety issues with radioactive materials in hardware setups.^[6] Additionally, biased sampling from pruned fault spaces or unweighted accounting of results can lead to misleading comparisons of system robustness, as seen when fault coverage metrics hide performance degradations.^[108] To mitigate these issues, hybrid approaches that blend software versatility with hardware accuracy are recommended, particularly for capturing short-latency faults while minimizing perturbations. Cost-benefit analyses, such as using extrapolated absolute failure counts weighted by data lifetime, help address biases in sampling and accounting, enabling more reliable susceptibility evaluations without exhaustive testing.^[108] Illustrative examples highlight these pitfalls; in 2010s studies of hypervisor reliability, fault injection overhead from compiler-based redundancy techniques reached 19-38% performance degradation, potentially masking underlying issues in cloud-like virtualized environments.^[110] Similarly, software fault injection in distributed systems like Tandem computers revealed workload perturbations that obscured true failure modes, emphasizing the need for careful design to avoid such interference.^[6]

Emerging Trends and Research Areas

Recent advancements in fault injection have increasingly incorporated artificial intelligence to automate the process, particularly through machine learning models that predict optimal fault locations for testing system resilience. Since 2020, techniques such as machine learning-assisted fault injection have enabled the selection of fault types, timings, and locations to maximize the likelihood of revealing system failures, improving efficiency over manual methods. Reinforcement learning-based approaches further automate fault configuration in model-implemented simulations, targeting catastrophic failures in complex systems.^[111] In quantum computing, fault injection has emerged as a critical tool for simulating qubit errors and assessing error correction mechanisms. Tools like QuFI formalize qubit fault models to evaluate the reliability of quantum circuits, addressing challenges in noise modeling and error propagation.^[112] VHDL-based simulated fault injection extends classical techniques to quantum reliability assessment, enabling high-fidelity error modeling in circuit simulations.^[113] Such approaches are essential for developing fault-tolerant quantum systems, where statistical fault injection accounts for the unique complexities of quantum noise.^[114] Ongoing research explores fault injection in distributed environments like edge computing and 6G networks, where injecting faults simulates network degradation to test predictive maintenance in small cells.^[115] In ethical AI testing, fault injection evaluates robustness against failures, including adversarial perturbations, to ensure trustworthy distributed systems.^[28] Surveys of failure analysis in AI systems highlight fault injection's role across layers, from data to deployment, to address vulnerabilities like irregular inputs and fairness issues.^[116] Future directions emphasize integrating fault injection with digital twins for dynamic testing of safety-critical systems, allowing fault structures to be introduced without altering virtual prototypes.^[117] This facilitates real-time fault diagnostics and training of models like Bayesian networks in virtual environments.^[118] Standardization efforts, aligned with ISO 26262 for automotive electronics, promote retargetable fault injection frameworks to verify safety in autonomous systems.^[119] Automated test case generation via fault injection supports compliance and resilience assessment in these domains.^[120]

References

[1]
(PDF) A Survey on Fault Injection Techniques - ResearchGate
It involves inserting faults into a system and monitoring the system to determine its behavior in response to a fault. Several fault injection techniques have ...
[2]
Software Fault Injection: A Practical Perspective - IntechOpen
Software fault injection ( SFI ) denotes the artificial insertion— injection— of faults and error states into a running software system. It can be applied ...
[3]
SoK: A Beginner-Friendly Introduction to Fault Injection Attacks - arXiv
Sep 22, 2025 · Fault Injection is the study of observing how systems behave under unusual stress, environmental or otherwise. In practice, fault injection ...
[4]
Assessing Dependability with Software Fault Injection: A Survey
Software Fault Injection is a method to anticipate worst-case scenarios caused by faulty software through the deliberate injection of software faults.
[5]
A Systematic Review of Fault Injection Attacks on IoT Systems - MDPI
Jun 28, 2022 · Fault injection attacks on IoT systems are aimed at altering software behavior by introducing faults into the hardware devices of the system.
[6]
[PDF] Fault Injection Techniques and Tools
Researchers and engineers have created many novel methods to inject faults, which can be implemented in both hardware and software. Mei-Chen. Hsueh,. Timothy. K ...
[7]
Fault Injection - an overview | ScienceDirect Topics
Fault injection is defined as a dependability technique that involves observing system behavior in the presence of faults to ensure that the system functions ...
[8]
Error Injection - an overview | ScienceDirect Topics
Crash fault is a further subset of omission fault, which is in turn a subset of timing fault. When all the responses fail, then it is called a crash fault . 2.
[9]
Fault Tolerance Mechanism - an overview | ScienceDirect Topics
1. Fault models vary, with crash faults and Byzantine faults requiring distinct fault-tolerance techniques such as checking and monitoring, checkpoint and ...
[10]
[PDF] Making Byzantine Fault Tolerant Systems Tolerate ... - USENIX
Abstract. This paper argues for a new approach to building Byzan- tine fault tolerant replication systems. We observe that although recently developed BFT ...<|control11|><|separator|>
[11]
[PDF] Testing CAN-based Safety-Critical Systems using Fault Injection
Examples of software faults are missing initialization, omitted logic, incorrect timing and wrong algorithm. The effects of these faults on a CPU can be ...
[12]
Injecting bit flip faults by means of a purely software approach
Bit flips provoked by radiation are a main concern for space applications. A fault injection experiment performed using a software simulator is described in
[13]
[PDF] Faultinjection TechniquesandTools
To do prototype-based fault injection, faults are injected either at the hardware Level (logical or elec- trical faults) or at the software level (code or data.
[14]
[PDF] Lineage-driven Fault Injection - Web Services
This protocol is correct in the fail-stop model, in which processes can fail by crashing but messages are not lost. The programmer has committed a common error: ...<|separator|>
[15]
[PDF] How Fail-Stop are Faulty Programs?
Abstract. Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communi- cate with other processes.
[16]
[PDF] In the 1960s, during the development phase of NASA's Apollo lunar ...
Their Apollo experience included extensive analysis of computer errors, which led to their later formulation of a mathematical theory for development of "higher.Missing: history injection
[17]
Fault Injection on Microelectronics – Why You Should Care
NASA played a large role in this early development through their use of fault tolerant computer systems aboard spacecraft.Missing: Apollo 1960s
[18]
Systematic Design of Fault-Tolerant Computers - SpringerLink
Avižienis, A., “Architecture of Fault-Tolerant Computing Systems,” Digest of ... Fault-Tolerant Computing, Paris, June 1975, pp.3–16. Google Scholar.
[19]
[PDF] Fault injection: a method for validating computer-system dependability
This technique lets faults be accurately simulated at a low abstraction level, while the system responses are efficiently simulated at higher abstraction levels ...
[20]
[PDF] Physical and Software Based Fault Injection Attacks Against TEEs in ...
HFI was first used in the 1970s as a method for testing the durability of semiconductors with the aim of introducing abnormal and undesired behaviours [CCS99].
[21]
Fault Injection Experiments Using FIAT - ACM Digital Library
The results of several experiments conducted using the fault-injection-based automated testing (FIAT) system are presented. FIAT is capable of emulating a ...
[22]
[PDF] Fault-Tolerant Avionics - UNC Computer Science
Fault-tolerant designs are required to ensure safe operation of digital avionics systems performing flight-critical functions. This chapter discusses the ...
[23]
Software Fault Injection and Monitoring in Processor Functional Units1
Fault Injection for Embedded Microprocessor-based Systems · A. BensoM ... tool called FIMB UL (Fault Injection and Monitoring using BUilt in Logic).
[24]
How to automate software fault injection testing, without changing ...
Aug 21, 2018 · The use of fault injection as a robustness testing technique is mandated for safety critical avionics software by the DO 178B/C safety standard ...
[25]
Fault tolerance in computational grids: perspectives, challenges ...
Nov 18, 2016 · In fault injection, faults are considered to be a valid case for a fault tolerant system, and are the techniques through which we can actually ...
[26]
The Netflix Simian Army - Netflix TechBlog
Jul 19, 2011 · Inspired by the success of the Chaos Monkey, we've started creating new simians that induce various kinds of failures, or detect abnormal ...Missing: injection | Show results with:injection
[27]
Fault Injection in Virtualized Systems—Challenges and Applications
May 1, 2015 · We explore the benefits of using virtualization for fault injection and discuss the challenges of implementing fault injection in virtualized ...
[28]
Design and Evaluation of Hybrid Fault-Detection Systems ...
Abstract: This paper presents a new hybrid fault/error injection technique which overcomes the limitations of both software-based and hardware-based ...
[29]
Towards fault-tolerant distributed quantum computation (FT-DQC)
In this survey, we present a review of existing literature that aims to alleviate the scalability and reliability issues in quantum computers.
[30]
https://ieeexplore.ieee.org/document/8416212
[31]
https://www.sciencedirect.com/science/article/pii/S2405959525000359
[32]
[PDF] Fault Injection in Distributed Java Applications
FCI is thus a Debugger-based Fault Injector because the injection of faults and the instrumentation of the tested application is made using a debugger.Missing: interceptor | Show results with:interceptor
[33]
Hovac: A Configurable Fault Injection Framework for Benchmarking ...
We present a configurable tool for dependability benchmarking, Hovac, which uses DLL API hooking to inject faults into third party library calls.
[34]
https://www.lri.fr/~bibli/Rapports-internes/2005/RR1420.pdf
[35]
None
### Summary of Hardware-Based Fault Injection from https://arxiv.org/pdf/2509.18341
[36]
https://ieeexplore.ieee.org/document/540200
[37]
[PDF] TURTLE: A Low-Cost Fault Injection Platform for SRAM-based FPGAs
TURTLE is a low-cost fault injection platform for SRAM FPGAs, emulating upsets in CRAM to test SEU mitigation techniques. It uses partial reconfiguration via ...
[38]
A fast, flexible, and easy-to-develop FPGA-based fault injection ...
This paper proposes an easy-to-develop and flexible FPGA-based fault injection technique. This technique utilizes debugging facilities of Altera FPGAs.
[39]
https://par.nsf.gov/servlets/purl/10184039
[40]
https://www.sciencedirect.com/science/article/abs/pii/S0026271414000067
[41]
[PDF] A survey on simulation-based fault injection tools for complex systems
Jul 22, 2019 · The goal of fault tolerant computing is to develop computing systems that perform correctly, respecting their functions, in the presence of ...<|separator|>
[42]
https://link.springer.com/chapter/10.1007/978-3-319-31271-2_13
[43]
A Fault Model for Fault Injection Analysis of Dynamic UML Dynamic ...
In this paper, we address V&V analysis methods based on fault injection at the software specification level. We present a fault model and a fault injection ...Missing: SPICE | Show results with:SPICE
[44]
SoC-level fault injection methodology in SystemC design platform
### Summary of SoC-Level Fault Injection in SystemC
[45]
Simulation-based Fault Injection with QEMU for Speeding-up ...
Abstract. Simulation-based fault injection (SFI) represents a valuable solution for early analysis of software dependability and fault tolerance properties ...Missing: seminal | Show results with:seminal
[46]
Simbah-FI: Simulation-Based Hybrid Fault Injector
### Summary of Simbah-FI: Simulation-Based Hybrid Fault Injector
[47]
[PDF] Transition Faults and Transition Path Delay Faults - Purdue e-Pubs
Two types of delay fault models are commonly used: the transition fault model [1] and the path delay fault model [2]-[4].
[48]
(PDF) VHDL Simulation-Based Fault Injection Techniques
In this work it is intended to compare different VHDL-based fault injection techniques: simulator commands, saboteurs and mutants for the validation of fault ...Missing: seminal | Show results with:seminal
[49]
https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1192&context=open_access_dissertations
[50]
[PDF] Fundamental Concepts of Dependability
... dependability. The fault-error-failure model is central to the understanding and mastering of the various threats that may affect a system, and it enables a ...
[51]
[PDF] Using Fault Injection to Increase Software Test Coverage
The code mutation aspect of this scheme can be per- formed by a pre-processor, which transforms pre- and post- conditions into case injection statements and ...
[52]
https://www.cs.cmu.edu/~garlan/17811/Readings/avizienis01_fund_concp_depend.pdf
[53]
[PDF] Fundamental Concepts of Dependability
When evaluating fault-tolerant systems, the coverage provided by error and fault handling mechanisms has a drastic influence on dependability measures. The.
[54]
[PDF] A Statistical and Model-Driven Approach for Comprehensive Fault ...
Notably, there is a slightly higher Fault Propagation Rate (ranging from 5-10%) observed in the 2-stage pipelined CPU in comparison to other benchmarks. This ...
[55]
Towards Availability and Maintainability Benchmarks: a Case Study ...
Our methodologies are based on fault injection, used to purposefully compromise availability and to bring systems to a state where maintenance is required. Our ...
[56]
[PDF] ROC-1: Hardware Support for Recovery-Oriented Computing
A system with hardware and software isolation can be instrumented at its component interfaces to inject test inputs or faults and to observe the system's ...
[57]
Xception™: A Software Implemented Fault Injection Tool
a software implemented fault injection tool ... Fault Injection Techniques and Tools for Embedded Systems Reliability Evaluation ...Missing: C programs
[58]
Xception™: A Software Implemented Fault Injection Tool
Xception™: A Software Implemented Fault Injection Tool. January 2004. DOI:10.1007/0-306-48711-X_8. In book: Fault Injection Techniques and Tools for Embedded ...
[59]
FERRARI: A Flexible Software-Based Fault and Error Injection System
This paper describes the methodology and guidelines for the design of flexible software based fault and error injection and presents a tool, FERRARI, that ...
[60]
FERRARI: a tool for the validation of system dependability properties
The authors present FERRARI, a fault and error automatic real-time injector, which can evaluate complex systems by emulating most hardware faults in software.
[61]
Experimental Assessment of Cloud Software Dependability Using ...
Mar 28, 2015 · Experimental Assessment of Cloud Software Dependability Using Fault Injection · Conference paper · First Online: 01 January 2015.Missing: studies | Show results with:studies
[62]
danceos/fail: FAult Injection Leveraged - GitHub
FAIL* is a fault-injection (FI) framework that provides support for detailed fault-injection campaigns.
[63]
[PDF] A Flexible Fault Injection Framework for TensorFlow Applications
Apr 3, 2020 · TensorFI is able to inject both hardware and software faults in general TensorFlow programs. TensorFI is a configurable FI tool that is flexible ...
[64]
TensorFI+: A Scalable Fault Injection Framework for Modern Deep ...
After the release of TensorFlow 2, a software-level fault injector named TensorFI is developed for TensorFlow 2 models, which is limited to inject faults only ...
[65]
[PDF] MRFI: An Open Source Multi-Resolution Fault Injection Framework ...
To this end, we propose MRFI, a highly configurable multi-resolution fault injection tool for deep neural networks. It enables users to modify an independent ...
[66]
DependableSystemsLab/TensorFI - GitHub
TensorFI is a fault injection framework for injecting both hardware and software faults into applications written using the TensorFlow framework.
[67]
Best Chaos Engineering Tools: Open Source & Commercial Guide
Jul 17, 2025 · Explore the best chaos engineering tools for resilience testing, including ChaosMesh, Gremlin, Steadybit, AWS FIS, and more.
[68]
Gremlin Pricing
Quickly and confidently perform chaos engineering experiments to replicate past incidents and specific failure modes. Service Reliability Scores & Dashboard.
[69]
Chaos Engineering - Gremlin
Chaos Engineering is a tool we use to build such an immunity in our technical systems. We inject harm (like latency, CPU failure, or network black holes) in ...
[70]
Gremlin Reviews 2025: Details, Pricing, & Features | G2
Easy to use chaos engineering tool, minimal installation, great for entrants in chaos engineering concepts. Easy cloud integration. Lots of documentation to get ...
[71]
VC Z01X: Fault Simulation & Injection - Synopsys
VC Z01X is the only high-performance fault injection and simulation solution for multiple purposes: manufacturing test, functional safety analysis, and ...Missing: Verdi | Show results with:Verdi
[72]
Verdi Automated Debug System | Synopsys
Verdi is a debug and verification platform that streamlines design, debug, and verification, using AI to automate steps and manage regressions.
[73]
[PDF] VC Z01X Fault Simulation for Functional Safety Verification - Synopsys
VC Z01X is a high-speed fault simulation tool for functional safety verification, injecting faults and simulating effects to meet safety standards.
[74]
Safety - LDRA
The LDRA tool suite offers a range of dynamic testing capabilities designed to enhance software quality and ensure compliance with safety-related standards.
[75]
[PDF] Implementing ISO 26262 second edition with the LDRA tool suite®
Fault injection and resource tests help further ensure robustness and resilience. For organizations that apply model-based development, back-to-back testing ...
[76]
ISO 26262, functional safety, and ASILs - - LDRA
The LDRA tool suite helps ease the path to compliance by automating the required validation and verification work and by providing traceability throughout the ...
[77]
Top 47 DevOps Statistics 2025: Growth, Benefits, and Trends
Oct 16, 2025 · Check out 47 DevOps statistics and data roundups for top challenges, activities in IT sector, adoption rates, industry growth, and more.Devops Market Growth... · Devops Technologies... · Future Devops Trends
[78]
DevOps Statistics and Adoption: A Comprehensive Analysis for 2025
May 29, 2025 · By 2025, over 78% of organizations globally have implemented DevOps practices, reflecting its growing importance in modern software development ...
[79]
Crash Test Your Code with Fault Injection for Unstoppable ...
Sep 9, 2025 · Practical Steps to Get Started. 1. Define steady state. Agree on metrics that reflect normal operation. Is it the average request latency ...
[80]
[PDF] ProFIPy: Programmable Software Fault Injection as-a-Service - arXiv
May 11, 2020 · Abstract—In this paper, we present a new fault injection tool (ProFIPy) for Python software. The tool is designed to.
[81]
Injecting software faults in Python applications: The OpenStack case ...
Jan 1, 2022 · In this paper, we present FIT4Python, a tool for injecting software faults in Python code and then use it, in a mutation testing campaign, to analyse the ...
[82]
androm3da/libfaultinj: Fault injection library - GitHub
libfaultinj is a fault-injection library. In the context in which your software executes, there's some physical device that ultimately carries out the tasks ...Fault Injection · High Level Examples · Inject Errors
[83]
mull-project/mull: Practical mutation testing and fault ... - GitHub
Mull is a practical mutation testing tool for C and C++. For installation and usage please refer to the latest documentation.
[84]
[PDF] Mull it over: mutation testing based on LLVM - arXiv
Aug 5, 2019 · Abstract—This paper describes Mull, an open-source tool for mutation testing based on the LLVM framework. Mull works.
[85]
tikv/fail-rs: Fail points for rust - GitHub
A fail point implementation for Rust. Fail points are code instrumentations that allow errors and other behavior to be injected dynamically at runtime.Missing: libraries | Show results with:libraries
[86]
fault injection library in go using standard http middleware - GitHub
The fault package provides go http middleware that makes it easy to inject faults into your service. Use the fault package to reject incoming requests.Lingrino/go-Fault · Benchmarks · UsageMissing: Rust | Show results with:Rust
[87]
jpdias/ros_fault_inj_toolkit: ROS Fault Injection Toolkit - GitHub
Jun 22, 2022 · This system allows the development of autonomous vehicles and deploys the same system on a real robot/car. However, in real use cases, there are ...Missing: integration Selenium
[88]
CharybdeFS: a new fault-injecting filesystem for software testing
Feb 16, 2016 · The idea is to make CharybdeFS randomly kill the database on the flush or sync system calls, and see if the data is still consistent at next ...
[89]
Fault-Injection Testing for ISO 26262 Compliance - Embitel
Jun 9, 2022 · Timing fault injection: This involves altering the timing of events in the system, such as delays or race conditions, to trigger faults or ...
[90]
Netflix Open Sources Chaos Monkey - A Tool Designed To Cause ...
Jul 30, 2012 · Netflix has open sourced “Chaos Monkey,” its tool designed to purposely cause failure in order to increase the resiliency of an application ...
[91]
None
###Summary of Pre-Silicon Fault Injection Using Emulation or Simulation for Hardware Validation
[92]
An SEU fault injection platform for radiation-harden design ...
Aug 8, 2022 · An SEU fault injection platform was designed to ease the debugging of radiation-harden design in FPGA. The platform includes the FPGA being ...
[93]
IEC 61508: The Functional Safety Standard - Intertek
Fault Injection and Diagnostic Testing - introduce faults to validate the system's response and assess diagnostic coverage and the system's ability to enter a ...
[94]
Using Fault Insertion Units (FIUs) for Electronic Testing
### Summary of Fault Insertion in HIL for Automotive ECUs
[95]
Hardware-in-the-Loop-Based Real-Time Fault Injection Framework ...
Feb 10, 2022 · In this study, a real-time FI framework is proposed based on a hardware-in-the-loop (HiL) simulation platform and a real-time electronic control unit (ECU) ...
[96]
[PDF] Post-Silicon Validation Opportunities, Challenges and Recent ...
ABSTRACT. Post-silicon validation is used to detect and fix bugs in integrated circuits and systems after manufacture. Due to sheer design complexity,.<|control11|><|separator|>
[97]
SoK: Fault Injection Attacks on Cryptosystems - ACM Digital Library
Oct 29, 2023 · This paper provides a survey of fault attack techniques on different cryptosystems. The fault attack consists of two main components: fault ...
[98]
https://iopscience.iop.org/article/10.1088/1748-0221/17/08/P08007
[99]
DEF CON® 25 Hacking Conference - Talks
In this presentation we will quickly overview fault injection techniques, timing, and power analysis methods using the Open Source Hardware tool, the ...
[100]
[PDF] Avoiding Pitfalls in Fault-Injection Based Comparison ... - TU Dortmund
We identify three common pitfalls that can skew or even completely invalidate the analysis, and lead to wrong conclusions when comparing the effectiveness of.
[101]
[PDF] The State of Fault Injection Vulnerability Detection - Hal-Inria
4 Challenges. This section discusses common challenges for fault injection vulnerability detec- tion and how they impact the current state of the art in this ...
[102]
[PDF] Understanding Reliability Implication of Hardware Error in ... - USENIX
The fault injection framework contains four components: 1) a profiler is used to profile the hypervisor and identify the most frequently used functions (i.e., ...
[103]
Failure Identification Using Model-Implemented Fault Injection with ...
The goal of fault injection is to find a catastrophic fault that can cause the system to fail by injecting faults into it. These catastrophic faults are less ...
[104]
AI-Driven Fault Injection Testing: Enhancing System Resilience with ...
May 8, 2025 · Experimental results demonstrate a 28% improvement in fault detection accuracy and a 35% reduction in system recovery time compared to ...Missing: trends 2020
[105]
[PDF] QuFI: a Quantum Fault Injector to Measure the Reliability of Qubits ...
Mar 14, 2022 · In this paper, we address three main challenges associated with the reliability evaluation of quantum circuits: (1) the formalization of qubit(s) ...
[106]
Quantum circuit's reliability assessment with VHDL-based simulated ...
This paper presents a VHDL-based simulated fault injection (SFI) methodology for quantum circuits. The main objective is to attain a high error modeling ...
[107]
[PDF] A Day In the Life of a Quantum Error - Georgia Institute of Technology
In many ways, quantum computing significantly increases the complexity and difficulty of statistical fault injection. As accurate noise models are difficult to ...
[108]
Federated Edge Learning for Predictive Maintenance in 6G Small ...
Sep 14, 2025 · Additionally, to simulate realistic network degradation, the methodology incorporates fault injection by selectively deactivating one or more ...
[109]
A Survey on Failure Analysis and Fault Injection in AI Systems
May 16, 2025 · This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems.
[110]
A Survey on Failure Analysis and Fault Injection in AI Systems - arXiv
Jun 28, 2024 · Code faults typically originate from coding mistakes. Logical faults in code often involve incorrect algorithm implementation, impacting the ...<|control11|><|separator|>
[111]
Dynamic fault injection into digital twins of safety-critical systems
In this work we present a technology for dynamically introducing fault structures into digital twins without the need to change the virtual prototype model.
[112]
Digital Twin for Training Bayesian Networks for Fault Diagnostics of ...
Feb 13, 2022 · The proposed DT approach enables injection of faults in the virtual system, thereby alleviating the need for expensive factory-floor ...<|separator|>
[113]
A Retargetable Fault Injection Framework for Safety Validation of ...
To test safety of in-vehicle electronics, the ISO 26262 standard on functional safety recommends using fault injection during component and system-level design.
[114]
Automating Fault Test Cases Generation and Execution for ... - NIH
We explore fault injection's role in verifying system robustness against failures, guided by ISO 26262 standards, and its integration into the development ...